Convert web pages to single pdf

In this tutorial I will list few commands that can be used to convert set of web pages (recursively) to a single pdf document.

First, download the web pages.

$ cd /tmp
$ mkdir wget
$ wget --mirror -w 2 -p --html-extension --convert-links -P /tmp/wget http://www.someweblink.com/somedir

Change working directory (cd) to the target directory in the downloaded folder, where you want to convert html files to pdf

$ cd  www.somewelink.com/somedir
$ find . -name '*.html' -exec wkhtmltopdf {} {}.pdf \;

This will create pdf files for each html file in each sub-directory recursively.

Copy the pdf files to a particular directory.

Note: Sometimes all files may be named index.html.pdf, so we must make sure one file does not replace other during copying to a single directory.

#!/bin/bash

for f in `find . -name '*pdf'`
do
 filename=`echo $f|awk -F'/' '{SL = NF-1; TL = NF-2; print $TL "_" $SL "_" $NF}'`
 cp $f newfolder/$filename
done

Create a shell file (file.sh) and execute it using bash (bash file.sh) in your target directory. This will copy all the pdf files recursively and add folder name to it.

Note: If the file starts with “.”, all of them will be hidden inside the newfolder directory. Use “ls -al” command to list them. If some files do not start with “.” they might be out of order. You can add “.00” or other prefix to the file name to list them in order.

Once the files are in order (although all of them might start with “.”, use following command to join them.

pdfunite .*pdf merged.pdf

Conclusion

I hope you found this article useful.

References

  1. http://darrennewton.com/2011/10/30/mirror-site-and-convert-to-pdf/
  2. http://stackoverflow.com/questions/2507766/merge-convert-multiple-pdf-files-into-one-pdf
  3. http://stackoverflow.com/questions/643372/append-name-of-parent-folders-and-subfolders-to-the-names-of-the-multiple-files
Advertisements

Leave a comment

Filed under Uncategorized

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s