A friend of mine emailed me the other day with a quick question: “What’s an easy way to convert an entire site to PDF? Are there tools for this?”
Why yes, yes there are. In fact, it’s pretty easy to do if you’re on a Mac or Linux OS using wget and wkhtmltopdf:
$ mkdir /wget
$ wget --mirror -w 2 -p --html-extension --convert-links -P /wget http://darrenknewton.com
$ find darrenknewton.com -name '*.html' -exec wkhtmltopdf {} {}.pdf \;
$ mkdir pdfs
$ find darrenknewton.com -name '*.pdf' -exec mv {} pdfs/ \;
Explanation
wget
is a great little program to grab content from the web. It’s a web Swiss Army Knife®. wkhtmltopdf
is another great piece of software which converts html
to pdf
. It can take content from the web itself, or pickup files from your desktop.
You can install both packages easily with Homebrew. If you’re a MacPorts person then you’re out of luck with wkhtmltopdf
and will need to pickup a binary.
In this example I am mirroring the site to my desktop using wget
and then doing the conversion to pdf
. I could probably rig up a way to pass urls to wkhtmltopdf
but I think making a local copy first gives me some flexibility and for a really big site I would be able to do the PDF conversion offline and at my leisure.
So first, let’s mirror the site with wget
(I’m going to use this site as an example):
$ wget --mirror -w 2 -p --html-extension --convert-links -P /wget http://darrenknewton.com
This will spider the site and dump all of its files into /wget
, a directory I made for this demo. When you spider my site wget
will create a directory called darrenknewton.com
like this:
$ ls -alh darrenknewton.com
drwxr-xr-x 12 shibuya staff 408B Oct 30 18:17 .
drwxr-xr-x 5 shibuya staff 170B Oct 30 18:26 ..
drwxr-xr-x 3 shibuya staff 102B Oct 30 18:17 about
-rw-r--r-- 1 shibuya staff 8.3K Oct 30 17:39 about.1.html
-rw-r--r-- 1 shibuya staff 6.6K Oct 30 09:31 atom.xml
drwxr-xr-x 6 shibuya staff 204B Oct 30 18:17 blog
-rw-r--r-- 1 shibuya staff 561B Oct 30 17:38 favicon.png
drwxr-xr-x 8 shibuya staff 272B Oct 30 17:39 images
-rw-r--r-- 1 shibuya staff 13K Oct 30 17:39 index.html
drwxr-xr-x 6 shibuya staff 204B Oct 30 17:39 javascripts
-rw-r--r-- 1 shibuya staff 22B Oct 30 09:31 robots.txt
drwxr-xr-x 3 shibuya staff 102B Oct 30 17:39 stylesheets
The .html
files are all tucked away into directories, and we don’t want to go manually searching for them, so we will let find
do it for us. The following snippet will find all of the .html
files and pass them to wkhtmltopdf
for conversion:
$ find darrenknewton.com -name '*.html' -exec wkhtmltopdf {} {}.pdf \;
This creates PDFs alongside the html
files in the directory. So let’s grab those and dump them into their own directory.
$ mkdir pdfs
$ find darrenknewton.com -name '*.pdf' -exec mv {} pdfs/ \;
And there you go, you should have a directory full of PDFs that look something like this.
Comments