Blog post metadata PDFs from Webpages | Hugh's Curriculum Vitae

PDFs from Webpages

In the course of writing a paper I needed to take screen shots of webpages and include those as figures in the paper. I could take a screenshot save it as a .png or .jpg and call it good. However these sorts of images do not provide a great deal of resolution and do not look clean in a published paper. Especially when dealing with screenshots as the resolution is not very high — at least in my experince. I could print the web page to PDF but a post-script style PDF would basically create an image and wrap it in a PDF. This is not really any better than the screenshot solution.

With a quick internet search I uncovered two potential solutions: wkhtmltopdf and WebVector/CSSBox. I decided to try out wkhtmltopdf. It was simple and easy to install on linux. It produces great resutls with a bit of tinkering ( manual here). Though it should be noted that wkhtmltopdf is based on webkit which has mostly not seen progressive development. I also tried WebVector (specifically on the website Pangloss) and it did not produce expected results.

Playing around with wkhtmltopdf I was able to produce a PDF of several webpages I was looking to include. I used the following workflow after install:

I adjusted my screen size to fit the type of display I wanted. Then I measured it via I took that and made it my viewport size. Then I played around with my output settings for the PDF. I knew that I wanted really low margins because I was going to embed this PDF as a figure in another PDF. Then I continued to play around with the page size to find the cut-off point between the content I wanted and the footer. In the end the code I used looked like this:

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 6.25in --page-width 7.75in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 sil-item.pdf

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 8in --page-width 7.75in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 K-item.pdf

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 8.75in --page-width 7in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 olac-item1.pdf

This produced a nice PDF. But I needed to get rid of the second page. I did this with PDF-Shuffler and just deleted the second page. I wanted to include the final PDF in XLingPaper which uses XeLaTeX on the back end. However one of the current dependencies only allows PDFs of version 1.4 or older. So I had to “back version” the PDF version to 1.4. I the following code (adjustin for file names) to do that.

gs -q -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o out.pdf in.pdf My suspicion though is that passing the PDF through GhostSript also removes some of the mark-up features which already exist in the PDF via wkhtmltopdf. This may include some of the links and embedded fonts — but on this last point I am not 100 percent sure.

In an earlier version of this workflow I was using standard size pages (A4, A3, etc.). Doing so required that I then cut the PDF so that the usable portion was able to usefully be scaled in xlingpaper. To do this I used to reduce the physical dimensions of pdf paes. However this was not necessary to maintain because I was able to specify the dimentions I needed directly in wkhtmltopdf.

Hugh Paterson III
Hugh Paterson III

My research interests include typological patterns in articulatory phonetics; User Experience design in language tools; and graph theory applied to language and linguistics.