PDFs from Webpages

In the course of writing a paper I needed to take screen shots of webpages and include those as figures in the paper. I could take a screenshot save it as a .png or .jpg and call it good. However these sorts of images do not provide a great deal of resolution and do not look clean in a published paper. Especially when dealing with screenshots as the resolution is not very high — at least in my experience. I could print the web page to PDF but a post-script style PDF would basically create an image and wrap it in a PDF. This is not really any better than the screenshot solution.

With a quick internet search I discovered two potential solutions: wkhtmltopdf and WebVector/CSSBox. I decided to try out wkhtmltopdf. It was simple and easy to install on Linux. It produces great results with a bit of tinkering ( manual here). Though it should be noted that wkhtmltopdf is based on webkit which has mostly not seen progressive development. I also tried WebVector (specifically on the website Pangloss) and it did not produce expected results.

Playing around with wkhtmltopdf I was able to produce a PDF of several webpages I was looking to include. I used the following workflow after install:

I adjusted my screen size to fit the type of display I wanted. Then I measured it via whatismyviewport.com. I took that and made it my viewport size. Then I played around with my output settings for the PDF. I knew that I wanted really low margins because I was going to embed this PDF as a figure in another PDF. Then I continued to play around with the page size to find the cut-off point between the content I wanted and the footer. In the end the code I used looked like this:

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 6.25in --page-width 7.75in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 https://www.sil.org/resources/archives/52216 sil-item.pdf

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 8in --page-width 7.75in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 https://scholarspace.manoa.hawaii.edu/handle/10125/30788 K-item.pdf

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 8.75in --page-width 7in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 http://www.language-archives.org/item/oai:sil.org:52216 olac-item1.pdf

This produced a nice PDF. But I needed to get rid of the second page. I did this with PDF-Shuffler and just deleted the second page. I wanted to include the final PDF in XLingPaper which uses XeLaTeX on the back end. However one of XLingPaper’s current dependencies only allows PDFs of version 1.4 or older. So I had to “back version” the PDF version to 1.4. I the following code (adjusting for file names) to do that.

gs -q -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o out.pdf in.pdf My suspicion though is that passing the PDF through GhostScript also removes some of the mark-up features which already exist in the PDF via wkhtmltopdf. This may include some of the links and embedded fonts — but on this last point I am not 100 percent sure.

In an earlier version of this workflow I was using standard size pages (A4, A3, etc.). Doing so required that I then cut the PDF so that the usable portion was able to usefully be scaled in XLingPaper. To do this I used pdfresizer.com to reduce the physical dimensions of pdf pages. However this was not necessary to maintain because I was able to specify the dimensions I needed directly in wkhtmltopdf.

Tags:
Categories:
Content Mediums:
Hugh Paterson III
Hugh Paterson III
Collaborative Scholar

I specialize in bespoke research at the intersection of Linguistics, Law, Languages, and Technology; specifically utility and life-cycle management for information products in these spaces.

Related