PDFs from Webpages

In the course of writing a paper I needed to take screen shots of webpages and include those as figures in the paper. I could take a screenshot save it as a .png or .jpg and call it good. However these sorts of images do not provide a great deal of resolution and do not look clean in a published paper. Especially when dealing with screenshots as the resolution is not very high — at least in my experince. I could print the web page to PDF but a post-script style PDF would basically create an image and wrap it in a PDF. This is not really any better than the screenshot solution.

With a quick internet search I uncovered two potential solutions: wkhtmltopdf and WebVector/CSSBox. I decided to try out wkhtmltopdf. It was simple and easy to install on linux. It produces great resutls with a bit of tinkering ( manual here). Though it should be noted that wkhtmltopdf is based on webkit which has mostly not seen progressive development. I also tried WebVector (specifically on the website Pangloss) and it did not produce expected results.

Playing around with wkhtmltopdf I was able to produce a PDF of several webpages I was looking to include. I used the following workflow after install:

I adjusted my screen size to fit the type of display I wanted. Then I measured it via whatismyviewport.com. I took that and made it my viewport size. Then I played around with my output settings for the PDF. I knew that I wanted really low margins because I was going to embed this PDF as a figure in another PDF. Then I continued to play around with the page size to find the cut-off point between the content I wanted and the footer. In the end the code I used looked like this:

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 6.25in --page-width 7.75in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 https://www.sil.org/resources/archives/52216 sil-item.pdf

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 8in --page-width 7.75in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 https://scholarspace.manoa.hawaii.edu/handle/10125/30788 K-item.pdf

wkhtmltopdf --no-print-media-type --viewport-size 1100 --page-height 8.75in --page-width 7in -T 1mm -B 1mm -R 1mm -L 1mm --load-media-error-handling skip --enable-javascript --javascript-delay 2000 http://www.language-archives.org/item/oai:sil.org:52216 olac-item1.pdf

This produced a nice PDF. But I needed to get rid of the second page. I did this with PDF-Shuffler and just deleted the second page. I wanted to include the final PDF in XLingPaper which uses XeLaTeX on the back end. However one of the current dependencies only allows PDFs of version 1.4 or older. So I had to “back version” the PDF version to 1.4. I the following code (adjustin for file names) to do that.

gs -q -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o out.pdf in.pdf My suspicion though is that passing the PDF through GhostSript also removes some of the mark-up features which already exist in the PDF via wkhtmltopdf. This may include some of the links and embedded fonts — but on this last point I am not 100 percent sure.

In an earlier version of this workflow I was using standard size pages (A4, A3, etc.). Doing so required that I then cut the PDF so that the usable portion was able to usefully be scaled in xlingpaper. To do this I used pdfresizer.com to reduce the physical dimensions of pdf paes. However this was not necessary to maintain because I was able to specify the dimentions I needed directly in wkhtmltopdf.

Tags:
Categories:
Content Mediums:
Hugh Paterson III
Hugh Paterson III
Collaborative Scholar

My research interests include Typological Patterns in articulatory phonetics, User Experience Design in language tools and Graph Theory applied to language and linguistic resource discovery.

Related