PDFs from Webpages
In the course of writing a paper I needed to take screen shots of webpages and include those as figures in the paper. I could take a screenshot save it as a .png or .jpg and call it good. However these sorts of images do not provide a great deal of resolution and do not look clean in a published paper. Especially when dealing with screenshots as the resolution is not very high — at least in my experince. I could print the web page to PDF but a post-script style PDF would basically create an image and wrap it in a PDF. This is not really any better than the screenshot solution.
With a quick internet search I uncovered two potential solutions:
WebVector/CSSBox. I decided to try out
wkhtmltopdf. It was simple and easy to install on linux. It produces great resutls with a bit of tinkering (
manual here). Though it should be noted that
wkhtmltopdf is based on webkit which has mostly not seen progressive development. I also tried WebVector (specifically on the website Pangloss) and it did not produce expected results.
Playing around with wkhtmltopdf I was able to produce a PDF of several webpages I was looking to include. I used the following workflow after install:
I adjusted my screen size to fit the type of display I wanted. Then I measured it via whatismyviewport.com. I took that and made it my viewport size. Then I played around with my output settings for the PDF. I knew that I wanted really low margins because I was going to embed this PDF as a figure in another PDF. Then I continued to play around with the page size to find the cut-off point between the content I wanted and the footer. In the end the code I used looked like this:
This produced a nice PDF. But I needed to get rid of the second page. I did this with PDF-Shuffler and just deleted the second page. I wanted to include the final PDF in XLingPaper which uses XeLaTeX on the back end. However one of the current dependencies only allows PDFs of version 1.4 or older. So I had to “back version” the PDF version to 1.4. I the following code (adjustin for file names) to do that.
gs -q -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o out.pdf in.pdf
My suspicion though is that passing the PDF through GhostSript also removes some of the mark-up features which already exist in the PDF via
wkhtmltopdf. This may include some of the links and embedded fonts — but on this last point I am not 100 percent sure.
In an earlier version of this workflow I was using standard size pages (A4, A3, etc.). Doing so required that I then cut the PDF so that the usable portion was able to usefully be scaled in xlingpaper. To do this I used
pdfresizer.com to reduce the physical dimensions of pdf paes. However this was not necessary to maintain because I was able to specify the dimentions I needed directly in