R-logo

R-logo

I recently came across a CC licensed book the web version of which is split across multiple pages (presumably to encourage purchase of the book – which is totally fair enough…but sorry, I want it in a single file). The book (free, or pay for an ebook/hard copy version) was the [handbook of data journalism]1 – which looks excellent (and we should all support). So I was curious about how to grab all the URLs from the index page and gather these into a single document to print via R (this is within the license for the book material). There are lots of other cases where being able to do something similar might be useful, so I thought I’d share my (pretty crude) code…there are definitely more elegant ways to do this. I like to be able to do everything via R, but in this case that involves calling a few command line tools (which is fine). So the easiest way (I think) is to use a wget (a command line tool) which you have to install separately (and be careful with testing…I tried to download the internet in one test), and then another command line tool to save to pdf. As usual, this “I don’t want to spend time opening each page” time-saver took longer than I expected, but I’ve already found another identical use-case so silver linings and all.

library(RCurl)
library(XML)
library(httr)

#check wget path set
Sys.which("wget")

#Get the set of URLs we're interested in
url <- "http://datajournalismhandbook.org/1.0/en/index.html" #set the index page
doc <- htmlParse(url) #get the html from that page
links <- as.vector(xpathSApply(doc, "//a/@href")) #grab the urls from the html and save just them in a vector

#find the URLs in the list that are off the home directory (i.e. no http://) and pre-pend the url
for(i in 1:length(links)){
  myurl <- links[[i]]
  links[[i]] <- ifelse(grepl("http",myurl),
                       myurl,paste0("http://datajournalismhandbook.org/1.0/en/",myurl))
                       }

#store index
system("wget "http://datajournalismhandbook.org/1.0/en/index.html" -p -k -m")

#then append all subsequent pages
for(i in 1:length(links)){
  system(paste0("wget "",links[[i]],""-p -k -m"))
  #download.file(links[[i]],"test.html",mode = "append", extra = "-p -k -m ./") #no recursion (-r), but get pre-requisites, and (-k) convert for local viewing  
}

#get the list of local pages from the index
doc <- htmlParse("C:/Users/Documents/datajournalismhandbook.org/1.0/en/index.html") #get the html from that page
links_local <- as.vector(xpathSApply(doc, "//a/@href")) #grab the urls from the html and save just them in a vector

#find the URLs in the list that are off the home directory (i.e. no http://) and pre-pend the url
for(i in 1:length(links_local)){
  myurl <- links_local[[i]]
  links_local[[i]] <- ifelse(grepl("http",myurl),
                       myurl,paste0("C:/Users/Documents/datajournalismhandbook.org/1.0/en/",myurl))
}

#is this stored in order?
links_local #check order, looks good to me!

#do a little tidying
links_local <- links_local[3:length(links_local)]
links_local[^1] <- "C:/Users/Documents/datajournalismhandbook.org/1.0/en/index.html"

#if you haven't installed wkhtmltopdf yet, do that http://wkhtmltopdf.org/usage/wkhtmltopdf.txt
#then save each page to pdf sequentially appending by looping through links again
system(paste0("C:/Program" "Files/wkhtmltopdf/bin/wkhtmltopdf.exe --javascript-delay 100000 --footer-center [page]",paste0(links_local,collapse=" ")," datajournalismhandbook.pdf"))

system(paste0("C:/Program" "Files/wkhtmltopdf/bin/wkhtmltopdf.exe --disable-javascript --footer-center [page]",paste0(links_local,collapse=" ")," datajournalismhandbook.pdf"))

#of course the alternative is to just do that on the live pages
system(paste0("C:/Program" "Files/wkhtmltopdf/bin/wkhtmltopdf.exe --footer-center [page]",paste0(links,collapse=" ")," datajournalismhandbook.pdf"))

I am still getting an error…but it seems to have saved a formatted pdf with bookmarks, it’s not the prettiest export in the world but it will certainly do.

Footnotes

  1. http://datajournalismhandbook.org/1.0/en/index.html