Just to upfront this disclaimer again, I write (/hack together) horrible code…and this is very drafty, largely for my own future reference I have a set of html documents (I converted some from pdf but that doesn’t matter for this purpose). I want to extract some information about those documents including: 1. Unique, or highly document specific terms (tf;idf) 2. Metadata (such as authorship, url, date, etc.) 3. Topical information (how are pages clustered, any keyterms from those clusters) 4. Possible synonyms related to the page 5. Links the page is associated with (and ideally, those pages) And hey look at that, there’re a load of python packages (bindings/wrappers/whatever) for that: 1. tf;idf 2. Metadata extractor http://www.gnu.org/software/libextractor/ 3. There are a few ways of doing this, LSA as below may be best for my purpose 4. We can partly do this through DBPedia Spotlight for keyword disambiguation, then look for synonyms of those. We might also be interested in collecting related information so that claims can be associated with their original concepts (might also be useful if we wanted to do things like clustering search queries together). 5. httrack python https://www.google.co.uk/search?q=httrack+python&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&gws_rd=cr&ei=vniHUp_wJMSRhQemyYDYCw … Unfortunately, I’d already started doing it in R (oh YAY) so I’ve got some started code there, which I either need to translate to Python, use the product of in Python, or run using something like RPy (which lets you run R through Python). Anyway this is my drafty stuff so far (largely for my future reference).

#general links of use given in code, also Fridolin's post here v useful for packages http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
# load required libraries
library(tm)
library(ggplot2)
library(lsa)
library(XML)
library(RCurl)
library(SnowballC)
library(textir)
library(plyr)

#FOR EACH file in directory (or URL in list)
#Thanks to http://stackoverflow.com/questions/15016462/create-a-corpus-from-many-html-files-in-r

# get data
setwd("C:/Users/.../docs") # this folder has your HTML files
html < - list.files(pattern=".(htm|html)\$") # get just .htm and .html files
length(html)
#create a list to store the documents in
document_list <- vector("list",length(html))
for(rawfile in 1:length(html)){
filepath <- html[rawfile]
#Extract metadata, save this somewhere

# READ AND PARSE SINGLE HTML FILE
#Thanks to http://www.r-bloggers.com/reading-html-pages-in-r-for-text-processing/
doc.html = htmlTreeParse(filepath,useInternal = TRUE)
# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
# Replace all n by spaces
doc.text = gsub('n', ' ', doc.text)
# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ' ')
#Removing non-ASCII characters thanks to http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
Encoding(doc.text) <- "latin1"  # (just to make sure)
iconv(doc.text, "latin1", "ASCII", sub="")
#Write the parsed text to the document list
document_list[[rawfile]] <- doc.text
}

#Make a Corpus out of that list?
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus  # check corpus

#tf;idf
dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf, normalize=TRUE))
#Select most discriminatory words (NOTE: Another way of using this list as as a lookup for all words within a given other document, e.g. we see where users source from by the words they use. Even so, limiting to top results is probably wise for efficiency)
#thanks to http://stackoverflow.com/questions/15506118/make-dataframe-of-top-n-frequent-terms-for-multiple-corpora-using-tm-package-in
#that example selects the most common words in each document, while I want to select the most discriminatory words from each
#convert DTM to matrix
#then convert to top words matrix
m = as.matrix(t(dtm))
mdf = as.data.frame(m)
#extract words unique to each document into a corpus of length(html) where each document receives a list of terms (plyr reshapping?)
#I can't work out how to spot these in my df...

#extract the top weighted words to each document into lists of length (html)

#basic lsa (just to test something out) thanks to Bodong http://bodongchen.com/blog/?p=301
#distance matrix
# 2. MDS with raw term-document matrix compute distance matrix
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat  # check distance matrix
#LSA
# 3. MDS with LSA
td.mat.lsa <- lw_bintf(td.mat) * gw_idf(td.mat)  # weighting
lsaSpace <- lsa(td.mat.lsa)  # create LSA space
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))  # compute distance matrix
dist.mat.lsa  # check distance matrix

#keyword extraction, etc.
#dbpedia spotlight. weka Wikipedia miner useful?

Checkout http://texlexan.sourceforge.net/ too…