Blog

Multiple Document Processing in R or Python

Just to upfront this disclaimer again, I write (/hack together) horrible code…and this is very drafty, largely for my own future reference

I have a set of html documents (I converted some from pdf but that doesn’t matter for this purpose). I want to extract some information about those documents including:

  1. Unique, or highly document specific terms (tf;idf)
  2. Metadata (such as authorship, url, date, etc.)
  3. Topical information (how are pages clustered, any keyterms from those clusters)
  4. Possible synonyms related to the page
  5. Links the page is associated with (and ideally, those pages)

And hey look at that, there’re a load of python packages (bindings/wrappers/whatever) for that:

  1. tf;idf
  2. Metadata extractor http://www.gnu.org/software/libextractor/
  3. There are a few ways of doing this, LSA as below may be best for my purpose
  4. We can partly do this through DBPedia Spotlight for keyword disambiguation, then look for synonyms of those. We might also be interested in collecting related information so that claims can be associated with their original concepts (might also be useful if we wanted to do things like clustering search queries together).
  5. httrack python https://www.google.co.uk/search?q=httrack+python&ie=utf-8&oe=utf-8&rls=org.mozilla:en-US:official&client=firefox-a&gws_rd=cr&ei=vniHUp_wJMSRhQemyYDYCw

Unfortunately, I’d already started doing it in R (oh YAY) so I’ve got some started code there, which I either need to translate to Python, use the product of in Python, or run using something like RPy (which lets you run R through Python). Anyway this is my drafty stuff so far (largely for my future reference).


#general links of use given in code, also Fridolin's post here v useful for packages http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
# load required libraries
library(tm)
library(ggplot2)
library(lsa)
library(XML)
library(RCurl)
library(SnowballC)
library(textir)
library(plyr)

#FOR EACH file in directory (or URL in list)
#Thanks to http://stackoverflow.com/questions/15016462/create-a-corpus-from-many-html-files-in-r

# get data
setwd("C:/Users/.../docs") # this folder has your HTML files
html < - list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
length(html)
#create a list to store the documents in
document_list <- vector("list",length(html))
for(rawfile in 1:length(html)){
filepath <- html[rawfile]
#Extract metadata, save this somewhere

# READ AND PARSE SINGLE HTML FILE
#Thanks to http://www.r-bloggers.com/reading-html-pages-in-r-for-text-processing/
doc.html = htmlTreeParse(filepath,useInternal = TRUE)
# Extract all the paragraphs (HTML tag is p, starting at
# the root of the document). Unlist flattens the list to
# create a character vector.
doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
# Replace all \n by spaces
doc.text = gsub('\\n', ' ', doc.text)
# Join all the elements of the character vector into a single
# character string, separated by spaces
doc.text = paste(doc.text, collapse = ' ')
#Removing non-ASCII characters thanks to http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
Encoding(doc.text) <- "latin1"  # (just to make sure)
iconv(doc.text, "latin1", "ASCII", sub="")
#Write the parsed text to the document list
document_list[[rawfile]] <- doc.text
}

#Make a Corpus out of that list?
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
corpus <- tm_map(corpus, stemDocument, language = "english")
corpus  # check corpus

#tf;idf
dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf, normalize=TRUE))
#Select most discriminatory words (NOTE: Another way of using this list as as a lookup for all words within a given other document, e.g. we see where users source from by the words they use. Even so, limiting to top results is probably wise for efficiency)
#thanks to http://stackoverflow.com/questions/15506118/make-dataframe-of-top-n-frequent-terms-for-multiple-corpora-using-tm-package-in
#that example selects the most common words in each document, while I want to select the most discriminatory words from each
#convert DTM to matrix
#then convert to top words matrix
m = as.matrix(t(dtm))
mdf = as.data.frame(m)
#extract words unique to each document into a corpus of length(html) where each document receives a list of terms (plyr reshapping?)
#I can't work out how to spot these in my df...

#extract the top weighted words to each document into lists of length (html)

#basic lsa (just to test something out) thanks to Bodong http://bodongchen.com/blog/?p=301
#distance matrix
# 2. MDS with raw term-document matrix compute distance matrix
td.mat <- as.matrix(TermDocumentMatrix(corpus))
dist.mat <- dist(t(as.matrix(td.mat)))
dist.mat  # check distance matrix
#LSA
# 3. MDS with LSA
td.mat.lsa <- lw_bintf(td.mat) * gw_idf(td.mat)  # weighting
lsaSpace <- lsa(td.mat.lsa)  # create LSA space
dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))  # compute distance matrix
dist.mat.lsa  # check distance matrix

#keyword extraction, etc.
#dbpedia spotlight. weka Wikipedia miner useful?

Checkout http://texlexan.sourceforge.net/ too...


Print pagePDF pageEmail page

This Post Has 2 Comments

  1. Simon Knight says:
    #general links of use given in code, also Fridolin's post here v useful for packages http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
    # load required libraries
    library(tm)
    library(ggplot2)
    library(lsa)
    library(XML)
    library(RCurl)
    library(SnowballC)
    library(textir)
    library(plyr)
    library(qdap)
    
    #FOR EACH file in directory (or URL in list)
    #Thanks to http://stackoverflow.com/questions/15016462/create-a-corpus-from-many-html-files-in-r
    
    # get data
    setwd("C:/Users/...documents/") # this folder has your HTML files 
    html < - list.files(pattern="\\.(htm|html)$") # get just .htm and .html files
    length(html)
    #create a list to store the documents in
    document_list <- vector("list",length(html))
    for(rawfile in 1:length(html)){
      filepath <- html[rawfile]
      
      #Extract metadata, save this somewhere
    
      # READ AND PARSE SINGLE HTML FILE
      #Thanks to http://www.r-bloggers.com/reading-html-pages-in-r-for-text-processing/
      doc.html = htmlTreeParse(filepath,useInternal = TRUE)
      # Extract all the paragraphs (HTML tag is p, starting at
      # the root of the document). Unlist flattens the list to
      # create a character vector.
      doc.text = unlist(xpathApply(doc.html, '//p', xmlValue))
      # Replace all \n by spaces
      doc.text = gsub('\\n', ' ', doc.text)
      # Join all the elements of the character vector into a single
      # character string, separated by spaces
      doc.text = paste(doc.text, collapse = ' ')
      #Removing non-ASCII characters thanks to http://stackoverflow.com/questions/9934856/removing-non-ascii-characters-from-data-files
      Encoding(doc.text) <- "latin1"  # (just to make sure)
      iconv(doc.text, "latin1", "ASCII", sub="")
      #remove whitespace
      stripWhitespace(doc.text)
      document_list[[rawfile]] <- doc.text
    }
    rm(doc.text, doc.html,filepath,rawfile)
    
    #Make a Corpus out of that list
    corpus <- Corpus(VectorSource(document_list))
    #Do stuff to the corpus. NOTE: You might want to consider regex instead of removing punctuation, e.g. for certain names/entities/hyphenated things etc
    corpus <- tm_map(corpus, tolower)
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, function(x) removeWords(x, stopwords("english")))
    corpus <- tm_map(corpus, stemDocument, language = "english")
    corpus <-tm_map(corpus,stripWhitespace)
    corpus  # check corpus
    
    #TO FIND UNIQUE TERMS IN DOCS; thanks to @amyaishab
    #Split corpus to term vector
    corpussplit <- tm_map(corpus, strsplit, "\\W")
    corpussplit <- tm_map(corpussplit, unlist)
    #create a place to store your unique terms
    document_unique_terms <- vector("list",length(html))
    #For each doc in corpus, select doc, compare to all other docs:
    for(split_doc in 1:length(html)){
      target_doc <- unlist(corpussplit[split_doc])
      #we want to create a vector of all other terms in the rest of the corpus
      #1st, remove corpus_docs from corpussplit (you can also set it to NULL (<-NULL) ...which would be a bad idea)
      corpus_docs <- corpussplit[-split_doc]
      #then combine the rest 
      corpus_docs <- unlist(corpus_docs)
      #now find uniques etc.
      split_doc_uniq <-target_doc[!target_doc %in% corpus_docs]#finding the unique words in tanabata by looking for the things in tanabata that are not(!) in festival
      split_doc_uniq_freq<-table(split_doc_uniq)#finding the frequencies of the words in tanabata
      split_doc_uniq_freq<-sort(split_doc_uniq_freq, decreasing=T)#putting them so most freq are at the top but this only gives you the numbers
      split_doc_uniq_freq<-paste(names(split_doc_uniq_freq), split_doc_uniq_freq, sep="\t")#this makes it so that the words and freq appear
      document_unique_terms[[split_doc]] <- split_doc_uniq_freq 
    }
    rm(split_doc,split_doc_uniq,split_doc_uniq_freq,target_doc,unique_terms,corpus_docs)
    
    #Checkout the qdap package
    
    #tf;idf
    dtm <- DocumentTermMatrix(corpus, control = list(weighting = weightTfIdf, normalize=TRUE))
    #Select most discriminatory words (NOTE: Another way of using this list as as a lookup for all words within a given other document, e.g. we see where users source from by the words they use. Even so, limiting to top results is probably wise for efficiency)
    #thanks to http://stackoverflow.com/questions/15506118/make-dataframe-of-top-n-frequent-terms-for-multiple-corpora-using-tm-package-in
    #that example selects the most common words in each document, while I want to select the most discriminatory words from each
    #convert DTM to matrix
    #then convert to top words matrix
    m = as.matrix(t(dtm))
    mdf = as.data.frame(m)
    
    #lsa thanks to Bodong http://bodongchen.com/blog/?p=301
    #distance matrix
    # 2. MDS with raw term-document matrix compute distance matrix
    td.mat <- as.matrix(TermDocumentMatrix(corpus))
    dist.mat <- dist(t(as.matrix(td.mat)))
    dist.mat  # check distance matrix
    #LSA
    # 3. MDS with LSA
    td.mat.lsa <- lw_bintf(td.mat) * gw_idf(td.mat)  # weighting
    lsaSpace <- lsa(td.mat.lsa)  # create LSA space
    dist.mat.lsa <- dist(t(as.textmatrix(lsaSpace)))  # compute distance matrix
    dist.mat.lsa  # check distance matrix
    

Leave A Reply

You must be logged in to post a comment.

%d bloggers like this: