Thesaurus (and dictionaries)

Thesaurus…

So I don’t think anyone has blogged about this topic (although the github [README of the package]1 I’ll talk about is pretty good) – using a thesaurus to look for matching strings in a text. This is just a sketch of some stuff I did a while ago – it’s unfinished but I might come back to it at some point.   I’m currently looking at the sorts of things some participants said while engaged in some tasks. So for example, in a dataframe called ‘chat’ I have a column called ‘content’ which contains a load of messages. I’m interested in the presence of some keyterms in that. I can write code like the below to (1) subset to rows with a keyterm in (grep), or (2) return a T/F vector indicating whether a term appears in the relevant column for that row (grepl)

chat[grep("awful|terrible",chat\$content, perl=T), ]
chat\$emotion <- apply(chat[^16],1,function(x) grepl("awful|terrible",x, perl=T))
#column 16 is chat\$content

Of course, you could extend the terms in lots of ways including manually or using a corpus (and more regular expressions). There are also various ways to get new words programmatically, again a corpus is a good option, e.g. by using a term document matrix to look for keyterms, or looking for collocates. I was interested in whether or not there was an easy function to connect to a thesaurus to populate a list from that. Short answer – sort of…the rtematres package allows you to connect to a TemaTres server, a structured vocabulary database. I’ll run through how you can do that below, but unfortunately as far as I can see none of the the thesauri are general ‘everyday language’ vocabularies (lets ignore the additional complexities of electronic communication vocabularies!). The list of thesauri to connect to is here yes, there is a Harry Potter thesaurus, no, no idea), for the sake of example I’m going to connect to the ‘Subjects of New Zealand Thesaurus’.

install.packages("rtematres")
library(rtematres)
rtematres.options()

gives you

> rtematres.options()
\$tematres_url
[^1] "http://tematres.befdata.biow.uni-leipzig.de/vocab/index.php"
\$tematres_service_url
[^1] "http://tematres.befdata.biow.uni-leipzig.de/vocab/services.php"

So, using the sonz example, we want to:

rtematres.options("tematres_url" = "http://www.vocabularyserver.com/sonz/index.php")
rtematres.options("tematres_service_url" = "http://www.vocabularyserver.com/sonz/service.php")

Then we can start to do things (note some of this looks a bit different to the documentation on github, but matches the CRAN version)

rtematres.api(task  =  "search",  argument  =  "measurement")
rtematres.search("plant")

As an aside, another way to do this would be to connect to the [Big Huge Thesaurus api]2 (quality?) e.g. using the [Python connector]3. So, that’s not quite what I was originally looking for (ideas on that?) but might be something to come back to in the future…

Footnotes

  1. https://github.com/cpfaff/rtematres/blob/master/README.rmd

  2. http://words.bighugelabs.com/about.php

  3. https://pypi.python.org/pypi/pyhugeconnector