Thesaurus…
So I don’t think anyone has blogged about this topic (although the github [README of the package]1 I’ll talk about is pretty good) – using a thesaurus to look for matching strings in a text. This is just a sketch of some stuff I did a while ago – it’s unfinished but I might come back to it at some point. I’m currently looking at the sorts of things some participants said while engaged in some tasks. So for example, in a dataframe called ‘chat’ I have a column called ‘content’ which contains a load of messages. I’m interested in the presence of some keyterms in that. I can write code like the below to (1) subset to rows with a keyterm in (grep), or (2) return a T/F vector indicating whether a term appears in the relevant column for that row (grepl)
chat[grep("awful|terrible",chat\$content, perl=T), ]
chat\$emotion <- apply(chat[^16],1,function(x) grepl("awful|terrible",x, perl=T))
#column 16 is chat\$content
Of course, you could extend the terms in lots of ways including manually or using a corpus (and more regular expressions). There are also various ways to get new words programmatically, again a corpus is a good option, e.g. by using a term document matrix to look for keyterms, or looking for collocates. I was interested in whether or not there was an easy function to connect to a thesaurus to populate a list from that. Short answer – sort of…the rtematres package allows you to connect to a yes, there is a Harry Potter thesaurus, no, no idea), for the sake of example I’m going to connect to the ‘Subjects of New Zealand Thesaurus’.
install.packages("rtematres")
library(rtematres)
rtematres.options()
gives you
> rtematres.options()
\$tematres_url
[^1] "http://tematres.befdata.biow.uni-leipzig.de/vocab/index.php"
\$tematres_service_url
[^1] "http://tematres.befdata.biow.uni-leipzig.de/vocab/services.php"
So, using the sonz example, we want to:
rtematres.options("tematres_url" = "http://www.vocabularyserver.com/sonz/index.php")
rtematres.options("tematres_service_url" = "http://www.vocabularyserver.com/sonz/service.php")
Then we can start to do things (note some of this looks a bit different to the documentation on github, but matches the CRAN version)
rtematres.api(task = "search", argument = "measurement")
rtematres.search("plant")
As an aside, another way to do this would be to connect to the [Big Huge Thesaurus api]2 (quality?) e.g. using the [Python connector]3. So, that’s not quite what I was originally looking for (ideas on that?) but might be something to come back to in the future…