Cleaning essays in R (ish)

Error messages

“Should I worry about this?”

I’m currently working on a corpus of student essays. We want to run various analyses on these essays, but for that to work we need to have them in a fairly universal state, with some elements removed/modified for processing purposes.

As is entirely predictable, the essays vary in format and structure a lot, for example:#

  1. Most are docx format, some are pdf
  2. Some include the student ID in the filename, many do not (astonishingly, some don’t include it anywhere in the text either!)
  3. Some are formatted in columns
  4. Some have included the essay prompt at the beginning of the text, most have not
  5. Student IDs and names are scattered across: Start/end sections, filename, headers/footers
  6. Many figures and tables are included along with captions – we can’t do anything meaningful with these in our tools at the moment, the figures are lost in conversion to .txt anyway but other remnants are noise for analysis.

I started trying to do some cleaning of the above manually, and realised that would take me about 3 days for every batch – not viable. So, I moved on to try and automate. My thanks to Laura Allen at ASU for her incredibly useful advice on this (to be clear, my terrible code has nothing to do with her).

I went about trying to:

  1. Convert all documents to .txt
  2. Rename all documents to the student ID, and remove any references to their name or ID from the document text
  3. Remove any fore or tail text
  4. Remove headings, captions, and other random short text elements
  5. Convert references and links to unigrams (otherwise e.g. individual names and dates become features, which we don’t want)
  6. Clean some punctuation (specifically dashes, hyphens, etc. – a massive hassle)
  7. Convert bullets and numbered lists to paragraph text.

And I wanted to do all of this in R……however, I realised that was going to be a challenge, so instead it’s all called from within R/RStudio, with some system calls and a bit of embedded Python.

I’ve copied the working code (which I will clean up at some point) below…as will be obvious, well this:

So lots is from SO, and I asked 3 questions specifically:

  1. Removing spaces surrounding dashes in R
  2. Regular expression for matching a variety of types of numbered lists in R
  3. Finding short strings in new lines in an R character vector

Late on I found the qdapRegex package, which has a load of canned regex expressions (including for URL and citation extraction), so I rewrote some of the code and reduced it a bit. I’ve left most of the old stuff in though, so you can see the drafting (and the utter mess…so many things to improve).  The ‘tryCatch’ stuff was also added late…uhum, this is to identify where specific errors were coming from.

Challenge to the reader – improve it? Have I missed anything (I’m aware headings/footers are imported currently, but think they’re excluded by the short-line remover) ?

The below chunk gets IDs for matching purposes:

Then the actual cleaning and processing




Print pagePDF pageEmail page

This Post Has 1 Comment

Leave A Reply

%d bloggers like this: