Data cleaning in OpenRefine and R
One of the things I do in my Data Science for Innovation class is illustrate some of the issues we face in working with real datasets. Unlike many sample datasets, many datasets we encounter in authentic contexts are messy in various ways, or/and need wrangling into the shape we need for analysis. A great tool to illustrate some of the issues we face is OpenRefine. I confess, it isn’t a tool I’d used until I started teaching this module, and it’s likely not a tool many professional datascientists would continue using (I still use R for most of the bits and pieces I do even knowing about OpenRefine). What OpenRefine is great for is illustrating the issues and principles without getting bogged down in the specifics of “how do we code that”. It’s also incredibly useful because it is a tool I can imagine introducing to colleagues who don’t want to code. Having taught the OpenRefine way a couple of times, the feedback students gave was mostly positive, with a few wishing I’d selected a different tool. So, I took the OpenRefine tutorial, and worked out how you’d implement each step both in R, and in a spreadsheet (partially implemented so far as possible, solely for illustrative purposes). The three are embedded below and can be treated as CC-By.
Word document OpenRefine instructions
RPubs R instructions (click to view)
http://rpubs.com/sjgknight/cleaning
Spreadsheet illustration (a good example of the limitations of spreadsheets…)



Related Posts via Taxonomies
Data cleaning in OpenRefine and R by Simon Knight is licensed under a Creative Commons Attribution 4.0 International License.
Useful https://github.com/vpnagraj/rrefine
Hi. I am Fabio. Can you explain me this sentence:
You may want to turn off “guess data types”, particularly if you have data that contains
leading zeros in numbers or identifiers which are significant.
a bit more? Thanks,
Fabio
Hi Fabio
Thanks for the question, the issue with having ‘guess data types’ turned on, is that sometimes it will treat data different to how you want, for example, if you have a ‘StudentID’ field made up of numbers, (e.g. 4501980) this isn’t really a numeric data, there aren’t cases where we’d want to do maths on it, etc. so treating it as numeric would be wrong. Or, we might have a field with data that has 00035 00050, etc. and we want to preserve the leading 0s for some reason.