One of the things I do in my Data Science for Innovation class is illustrate some of the issues we face in working with real datasets. Unlike many sample datasets, many datasets we encounter in authentic contexts are messy in various ways, or/and need wrangling into the shape we need for analysis. A great tool to illustrate some of the issues we face is OpenRefine. I confess, it isn’t a tool I’d used until I started teaching this module, and it’s likely not a tool many professional datascientists would continue using (I still use R for most of the bits and pieces I do even knowing about OpenRefine). What OpenRefine is great for is illustrating the issues and principles without getting bogged down in the specifics of “how do we code that”. It’s also incredibly useful because it is a tool I can imagine introducing to colleagues who don’t want to code. Having taught the OpenRefine way a couple of times, the feedback students gave was mostly positive, with a few wishing I’d selected a different tool. So, I took the OpenRefine tutorial, and worked out how you’d implement each step both in R, and in a spreadsheet (partially implemented so far as possible, solely for illustrative purposes). The three are embedded below and can be treated as CC-By.
Word document OpenRefine instructions
RPubs R instructions (click to view)
Spreadsheet illustration (a good example of the limitations of spreadsheets…)
Related Posts via Taxonomies
Data cleaning in OpenRefine and R by Simon Knight is licensed under a Creative Commons Attribution 4.0 International License.