Blog

Data cleaning in OpenRefine and R

One of the things I do in my Data Science for Innovation class is illustrate some of the issues we face in working with real datasets. Unlike many sample datasets, many datasets we encounter in authentic contexts are messy in various ways, or/and need wrangling into the shape we need for analysis. A great tool to illustrate some of the issues we face is OpenRefine. I confess, it isn’t a tool I’d used until I started teaching this module, and it’s likely not a tool many professional datascientists would continue using (I still use R for most of the bits and pieces I do even knowing about OpenRefine). What OpenRefine is great for is illustrating the issues and principles without getting bogged down in the specifics of “how do we code that”. It’s also incredibly useful because it is a tool I can imagine introducing to colleagues who don’t want to code. Having taught the OpenRefine way a couple of times, the feedback students gave was mostly positive, with a few wishing I’d selected a different tool. So, I took the OpenRefine tutorial, and worked out how you’d implement each step both in R, and in a spreadsheet (partially implemented so far as possible, solely for illustrative purposes). The three are embedded below and can be treated as CC-By.

Word document OpenRefine instructions

https://goo.gl/mQC7oe

RPubs R instructions (click to view)

http://rpubs.com/sjgknight/cleaning

 

Spreadsheet illustration (a good example of the limitations of spreadsheets…)


Print pagePDF pageEmail page

CC BY 4.0 Data cleaning in OpenRefine and R by Simon Knight is licensed under a Creative Commons Attribution 4.0 International License.

This Post Has 3 Comments

  1. Fabio says:

    Hi. I am Fabio. Can you explain me this sentence:

    You may want to turn off “guess data types”, particularly if you have data that contains
    leading zeros in numbers or identifiers which are significant.

    a bit more? Thanks,

    Fabio

    • Simon Knight says:

      Hi Fabio
      Thanks for the question, the issue with having ‘guess data types’ turned on, is that sometimes it will treat data different to how you want, for example, if you have a ‘StudentID’ field made up of numbers, (e.g. 4501980) this isn’t really a numeric data, there aren’t cases where we’d want to do maths on it, etc. so treating it as numeric would be wrong. Or, we might have a field with data that has 00035 00050, etc. and we want to preserve the leading 0s for some reason.

Leave A Reply





%d bloggers like this: