Skip to content

Instantly share code, notes, and snippets.

@pipitone
Created December 30, 2017 03:57
Show Gist options
  • Save pipitone/6b3faa734c368a9be77a05cf5fdc0c5d to your computer and use it in GitHub Desktop.
Save pipitone/6b3faa734c368a9be77a05cf5fdc0c5d to your computer and use it in GitHub Desktop.
QMED Computes Workshop on Messy Data 2017-12-06

QMED Computes: Cleaning Up Messy Data

Workshop date: 2017-12-06

Author: Jon Pipitone, 2017

Google Dochttps://docs.google.com/document/d/1Th-NQPJsO7_BPjcRs9twOuCISq8QboK2TnG7K0yxxko/edit#

This work is licensed under a Creative Commons Attribution 4.0 International License.

Big Ideas

Here are the big takeaways I’d like you to leave with from this workshop:

  1. Messy data is inconsistent data. Fix it. Humans reconcile inconsistencies because they understand meaning, computers (and statistics) do not. Before you do any analysis, check each field in your dataset to see that it’s clean.

  2. Make the computer compute. Don’t do anything by hand that you can automate. This shows your work, and lets you fix problems later easily.

  3. Be able to explain your work. Decide what is raw data, and never edit it by hand. Instead, transform your data and record the steps so you can share, remix and replay them later. Save your OpenRefine scripts along with your raw and cleaned data.

Learning objectives

  • Know what OpenRefine does and when to use it

  • Know what is meant by “messy data”

  • Know how to import and export your data from OpenRefine

  • Know how to exploring and find inconsistencies using filters

  • Use clustering and facets to fixing those inconsistencies

  • Use transforms to clean and split up columns

  • Save your steps to apply to other datasets and for reproducibility

Tasks

In this session we will be following the Data Carpentry workshop on OpenRefine:

http://www.datacarpentry.org/OpenRefine-ecology-lesson

This lesson includes all the big ideas you should take away from this workshop.

  1. To get started, please follow up the instructions in the

    Setup section

  2. Watch the very clear videos on openrefine.org. They are quick overviews. Do not try to learn how to do everything from these videos. Focus on what can be done with this tool.

    Note: OpenRefine used to be called GoogleRefine.

    a. Watch the entire Introductory video

    b. Watch up to 5:45 of the Clean and Refine video. The stuff after this is complex, and not super relevant.

    c. The last video, Reconcile and Match, is something you can leave for later. It’s super cool stuff, but niche.

  3. Do the Data Carpentry exercises: http://www.datacarpentry.org/OpenRefine-ecology-lesson

  4. Consider how you would use this tool in a real project. Discuss with a partner and then see my answers below.

    a. What kinds of data are “messy”?

    b. What columns of your dataset should you check?

    c. How should you record how you cleaned up your data?

    d. How can you prevent data from getting messy in the first place?

Other resources

  • The OpenRefine Wiki Check the wiki for other learning resources, FAQs, and examples of how other people are using it.

  • Data Carpentry OpenRefine Resources This is included in the OpenRefine lessons, but you might have missed it. It includes pointers to other tutorials online,

  • Use of Spreadsheets for Research Data Collection and Preparation: A Primer

    This paper that summarizes all the important points about how to structure your spreadsheets and record data in a uniform way. Revisit it to see how to use Data Validation to keep your data consistent during data entry.

  • Good enough practices in scientific computing

    You may not think you are doing “scientific computing” but welcome to the 21st century: you probably are. This paper outlines the basic strategies you should strive towards when you get going.

My Answers to Question 4

What kinds of data are “messy”?

Any. Especially anything generated by humans (e.g. data entry from surveys), or extracted from human generated data (e.g. qualitative analysis of free form data like medical charts).

That said, even machine generated data could be messy (since, you know, humans created the machines).

What columns of your dataset should you check?

Ideally, every single one. At least doing a quick facet to look for outliers (not statistical outliers, but inconsistencies that lead to outliers; we’ll talk about detecting and dealing with outliers in the Statistics workshop).

How should you record how you cleaned up your data?

First, this is a leading question, but yes, you need to record how you clean up your data (Why?).

I suggest the following:

  1. Every time you clean up your data, make an explicit record of the steps you took to do this. OpenRefine makes it easy because you can Extract your cleaning steps and save them into a separate file.

  2. Organize your raw and cleaned data into separate folders, and cleanup files. For instance:

    data/
        clean/
            Portal_rodents.csv
            Portal_rodents.open-refine.json
        raw/
            Portal_rodents.csv
    README.md
    

    In this way, future you (or someone else) can reproduce your cleaned up data by loading up the OpenRefine script in the corresponding .json file that sits right next to it.

  3. If you had some manual steps to clean up your data (sometimes, you just have to), or other have other clean up scripts (we’ll talk about this in the R sessions), you should document every step somewhere. A README document that orients someone to your project is a good place to do that. The “Good Enough” paper has more info on this.

How can you prevent data from getting messy in the first place?

If you’re creating data from scratch (say, running a study and collecting data from participants, or extract psychiatric hospital funding data from paper copies of historical provincial budgets), you need to protect against messy data upfront. (You’re only human after all.)

This is especially true if you are working on the project with other humans or you are working on the project over a long period of time (and trust me, it’ll take longer than you think).

  1. Make clear guidelines about what kind of data goes into each column. Make sure everyone is on the same page. Perhaps document this in your README (but beware, humans don’t like reading documents). I.e. “In the ‘Dropped out” column, ‘Y’ means the participant dropped out, and ‘N’ means they did not.” Sounds too simple to write down, but it can really help.

  2. Don’t use colours to indicate anything meaningful in your spreadsheets. They are pretty, but humans forget what different colours mean, and you will hate yourself for this later when you discover stats programs are colorblind.

  3. Use Data Validation controls in your spreadsheets. These are settings you create for each column that control what are legal values, and enforce it. This does wonders to exclude the invisible leading and trailing spaces that, and people entering “YES”, “y”, “yes”, or “1” in columns that should just have “Y”.

  4. Use a tool specifically designed for data collection, not a spreadsheet. E.g. REDCap. This is an advanced topic, but worth knowing that such tools exist, and can have built in data validation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment