pipitone/qmed-computes-messy-data.md

## qmed-computes-messy-data.md

      
    Raw
  

              qmed-computes-messy-data.md
            
          
    QMED Computes: Cleaning Up Messy Data

Workshop date: 2017-12-06
Author: Jon Pipitone, 2017
Google Dochttps://docs.google.com/document/d/1Th-NQPJsO7_BPjcRs9twOuCISq8QboK2TnG7K0yxxko/edit#
This work is licensed under a Creative Commons Attribution 4.0
International License.
Big Ideas

Here are the big takeaways I’d like you to leave with from this
workshop:


Messy data is inconsistent data. Fix it. Humans reconcile
inconsistencies because they understand meaning, computers
(and statistics) do not. Before you do any analysis, check each
field in your dataset to see that it’s clean.


Make the computer compute. Don’t do anything by hand that you
can automate. This shows your work, and lets you fix problems
later easily.


Be able to explain your work. Decide what is raw data, and never
edit it by hand. Instead, transform your data and record the
steps so you can share, remix and replay them later. Save your
OpenRefine scripts along with your raw and cleaned data.


Learning objectives


Know what OpenRefine does and when to use it


Know what is meant by “messy data”


Know how to import and export your data from OpenRefine


Know how to exploring and find inconsistencies using filters


Use clustering and facets to fixing those inconsistencies


Use transforms to clean and split up columns


Save your steps to apply to other datasets and for reproducibility


Tasks

In this session we will be following the Data Carpentry workshop on
OpenRefine:

http://www.datacarpentry.org/OpenRefine-ecology-lesson

This lesson includes all the big ideas you should take away from this
workshop.


To get started, please follow up the instructions in the

Setup
section


Watch the very clear videos on openrefine.org.
They are quick overviews. Do not try to learn how to do everything from
these videos. Focus on what can be done with this tool.
Note: OpenRefine used to be called GoogleRefine.
a.  Watch the entire Introductory video
b.  Watch up to 5:45 of the Clean and Refine video.
The stuff after this is complex, and not super relevant.
c.  The last video, Reconcile and Match, is something you can leave
for later. It’s super cool stuff, but niche.


Do the Data Carpentry exercises: http://www.datacarpentry.org/OpenRefine-ecology-lesson


Consider how you would use this tool in a real project. Discuss with a partner and then see my answers below.
a.  What kinds of data are “messy”?
b.  What columns of your dataset should you check?
c.  How should you record how you cleaned up your data?
d.  How can you prevent data from getting messy in the first place?


Other resources


The OpenRefine Wiki Check
the wiki for other learning resources, FAQs, and examples of how other
people are using it.


Data Carpentry OpenRefine Resources This is included in the OpenRefine
lessons, but you might have missed it. It includes pointers to other
tutorials online,


Use of Spreadsheets for Research Data Collection and Preparation: A
Primer
This paper that summarizes all the important points about how to structure
your spreadsheets and record data in a uniform way.  Revisit it to see how
to use Data Validation to keep your data consistent during data
entry.


Good enough practices in scientific computing
You may not think you are doing “scientific computing” but welcome to the
21st century: you probably are. This paper outlines the basic strategies you
should strive towards when you get going.


My Answers to Question 4

What kinds of data are “messy”?

Any. Especially anything generated by humans (e.g. data entry from
surveys), or extracted from human generated data (e.g. qualitative
analysis of free form data like medical charts).
That said, even machine generated data could be messy (since, you know,
humans created the machines).
What columns of your dataset should you check?

Ideally, every single one. At least doing a quick facet to look for
outliers (not statistical outliers, but inconsistencies that lead to
outliers; we’ll talk about detecting and dealing with outliers in the
Statistics workshop).
How should you record how you cleaned up your data?

First, this is a leading question, but yes, you need to record how you
clean up your data (Why?).
I suggest the following:


Every time you clean up your data, make an explicit record of the
steps you took to do this. OpenRefine makes it easy because you
can Extract your cleaning
steps
and save them into a separate file.


Organize your raw and cleaned data into separate folders, and
cleanup files. For instance:
data/
    clean/
        Portal_rodents.csv
        Portal_rodents.open-refine.json
    raw/
        Portal_rodents.csv
README.md

In this way, future you (or someone else) can reproduce your cleaned
up data by loading up the OpenRefine script in the corresponding .json
file that sits right next to it.


If you had some manual steps to clean up your data (sometimes, you
just have to), or other have other clean up scripts (we’ll talk
about this in the R sessions), you should document every
step somewhere. A README
document that orients someone to your project is a good place to
do that. The “Good Enough” paper has more info on this.


How can you prevent data from getting messy in the first place?

If you’re creating data from scratch (say, running a study and
collecting data from participants, or extract psychiatric hospital
funding data from paper copies of historical provincial budgets), you
need to protect against messy data upfront. (You’re only human after
all.)
This is especially true if you are working on the project with other
humans or you are working on the project over a long period of time
(and trust me, it’ll take longer than you think).


Make clear guidelines about what kind of data goes into each column.
Make sure everyone is on the same page. Perhaps document this in
your README (but beware, humans don’t like reading documents).
I.e. “In the ‘Dropped out” column, ‘Y’ means the participant
dropped out, and ‘N’ means they did not.” Sounds too simple to
write down, but it can really help.


Don’t use colours to indicate anything meaningful in
your spreadsheets. They are pretty, but humans forget what
different colours mean, and you will hate yourself for this later
when you discover stats programs are colorblind.


Use Data
Validation
controls in your spreadsheets. These are settings you create for
each column that control what are legal values, and enforce it.
This does wonders to exclude the invisible leading and trailing
spaces that, and people entering “YES”, “y”, “yes”, or “1” in
columns that should just have “Y”.


Use a tool specifically designed for data collection, not
a spreadsheet. E.g. REDCap.
This is an advanced topic, but worth knowing that such tools
exist, and can have built in data validation.