marioa/OpenRefine.md

## OpenRefine.md

      
    Raw
  

              OpenRefine.md
            
          
    Data Cleaning with OpenRefine for Ecologists

This is a summarised version of the Data Carpentry's Data Cleaning with OpenRefine for Ecologists
 lesson.
Contents


Setup
Introduction
Working with OpenRefine
Filtering and Sorting with OpenRefine
Examining Numbers in OpenRefine
Scripts from OpenRefine
Exporting and Saving Data from OpenRefine
Other Resources in OpenRefine

Setup


Download dataset https://ndownloader.figshare.com/files/7823341. A teaching version
of the Portal Database. The data in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this
lesson.
OpenRefine (Freebase Gridworks developed by Metaweb, then Google Refine) http://openrefine.org/download.html. Available as open source since 2010.

Introduction

Time: 10/0 minutes.
Objectives


Describe OpenRefine’s uses and applications.
Differentiate data cleaning from data organization.
Experiment with OpenRefine’s user interface.
Locate helpful resources to learn more about OpenRefine.

Motivation


You need to keep provenance of data.
You must export your data at the end, OpenRefine does not save back to the original data.
You can undo operations.
OpenRefine helps you clean data.
Can replay operations on multiple data sets.
Can use it to illustarte different clustering algorithms.

Before we get started


Will need a browser (Firefox or Google chrome).
Download and install OpenRefine http://openrefine.org/download.html.
Download the data file to be used.
Run OpeRefne - point to http://localhost:3333/ if it does not open automatically.

What is OpenRefine


Java program

Runs locally on your machine


Does not modify your original data.

Can undo actions
Keep track of the actions you have performed on your data and export
Auto saves so you can pause/resume analysis.


Standardise and clean data across your file.

It can help you


Get an overview of a data set
Resolve inconsistencies in a data set
Help you split data up into more granular parts
Match local data up to other data sets
Enhance a data set with data from other sources
Save a set of data cleaning steps to replay on multiple files
Can work with datasets of up to 100 000 rows.

Key Points


OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
OpenRefine will automatically track any steps you take in working with your data.

Working with OpenRefine

Time: 15/20 minutes.
Objectives


Create a new OpenRefine project from a CSV file.
Recall what facets are and how they are used to sort and summarize data.
Recall what clustering is and how it is applied to group and edit typos.
Manipulate data using previous steps with undo/redo.
Employ drop-downs to split values from one column into multiple columns.
Employ drop-downs to remove white spaces from cells.

Reading in data


Open OpenRefine
Open the data file Portal_rodents_19772002_scinameUUIDs.csv

Click Create Project and select Get data from This Computer.
Click Choose Files and select the file Portal_rodents_19772002_scinameUUIDs.csv.

Click Open or double-click on the filename.


Click Next>> under the browse button to upload the data into OpenRefine.
OpenRefine gives you a preview - a chance to show you it understood the file.
If you change any of the parameters click Update Preview (bottom right). If this:

is the wrong file, click <<Start Over (upper left).
If all looks well, click Create Project>> (upper right).


Faceting


A ‘Facet’ groups all the like values that appear in a column
Allows you to filter the data by these values and edit values across many records at the same time.

It allows you to:

See a big picture of your data, and
Filter down to just the subset of rows that you want to change in bulk.

Text facets group all the identical text values in a column and lists each value with the number of records it appears in.
Here we will use faceting to look for potential errors in data entry in the scientificName column.

Scroll over to the scientificName column.
Click the down arrow and choose Facet > Text facet.
Gives every unique value in the scientificName column along with a count of occurrencess in the column.
Can sort by name or by count.


Do you notice any problems with the data? What are they?


Hover the mouse over one of the names in the Facet list.


You should see that you have an edit function available.


Can use edit to fix an error immediately.


OpenRefine will ask whether you want to apply the same correction to every value it finds like that one.


OpenRefine offers  better ways to find and fix these errors, which we’ll use instead.


Learn about these when we talk about clustering.


Can do other type of faceting:

Numeric facets
Timeline facets (for dates)
Custom facets
Scatterplot facets

Numeric and Scatterplot facets display graphs instead of lists of values.
Custom facets are a range of different types of facets. Some of the default custom facets are:

Word facet - breaks down text into words and counts & the number of records each word appears in
Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the:

‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row


Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column.

Can be useful for spotting incorrect or unusual data in a field where specific lengths are expected


Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.

Open refines limits the numbers of aggregates it will produce when faceting (2000 by default but can change; http://127.0.0.1:3333/preferences and changing the preference with key ui.browsing.listFacet.limit and value equal the new limit, such as 3000).
Exercise

Using faceting, find out how many years are represented in the census.


For yr do Facet > Text facet. A box will appear in the left panel showing that there are 26 unique entries in this column.


Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?


By default, the column yr is formatted as Text.
You can change the format by doing Edit cells > Common transforms > To number.
Doing Facet > Numeric facet creates a box in the left panel that shows a histogram of the number of entries per year.
Notice that the data is shown as a number, not a date.

If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.


Which years have the most and least observations?


After creating a facet, click Sort by count in the facet box. The year with the most observations is 1997. The least is 1977.

Clustering

Look for similarities:

In the scientificName Text Facet we created in the step above, click the Cluster button.
In the resulting pop-up window, you can change the Method and the Keying Function.
Select the key collision method and metaphone3 keying function.


It should identify three clusters.
Click the Merge? box beside each, then click Merge Selected and Recluster to apply the corrections to the dataset.


Explore, but don’t Merge again; just Close when you’re done.

We’ll now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
Split


Let us suppose we want to split the scientificName column into separate columns for genus and for species.
Click the down arrow at the top of the scientificName column. Choose Edit Column > Split into several columns....
In the pop-up, in the Separator box, replace the comma with a space.
Uncheck the box that says Remove this column.
Click OK. You’ll get some new columns called scientificName 1, scientificName 2, and so on.


Notice that in some cases scientificName 1 and scientificName 2 are empty.

Why is this?
What do you think we can do to fix this?


Exercise
Change the name of the second new column to “species”.
Solution


Click the down arrow and then Edit column > Rename this column.


Type “species” into the box that appears.


A pop-up will appear that says Another column already named species.


This is because there is another column where we’ve recorded the species abbreviation.

You can choose another name like speciesName for this column or
change the other species column name to speciesAbbreviation.


Undo / Redo


Click where it says Undo / Redo.


All the changes you have made to the data so far are listed here.


Click on the step that you want to go back to, in this case the previous step.


The added columns will disappear.


Notice that you can still click on the last step and make the columns reappear, and toggle back and forth between these states.
Leave the dataset in the state in which the scientificNames were clustered, but not yet split.

Trim Leading and Trailing Whitespace


In the header for the column scientificName, choose Edit cells > Common transforms > Trim leading and trailing whitespace.
Notice that the Split step has now disappeared from the Undo / Redo and is replaced with a Text transform on 3 cells.
Perform the same Split operation on scientificName that you undid earlier. This time you should only get two new columns. Why?

Go back to the unsplit view.
Key Points


Faceting and clustering approaches can identify errors or outliers in data.

Filtering and Sorting with OpenRefine

Time: 10/10 minutes
Objectives


Employ text filter or include/exclude to filter to a subset of rows.
Sort tables by a column.
Sort tables by multiple columns.

Filtering

Time: 10/10 minutes.

Click the down arrow next to scientificName > Text filter.


A scientificName facet will appear on the left margin.


Type in bai and press return.


There are 48 matching rows of the original 35549 rows (and these rows are selected for the subsequent steps).


At the top, change the view to Show 50 rows.


This way you will see all the matching rows.

Exercise

What scientific names (genus and species) are selected by this procedure?


Baiomys taylori and Chaetodipus baileyi.


How would you restrict this to one of the species selected?

make the search case sensitive
split the scientificName column into species and genus before filtering
include more letters in your filter


Excluding entries

Can filter further by include or exlude an entry in a facet.
Exercise
Use include / exclude to select only entries from one of these two species.
Go back to viewing both species.
Sort

You can sort data. Column > Sort....
Exercise
Sort by month. How can you ensure that months are in order?

If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to > Sort without the ...

Sort > Sort... - This option enables you to modify your original sort.
Sort > Reverse - This option allows you to reverse the order of the sort.
Sort > Remove sort - This option allows you to undo your sort.


Exercise
Sort the data by plot. What year(s) were observations recorded for plot 1 in this filtered dataset.

In the plot column, select Sort... > numbers and select smallest first.
The years represented are 1990 and 1995.

Sorting by multiple columns


You can sort by multiple columns by performing sort on additional columns.

The sort depends on the order in which you select columns to sort.


To restart the sorting process with a particular column

Check the sort by this column alone box in the Sort pop-up menu or
remove the sort in the top level link.


Exercise
You might like to look for trends in your data by month of collection across years.

How do you sort your data by month?


For the mo column, click on Sort... and then numbers.


How would you do this differently if you were instead trying to see all of your entries in chronological order?


Remove existing sort or use sort by this column alone.
Sort by yr then 'mo' then dy.

Key Points


OpenRefine provides a way to sort and filter data without affecting the raw data.

Examining Numbers in OpenRefine

Time: 10/10 minutes.
Objectives


Transform a text column into a number column.
Identify and modify non-numeric values in a column using facets.
Use scatterplot facet to examine relationships among columns.

Numbers


By default columns are read in as text.
Previous sorts interpreted the text as numbers but not changed the data.
Can changed the data by Edit cells > Common transforms
Remove any text formatting.
Change RecordID by Edit cells > Common transforms > to number

Values go from left-justified to right-justified.


Exercise
Transform three more columns, including period, from text to numbers. Can all columns be transformed to numbers?

Only observations that include only numerals (0-9) can be transformed to numbers.
If you apply a number transformation to a column that doesn’t meet this criteria

Click the Undo / Redo tab, you will see a step that starts with Text transform on 0 cells.

This means that the data in that column was not transformed.
Font goes from black to green.


Numeric facet


Sometimes there are non-number values or blanks in a column which may represent errors in data entry.
We can do that with a Numeric facet.

Exercise

For a column you transformed to numbers, edit one or two cells, replacing


numbers with text (such as abc) or
blank (no number or text).


Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
Notice that there are several checkboxes in this facet:


Numeric,
Non-numeric,
Blank, and
Error.


There are counts of the number of cells in each category.


You should see checks for Non-numeric and Blank if you changed some values.


Experiment with checking or unchecking these boxes to select subsets of your data.


When done examining the numeric data, remove this facet by clicking the x in the upper left corner of its panel.
Note that this does not undo the edits you made to the cells in this column.

If you want to reverse these edits, use the Undo / Redo function.


Scatterplot facet


Relate multiple columns representing numbers using the scatterplot facet.

Select a numeric column, for example recordID
Use the pulldown menu to Facet > Scatterplot facet.
A new window called Scatterplot Matrix appears.

There are plots for each pair of numeric columns organized in an upper right triangle.
Each plot has little dots for the cell values from each row.


Exercise

Examine the scatterplots overall. Do the patterns make sense?
Why does the scatterplot for recordID vs period have the pattern it does?

Examine pair of columns in detail


We can examine one pair of columns by clicking on its square in the Scatterplot Matrix.
A new facet with only that pair will appear in the left margin.

Exercise

Click in the scatterplot facet in the left margin and drag to highlight a rectangle.
This will subset the data to those entries.

Exercise

Click on the Scatterplot Matrix square for recordID and period to get that as a facet in the left margin.
Redo the Text filter on scientificName to show only entries including the letters bai.

Notice the change in the scatterplot.
It might be easier to see if you click export plot to put it on a new browser tab.


Key Points


OpenRefine also provides ways to get overviews of numerical data.

Scripts from OpenRefine

Time: 10/5 minutes.
Objectives


Describe how OpenRefine generates JSON (Java Script Object Notation) code.
Demonstrate ability to export JSON code from OpenRefine.
Save JSON code from an analysis.
Apply saved JSON code to an analysis.

Scripts


OpenRefine saves every change you make to the dataset.
Changes are saved in a format known as JSON (JavaScript Object Notation).
You can export this JSON script and apply it to other data files.

If you had multiple files to clean, and they all had the same type of errors

e.g. species name misspellings, leading white spaces


All files had the same column names
You can save the steps you took as a JSON script:

Open a new file to clean in OpenRefine,
aste in the script and run it.


Gives you a quick way to clean all of your related data.


Let's create a script:

In the Undo / Redo section
a. click Extract...
b. Select the steps that you want to apply to other datasets by clicking the check boxes.
Copy the code from the right hand panel and paste it into a text editor.


NotePad on Windows or TextEdit on Mac.
Make sure it saves as a plain text file.

In TextEdit, do this by selecting Format > Make plain text and save the file as a txt file.


Rename the project.


Download an uncleaned version of the dataset:

https://ndownloader.figshare.com/files/7823341 or
use the version of the raw dataset you saved to your computer.


Start a new project in OpenRefine with this file and name it something different from your existing project.


Click the Undo / Redo tab > Apply and paste in the contents of txt file with the JSON code.


Click Perform operations.


The dataset should now be the same as your other cleaned dataset.


Reproducible science


Now, that you know how scripts work, you may wonder how to use them in your own scientific research.
For inspiration, you can read more about the succesful application of the reproducible science principles:


Marwick et al. Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation, 2017.
Stewart Lowndes et al. Our path to better science in less time using open data science tools, Nature ecology and evolution, 2017.

Key Points


All changes are being tracked in OpenRefine

These can be used for scripts for future analyses or
Reproducing an analysis.


Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.

Exporting and Saving Data from OpenRefine

Time: 10/5 minutes.
Objectives


Save an OpenRefine project.
Export cleaned data from an OpenRefine project.

Saving


OpenRefine saves as you go along.

You can close your browser and resume your project at a later time.


Export

You can export a project:

Send to a collaborator: raw data -> cleaned data.
Publish this as part of your analysis.

To do this:

Click the Export button in the top right and select Export project.
A tar.gz file will download to your default Download directory.

The tar.gz extension tells you that this is a compressed file that this file contains multiple files.
You can double-click on the tar.gz file and it will expand into a directory.

On Windows, opening tar.gz files requires additional software such as:

7-zip or
WinZip.


Once installed associate with the type of file.


Look at the files that appear in this folder. What files are here? What information do you think these files contain?


You should see:

A history folder that contains a zip file for each applied change.
Each zip file contains

A change.txt file which records each individual transformation that you did to your data.


A data.zip file which includes a data.txt which is a copy of your raw data.
You may also see other files.


You can import an existing project into OpenRefine by:

Clicking Open... in the upper right Import Project
Select the tar.gz project file.

This project includes all of the raw data and cleaning steps that were part of the original project.
Exporting Cleaned Data

You can also export just your cleaned data, rather than the entire project.

Click Export and select the file type you want to export the data in.


Tab-separated values (tsv)
Comma-separated values (csv)


That file will be exported to your default Download directory.
The file can then be opened in:


Your favourite spreadsheet program or
imported into programs like R or Python.

Remember from our lesson on Spreadsheets that using widely-supported, non-proprietary file formats like tsv or csv improves the ability of yourself and others to use your data.
Key Points


Cleaned data or entire projects can be exported from OpenRefine.
Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.

Other Resources in OpenRefine

Time: 5/5 minutes.
Refer people to the lesson page:

https://datacarpentry.org/OpenRefine-ecology-lesson/06-resources/


This work is licensed under a 
Creative Commons Attribution 4.0 International License.