This is a summarised version of the Data Carpentry's Data Cleaning with OpenRefine for Ecologists lesson.
- Setup
- Introduction
- Working with OpenRefine
- Filtering and Sorting with OpenRefine
- Examining Numbers in OpenRefine
- Scripts from OpenRefine
- Exporting and Saving Data from OpenRefine
- Other Resources in OpenRefine
- Download dataset https://ndownloader.figshare.com/files/7823341. A teaching version of the Portal Database. The data in this lesson is a subset of the teaching version that has been intentionally ‘messed up’ for this lesson.
- OpenRefine (Freebase Gridworks developed by Metaweb, then Google Refine) http://openrefine.org/download.html. Available as open source since 2010.
Time: 10/0 minutes.
- Describe OpenRefine’s uses and applications.
- Differentiate data cleaning from data organization.
- Experiment with OpenRefine’s user interface.
- Locate helpful resources to learn more about OpenRefine.
- You need to keep provenance of data.
- You must export your data at the end, OpenRefine does not save back to the original data.
- You can undo operations.
- OpenRefine helps you clean data.
- Can replay operations on multiple data sets.
- Can use it to illustarte different clustering algorithms.
- Will need a browser (Firefox or Google chrome).
- Download and install OpenRefine http://openrefine.org/download.html.
- Download the data file to be used.
- Run OpeRefne - point to http://localhost:3333/ if it does not open automatically.
- Java program
- Runs locally on your machine
- Does not modify your original data.
- Can undo actions
- Keep track of the actions you have performed on your data and export
- Auto saves so you can pause/resume analysis.
- Standardise and clean data across your file.
- Get an overview of a data set
- Resolve inconsistencies in a data set
- Help you split data up into more granular parts
- Match local data up to other data sets
- Enhance a data set with data from other sources
- Save a set of data cleaning steps to replay on multiple files
- Can work with datasets of up to 100 000 rows.
- OpenRefine is a powerful, free and open source tool that can be used for data cleaning.
- OpenRefine will automatically track any steps you take in working with your data.
Time: 15/20 minutes.
- Create a new OpenRefine project from a CSV file.
- Recall what facets are and how they are used to sort and summarize data.
- Recall what clustering is and how it is applied to group and edit typos.
- Manipulate data using previous steps with undo/redo.
- Employ drop-downs to split values from one column into multiple columns.
- Employ drop-downs to remove white spaces from cells.
- Open OpenRefine
- Open the data file
Portal_rodents_19772002_scinameUUIDs.csv
- Click
Create Project
and selectGet data
fromThis Computer
. - Click
Choose Files
and select the filePortal_rodents_19772002_scinameUUIDs.csv
.- Click
Open
or double-click on the filename.
- Click
- Click
Next>>
under the browse button to upload the data into OpenRefine. - OpenRefine gives you a preview - a chance to show you it understood the file.
- If you change any of the parameters click
Update Preview
(bottom right). If this:- is the wrong file, click
<<Start Over
(upper left). - If all looks well, click
Create Project>>
(upper right).
- is the wrong file, click
- Click
- A ‘Facet’ groups all the like values that appear in a column
- Allows you to filter the data by these values and edit values across many records at the same time.
It allows you to:
- See a big picture of your data, and
- Filter down to just the subset of rows that you want to change in bulk.
Text facet
s group all the identical text values in a column and lists each value with the number of records it appears in.
Here we will use faceting to look for potential errors in data entry in the scientificName
column.
- Scroll over to the
scientificName
column. - Click the down arrow and choose
Facet > Text
facet. - Gives every unique value in the
scientificName
column along with a count of occurrencess in the column. - Can sort by
name
or bycount
.
- Do you notice any problems with the data? What are they?
- Hover the mouse over one of the names in the Facet list.
- You should see that you have an edit function available.
- Can use
edit
to fix an error immediately.
-
OpenRefine will ask whether you want to apply the same correction to every value it finds like that one.
-
OpenRefine offers better ways to find and fix these errors, which we’ll use instead.
-
Learn about these when we talk about clustering.
Can do other type of faceting:
- Numeric facets
- Timeline facets (for dates)
- Custom facets
- Scatterplot facets
Numeric and Scatterplot facets display graphs instead of lists of values.
Custom facets are a range of different types of facets. Some of the default custom facets are:
- Word facet - breaks down text into words and counts & the number of records each word appears in
- Duplicates facet - this results in a binary facet of ‘true’ or ‘false’. Rows appear in the:
- ‘true’ facet if the value in the selected column is an exact match for a value in the same column in another row
- Text length facet - creates a numeric facet based on the length (number of characters) of the text in each row for the selected column.
- Can be useful for spotting incorrect or unusual data in a field where specific lengths are expected
- Facet by blank - a binary facet of ‘true’ or ‘false’. Rows appear in the ‘true’ facet if they have no data present in that column. This is useful when looking for rows missing key data.
Open refines limits the numbers of aggregates it will produce when faceting (2000 by default but can change; http://127.0.0.1:3333/preferences and changing the preference with key ui.browsing.listFacet.limit
and value equal the new limit, such as 3000).
Exercise
- Using faceting, find out how many years are represented in the census.
- For
yr
doFacet > Text
facet. A box will appear in the left panel showing that there are 26 unique entries in this column.
- Is the column formatted as Number, Date, or Text? How does changing the format change the faceting display?
- By default, the column
yr
is formatted as Text. - You can change the format by doing
Edit cells > Common transforms > To number
. - Doing
Facet > Numeric
facet creates a box in the left panel that shows a histogram of the number of entries per year. - Notice that the data is shown as a number, not a date.
- If you instead transform the column to a date, the program will assume all entries are on January 1st of the year.
- Which years have the most and least observations?
- After creating a facet, click Sort by count in the facet box. The year with the most observations is 1997. The least is 1977.
Look for similarities:
- In the
scientificName
Text Facet we created in the step above, click theCluster
button. - In the resulting pop-up window, you can change the
Method
and theKeying Function
. - Select the
key collision
method andmetaphone3
keying function.
- It should identify three clusters.
- Click the Merge? box beside each, then click
Merge Selected and Recluster
to apply the corrections to the dataset.
- Explore, but don’t Merge again; just
Clos
e when you’re done.
We’ll now see other operations that will help us detect and correct the remaining problems, and that have other, more general uses.
- Let us suppose we want to split the
scientificName
column into separate columns forgenus
and forspecies
. - Click the down arrow at the top of the
scientificName
column. ChooseEdit Column > Split into several columns...
. - In the pop-up, in the Separator box, replace the comma with a space.
- Uncheck the box that says
Remove this column
. - Click
OK
. You’ll get some new columns calledscientificName 1
,scientificName 2
, and so on.
- Notice that in some cases scientificName 1 and scientificName 2 are empty.
- Why is this?
- What do you think we can do to fix this?
Exercise Change the name of the second new column to “species”.
Solution
-
Click the down arrow and then
Edit column > Rename this column
. -
Type “species” into the box that appears.
-
A pop-up will appear that says Another column already named species.
-
This is because there is another column where we’ve recorded the species abbreviation.
- You can choose another name like
speciesName
for this column or - change the other species column name to
speciesAbbreviation
.
- You can choose another name like
- Click where it says
Undo / Redo
.
- All the changes you have made to the data so far are listed here.
- Click on the step that you want to go back to, in this case the previous step.
- The added columns will disappear.
- Notice that you can still click on the last step and make the columns reappear, and toggle back and forth between these states.
- Leave the dataset in the state in which the
scientificNames
were clustered, but not yet split.
- In the header for the column
scientificName
, chooseEdit cells > Common transforms > Trim leading and trailing whitespace
. - Notice that the
Split
step has now disappeared from theUndo / Redo
and is replaced with a Text transform on 3 cells. - Perform the same
Split
operation onscientificName
that you undid earlier. This time you should only get two new columns. Why?
Go back to the unsplit view.
- Faceting and clustering approaches can identify errors or outliers in data.
Time: 10/10 minutes
- Employ text filter or include/exclude to filter to a subset of rows.
- Sort tables by a column.
- Sort tables by multiple columns.
Time: 10/10 minutes.
- Click the down arrow next to
scientificName > Text
filter.
- A
scientificName
facet will appear on the left margin.
- Type in
bai
and press return.
- There are 48 matching rows of the original 35549 rows (and these rows are selected for the subsequent steps).
- At the top, change the view to Show 50 rows.
- This way you will see all the matching rows.
Exercise
- What scientific names (genus and species) are selected by this procedure?
Baiomys taylori
andChaetodipus baileyi
.
- How would you restrict this to one of the species selected?
- make the search case sensitive
- split the
scientificName
column into species and genus before filtering - include more letters in your filter
Can filter further by include
or exlude
an entry in a facet.
Exercise
Use include / exclude
to select only entries from one of these two species.
Go back to viewing both species.
You can sort data. Column > Sort...
.
Exercise
Sort by month. How can you ensure that months are in order?
- If you try to re-sort a column that you have already used, the drop-down menu changes slightly, to
> Sort
without the...
Sort > Sort...
- This option enables you to modify your original sort.Sort > Reverse
- This option allows you to reverse the order of the sort.Sort > Remove sort
- This option allows you to undo your sort.
Exercise
Sort the data by plot
. What year(s) were observations recorded for plot 1 in this filtered dataset.
- In the
plot
column, selectSort... > numbers
and select smallest first. - The years represented are 1990 and 1995.
- You can sort by multiple columns by performing sort on additional columns.
- The sort depends on the order in which you select columns to sort.
- To restart the sorting process with a particular column
- Check the
sort by this column alone
box in the Sort pop-up menu or - remove the sort in the top level link.
- Check the
Exercise
You might like to look for trends in your data by month of collection across years.
- How do you sort your data by month?
- For the
mo
column, click onSort...
and then numbers.
- How would you do this differently if you were instead trying to see all of your entries in chronological order?
- Remove existing sort or use
sort by this column alone
. - Sort by
yr
then 'mo' thendy
.
- OpenRefine provides a way to sort and filter data without affecting the raw data.
Time: 10/10 minutes.
- Transform a text column into a number column.
- Identify and modify non-numeric values in a column using facets.
- Use scatterplot facet to examine relationships among columns.
- By default columns are read in as text.
- Previous sorts interpreted the text as numbers but not changed the data.
- Can changed the data by
Edit cells > Common transforms
- Remove any text formatting.
- Change
RecordID
byEdit cells > Common transforms > to number
- Values go from left-justified to right-justified.
Exercise
Transform three more columns, including period
, from text to numbers. Can all columns be transformed to numbers?
- Only observations that include only numerals (0-9) can be transformed to numbers.
- If you apply a number transformation to a column that doesn’t meet this criteria
- Click the
Undo / Redo
tab, you will see a step that starts with Text transform on 0 cells.- This means that the data in that column was not transformed.
- Font goes from black to green.
- Click the
- Sometimes there are non-number values or blanks in a column which may represent errors in data entry.
- We can do that with a
Numeric
facet.
Exercise
- For a column you transformed to numbers, edit one or two cells, replacing
- numbers with text (such as abc) or
- blank (no number or text).
- Use the pulldown menu to apply a numeric facet to the column you edited. The facet will appear in the left panel.
- Notice that there are several checkboxes in this facet:
Numeric
,Non-numeric
,Blank
, andError
.
- There are counts of the number of cells in each category.
- You should see checks for
Non-numeric
andBlank
if you changed some values.
- Experiment with checking or unchecking these boxes to select subsets of your data.
- When done examining the numeric data, remove this facet by clicking the
x
in the upper left corner of its panel. - Note that this does not undo the edits you made to the cells in this column.
- If you want to reverse these edits, use the
Undo / Redo
function.
- If you want to reverse these edits, use the
- Relate multiple columns representing numbers using the
scatterplot
facet.- Select a numeric column, for example
recordID
- Use the pulldown menu to
Facet > Scatterplot
facet. - A new window called Scatterplot Matrix appears.
- There are plots for each pair of numeric columns organized in an upper right triangle.
- Each plot has little dots for the cell values from each row.
- Select a numeric column, for example
Exercise
- Examine the scatterplots overall. Do the patterns make sense?
- Why does the scatterplot for
recordID
vsperiod
have the pattern it does?
- We can examine one pair of columns by clicking on its square in the
Scatterplot Matrix
. - A new facet with only that pair will appear in the left margin.
Exercise
- Click in the scatterplot facet in the left margin and drag to highlight a rectangle.
- This will subset the data to those entries.
Exercise
- Click on the
Scatterplot Matrix
square forrecordID
andperiod
to get that as a facet in the left margin. - Redo the Text
filter
onscientificName
to show only entries including the lettersbai
.- Notice the change in the scatterplot.
- It might be easier to see if you click export plot to put it on a new browser tab.
- OpenRefine also provides ways to get overviews of numerical data.
Time: 10/5 minutes.
- Describe how OpenRefine generates JSON (Java Script Object Notation) code.
- Demonstrate ability to export JSON code from OpenRefine.
- Save JSON code from an analysis.
- Apply saved JSON code to an analysis.
- OpenRefine saves every change you make to the dataset.
- Changes are saved in a format known as JSON (JavaScript Object Notation).
- You can export this JSON script and apply it to other data files.
- If you had multiple files to clean, and they all had the same type of errors
- e.g. species name misspellings, leading white spaces
- All files had the same column names
- You can save the steps you took as a JSON script:
- Open a new file to clean in OpenRefine,
- aste in the script and run it.
- Gives you a quick way to clean all of your related data.
- If you had multiple files to clean, and they all had the same type of errors
Let's create a script:
- In the
Undo / Redo
section a. click Extract... b. Select the steps that you want to apply to other datasets by clicking the check boxes. - Copy the code from the right hand panel and paste it into a text editor.
NotePad
on Windows orTextEdi
t on Mac.- Make sure it saves as a plain text file.
- In
TextEdit
, do this by selectingFormat > Make plain text
andsave
the file as a txt file.
- In
-
Rename the project.
-
Download an uncleaned version of the dataset:
- https://ndownloader.figshare.com/files/7823341 or
- use the version of the raw dataset you saved to your computer.
-
Start a new project in OpenRefine with this file and name it something different from your existing project.
-
Click the
Undo / Redo tab > Apply
and paste in the contents of txt file with the JSON code. -
Click
Perform operations
. -
The dataset should now be the same as your other cleaned dataset.
- Now, that you know how scripts work, you may wonder how to use them in your own scientific research.
- For inspiration, you can read more about the succesful application of the reproducible science principles:
- Marwick et al. Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation, 2017.
- Stewart Lowndes et al. Our path to better science in less time using open data science tools, Nature ecology and evolution, 2017.
- All changes are being tracked in OpenRefine
- These can be used for scripts for future analyses or
- Reproducing an analysis.
- Scripts can (and should) be published together with the dataset as part of the digital appendix of the research output.
Time: 10/5 minutes.
- Save an OpenRefine project.
- Export cleaned data from an OpenRefine project.
- OpenRefine saves as you go along.
- You can close your browser and resume your project at a later time.
You can export a project:
- Send to a collaborator: raw data -> cleaned data.
- Publish this as part of your analysis.
To do this:
- Click the
Export
button in the top right and selectExport project
. - A
tar.gz
file will download to your default Download directory.- The
tar.gz
extension tells you that this is a compressed file that this file contains multiple files. - You can double-click on the
tar.gz
file and it will expand into a directory.- On Windows, opening tar.gz files requires additional software such as:
- 7-zip or
- WinZip.
- On Windows, opening tar.gz files requires additional software such as:
- Once installed associate with the type of file.
- The
- Look at the files that appear in this folder. What files are here? What information do you think these files contain?
- You should see:
- A history folder that contains a zip file for each applied change.
- Each zip file contains
- A
change.tx
t file which records each individual transformation that you did to your data.
- A
- A
data.zip
file which includes adata.txt
which is a copy of your raw data. - You may also see other files.
You can import an existing project into OpenRefine by:
- Clicking
Open...
in the upper rightImport Project
- Select the
tar.gz
project file.
This project includes all of the raw data and cleaning steps that were part of the original project.
You can also export just your cleaned data, rather than the entire project.
- Click
Export
and select the file type you want to export the data in.
- Tab-separated values (tsv)
- Comma-separated values (csv)
- That file will be exported to your default
Download
directory. - The file can then be opened in:
- Your favourite spreadsheet program or
- imported into programs like R or Python.
Remember from our lesson on Spreadsheets that using widely-supported, non-proprietary file formats like tsv or csv improves the ability of yourself and others to use your data.
- Cleaned data or entire projects can be exported from OpenRefine.
- Projects can be shared with collaborators, enabling them to see, reproduce and check all data cleaning steps you performed.
Time: 5/5 minutes.
Refer people to the lesson page:
This work is licensed under a Creative Commons Attribution 4.0 International License.