Skip to content

Instantly share code, notes, and snippets.

@marioa
Last active June 13, 2019 08:23
Show Gist options
  • Save marioa/1e7eabda16d6d39016120220d42481bf to your computer and use it in GitHub Desktop.
Save marioa/1e7eabda16d6d39016120220d42481bf to your computer and use it in GitHub Desktop.
Carpentries Ecology spreadsheets lesson.

Spreadsheets

This is a reduced version of the Carpentries ecology spreadsheet lesson. There are also Carpentry instructor notes.

Contents

Overview

The lesson covers:

  • Good data entry practices - formatting data tables in spreadsheets.
  • How to avoid common formatting mistake.
  • Approaches for handling dates in spreadsheets.
  • Basic quality control and data manipulation in spreadsheets.
  • Exporting data from spreadsheets.

The lesson does not cover data processing but the data wrangling (making it fit for purpose).

Many examples of spreadsheet packages:

Setup

  1. Download data:

Ernest, M., Brown, J., Valone, T., and White, E.P. (2017). Portal Project Teaching Database. Version 6. Figshare. DOI: 10.6084/m9.figshare.1314459.v6.

Data has been messed-up for this lesson.

Introduction

Time: 15 minutes.

Objectives

  • Implement best practices in data table formatting
  • Identify and address common formatting mistakes
  • Understand approaches for handling dates in spreadsheets
  • Utilize basic quality control features and data manipulation practices
  • Effectively export data from spreadsheet programs

Exercise:

  • How many people have used spreadsheets in their research?
  • How many people have accidentally done something that made them frustrated or sad?

Can use spreadsheets for:

  • Data entry
  • Organizing data
  • Subsetting and sorting data
  • Statistics
  • Plotting

Reproducibility can be an issue.

Formatting data tables in Spreadsheets

Time: 15 minutes.

Objectives

  • Describe best practices for data entry and formatting in spreadsheets.
  • Apply best practices to arrange variables and observations in a spreadsheet.

Cardinal rules of using spreadsheet programs for data:

  1. Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’.
  2. Put each observation in its own row.
  3. Don’t combine multiple pieces of information in one cell. Sometimes it just seems like one thing, but think if that’s the only way you’ll want to be able to use or sort that data.
  4. Leave the raw data raw - don’t change it!
  5. Export the cleaned data to a text-based format like CSV (comma-separated values) format. This ensures that anyone can use the data, and is required by most data repositories.
  • Download the data set that is going to be used.
    • observations of a small mammal community in southern Arizona.
    • studying the effects of rodents and ants on the plant community.
    • study has been running for almost 40 years.
    • rodents are sampled on a series of 24 plots, with different experimental manipulations controlling which rodents are allowed to access which plots.

Real dataset used in about 100 publications. Has been simplified a bit for this course.

Exercise: [10 mins]

  1. Download the data.
  2. Open the data in a spreadsheet programme.
  3. Two tabs for data collected in 2013, 2014.
  4. Pair up, identify what is wrong with the spreadsheet. What steps would you take to clean-up the data. Do not modify the original data. Create a new tab and clean the data there.

Hadley Wickham, Tidy Data, Vol. 59, Issue 10, Sep 2014, Journal of Statistical Software. http://www.jstatsoft.org/v59/i10.

Some issues with the spreadsheet:

  • 2013 Tab,
    • DM species has the "g" units in the cells
    • Missing weights, other data entered but not the values - what does that mean?
    • Species DS table has Weights where the scales are not calibrated - what to do with those?
  • 2014 Tab,
    • Tables based on Plot, previously species
    • Plot 2 table has some uncalibrated values.
    • Plot 3 table has a missing year, not shown, when expanded it is 2015 - should it be 2014?
    • Plot 4 table has 1978 as the year, mistake?
    • Plot 4 table combined the gender and species (can change Data tab -> Text to Columns)

Key Points

  • Never modify your raw data. Always make a copy before making any changes.
  • Keep track of all of the steps you take to clean your data in a plain text file.
  • Organize your data according to tidy data principles.

Formatting problems

Time: 20 mins.

Objectives

  • Formatting problems

Common Spreadhseet Errors

  • Using multiple tables:
    • computer sees a row as a data entity.
    • You are artificially creating boundaries.
    • Use the same header in multiple places.
  • Using multiple tabs:
    • You may introduce inconsistences between tabs.
    • You will have to tell the data how to combine tabs.

Instead of using tabs you could add a new column. Talk about freezing rows/columns

  • Not filling in zeros:
    • Implicit assumption of values, not the same as having a value.
    • Computer interprets it as a missing value as a null value and not a zero.
  • Using problematic null values:
    • using -999 or other numerical values (or zero) to represent missing data.
    • Table 1, p8 of White et al, 2013 provide pros/cons different strategies for null values/
  • Using formatting to convey information:
    • Use formatting to convey semantics -> create a new field to encode this information.
  • Using formatting to make the data sheet look pretty:
    • e.g. merging cells, may produce problems when automatically reading data.
  • Placing comments or units in cells:
    • Cannot usually read comments in cells.
  • Entering more than one piece of information in a cell:
    • It makes automated analysis more difficult, have to separate - use different cells instead.
  • Using problematic field names:
    • Embedded spaces, use of - could cause problems, strange characters, etc. will add to the difficulty of any anallysis.
  • Using special characters in data:
    • May make your data harder to interpret.
  • Inclusion of metadata in data table:
    • e.g. a legend, information about provenance, a data dictionary, etc. You may not remember, it facilitates reuse of your data
    • Separate from the data, include a readme file (do not want to decouple too far), point back to the metadata.

Key Points

  • Avoid using multiple tables within one spreadsheet.
  • Avoid spreading data across multiple tabs.
  • Record zeros as zeros.
  • Use an appropriate null value to record missing data.
  • Don’t use formatting to convey information or to make your spreadsheet look pretty.
  • Place comments in a separate column.
  • Record units in column headers.
  • Include only one piece of information in a cell.
  • Avoid spaces, numbers and special characters in column headers.
  • Avoid special characters in your data.
  • Record metadata in a separate plain text file.

Dates as data

Time: 10 mins.

Objectives

  • Describe how dates are stored and formatted in spreadsheets.
  • Describe the advantages of alternative date formatting in spreadsheets.
  • Demonstrate best practices for entering dates in spreadsheets.

Exercise

Challenge: pulling month, day and year out of dates

  • In the dates tab of your spreadsheet you have the data from 2014 plot 3. There’s a Date collected column.
  • Let’s extract month, day and year from the dates to new columns. For this we can use the built in Excel functions
YEAR() MONTH() 
DAY()

(Make sure the new column is formatted as a number and not as a date.)

You can see that even though you wanted the year to be 2014, your spreadsheet program automatically interpreted it as 2015, the year you entered the data.

Exercise

Challenge: pulling hour, minute and second out of the current time

Current time and date are best retrieved using the functions NOW(), which returns the current date and time, and TODAY(), which returns the current date. The results will be formatted according to your computer’s settings.

  1. Extract the year, month and day from the current date and time string returned by the NOW() function.
  2. Calculate the current time using NOW()-TODAY().
  3. Extract the hour, minute and second from the current time using functions HOUR(), MINUTE() and SECOND().
  4. Press F9 to force the spreadsheet to recalculate the NOW() function, and check that it has been updated.

Solution

  1. To get the year, type =YEAR(NOW()) into any cell in your spreadsheet. To get the month, type =MONTH(NOW()). To get the day, type =DAY(NOW()).
  2. Typing =NOW()-TODAY() will result in a decimal value that is not easily human parsable to a clock-based time. You will need to use the strategies in the third part of this challenge to convert this decimal value to readable time.
  3. To extract the hour, type =HOUR(NOW()-TODAY()) and similarly for minute and second.
  • It is much safer to store dates with YEAR, MONTH, DAY in separate columns or as YEAR and DAY-OF-YEAR in separate columns.
  • Excel is unable to parse dates before 1900-01-01, cf. going from Julian to Gregorian Calendar, see this article.
  • Excel stores dates as numbers => can add numbers and it will get the answer right.
  • Adding months/years a bit more complex:
    • First we extract the single entities (day, month or year)
    • We can then add values to to that
    • Finally the complete date string is reconstructed using the DATE() function.

Exercise What happens to the dates in the “dates” tab of our workbook if we save this sheet in Excel (in csv format) and then open the file in a plain text editor (like TextEdit or Notepad)? What happens to the dates if we then open the csv file in Excel?

Answer

  • Click on the “dates” tab.
    • Double-click on any of the values in the Date collected column.
    • The dates display the year as 2015.
  • Select File -> Save As in Excel and in the drop down menu for file format select CSV UTF-8 (Comma delimited) (.csv). Click Save.
  • Navigate to the file in your finder application. Right click and select Open With. Choose a plain text editor application and view the file. Notice that the dates display as month/day without any year information.
  • Now right click on the file again and open with Excel.
    • Notice that the dates display with the current year, not 2015.
    • Exporting data from Excel and then importing it back into Excel fundamentally changed the data!

Key Points

  • Treating dates as multiple pieces of data rather than one makes them easier to handle.

Quality control

Time: 20 minutes

Objectives

  • Apply quality control techniques to identify errors in spreadsheets and limit incorrect data entry.

Quality assurnace

  • Quality assurance stops bad data from ever being entered by checking to see if values are valid during data entry.

Use Data Validation (Excel) or Validity (Libre Office Calc).

  1. Select the cells or column you want to validate
  2. On the Data tab select Data Validation
  3. In the Allow box select the kind of data that should be in the column.

Try out an example. Set the plot column in our spreadsheet to only allow plot values that are integers between 1 and 24.

Quality Control

  • Make sure you have saved a copy of your data.
  • Document the process. Write a readme, guidelines for writing a good readme.

Sorting

Can use this to find bad/suspect values.

Exercise

We’ve combined all of the tables from the messy data into a single table in a single tab. Download this semi-cleaned data file to your computer: survey_sorting_exercise (https://github.com/datacarpentry/spreadsheet-ecology-lesson/blob/gh-pages/data/survey_sorting_exercise.xlsx?raw=true).

Once downloaded, sort the Weight_grams column in your spreadsheet program from Largest to Smallest.

What do you notice?

Conditional formatting

Conditional formatting basically can do something like color code your values by some criteria or lowest to highest. This makes it easy to scan your data for outliers.

Exercise

  1. In the main Excel menu bar, click Home > Conditional Formatting... choose a formatting rule.
  2. Apply a 2-Color Scale formatting rule with the lowest values set to orange and the highest values set to yellow.
  3. Now we can scan through and different colors will stand out. Do you notice any strange values?

Solution

  • Cells that contain non-numerical values are not colored. This includes both the cells where the letter “g” was included and the empty cells.

Key Points

  • Always copy your original spreadsheet file and work with a copy so you don’t affect the raw data.
  • Use data validation to prevent accidentally entering invalid data.
  • Use sorting to check for invalid data.
  • Use conditional formatting (cautiously) to check for invalid data.

Exporting data

Time: 10 minutes

Objectives

  • Store spreadsheet data in universal file formats.
  • Export data from a spreadsheet to a CSV file.

Why?

  • It is a proprietary format, and it is possible that in the future, technology won’t exist (or will become sufficiently rare) to make it inconvenient, if not impossible, to open the file.
  • Other spreadsheet software may not be able to open files saved in a proprietary Excel format.
  • Different versions of Excel may handle data differently, leading to inconsistencies.
  • Journals and grant agencies are requiring you to deposit your data in a data repository, most of them don’t accept Excel format.

The above points also apply to other formats such as open data formats used by LibreOffice / Open Office. These formats are not static and do not get parsed the same way by different software packages.

Saving csv (comma seprarted values) or tsv (tab separated values) makes your code much more portable.

To save a file you have opened in Excel in CSV format:

  1. From the top menu select File and Save as.
  2. From the Format field, from the list, select Comma Separated Values (*.csv).
  3. Double check the file name and the location where you want to save it and hit Save.

Cross-platform Operability

For new lines:

  • Unix use a \n (line feed)
  • Windows use \r\n (carriage return + line feed)

May sometimes see spurious characters because of this. You need to take care that this does not happen to your data.

Key Points

  • Data stored in common spreadsheet formats will often not be read correctly into data analysis software, introducing errors into your data.
  • Exporting data from spreadsheets to formats like CSV or TSV puts it in a format that can be used consistently by most programs.

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 4.0 International License.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment