Skip to content

Instantly share code, notes, and snippets.

@nmagee
Last active February 8, 2025 19:43
Show Gist options
  • Save nmagee/dd2dbb39426376914eb10618b94ae81b to your computer and use it in GitHub Desktop.
Save nmagee/dd2dbb39426376914eb10618b94ae81b to your computer and use it in GitHub Desktop.
Lab 3

DS2002 Lab 3 - Data Cleaning Scripts

Sample Data

  • Sample data - https://s3.amazonaws.com/ds2002-resources/labs/lab3-bundle.tar.gz - tar-zipped TSV
  • Stock Data - https://s3.amazonaws.com/ds2002-resources/labs/stock_data.tsv - TSV
  • Flight Log - https://s3.amazonaws.com/ds2002-resources/labs/flights.csv - CSV

1. Write a bash script to fetch and decompress a remote bundle

Write a bash script that can be run by passing the URL of a remote TAR bundle with it, like this:

./fetch_script.sh https://s3.amazonaws.com/ds2002-resources/labs/lab3-bundle.tar.gz

The script should:

  1. Fetch the remove bundle.
  2. Decompress the downloaded bundle.
  3. Convert the TSV file to a CSV file, resulting in a new CSV file.

2. Write a bash script that removes blank lines from a text file

Write a script that takes a local CSV file as one parameter, and removes any blank lines from the file.

The resulting output should be written to a new file given a name from a second parameter passed to the script.

Example - script and two parameters:

./remove_blanks.sh my-jumbled-file.csv my-clean-file.csv

A reminder that two ways to do this using bash are:

# awk can remove spaces
awk '!/^[[:space:]]*$/' myfile.tsv

# tr can remove spaces
cat myfile.tsv | tr -s '\n' > my_new_file.tsv

3. Write a Python3 script that cleans a CSV file

Write a script in Python that takes a local CSV filename as a parameter, and performs the following steps.

  1. Loads the file into a Pandas dataframe
  2. Removes all records where there are empty, NULL, or NaN values present in any column.
  3. Removes any duplicate records
  4. Validates rows have been removed by printing the row count between steps
  5. Saves the cleaned dataframe to a new CSV file.

Use the flights file above to test with.

Refer to the python file in this gist for Pandas/CSV reference

4. BONUS

Test/sample data is usually referred to as "synthetic data". Try one (or both) of the following:

  • Using Mockaroo or another online tool, generate a large dataset with some duplicate rows, missing data, etc. Using your scripts above clean the data file yourself.
  • Using a Python library (there are several) create your own synthetic data. Here is an example walkthrough.

Submit Your Work

Create a new GitHub gist and add each of these scripts as separate scripts within the same gist. Submit the URL for your gist into Canvas for grading.

import pandas as pd
# Basic CSV import
df = pd.read_csv('flights.csv')
# To count rows in a dataframe
df.count()
# To print out the dataframe
print(df.to_string())
# To remove any columns that are blank or missing:
df.dropna(inplace = True)
# To remove any duplicate rows
df.drop_duplicates(inplace = True)
# To write the dataframe out to a new CSV
df.to_csv('filename.csv')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment