nmagee/README.md

## README.md

      
    Raw
  

              README.md
            
          
    DS2002 Lab 3 - Data Cleaning Scripts

Sample Data


Sample data - https://s3.amazonaws.com/ds2002-resources/labs/lab3-bundle.tar.gz - tar-zipped TSV
Stock Data - https://s3.amazonaws.com/ds2002-resources/labs/stock_data.tsv - TSV
Flight Log - https://s3.amazonaws.com/ds2002-resources/labs/flights.csv - CSV

1. Write a bash script to fetch and decompress a remote bundle

Write a bash script that can be run by passing the URL of a remote TAR bundle with it, like this:
./fetch_script.sh https://s3.amazonaws.com/ds2002-resources/labs/lab3-bundle.tar.gz

The script should:

Fetch the remove bundle.
Decompress the downloaded bundle.
Convert the TSV file to a CSV file, resulting in a new CSV file.

2. Write a bash script that removes blank lines from a text file

Write a script that takes a local CSV file as one parameter, and removes any blank lines from the file.
The resulting output should be written to a new file given a name from a second parameter passed to the script.
Example - script and two parameters:
./remove_blanks.sh my-jumbled-file.csv my-clean-file.csv

A reminder that two ways to do this using bash are:
# awk can remove spaces
awk '!/^[[:space:]]*$/' myfile.tsv

# tr can remove spaces
cat myfile.tsv | tr -s '\n' > my_new_file.tsv

3. Write a Python3 script that cleans a CSV file

Write a script in Python that takes a local CSV filename as a parameter, and performs the following steps.

Loads the file into a Pandas dataframe
Removes all records where there are empty, NULL, or NaN values present in any column.
Removes any duplicate records
Validates rows have been removed by printing the row count between steps
Saves the cleaned dataframe to a new CSV file.

Use the flights file above to test with.
Refer to the python file in this gist for Pandas/CSV reference
4. BONUS

Test/sample data is usually referred to as "synthetic data". Try one (or both) of the following:

Using Mockaroo or another online tool, generate a large dataset with some duplicate rows, missing data, etc. Using your scripts above clean the data file yourself.
Using a Python library (there are several) create your own synthetic data. Here is an example walkthrough.

Submit Your Work

Create a new GitHub gist and add each of these scripts as separate scripts within the same gist. Submit the URL for your gist into Canvas for grading.

  
## pandas-csv.py
import pandas as pd

# Basic CSV import
df = pd.read_csv('flights.csv')

# To count rows in a dataframe
df.count()

# To print out the dataframe
print(df.to_string())

# To remove any columns that are blank or missing:
df.dropna(inplace = True)

# To remove any duplicate rows
df.drop_duplicates(inplace = True)

# To write the dataframe out to a new CSV
df.to_csv('filename.csv')
	import pandas as pd

	# Basic CSV import
	df = pd.read_csv('flights.csv')

	# To count rows in a dataframe
	df.count()

	# To print out the dataframe
	print(df.to_string())

	# To remove any columns that are blank or missing:
	df.dropna(inplace = True)

	# To remove any duplicate rows
	df.drop_duplicates(inplace = True)

	# To write the dataframe out to a new CSV
	df.to_csv('filename.csv')