- Sample data -
https://s3.amazonaws.com/ds2002-resources/labs/lab3-bundle.tar.gz
- tar-zipped TSV - Stock Data -
https://s3.amazonaws.com/ds2002-resources/labs/stock_data.tsv
- TSV - Flight Log -
https://s3.amazonaws.com/ds2002-resources/labs/flights.csv
- CSV
Write a bash script that can be run by passing the URL of a remote TAR bundle with it, like this:
./fetch_script.sh https://s3.amazonaws.com/ds2002-resources/labs/lab3-bundle.tar.gz
The script should:
- Fetch the remove bundle.
- Decompress the downloaded bundle.
- Convert the TSV file to a CSV file, resulting in a new CSV file.
Write a script that takes a local CSV file as one parameter, and removes any blank lines from the file.
The resulting output should be written to a new file given a name from a second parameter passed to the script.
Example - script and two parameters:
./remove_blanks.sh my-jumbled-file.csv my-clean-file.csv
A reminder that two ways to do this using bash are:
# awk can remove spaces
awk '!/^[[:space:]]*$/' myfile.tsv
# tr can remove spaces
cat myfile.tsv | tr -s '\n' > my_new_file.tsv
Write a script in Python that takes a local CSV filename as a parameter, and performs the following steps.
- Loads the file into a Pandas dataframe
- Removes all records where there are empty,
NULL
, orNaN
values present in any column. - Removes any duplicate records
- Validates rows have been removed by printing the row count between steps
- Saves the cleaned dataframe to a new CSV file.
Use the flights
file above to test with.
Refer to the python file in this gist for Pandas/CSV reference
Test/sample data is usually referred to as "synthetic data". Try one (or both) of the following:
- Using Mockaroo or another online tool, generate a large dataset with some duplicate rows, missing data, etc. Using your scripts above clean the data file yourself.
- Using a Python library (there are several) create your own synthetic data. Here is an example walkthrough.
Create a new GitHub gist and add each of these scripts as separate scripts within the same gist. Submit the URL for your gist into Canvas for grading.