Skip to content

Instantly share code, notes, and snippets.

@martinburch
Last active February 9, 2024 21:12
Show Gist options
  • Save martinburch/6057011 to your computer and use it in GitHub Desktop.
Save martinburch/6057011 to your computer and use it in GitHub Desktop.
One liners: commands to clean up your data using csvkit with csvpys on a Mac. ("So your data doesn't get the last laugh!")
#!/bin/bash
# Install csvkit with csvpys
# (csvpys hasn't been pulled back into the main csvkit repo that you can pip install)
git clone https://github.com/cypreess/csvkit.git
cd csvkit
python setup.py build
sudo python setup.py install
cd ~
# Depending on how you're using Python, you might need to add the install path to your PATH
nano ~/.bash_profile
# add /usr/local/share/python or whatever directory the tools are living in
# not sure? try sudo find / -name csvcut
# Reload your PATH
source ~/.bash_profile
# Normalize your data, if necessary
csvclean data.csv
# Slice off any columns you won't be needing, if necessary
csvcut -c 2,3,4,5 data_out.csv > cut.csv
# Remove any extra header rows that can't handle the python string operations you're about to throw
# This example deletes row 2
sed 2d cut.csv > sed.csv
# You can also do sed -i to do "in place" replacement, but sed is a dangerous tool
# so proceed carefully and don't mangle your data
# Do multiple cleanups at once
csvpys -s NEW-COL-NAME "c[2].strip()" -s SECOND-NEW-COL-NAME "c[3].strip()" -s THIRD-NEW-COL-NAME "int(c[4].strip())" -s FOURTH-NEW-COL-NAME "c[1].strip()" sed.csv > strip.csv
# NB: you can wrap a field in int() to make it an integer but not in strip() to remove whitespace
# Strip only works on objects that are already strings.
# Remove the old columns that csvpys keeps, and reorder any columns as necessary
csvcut -c 8,5,6,7 strip.csv > reorder.csv
# Not sure of your column IDs? Try csvcut -n strip.csv
# Integrity check: does number of rows match original number of rows?
wc -l reorder.csv
# should match
wc -l sed.csv
# Remove rows where a column (col 2 in this case) is empty
csvgrep -c 2 -r '^.' reorder.csv > blanks_removed.csv
# -r '^.' matches any character at the beginning of the field
# possible bug in csvkit: -m '' -i is not a valid way to match nonblanks
# Replace the header row with column names of your choosing
cat blanks_removed.csv | sed '1 s/.*/renamed_column,renamed_col2,third_renamed_column,col_no_four/' > headers.csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment