Skip to content

Instantly share code, notes, and snippets.

@freejoe76
Last active November 28, 2018 21:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save freejoe76/8e7bc51416a6ffad3f9bc2e7aa636822 to your computer and use it in GitHub Desktop.
Save freejoe76/8e7bc51416a6ffad3f9bc2e7aa636822 to your computer and use it in GitHub Desktop.

Csvkit intro

A walk-through on slimming down a CSV, getting some information about its fields, slimming down the columns we're working with and pulling information about those columns.

https://csvkit.readthedocs.io/en/1.0.2/scripts/csvcut.html

High-level look at the spreadsheet

Csvkit can work slow on large CSVs, so for the sake of this walk-through we're going to work with 5,000 rows of the CSV.

head -n 5000 2014.CSV > test.csv

See the meta information on the spreadsheet file, according to the filesystem:

stat test.csv

See some high-level analysis of the data in the spreadsheet itself:

csvstat test.csv

Pull out the columns we want to look at

This command pulls the first two columns:

csvcut test.csv -c 1,2

This command pulls the 5th, 7th, 9th:

csvcut test.csv -c 5,7,9

This command pulls the 5th, 7th, 9th and runs the high-level analysis of it. It uses the | character to connect the output of the csvcut command to the csvstat command. Using the | character (the pipe character) is something you'll do in other places too:

csvcut test.csv -c 5,7,9 | csvstat

Now, we're going to pull the contents of the fifth column and put that in its own CSV, for parsing later. We do that with the > character, which like the | is available to all programs. The > will take the output of the preceding command and put it in the place specified.

csvcut test.csv -c 5 > workstat.csv

Do some preliminary searches & analysis of the new spreadsheet

grep is a command that searches a file line-by-line for a particular string.

This command searches our filtered spreadsheet for every instance of "WA":

grep WA workstat.csv

wc is short for "word count," which counts words, lines and paragraphs. wc -l is a shortcut for line count, and paired with grep (with the |) we get a count of the number of lines a particular string shows up in:

grep WA workstat.csv | wc -l

Say we want to count more than one string. We could use the ; to put two separate commands on one line:

grep WA workstat.csv | wc -l; grep CO workstat.csv | wc -l

But say we needed to count five strings. That's a long command. We can use a for loop instead:

Basic for loop:

for STATE in 'WA' 'CO' 'AZ' 'CA'; do echo $STATE; done

For loop tuned to our spreadsheet:

for STATE in 'WA' 'CO' 'AZ' 'CA'; do grep $STATE workstat.csv | wc -l; done

Here's cat

cat is short for concatenate, and is a way you can combine multiple files. It's also a way you can show the contents of a particular file, like so:

cat us-state-postal-codes

More advanced commands

Our goal:

We're going to use cat in a for loop that will get increasingly complicated as it gets closer to everything we want it to do.

First: Print each state out, one by one.

for STATE in `cat us-state-postal-codes`; do echo $STATE; done

Next: Use grep to search for instances of each state.

for STATE in `cat us-state-postal-codes`; do grep $STATE workstat.csv; done

That wasn't useful in itself, but combine that command with our line-count command ( wc -l ) and we get the total number of instances of each state.

for STATE in `cat us-state-postal-codes`; do grep $STATE workstat.csv | wc -l; done

The problem with that is it doesn't include the actual state. So we echo out the state name.

for STATE in `cat us-state-postal-codes`; do echo $STATE; grep $STATE workstat.csv | wc -l; done

But that outputs on one line for the state, another line for the number. We want both on the same line, ideally in a format amenable for turning into a csv. That means sticking the grep inside the echo command like this:

for STATE in `cat us-state-postal-codes`; do echo "$STATE,`grep $STATE workstat.csv | wc -l`"; done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment