freejoe76/csvkit-intro.md

## csvkit-intro.md

      
    Raw
  

              csvkit-intro.md
            
          
    Csvkit intro

A walk-through on slimming down a CSV, getting some information about its fields, slimming down the columns we're working with and pulling information about those columns.
https://csvkit.readthedocs.io/en/1.0.2/scripts/csvcut.html
High-level look at the spreadsheet

Csvkit can work slow on large CSVs, so for the sake of this walk-through we're going to work with 5,000 rows of the CSV.
head -n 5000 2014.CSV > test.csv
See the meta information on the spreadsheet file, according to the filesystem:
stat test.csv
See some high-level analysis of the data in the spreadsheet itself:
csvstat test.csv
Pull out the columns we want to look at

This command pulls the first two columns:
csvcut test.csv -c 1,2
This command pulls the 5th, 7th, 9th:
csvcut test.csv -c 5,7,9
This command pulls the 5th, 7th, 9th and runs the high-level analysis of it. It uses the | character to connect the output of the csvcut command to the csvstat command. Using the | character (the pipe character) is something you'll do in other places too:
csvcut test.csv -c 5,7,9 | csvstat
Now, we're going to pull the contents of the fifth column and put that in its own CSV, for parsing later. We do that with the > character, which like the | is available to all programs. The > will take the output of the preceding command and put it in the place specified.
csvcut test.csv -c 5 > workstat.csv
Do some preliminary searches & analysis of the new spreadsheet

grep is a command that searches a file line-by-line for a particular string.
This command searches our filtered spreadsheet for every instance of "WA":
grep WA workstat.csv
wc is short for "word count," which counts words, lines and paragraphs. wc -l is a shortcut for line count, and paired with grep (with the |) we get a count of the number of lines a particular string shows up in:
grep WA workstat.csv | wc -l
Say we want to count more than one string. We could use the ; to put two separate commands on one line:
grep WA workstat.csv | wc -l; grep CO workstat.csv | wc -l
But say we needed to count five strings. That's a long command. We can use a for loop instead:
Basic for loop:
for STATE in 'WA' 'CO' 'AZ' 'CA'; do echo $STATE; done
For loop tuned to our spreadsheet:
for STATE in 'WA' 'CO' 'AZ' 'CA'; do grep $STATE workstat.csv | wc -l; done
Here's cat

cat is short for concatenate, and is a way you can combine multiple files. It's also a way you can show the contents of a particular file, like so:
cat us-state-postal-codes
More advanced commands

Our goal:
We're going to use cat in a for loop that will get increasingly complicated as it gets closer to everything we want it to do.
First: Print each state out, one by one.
for STATE in `cat us-state-postal-codes`; do echo $STATE; done
Next: Use grep to search for instances of each state.
for STATE in `cat us-state-postal-codes`; do grep $STATE workstat.csv; done
That wasn't useful in itself, but combine that command with our line-count command ( wc -l ) and we get the total number of instances of each state.
for STATE in `cat us-state-postal-codes`; do grep $STATE workstat.csv | wc -l; done
The problem with that is it doesn't include the actual state. So we echo out the state name.
for STATE in `cat us-state-postal-codes`; do echo $STATE; grep $STATE workstat.csv | wc -l; done
But that outputs on one line for the state, another line for the number. We want both on the same line, ideally in a format amenable for turning into a csv. That means sticking the grep inside the echo command like this:
for STATE in `cat us-state-postal-codes`; do echo "$STATE,`grep $STATE workstat.csv | wc -l`"; done