A walk-through on slimming down a CSV, getting some information about its fields, slimming down the columns we're working with and pulling information about those columns.
https://csvkit.readthedocs.io/en/1.0.2/scripts/csvcut.html
Csvkit can work slow on large CSVs, so for the sake of this walk-through we're going to work with 5,000 rows of the CSV.
head -n 5000 2014.CSV > test.csv
See the meta information on the spreadsheet file, according to the filesystem:
stat test.csv
See some high-level analysis of the data in the spreadsheet itself:
csvstat test.csv
This command pulls the first two columns:
csvcut test.csv -c 1,2
This command pulls the 5th, 7th, 9th:
csvcut test.csv -c 5,7,9
This command pulls the 5th, 7th, 9th and runs the high-level analysis of it. It uses the |
character to connect the output of the csvcut command to the csvstat command. Using the |
character (the pipe character) is something you'll do in other places too:
csvcut test.csv -c 5,7,9 | csvstat
Now, we're going to pull the contents of the fifth column and put that in its own CSV, for parsing later. We do that with the >
character, which like the |
is available to all programs. The >
will take the output of the preceding command and put it in the place specified.
csvcut test.csv -c 5 > workstat.csv
grep is a command that searches a file line-by-line for a particular string.
This command searches our filtered spreadsheet for every instance of "WA":
grep WA workstat.csv
wc
is short for "word count," which counts words, lines and paragraphs. wc -l
is a shortcut for line count, and paired with grep (with the |
) we get a count of the number of lines a particular string shows up in:
grep WA workstat.csv | wc -l
Say we want to count more than one string. We could use the ;
to put two separate commands on one line:
grep WA workstat.csv | wc -l; grep CO workstat.csv | wc -l
But say we needed to count five strings. That's a long command. We can use a for loop instead:
Basic for loop:
for STATE in 'WA' 'CO' 'AZ' 'CA'; do echo $STATE; done
For loop tuned to our spreadsheet:
for STATE in 'WA' 'CO' 'AZ' 'CA'; do grep $STATE workstat.csv | wc -l; done
cat
is short for concatenate, and is a way you can combine multiple files. It's also a way you can show the contents of a particular file, like so:
cat us-state-postal-codes
Our goal:
We're going to use cat
in a for loop that will get increasingly complicated as it gets closer to everything we want it to do.
First: Print each state out, one by one.
for STATE in `cat us-state-postal-codes`; do echo $STATE; done
Next: Use grep
to search for instances of each state.
for STATE in `cat us-state-postal-codes`; do grep $STATE workstat.csv; done
That wasn't useful in itself, but combine that command with our line-count command ( wc -l
) and we get the total number of instances of each state.
for STATE in `cat us-state-postal-codes`; do grep $STATE workstat.csv | wc -l; done
The problem with that is it doesn't include the actual state. So we echo out the state name.
for STATE in `cat us-state-postal-codes`; do echo $STATE; grep $STATE workstat.csv | wc -l; done
But that outputs on one line for the state, another line for the number. We want both on the same line, ideally in a format amenable for turning into a csv. That means sticking the grep
inside the echo
command like this:
for STATE in `cat us-state-postal-codes`; do echo "$STATE,`grep $STATE workstat.csv | wc -l`"; done