Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@briandailey
Created August 11, 2012 15:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save briandailey/b2554e101323dcac7cf3 to your computer and use it in GitHub Desktop.
Save briandailey/b2554e101323dcac7cf3 to your computer and use it in GitHub Desktop.
hackday nashville august 11, 2012
Test dataset: NPI file. 4GB uncompressed csv.
http://nppes.viva-it.com/NPI_Files.html
1. Briefly talk about *nix philosophy of small, simple tools working together to get a job done.
1. Doug Mcllroy, inventor of pipes:
2. (i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.
(ii) Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.
(iii) Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.
(iv) Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them.
3. Text is the universal interface.
2. Explain that I don't have a background in perl, so perl users may find this talk amusing.
3. wc -l
1. get line count on the file, just to see how many records we're dealing with.
2. This file has 3717914 records.
4. Head and tail the file.
1. Grab the headers with head -1 | tr , "\n" | less -N
5. less (doesn't load the entire file into memory, allows browsing and searching to some extent)
6. cut -d, -fN
1. explain pitfall of quoted commas.
2. sneaky trick: sed 's/",/"|/g' only works because there are no unescaped quotes.
7. tr
1. tr '[:lower:]' '[:upper:]'
2. tr -d '\r' (dos -> unix)
8. sed
1. stream editor, great for regex.
2. print specific line from the file - sed '1q;d'
1. q branch to the end of the script (quit)
2. d - display line
3. sed can be a little slow.
9. awk
1. inspecting columns (same as cut)
2. use cases/examples:
1. using it to pad a column (e.g., a zipcode) with zeros
1. awk -F, '{ printf('%06\n", $1) }'
3. checking number of fields
1. awk -F, '{ print NF }' | uniq
1. again with the quoted commas!
2. prove sed hack works:
1. pv npidata_20050523-20120709.csv | sed 's/",/"|/g' | awk 'BEGIN { FS="|" } { if (NF != 329) { print NF,$0} }' | uniq
4. pulling sample data
10. grep/ack
1. Ack is much like grep, but you can pass file type, it's automatically recursive, highlights by default, etc.
2. It's also significantly faster.
11. create a sample for test runs, explain how samples may need to be random.
Some uses:
Create a pipe-delimited version of the file.
pv npidata_20050523-20120709.csv | sed 's/",/"|/g' > piped.txt
Took about ten minutes. Warning: some pipes were in the file to begin with. How did I find out?
awk 'BEGIN { FS="|" } { print NF }' piped.txt | uniq
Replace zipcodes with five-character limit.
head -10 piped.txt | awk -v Q='"' 'BEGIN { FS="|"; OFS="|" } { $33 = substr($33, 0, 6) Q; print $0 }' | cut -d\| -f33
Passing quotes into awk is a bear, so we create a variable for that and pass it in.
Checking out output with cut is generally a good idea.
How many doctors are practicing in east Nashville?
Process is something like:
1. Locate postal code field.
2. Go over file, filter for matching postal codes, and report back number of lines.
pv piped.txt | cut -d, -f6,33 | grep \"37206 | wc -l
Viola! 145 records, discovered in about 90 seconds. Take my word for it.
Create a sample dataset.
We can grab 1/10th of the file.
awk 'BEGIN { srand() } rand() <= .01' piped.txt | wc -l
awk 'BEGIN { srand() } rand() <= .01' piped.txt > sample.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment