-
-
Save briandailey/b2554e101323dcac7cf3 to your computer and use it in GitHub Desktop.
hackday nashville august 11, 2012
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Test dataset: NPI file. 4GB uncompressed csv. | |
http://nppes.viva-it.com/NPI_Files.html | |
1. Briefly talk about *nix philosophy of small, simple tools working together to get a job done. | |
1. Doug Mcllroy, inventor of pipes: | |
2. (i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features. | |
(ii) Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input. | |
(iii) Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them. | |
(iv) Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them. | |
3. Text is the universal interface. | |
2. Explain that I don't have a background in perl, so perl users may find this talk amusing. | |
3. wc -l | |
1. get line count on the file, just to see how many records we're dealing with. | |
2. This file has 3717914 records. | |
4. Head and tail the file. | |
1. Grab the headers with head -1 | tr , "\n" | less -N | |
5. less (doesn't load the entire file into memory, allows browsing and searching to some extent) | |
6. cut -d, -fN | |
1. explain pitfall of quoted commas. | |
2. sneaky trick: sed 's/",/"|/g' only works because there are no unescaped quotes. | |
7. tr | |
1. tr '[:lower:]' '[:upper:]' | |
2. tr -d '\r' (dos -> unix) | |
8. sed | |
1. stream editor, great for regex. | |
2. print specific line from the file - sed '1q;d' | |
1. q branch to the end of the script (quit) | |
2. d - display line | |
3. sed can be a little slow. | |
9. awk | |
1. inspecting columns (same as cut) | |
2. use cases/examples: | |
1. using it to pad a column (e.g., a zipcode) with zeros | |
1. awk -F, '{ printf('%06\n", $1) }' | |
3. checking number of fields | |
1. awk -F, '{ print NF }' | uniq | |
1. again with the quoted commas! | |
2. prove sed hack works: | |
1. pv npidata_20050523-20120709.csv | sed 's/",/"|/g' | awk 'BEGIN { FS="|" } { if (NF != 329) { print NF,$0} }' | uniq | |
4. pulling sample data | |
10. grep/ack | |
1. Ack is much like grep, but you can pass file type, it's automatically recursive, highlights by default, etc. | |
2. It's also significantly faster. | |
11. create a sample for test runs, explain how samples may need to be random. | |
Some uses: | |
Create a pipe-delimited version of the file. | |
pv npidata_20050523-20120709.csv | sed 's/",/"|/g' > piped.txt | |
Took about ten minutes. Warning: some pipes were in the file to begin with. How did I find out? | |
awk 'BEGIN { FS="|" } { print NF }' piped.txt | uniq | |
Replace zipcodes with five-character limit. | |
head -10 piped.txt | awk -v Q='"' 'BEGIN { FS="|"; OFS="|" } { $33 = substr($33, 0, 6) Q; print $0 }' | cut -d\| -f33 | |
Passing quotes into awk is a bear, so we create a variable for that and pass it in. | |
Checking out output with cut is generally a good idea. | |
How many doctors are practicing in east Nashville? | |
Process is something like: | |
1. Locate postal code field. | |
2. Go over file, filter for matching postal codes, and report back number of lines. | |
pv piped.txt | cut -d, -f6,33 | grep \"37206 | wc -l | |
Viola! 145 records, discovered in about 90 seconds. Take my word for it. | |
Create a sample dataset. | |
We can grab 1/10th of the file. | |
awk 'BEGIN { srand() } rand() <= .01' piped.txt | wc -l | |
awk 'BEGIN { srand() } rand() <= .01' piped.txt > sample.txt | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment