briandailey/gist:b2554e101323dcac7cf3 Secret

## gistfile1.txt
Test dataset: NPI file. 4GB uncompressed csv.


http://nppes.viva-it.com/NPI_Files.html


  1. Briefly talk about *nix philosophy of small, simple tools working together to get a job done.
    1. Doug Mcllroy, inventor of pipes:
    2. (i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

(ii) Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.

(iii) Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.

(iv) Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them.
    3. Text is the universal interface.
  2. Explain that I don't have a background in perl, so perl users may find this talk amusing.
  3. wc -l
    1. get line count on the file, just to see how many records we're dealing with.
    2. This file has 3717914 records.
  4. Head and tail the file.
    1. Grab the headers with head -1 | tr , "\n" | less -N
  5. less (doesn't load the entire file into memory, allows browsing and searching to some extent)
  6. cut -d, -fN
    1. explain pitfall of quoted commas.
    2. sneaky trick: sed 's/",/"|/g' only works because there are no unescaped quotes.
  7. tr
    1. tr '[:lower:]' '[:upper:]'
    2. tr -d '\r' (dos -> unix)
  8. sed
    1. stream editor, great for regex.
    2. print specific line from the file - sed '1q;d'
      1. q branch to the end of the script (quit)
      2. d - display line
    3. sed can be a little slow.
  9. awk
    1. inspecting columns (same as cut)
    2. use cases/examples:
      1. using it to pad a column (e.g., a zipcode) with zeros
        1. awk -F, '{ printf('%06\n", $1) }'
    3. checking number of fields
      1. awk -F, '{ print NF }' | uniq
        1. again with the quoted commas!
        2. prove sed hack works:
          1. pv npidata_20050523-20120709.csv | sed 's/",/"|/g' | awk 'BEGIN { FS="|" } { if (NF != 329) { print NF,$0} }' | uniq
    4. pulling sample data
  10. grep/ack
    1. Ack is much like grep, but you can pass file type, it's automatically recursive, highlights by default, etc.
    2. It's also significantly faster.
  11. create a sample for test runs, explain how samples may need to be random.

Some uses:


Create a pipe-delimited version of the file.
pv npidata_20050523-20120709.csv | sed 's/",/"|/g' > piped.txt

Took about ten minutes. Warning: some pipes were in the file to begin with. How did I find out?

awk 'BEGIN { FS="|" } { print NF }' piped.txt | uniq


Replace zipcodes with five-character limit.
head -10 piped.txt | awk -v Q='"' 'BEGIN { FS="|"; OFS="|" } { $33 = substr($33, 0, 6) Q; print $0 }' | cut -d\| -f33

Passing quotes into awk is a bear, so we create a variable for that and pass it in.

Checking out output with cut is generally a good idea.


How many doctors are practicing in east Nashville?
Process is something like:


  1. Locate postal code field.
  2. Go over file, filter for matching postal codes, and report back number of lines.
pv piped.txt | cut -d, -f6,33 | grep \"37206 | wc -l

Viola! 145 records, discovered in about 90 seconds. Take my word for it.

Create a sample dataset.
We can grab 1/10th of the file.

awk 'BEGIN { srand() } rand() <= .01' piped.txt | wc -l
awk 'BEGIN { srand() } rand() <= .01' piped.txt > sample.txt
	Test dataset: NPI file. 4GB uncompressed csv.


	http://nppes.viva-it.com/NPI_Files.html


	1. Briefly talk about *nix philosophy of small, simple tools working together to get a job done.
	1. Doug Mcllroy, inventor of pipes:
	2. (i) Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.

	(ii) Expect the output of every program to become the input to another, as yet unknown, program. Don't clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don't insist on interactive input.

	(iii) Design and build software, even operating systems, to be tried early, ideally within weeks. Don't hesitate to throw away the clumsy parts and rebuild them.

	(iv) Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you've finished using them.
	3. Text is the universal interface.
	2. Explain that I don't have a background in perl, so perl users may find this talk amusing.
	3. wc -l
	1. get line count on the file, just to see how many records we're dealing with.
	2. This file has 3717914 records.
	4. Head and tail the file.
	1. Grab the headers with head -1 \| tr , "\n" \| less -N
	5. less (doesn't load the entire file into memory, allows browsing and searching to some extent)
	6. cut -d, -fN
	1. explain pitfall of quoted commas.
	2. sneaky trick: sed 's/",/"\|/g' only works because there are no unescaped quotes.
	7. tr
	1. tr '[:lower:]' '[:upper:]'
	2. tr -d '\r' (dos -> unix)
	8. sed
	1. stream editor, great for regex.
	2. print specific line from the file - sed '1q;d'
	1. q branch to the end of the script (quit)
	2. d - display line
	3. sed can be a little slow.
	9. awk
	1. inspecting columns (same as cut)
	2. use cases/examples:
	1. using it to pad a column (e.g., a zipcode) with zeros
	1. awk -F, '{ printf('%06\n", $1) }'
	3. checking number of fields
	1. awk -F, '{ print NF }' \| uniq
	1. again with the quoted commas!
	2. prove sed hack works:
	1. pv npidata_20050523-20120709.csv \| sed 's/",/"\|/g' \| awk 'BEGIN { FS="\|" } { if (NF != 329) { print NF,$0} }' \| uniq
	4. pulling sample data
	10. grep/ack
	1. Ack is much like grep, but you can pass file type, it's automatically recursive, highlights by default, etc.
	2. It's also significantly faster.
	11. create a sample for test runs, explain how samples may need to be random.

	Some uses:


	Create a pipe-delimited version of the file.
	pv npidata_20050523-20120709.csv \| sed 's/",/"\|/g' > piped.txt

	Took about ten minutes. Warning: some pipes were in the file to begin with. How did I find out?

	awk 'BEGIN { FS="\|" } { print NF }' piped.txt \| uniq


	Replace zipcodes with five-character limit.
	head -10 piped.txt \| awk -v Q='"' 'BEGIN { FS="\|"; OFS="\|" } { $33 = substr($33, 0, 6) Q; print $0 }' \| cut -d\\| -f33

	Passing quotes into awk is a bear, so we create a variable for that and pass it in.

	Checking out output with cut is generally a good idea.


	How many doctors are practicing in east Nashville?
	Process is something like:




	1. Locate postal code field.
	2. Go over file, filter for matching postal codes, and report back number of lines.
	pv piped.txt \| cut -d, -f6,33 \| grep \"37206 \| wc -l

	Viola! 145 records, discovered in about 90 seconds. Take my word for it.

	Create a sample dataset.
	We can grab 1/10th of the file.

	awk 'BEGIN { srand() } rand() <= .01' piped.txt \| wc -l
	awk 'BEGIN { srand() } rand() <= .01' piped.txt > sample.txt