Created
September 28, 2011 15:51
-
-
Save mumrah/1248302 to your computer and use it in GitHub Desktop.
Use associative arrays to aggregate data with Awk
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
BEGIN { | |
FS=","; | |
OFS=","; | |
} | |
{ | |
COUNTS[$1] += 1; | |
TOTALS[$1] += $2; | |
} | |
END { | |
GLOBAL_COUNT = 0; | |
GLOBAL_TOTAL = 0; | |
for(ID in COUNTS) { | |
GLOBAL_COUNT += COUNTS[ID]; | |
GLOBAL_TOTAL += TOTALS[ID]; | |
} | |
GLOBAL_AVG = GLOBAL_TOTAL / GLOBAL_COUNT; | |
print "key","count","average"; | |
for(ID in COUNTS) { | |
print ID,COUNTS[ID],(TOTALS[ID] / COUNTS[ID]); | |
} | |
print "global average:" GLOBAL_AVG; | |
print "global count:" GLOBAL_COUNT; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I used this script to process 100480507 records (2.73Gb) in 226 seconds (12.37 Mbps). The system was entirely IO bound during the processing. Interestingly enough, once all the data files were in disk cache, the process was completely CPU bound and executed only slightly faster: 201 seconds (13.91 Mbps).
Still, hot damn that's fast.