Skip to content

Instantly share code, notes, and snippets.

@mumrah
Created September 28, 2011 15:51
Show Gist options
  • Save mumrah/1248302 to your computer and use it in GitHub Desktop.
Save mumrah/1248302 to your computer and use it in GitHub Desktop.
Use associative arrays to aggregate data with Awk
BEGIN {
FS=",";
OFS=",";
}
{
COUNTS[$1] += 1;
TOTALS[$1] += $2;
}
END {
GLOBAL_COUNT = 0;
GLOBAL_TOTAL = 0;
for(ID in COUNTS) {
GLOBAL_COUNT += COUNTS[ID];
GLOBAL_TOTAL += TOTALS[ID];
}
GLOBAL_AVG = GLOBAL_TOTAL / GLOBAL_COUNT;
print "key","count","average";
for(ID in COUNTS) {
print ID,COUNTS[ID],(TOTALS[ID] / COUNTS[ID]);
}
print "global average:" GLOBAL_AVG;
print "global count:" GLOBAL_COUNT;
}
@mumrah
Copy link
Author

mumrah commented Sep 28, 2011

Assumes input files formatted like:

key,value

And produces output:

key,count,average

@mumrah
Copy link
Author

mumrah commented Sep 28, 2011

I used this script to process 100480507 records (2.73Gb) in 226 seconds (12.37 Mbps). The system was entirely IO bound during the processing. Interestingly enough, once all the data files were in disk cache, the process was completely CPU bound and executed only slightly faster: 201 seconds (13.91 Mbps).

Still, hot damn that's fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment