Created
June 17, 2010 11:42
-
-
Save matpalm/442001 to your computer and use it in GitHub Desktop.
split and xargs eg
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# ps. writing this from my head so there is probably a few errors | |
# calculate total number of lines across the files | |
TOTAL_LINES=`cat *.json | wc -l` | |
# calculate what 1/4 would be (for a quad core box) | |
let SPLIT_SIZE=$TOTAL_LINES/4 | |
# split the original input into files of this size, can end up with 5 files due | |
# to rounding of /4 but doesnt matter too much | |
cat *.json | split -l $SPLIT_SIZE - SPLIT_ | |
# run processing steps in parallel, 4 at a time, across the split files | |
# this allows the most cpu intensive piece, the grep and sort, to be paralleised | |
find SPLIT_* | xargs -n1 -P4 -I{} bash -c "cat {} | grep -iPo 'g+o+a+l+' | sort > {}.sorted" | |
# combine output | |
sort -m *.sorted | uniq -c | sort -rg > result | |
rm SPLIT_* |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment