Skip to content

Instantly share code, notes, and snippets.

@matpalm
Created June 17, 2010 11:42
Show Gist options
  • Save matpalm/442001 to your computer and use it in GitHub Desktop.
Save matpalm/442001 to your computer and use it in GitHub Desktop.
split and xargs eg
# ps. writing this from my head so there is probably a few errors
# calculate total number of lines across the files
TOTAL_LINES=`cat *.json | wc -l`
# calculate what 1/4 would be (for a quad core box)
let SPLIT_SIZE=$TOTAL_LINES/4
# split the original input into files of this size, can end up with 5 files due
# to rounding of /4 but doesnt matter too much
cat *.json | split -l $SPLIT_SIZE - SPLIT_
# run processing steps in parallel, 4 at a time, across the split files
# this allows the most cpu intensive piece, the grep and sort, to be paralleised
find SPLIT_* | xargs -n1 -P4 -I{} bash -c "cat {} | grep -iPo 'g+o+a+l+' | sort > {}.sorted"
# combine output
sort -m *.sorted | uniq -c | sort -rg > result
rm SPLIT_*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment