Skip to content

Instantly share code, notes, and snippets.

@albarrentine
Created December 5, 2016 06:23
Show Gist options
  • Save albarrentine/81f14e64158d4150ca875c2597e0382e to your computer and use it in GitHub Desktop.
Save albarrentine/81f14e64158d4150ca875c2597e0382e to your computer and use it in GitHub Desktop.
Randomly shuffle a newline-delimited file that's larger than main memory
set -e
if [ "$#" -lt 3 ]; then
echo "Usage: chunked_shuffle filename parts outfile"
exit 1
fi
filename=$1
parts=$2
outfile=$3
awk -v parts=$parts -v filename=$filename 'BEGIN{srand();} { print > filename"."int(rand() * parts) }' $filename
tmp_outfile=$filename.out
> $tmp_outfile
for i in $(seq 0 $[$parts - 1]); do
shuf $filename.$i >> $tmp_outfile
rm $filename.$i
done
mv $tmp_outfile $outfile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment