Created
February 27, 2024 22:39
-
-
Save kjhealy/85f23c3ba158770ffa3ae09de2ef946a to your computer and use it in GitHub Desktop.
one-percent-stream-sample.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
## Take an approximately 0.1 percent sample of lines from this gzipped | |
## csv file. We do this by having gzip stream to STDOUT and then the | |
## Perl one-liner does the sampling. On an 80GB file, the output will | |
## be ~80MB. Strictly speaking this is only roughly a 1% sample. | |
## Also, there are a few possible edge cases with the rough-and-ready | |
## sampling method, but they're not that likely to worry us given what | |
## we want to do. | |
gzip -cd giantfile.csv.gz | perl -ne 'print if (rand() < .001)' > sample.csv | |
## Don't forget to put back the column names | |
gzip -cd giantfile.csv.gz | head -n 1 > header.csv | |
cat header.tsv sample.csv > tmp.tsv && mv -f tmp.csv sample.csv | |
## You could do the sampling with awk, too, but it will be a bit slower | |
# awk 'BEGIN {srand()} !/^$/ { if (rand() <= .001 || FNR==1) print $0}' |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment