Skip to content

Instantly share code, notes, and snippets.

@helgejo
Created March 7, 2016 13:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save helgejo/d763ed974cbd17d130a5 to your computer and use it in GitHub Desktop.
Save helgejo/d763ed974cbd17d130a5 to your computer and use it in GitHub Desktop.
Samling techniques
If the data is huge and can't be loaded because of RAM issues there's a very simple way to sample your data using streaming techniques. It consist in selection first randomly the lines number that you will take in your sample, and then select them.
You can either do a regular random sample, or a random stratified sample if you have an output variable Y and want to keep the same distribution in your stratified sample.
Random sample
1/ Count the number of lines of your big file by reading the file line by line, you now have nb_lines
2/Generate a list of random numbers between 1 and nb_lines called for instance selected_lines, which will correspond to the id of the lines you will select in your big base
3/Go again trough the original big data file and select the lines which matches the lines number of selected_lines and write them in a new file.
Stratified sample for a discrete output variable
You roughly have to do the same except that you have to filter your line ids by class before doing the random selection of the lines.
1/ Generate the line number list for each class (eg Y has 3 classes so you will get 3 list of line numbers)
2/ For each class generate the random selected lines among the lines number list.
3/ Merge those lists into one list
4/ Select your lines from the merged list from the big original data file
From https://www.linkedin.com/groups/77616/77616-6110315650918395908
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment