helgejo/Sampling from large data

## Sampling from large data
 If the data is huge and can't be loaded because of RAM issues there's a very simple way to sample your data using streaming techniques. It consist in selection first randomly the lines number that you will take in your sample, and then select them.

You can either do a regular random sample, or a random stratified sample if you have an output variable Y and want to keep the same distribution in your stratified sample.

Random sample
1/ Count the number of lines of your big file by reading the file line by line, you now have nb_lines
2/Generate a list of random numbers between 1 and nb_lines called for instance selected_lines, which will correspond to the id of the lines you will select in your big base
3/Go again trough the original big data file and select the lines which matches the lines number of selected_lines and write them in a new file.

Stratified sample for a discrete output variable
You roughly have to do the same except that you have to filter your line ids by class before doing the random selection of the lines.
1/ Generate the line number list for each class (eg Y has 3 classes so you will get 3 list of line numbers)
2/ For each class generate the random selected lines among the lines number list.
3/ Merge those lists into one list
4/ Select your lines from the merged list from the big original data file

From https://www.linkedin.com/groups/77616/77616-6110315650918395908
	If the data is huge and can't be loaded because of RAM issues there's a very simple way to sample your data using streaming techniques. It consist in selection first randomly the lines number that you will take in your sample, and then select them.

	You can either do a regular random sample, or a random stratified sample if you have an output variable Y and want to keep the same distribution in your stratified sample.

	Random sample
	1/ Count the number of lines of your big file by reading the file line by line, you now have nb_lines
	2/Generate a list of random numbers between 1 and nb_lines called for instance selected_lines, which will correspond to the id of the lines you will select in your big base
	3/Go again trough the original big data file and select the lines which matches the lines number of selected_lines and write them in a new file.

	Stratified sample for a discrete output variable
	You roughly have to do the same except that you have to filter your line ids by class before doing the random selection of the lines.
	1/ Generate the line number list for each class (eg Y has 3 classes so you will get 3 list of line numbers)
	2/ For each class generate the random selected lines among the lines number list.
	3/ Merge those lists into one list
	4/ Select your lines from the merged list from the big original data file

	From https://www.linkedin.com/groups/77616/77616-6110315650918395908