Skip to content

Instantly share code, notes, and snippets.

View hietalajulius's full-sized avatar

Julius Hietala hietalajulius

View GitHub Profile
@Nipsuli
Nipsuli / two_pass_shuffle.py
Last active November 10, 2021 19:06
Two pass shuffle implementation for algorithm described: in https://blog.janestreet.com/how-to-shuffle-a-big-dataset/
import contextlib
import tempfile
import random
def two_pass_shuffle(input_files, output_files, temp_file_count, header_lines=0):
"""
two_pass_shuffle
Suffle data larger that can be shuffled in memory.
Implementation based on: