Skip to content

Instantly share code, notes, and snippets.

@riccardo1980
Created December 2, 2018 11:05
Show Gist options
  • Save riccardo1980/8bd125c5acc4a265783f6aa8ffeae8e0 to your computer and use it in GitHub Desktop.
Save riccardo1980/8bd125c5acc4a265783f6aa8ffeae8e0 to your computer and use it in GitHub Desktop.
Split Pandas data frame by hash of observation features
import hashlib
import pandas as pd
def split_train_test_by_id(data, test_ratio=0.2, id_column=None, hash=hashlib.md5):
"""
Reproducible train test split: uses hash of string concatenation of observation
"""
def _test_set_check(identifier, test_ratio, hash):
return ord( hash(identifier).digest()[-1] ) < 256 * test_ratio
if id_column is None:
id_column=data.columns
ids = pd.Series(data[id_column].astype(str).values.tolist()).str.join('')
in_test_set = ids.apply(lambda id_: _test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment