Skip to content

Instantly share code, notes, and snippets.

@BalazsHoranyi
Last active May 31, 2018 15:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save BalazsHoranyi/b9b6a65f1bd15ba21806e544006a9b22 to your computer and use it in GitHub Desktop.
Save BalazsHoranyi/b9b6a65f1bd15ba21806e544006a9b22 to your computer and use it in GitHub Desktop.
from distributed import Client, LocalCluster
import dask.dataframe as dd
import numpy as np
cluster = LocalCluster(ip='0.0.0.0', n_workers=32, threads_per_worker=1, diagnostics_port=8787, **{'memory_limit': 2e9})
client = Client(cluster)
print(client)
df = dd.read_parquet('parquet/')
print(f'found {len(df)} interactions')
df['user_id'] = df['actor'].apply(lambda x: ast.literal_eval(x).get('login', 'unknown'), meta=('x', 'U'))
df['repo_id'] = df['repo'].apply(lambda x: ast.literal_eval(x).get('name', 'unkown'), meta=('x', 'U'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment