Skip to content

Instantly share code, notes, and snippets.

View rvencu's full-sized avatar

Richard Vencu rvencu

View GitHub Profile
@rvencu
rvencu / cah_stats_spark.py
Created July 16, 2021 20:05 — forked from rom1504/a_cah_to_parquet_pyspark.md
cah_stats_spark.py
'''
Compute some stats on cah collection
First get the files with:
lynx -dump -hiddenlinks=listonly -nonumbers http://the-eye.eu/eleuther_staging/cah/ | grep cah | grep .csv > cah.csv
aria2c --dir=shards --auto-file-renaming=false --continue=true -i cah.csv -x 16 -s 16 -j 100
Takes a few minutes to run
Then pip install pyspark
Then run this file. It also takes a few minutes
'''