Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Reading Wikidata dumps via Spark
# Takes around 30 minutes just to show df.head()
%%time
wikidata_dump_path="/path/to/latest-all.json.bz2"
df = sql.read.option("multiline", "true").json(wikidata_dump_path)
df.head()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment