Skip to content

Instantly share code, notes, and snippets.

@napsternxg
Created August 21, 2020 03:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save napsternxg/060b2d507bd2384cc72acab9ca9ddfc8 to your computer and use it in GitHub Desktop.
Save napsternxg/060b2d507bd2384cc72acab9ca9ddfc8 to your computer and use it in GitHub Desktop.
Reading Wikidata dumps via Spark
# Takes around 30 minutes just to show df.head()
%%time
wikidata_dump_path="/path/to/latest-all.json.bz2"
df = sql.read.option("multiline", "true").json(wikidata_dump_path)
df.head()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment