Skip to content

Instantly share code, notes, and snippets.

@ealmansi
Created October 16, 2015 17:30
Show Gist options
  • Save ealmansi/6736d076646fefc3a976 to your computer and use it in GitHub Desktop.
Save ealmansi/6736d076646fefc3a976 to your computer and use it in GitHub Desktop.
Generate Wikipedia Revision Timestamps
// Command in Bash to run Spark shell.
spark-shell --master yarn --driver-memory 50G --num-executors 5 --executor-cores 2 --executor-memory 4G --packages com.databricks:spark-csv_2.10:1.2.0
// Commands in Scala to run within the Spark shell.
val pageMetadataDF = sqlContext.load("/user/ealmansi/data/enwiki-20150901/parquet/page_metadata", "parquet")
pageMetadataDF.registerTempTable("page_metadata")
sql("""SELECT
page_id,
title,
ns,
revision.revision_id,
revision.timestamp
FROM page_metadata
LATERAL VIEW explode(revisions) r AS revision
WHERE ns = 0 OR ns = 14""")
.save("/user/path/to/a/hadoop/filesystem/directory", "com.databricks.spark.csv")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment