Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save zouzias/3eea4a9da16331921f823d1acee1cb71 to your computer and use it in GitHub Desktop.
Save zouzias/3eea4a9da16331921f823d1acee1cb71 to your computer and use it in GitHub Desktop.
DBLP XML to Parquet using Spark
bin/spark-shell --packages com.databricks:spark-xml_2.11:0.5.0
scala> import com.databricks.spark.xml._
val df = spark.read.option("rowTag", "article").xml("/Users/anastasios/test/dblp.xml")
val df = spark.read.option("rowTag", "article").xml("/Users/anastasios/test/dblp/dblp.xml")
scala> val dblp = df.select("_key", "author", "title", "journal", "year")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment