Skip to content

Instantly share code, notes, and snippets.

@wbchn
Created May 12, 2016 03:22
Show Gist options
  • Save wbchn/0dd76db7de2329faa8c38c153e3a4ed6 to your computer and use it in GitHub Desktop.
Save wbchn/0dd76db7de2329faa8c38c153e3a4ed6 to your computer and use it in GitHub Desktop.
Spark Starter

Notebook.1

Zeppelin: add dep

eg: add csv package,

%dep
z.reset()
z.addRepo("Spark Packages Repo").url("http://dl.bintray.com/spark-packages/maven")
z.load("com.databricks:spark-csv_2.11:1.4.0")

Zeppline:csv table query exception

No exception in pyspark. issue comment sql query will raise java.lang.ClassNotFoundException: com.databricks.spark.csv.CsvRelation$$anonfun$1$$anonfun$2

solve: add cacheTable after registerTempTable

df_parquet.registerTempTable("click_parquet")
sqlContext.cacheTable("click_parquet")

IPython for pyspark

Of cause, install ipython first: sudo pip-2.7 install ipython. then, start IPYTHON=1 pyspark.

Add package for pyspark

eg: IPYTHON=1 pyspark --packages com.databricks:spark-csv_2.11:1.4.0

If using --jars, must make sure the jar packages exsit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment