Skip to content

Instantly share code, notes, and snippets.

View ssimeonov's full-sized avatar

Simeon Simeonov ssimeonov

View GitHub Profile
➜ dev spark-1.4.1-bin-hadoop2.6/bin/spark-sql --packages "com.databricks:spark-csv_2.10:1.0.3,com.lihaoyi:pprint_2.10:0.3.4" --driver-memory 4g --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=512m" --conf "spark.local.dir=/Users/sim/tmp" --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
Ivy Default Cache set to: /Users/sim/.ivy2/cache
The jars for the packages stored in: /Users/sim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
com.lihaoyi#pprint_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..

Spark exceptions later on cause persistent I/O problems

When using spark-shell in local mode, I've observed the following behavior on a number of nodes:

  1. Some operation generates an exception related to Spark SQL processing via HiveContext.
  2. From that point on, nothing could be written to Hive with saveAsTable.
  3. Another identically-configured version of Spark on the same machine may not exhibit the problem.
  4. A new identically-configured installation of the same version on the same machine would exhibit the problem.

The behavior is difficult to reproduce reliably but consistently observable with a lot of Spark SQL experimentation.

@ssimeonov
ssimeonov / code.scala
Last active September 24, 2018 06:22
SPARK-9343: DROP IF EXISTS throws if a table is missing
import org.apache.spark.sql.hive.HiveContext
val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._
// Table test is not present
ctx.tableNames
// ERROR Hive: NoSuchObjectException(message:default.test table not found)
ctx.sql("drop table if exists test")
@ssimeonov
ssimeonov / code.scala
Last active August 29, 2015 14:25
SPARK-9342 Spark SQL problems dealing with views
// This code is designed to be pasted in spark-shell in a *nix environment
// On Windows, replace sys.env("HOME") with a directory of your choice
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._
@ssimeonov
ssimeonov / code.scala
Last active August 29, 2015 14:25
I/O error in saveAsTable
// This code is pasted into spark-shell
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode
val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._
val devRoot = "/home/ubuntu/spx"
ctx.
jsonFile("file://" + devRoot + "/data/swoop-ml-nlp/dimensions/component_variations.jsonlines").
@ssimeonov
ssimeonov / a_shell_test.scala
Last active April 16, 2019 08:25
SPARK-9210 test: Spark SQL first() vs. first_value()
// This code is designed to be pasted in spark-shell in a *nix environment
// On Windows, replace sys.env("HOME") with a directory of your choice
import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode
val ctx = sqlContext.asInstanceOf[HiveContext]

Spark 1.4.0 regression: out-of-memory conditions on small data

A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and fails with a series of out-of-memory errors in 1.4.0.

The data in question is a single file of 88,283 JSON objects with at most 109 fields per object. Size on disk is 181Mb.

This gist includes the code and the full output from the 1.3.1 and 1.4.0 runs, including the command line showing how spark-shell is started.

➜ jq git:(master) ✗ make clean
rm -f jq
test -z "libjq.la " || rm -f libjq.la
rm -f ./so_locations
rm -rf .libs _libs
rm -f version.h .remake-version-h
rm -f *.o
test -z "tests/all.log" || rm -f tests/all.log
test -z "tests/all.trs" || rm -f tests/all.trs
test -z "test-suite.log" || rm -f test-suite.log

Ad hoc setup for a Swoop ML experimentation machine

Run the following:

curl 'https://gist.githubusercontent.com/ssimeonov/2319ecb00d825d6f5c78/raw/2bf43b3c5b766b9ce16f647fadbd7b423234f210/aws_ml_setup.sh' | bash -v

If the script exits without an error right after installing some packages, run it again.