Simeon Simeonov ssimeonov

## spark_invalid_column_reference.txt
➜  dev  spark-1.4.1-bin-hadoop2.6/bin/spark-sql --packages "com.databricks:spark-csv_2.10:1.0.3,com.lihaoyi:pprint_2.10:0.3.4" --driver-memory 4g --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=512m" --conf "spark.local.dir=/Users/sim/tmp" --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
Ivy Default Cache set to: /Users/sim/.ivy2/cache
The jars for the packages stored in: /Users/sim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
com.lihaoyi#pprint_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.0.3 in central
	found org.apache.commons#commons-csv;1.1 in central

## spark_shell_stuck.txt
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.
15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..

## aREADME.md

      
              5 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ssimeonov
                / aREADME.md
            
            
              Last active
              August 29, 2015 14:25
            
          
    Spark exceptions later on cause persistent I/O problems

When using spark-shell in local mode, I've observed the following behavior on a number of nodes:

Some operation generates an exception related to Spark SQL processing via HiveContext.
From that point on, nothing could be written to Hive with saveAsTable.
Another identically-configured version of Spark on the same machine may not exhibit the problem.
A new identically-configured installation of the same version on the same machine would exhibit the problem.

The behavior is difficult to reproduce reliably but consistently observable with a lot of Spark SQL experimentation.

  
## code.scala
import org.apache.spark.sql.hive.HiveContext

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

// Table test is not present
ctx.tableNames

// ERROR Hive: NoSuchObjectException(message:default.test table not found)
ctx.sql("drop table if exists test")

## code.scala
// This code is designed to be pasted in spark-shell in a *nix environment
// On Windows, replace sys.env("HOME") with a directory of your choice

import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

## code.scala
// This code is pasted into spark-shell
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]
import ctx.implicits._

val devRoot = "/home/ubuntu/spx"
ctx.
  jsonFile("file://" + devRoot + "/data/swoop-ml-nlp/dimensions/component_variations.jsonlines").

## a_shell_test.scala
// This code is designed to be pasted in spark-shell in a *nix environment
// On Windows, replace sys.env("HOME") with a directory of your choice

import java.io.File
import java.io.PrintWriter
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SaveMode

val ctx = sqlContext.asInstanceOf[HiveContext]

## 0_README.md

      
              4 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ssimeonov
                / 0_README.md
            
            
              Created
              July 2, 2015 19:36
            
          
    Spark 1.4.0 regression: out-of-memory conditions on small data

A very simple Spark SQL COUNT operation succeeds in spark-shell for 1.3.1 and fails with a series of out-of-memory errors in 1.4.0.
The data in question is a single file of 88,283 JSON objects with at most 109 fields per object. Size on disk is 181Mb.
This gist includes the code and the full output from the 1.3.1 and 1.4.0 runs, including the command line showing how spark-shell is started.

  
## gist:98d3dd13c6eecb3ae454
➜  jq git:(master) ✗ make clean
 rm -f jq
test -z "libjq.la " || rm -f libjq.la
rm -f ./so_locations
rm -rf .libs _libs
rm -f version.h .remake-version-h
rm -f *.o
test -z "tests/all.log" || rm -f tests/all.log
test -z "tests/all.trs" || rm -f tests/all.trs
test -z "test-suite.log" || rm -f test-suite.log

## 01_setup.md

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ssimeonov
                / 01_setup.md
            
            
              Last active
              August 29, 2015 14:21
            
          
    Ad hoc setup for a Swoop ML experimentation machine

Run the following:
curl 'https://gist.githubusercontent.com/ssimeonov/2319ecb00d825d6f5c78/raw/2bf43b3c5b766b9ce16f647fadbd7b423234f210/aws_ml_setup.sh' | bash -v
If the script exits without an error right after installing some packages, run it again.
	➜ dev spark-1.4.1-bin-hadoop2.6/bin/spark-sql --packages "com.databricks:spark-csv_2.10:1.0.3,com.lihaoyi:pprint_2.10:0.3.4" --driver-memory 4g --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=512m" --conf "spark.local.dir=/Users/sim/tmp" --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
	Ivy Default Cache set to: /Users/sim/.ivy2/cache
	The jars for the packages stored in: /Users/sim/.ivy2/jars
	:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
	com.databricks#spark-csv_2.10 added as a dependency
	com.lihaoyi#pprint_2.10 added as a dependency
	:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
	confs: [default]
	found com.databricks#spark-csv_2.10;1.0.3 in central
	found org.apache.commons#commons-csv;1.1 in central
	15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
	SQL context available as sqlContext.
	15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
	15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
	SQL context available as sqlContext.
	15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
	15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
	SQL context available as sqlContext.
	15/08/05 02:48:16 INFO HiveContext: Initializing execution hive, version 0.13.1
	15/08/05 02:48:16 INFO SparkILoop: Created sql context (with Hive support)..
	import org.apache.spark.sql.hive.HiveContext

	val ctx = sqlContext.asInstanceOf[HiveContext]
	import ctx.implicits._

	// Table test is not present
	ctx.tableNames

	// ERROR Hive: NoSuchObjectException(message:default.test table not found)
	ctx.sql("drop table if exists test")
	// This code is designed to be pasted in spark-shell in a *nix environment
	// On Windows, replace sys.env("HOME") with a directory of your choice

	import java.io.File
	import java.io.PrintWriter
	import org.apache.spark.sql.hive.HiveContext

	val ctx = sqlContext.asInstanceOf[HiveContext]
	import ctx.implicits._
	// This code is pasted into spark-shell
	import org.apache.spark.sql.hive.HiveContext
	import org.apache.spark.sql.SaveMode

	val ctx = sqlContext.asInstanceOf[HiveContext]
	import ctx.implicits._

	val devRoot = "/home/ubuntu/spx"
	ctx.
	jsonFile("file://" + devRoot + "/data/swoop-ml-nlp/dimensions/component_variations.jsonlines").
	➜ jq git:(master) ✗ make clean
	rm -f jq
	test -z "libjq.la " \|\| rm -f libjq.la
	rm -f ./so_locations
	rm -rf .libs _libs
	rm -f version.h .remake-version-h
	rm -f *.o
	test -z "tests/all.log" \|\| rm -f tests/all.log
	test -z "tests/all.trs" \|\| rm -f tests/all.trs
	test -z "test-suite.log" \|\| rm -f test-suite.log