Skip to content

Instantly share code, notes, and snippets.


Simeon Simeonov ssimeonov

View GitHub Profile
ssimeonov / 0 mvn_output.txt
Last active Apr 15, 2017
xgboost-spark test error on Mac OSX Yosemite with gcc/g++-6 with OpenMP support
View 0 mvn_output.txt
➜ jvm-packages git:(master) ✗ mvn -Dspark.version=2.1.0 package
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
[INFO] Scanning for projects...
[WARNING] Some problems were encountered while building the effective model for ml.dmlc:xgboost4j:jar:0.7
[WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:exec-maven-plugin is missing. @ line 40, column 29
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
ssimeonov / distributedFileListing.scala
Last active Jul 26, 2016
Distributed file listing using Spark and the Hadoop file system APIs
View distributedFileListing.scala
case class FInfo(
path: String,
parent: String,
isDir: Boolean,
size: Long,
modificationTime: Long,
partitions: Map[String, String]) {
// @todo encoding issues
def hasExt(ext: String) = endsWith(ext)
ssimeonov / DataFrameFunctions.scala
Last active Sep 1, 2016
Querying DataFrame with SQL without explicit registration of a temporary table
View DataFrameFunctions.scala
object DataFrameFunctions {
final val TEMP_TABLE_PLACEHOLDER = "~tbl~"
/** Executes a SQL statement on the dataframe.
* Behind the scenes, it registers and cleans up a temporary table.
* @param df input dataframe
* @param stmtTemplate SQL statement template that uses the value of
* `TEMP_TABLE_PLACEHOLDER` for the table name.
View spark_sql_test_failures.txt
➜ spark git:(master) ✗ build/sbt sql/test
Using /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
[info] Loading global plugins from /Users/sim/.sbt/0.13/plugins
[info] Loading project definition from /Users/sim/dev/spx/spark/project/project
[info] Loading project definition from /Users/sim/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /Users/sim/dev/spx/spark/project
[info] Set current project to spark-parent (in build file:/Users/sim/dev/spx/spark/)
View spark_test_failures.txt
[info] spark-streaming: found 30 potential binary incompatibilities (filtered 8)
[error] * method delaySeconds()Int in class org.apache.spark.streaming.Checkpoint does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.Checkpoint.delaySeconds")
[error] * class org.apache.spark.streaming.receiver.ActorSupervisorStrategy does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streaming.receiver.ActorSupervisorStrategy")
[error] * object org.apache.spark.streaming.receiver.IteratorData does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streaming.receiver.IteratorData$")
[error] * class org.apache.spark.streaming.receiver.ByteBufferData does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streami
View ContrivedAdd.scala
object ContrivedAdd {
import shapeless._
import record._
import syntax.singleton._
import shapeless.ops.record.Updater
import scalaz._
import Scalaz._
case class S[L <: HList](total: Int, scratch: L)
View shapeless-transformations.scala
package ss
object ContrivedAdd {
import shapeless._
import record._
import syntax.singleton._
import scalaz._
import Scalaz._
ssimeonov / databricks.scala
Created Jan 8, 2016
Some improvements to Databricks' Scala notebook capabilities.
View databricks.scala
val ctx = sqlContext
import ctx.implicits._
// With nested structs, sometimes JSON is a much more readable form than display()
def showall(df: DataFrame, num: Int): Unit = df.limit(num).toJSON.collect.foreach(println)
def showall(sql: String, num: Int = 100): Unit = showall(ctx.sql(sql), num)
def hivePath(name: String) = s"/user/hive/warehouse/$name"
// Bug workaround

#Scala .hashCode vs. MurmurHash3 for Spark's MLlib

This is simple test of two hashing functions:

  • Scala's native implementation (obj.##), used in HashingTF
  • MurmurHash3, included in Scala, used by Vowpal Wabbit and many others

The test uses the aspell dictionary generated with the "insane" setting (download), which produces 676,547 entries, and explores the following grid:

  • Feature vector sizes: 2^18..22
View spark_invalid_column_reference.txt
➜ dev spark-1.4.1-bin-hadoop2.6/bin/spark-sql --packages "com.databricks:spark-csv_2.10:1.0.3,com.lihaoyi:pprint_2.10:0.3.4" --driver-memory 4g --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=512m" --conf "spark.local.dir=/Users/sim/tmp" --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
Ivy Default Cache set to: /Users/sim/.ivy2/cache
The jars for the packages stored in: /Users/sim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
com.lihaoyi#pprint_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central