Skip to content

Instantly share code, notes, and snippets.

View ssimeonov's full-sized avatar

Simeon Simeonov ssimeonov

View GitHub Profile
package wordle
/** Wordle solver, game runner & simulator
*
* Optimizes based on a combination of an allowed word list (from the Wordle source code or any
* other source), word frequency data and the move in the game.
*
* @note
* [[Wordle.Game]] is mutable to allow for play in an environment without easy STDIN input. Use
* [[Wordle.Game.nextMove()]]. All words are in lowercase. Patterns are entered as as strings of
@ssimeonov
ssimeonov / 0 mvn_output.txt
Last active April 15, 2017 00:50
xgboost-spark test error on Mac OSX Yosemite with gcc/g++-6 with OpenMP support
➜ jvm-packages git:(master) ✗ mvn -Dspark.version=2.1.0 package
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512M; support was removed in 8.0
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for ml.dmlc:xgboost4j:jar:0.7
[WARNING] 'build.plugins.plugin.version' for org.codehaus.mojo:exec-maven-plugin is missing. @ line 40, column 29
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
@ssimeonov
ssimeonov / distributedFileListing.scala
Last active July 26, 2016 23:07
Distributed file listing using Spark and the Hadoop file system APIs
case class FInfo(
path: String,
parent: String,
isDir: Boolean,
size: Long,
modificationTime: Long,
partitions: Map[String, String]) {
// @todo encoding issues
def hasExt(ext: String) = endsWith(ext)
@ssimeonov
ssimeonov / DataFrameFunctions.scala
Last active September 1, 2016 18:30
Querying DataFrame with SQL without explicit registration of a temporary table
object DataFrameFunctions {
final val TEMP_TABLE_PLACEHOLDER = "~tbl~"
/** Executes a SQL statement on the dataframe.
* Behind the scenes, it registers and cleans up a temporary table.
*
* @param df input dataframe
* @param stmtTemplate SQL statement template that uses the value of
* `TEMP_TABLE_PLACEHOLDER` for the table name.
➜ spark git:(master) ✗ build/sbt sql/test
Using /Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home as default JAVA_HOME.
Note, this will be overridden by -java-home if it is set.
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; support was removed in 8.0
[info] Loading global plugins from /Users/sim/.sbt/0.13/plugins
[info] Loading project definition from /Users/sim/dev/spx/spark/project/project
[info] Loading project definition from /Users/sim/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader/project
[warn] Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'. To avoid conflict, Remove duplicate project resolvers (`resolvers`) or rename publishing resolver (`publishTo`).
[info] Loading project definition from /Users/sim/dev/spx/spark/project
[info] Set current project to spark-parent (in build file:/Users/sim/dev/spx/spark/)
[info] spark-streaming: found 30 potential binary incompatibilities (filtered 8)
[error] * method delaySeconds()Int in class org.apache.spark.streaming.Checkpoint does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.streaming.Checkpoint.delaySeconds")
[error] * class org.apache.spark.streaming.receiver.ActorSupervisorStrategy does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streaming.receiver.ActorSupervisorStrategy")
[error] * object org.apache.spark.streaming.receiver.IteratorData does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streaming.receiver.IteratorData$")
[error] * class org.apache.spark.streaming.receiver.ByteBufferData does not have a correspondent in new version
[error] filter with: ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.streami
object ContrivedAdd {
import shapeless._
import record._
import syntax.singleton._
import shapeless.ops.record.Updater
import scalaz._
import Scalaz._
case class S[L <: HList](total: Int, scratch: L)
@ssimeonov
ssimeonov / shapeless-transformations.scala
Created January 19, 2016 17:57
Code for Travis Brown's answer to State Transformations with a Shapless State Monad (http://stackoverflow.com/questions/34870889/state-transformations-with-a-shapeless-state-monad)
package ss
object ContrivedAdd {
import shapeless._
import record._
import syntax.singleton._
import scalaz._
import Scalaz._
@ssimeonov
ssimeonov / databricks.scala
Created January 8, 2016 21:23
Some improvements to Databricks' Scala notebook capabilities.
val ctx = sqlContext
import ctx.implicits._
// With nested structs, sometimes JSON is a much more readable form than display()
def showall(df: DataFrame, num: Int): Unit = df.limit(num).toJSON.collect.foreach(println)
def showall(sql: String, num: Int = 100): Unit = showall(ctx.sql(sql), num)
def hivePath(name: String) = s"/user/hive/warehouse/$name"
// Bug workaround

#Scala .hashCode vs. MurmurHash3 for Spark's MLlib

This is simple test of two hashing functions:

  • Scala's native implementation (obj.##), used in HashingTF
  • MurmurHash3, included in Scala, used by Vowpal Wabbit and many others

The test uses the aspell dictionary generated with the "insane" setting (download), which produces 676,547 entries, and explores the following grid:

  • Feature vector sizes: 2^18..22