Skip to content

Instantly share code, notes, and snippets.

View ssimeonov's full-sized avatar

Simeon Simeonov ssimeonov

View GitHub Profile
➜ dev spark-1.4.1-bin-hadoop2.6/bin/spark-sql --packages "com.databricks:spark-csv_2.10:1.0.3,com.lihaoyi:pprint_2.10:0.3.4" --driver-memory 4g --conf "spark.driver.extraJavaOptions=-XX:MaxPermSize=512m" --conf "spark.local.dir=/Users/sim/tmp" --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
Ivy Default Cache set to: /Users/sim/.ivy2/cache
The jars for the packages stored in: /Users/sim/.ivy2/jars
:: loading settings :: url = jar:file:/Users/sim/dev/spark-1.4.1-bin-hadoop2.6/lib/spark-assembly-1.4.1-hadoop2.6.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.databricks#spark-csv_2.10 added as a dependency
com.lihaoyi#pprint_2.10 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found com.databricks#spark-csv_2.10;1.0.3 in central
found org.apache.commons#commons-csv;1.1 in central

#Scala .hashCode vs. MurmurHash3 for Spark's MLlib

This is simple test of two hashing functions:

  • Scala's native implementation (obj.##), used in HashingTF
  • MurmurHash3, included in Scala, used by Vowpal Wabbit and many others

The test uses the aspell dictionary generated with the "insane" setting (download), which produces 676,547 entries, and explores the following grid:

  • Feature vector sizes: 2^18..22
@ssimeonov
ssimeonov / super_simple_sqs.rb
Created April 4, 2011 01:33
This is a simple way to post a message to Amazon SQS w/o any gems (the queue must already exist)
module Amazon
module Authentication
SIGNATURE_VERSION = "2"
@@digest = OpenSSL::Digest::Digest.new("sha256")
def sign(auth_string)
Base64.encode64(OpenSSL::HMAC.digest(digester, aws_secret_access_key, auth_string)).strip
end
def digester
@ssimeonov
ssimeonov / gist:1425931
Created December 3, 2011 03:39
Cool bash color prompt for Git fans
# git will show dirty branches
GIT_PS1_SHOWDIRTYSTATE=true
# Directory shortening for prompt
# Edit this based on the directory structures you like to use
function shorten_dir() {
d=$(pwd)
d=${d/\/Users\/sim\/dev\/spx/SPX}
d=${d/\/Users\/sim\/dev\/rails_projects/RP}
d=${d/\/Users\/sim\/dev\/proj/PROJ}
@ssimeonov
ssimeonov / rescue_spec.rb
Created December 21, 2011 16:05
Dynamic Resque job handlers
require "spx_utils/resque"
describe SpxUtils::Resque do
describe ".job_handler" do
subject do
SpxUtils::Resque.job_handler(:my_queue) do |x, y|
x + y
end
end
@ssimeonov
ssimeonov / readme.md
Created June 20, 2012 01:25
Proto environment confusion with do.call()

Environment confusion in proto

Problem Description

This gist demonstrates a strange behavior in the proto package for R when functions are dynamically invoked via do.call. When do.call is invoked inside a proto object, variables that belong to a function's closure are not visible, even though a naive navigation of the environment chain shows these variables to be present. This problem does not occur when do.call is used outside of proto.

Reproducibility

The problem is isolated to a testthat test in test-cmd.r.

@ssimeonov
ssimeonov / proto_store_test.r
Created August 23, 2012 03:32
Storing & retrieving proto objects from MongoDB
add = function(., name, p) {
data <- list(name=name, type=p$..Name, data=as.list(p))
.$collection()$insert(data)
},
get = function(., name) {
doc <- .$collection()$query_one(name=name)
if (!is.null(doc)) {
factory <- base:::get(doc$type, pos=1)
do.call(factory$new, doc$data)
@ssimeonov
ssimeonov / test_that_util.r
Created August 24, 2012 06:47
test_that matcher utilities
# Find expression that created a variable
find_expr <- testthat:::find_expr
includes <- function(x, included) {
if (length(included) > 0) {
if (is.list(included)) {
for (name in names(included)) {
if (!all(x[[name]] == included[[name]])) {
return(FALSE)
@ssimeonov
ssimeonov / my_data.r
Created September 7, 2012 01:56
RSpec hierarchical data structure evolution
=begin
Allows for a data structure to be safely evolved hierarchically during testing
Example:
context "with one try" do
with_data tries: 1
describe "with suppressed exceptions" do
with_data raise_on_fail: false
specify { retry_raise_hell(1, my_data).should be_nil }
@ssimeonov
ssimeonov / original_yahoo_response.html
Last active December 16, 2015 07:49
Anatomy of an online ad: initial response
<!-- Leaderboard slot on Yahoo!'s health network -->
<div id="yahoohealth-n-leaderboard-ad" style="border:0;margin-top:0;">
<div style="margin-bottom:9px;text-align:center">
<!-- Ad unit delivery script if scripting is available -->
<script language="JavaScript" type="text/javascript" src="http://us.adserver.yahoo.com/a?f=96843138&amp;p=health&amp;l=N&amp;c=r&amp;rs=cnp:healthline&amp;at=hlids%3Dx%26hlk1%3D8234384%26hlk2%3D8317112%26hlkwa%3D2790905%20refurl%3D%22%22">
</script>
<center>
<!-- Yahoo 1x1 tracking pixel delivery script -->
<script language="JavaScript" type="text/javascript" src="http://us.adserver.yahoo.com/a?f=96843138&amp;p=health&amp;l=FSRVY&amp;c=sr">
</script>