Skip to content

Instantly share code, notes, and snippets.

Avatar

Reynold Xin rxin

View GitHub Profile
@rxin
rxin / generated-assembly.txt
Created Feb 14, 2017
Processing trillion rows per second on a single machine: how can nested loop joins be this fast?
View generated-assembly.txt
Decoding compiled method 0x00007f4d0510f9d0:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00007f4ce9662458} 'join' '(JI)J' in 'Test'
0x00007f4d0510fb20: call 0x00007f4d1abd5a30 ; {runtime_call}
0x00007f4d0510fb25: data16 data16 nop WORD PTR [rax+rax*1+0x0]
0x00007f4d0510fb30: mov DWORD PTR [rsp-0x14000],eax
0x00007f4d0510fb37: push rbp
@rxin
rxin / benchmark.scala
Last active Sep 10, 2015
Spark Parquet benchmark
View benchmark.scala
// Launch spark-shell
MASTER=local[4] bin/spark-shell --driver-memory 4G --conf spark.shuffle.memoryFraction=0.5 --packages com.databricks:spark-csv_2.10:1.2.0
// Read the DF in
val pdf = sqlContext.read.parquet("d_small_key.parquet")
sqlContext.setConf("spark.sql.shuffle.partitions", "8")
// Data reading
val start = System.currentTimeMillis
View CodegenTest.scala
package org.apache.spark.sql
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object CodegenTest {
def main(args: Array[String]): Unit = {
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
@rxin
rxin / NaNTesting.java
Created Jul 21, 2015
NaN double vs float testing
View NaNTesting.java
package com.databricks.unsafe.util.benchmark;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
@rxin
rxin / BinarySearch.java
Created Jul 19, 2015
binary search vs linear scan
View BinarySearch.java
package com.databricks.unsafe.util.benchmark;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
View gist:2b0ee3d18bf23531ca3a
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2013| 1| 1| 517.0| 2.0| 830.0| 11.0| UA| N14228| 1545| EWR| IAH| 227.0| 1400| 5.0| 17.0|
|2013| 1| 1| 533.0| 4.0| 850.0| 20.0| UA| N24211| 1714| LGA| IAH| 227.0| 1416| 5.0| 33.0|
|2013| 1| 1| 542.0| 2.0| 923.0| 33.0| AA| N619AA| 1141| JFK| MIA| 160.0| 1089| 5.0| 42.0|
|2013| 1| 1| 544.0| -1.0| 1004.0| -18.0| B6| N804JB| 725| JFK| BQN| 183.0| 1576| 5.0| 44.0|
|2013| 1| 1| 554.0| -6.0| 812.0| -25.0| DL| N668DN| 461| LGA| ATL| 116.0| 762| 5.0| 54.0|
+----+-----+---+--------+---------+--------+---------+-------+--
View gist:577f7e15545a1edc6f88
In [1]: df = sqlContext.read.json("examples/src/main/resources/people.json")
In [2]: df.withColumn('a b', df.age)
Out[2]: DataFrame[age: bigint, name: string, a b: bigint]
In [3]: df.withColumn('a b', df.age).write.parquet('test-parquet.out')
15/06/03 01:14:56 ERROR InsertIntoHadoopFsRelation: Aborting job.
java.lang.RuntimeException: Attribute name "a b" contains invalid character(s) among " ,;{}() =". Please use alias to rename it.
at scala.sys.package$.error(package.scala:27)
@rxin
rxin / benchmark.scala
Created Apr 22, 2015
quasiquote vs janino
View benchmark.scala
package org.apache.spark.sql.catalyst.expressions.codegen
import org.codehaus.janino.SimpleCompiler
object CodeGenBenchmark {
def quasiquotes(): Unit = {
import scala.reflect.runtime.{universe => ru}
import scala.reflect.runtime.universe._
@rxin
rxin / UnsafeBenchmark.arrayTraversal
Created Mar 13, 2015
Unsafe vs primitive array traversal speed
View UnsafeBenchmark.arrayTraversal
# {method} 'arrayTraversal' '()J' in 'com/databricks/unsafe/util/benchmark/UnsafeBenchmark'
0x000000010a8c9ae0: callq 0x000000010a2165ee ; {runtime_call}
0x000000010a8c9ae5: data32 data32 nopw 0x0(%rax,%rax,1)
0x000000010a8c9af0: mov %eax,-0x14000(%rsp)
0x000000010a8c9af7: push %rbp
0x000000010a8c9af8: sub $0x30,%rsp
0x000000010a8c9afc: mov (%rsi),%r13d
0x000000010a8c9aff: mov 0x18(%rsi),%rbp
0x000000010a8c9b03: mov 0x8(%rsi),%rbx
0x000000010a8c9b07: mov %rsi,%rdi
@rxin
rxin / df.py
Last active Jan 26, 2017
DataFrame simple aggregation performance benchmark
View df.py
data = sqlContext.load("/home/rxin/ints.parquet")
data.groupBy("a").agg(col("a"), avg("num")).collect()