Skip to content

Instantly share code, notes, and snippets.

View rxin's full-sized avatar

Reynold Xin rxin

View GitHub Profile
@rxin
rxin / ramdisk.sh
Last active December 13, 2023 02:40
ramdisk create/delete on Mac OS X.
#!/bin/bash
# From http://tech.serbinn.net/2010/shell-script-to-create-ramdisk-on-mac-os-x/
#
ARGS=2
E_BADARGS=99
if [ $# -ne $ARGS ] # correct number of arguments to the script;
then
@rxin
rxin / benchmark.scala
Created April 22, 2015 06:36
quasiquote vs janino
package org.apache.spark.sql.catalyst.expressions.codegen
import org.codehaus.janino.SimpleCompiler
object CodeGenBenchmark {
def quasiquotes(): Unit = {
import scala.reflect.runtime.{universe => ru}
import scala.reflect.runtime.universe._
@rxin
rxin / ByteBufferPerf.scala
Last active May 14, 2018 21:27
Comparison of performance over various approaches to read Java ByteBuffer. The best way is to use Unsafe, which also enables reading multiple primitive data types from the same buffer.
/**
* To compile:
* scalac -optimize ByteBufferPerf.scala
*
* JAVA_OPTS="-Xmx2g" scala IntArrayPerf 10
* 49 62 48 45 48 45 48 50 47 45
*
* JAVA_OPTS="-Xmx2g" scala ByteBufferPerf 10
* 479 491 484 480 484 481 477 477 472 473
@rxin
rxin / BinarySearch.java
Created July 19, 2015 05:23
binary search vs linear scan
package com.databricks.unsafe.util.benchmark;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
@rxin
rxin / generated-assembly.txt
Created February 14, 2017 12:40
Processing trillion rows per second on a single machine: how can nested loop joins be this fast?
Decoding compiled method 0x00007f4d0510f9d0:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00007f4ce9662458} 'join' '(JI)J' in 'Test'
0x00007f4d0510fb20: call 0x00007f4d1abd5a30 ; {runtime_call}
0x00007f4d0510fb25: data16 data16 nop WORD PTR [rax+rax*1+0x0]
0x00007f4d0510fb30: mov DWORD PTR [rsp-0x14000],eax
0x00007f4d0510fb37: push rbp
@rxin
rxin / df.py
Last active January 26, 2017 00:44
DataFrame simple aggregation performance benchmark
data = sqlContext.load("/home/rxin/ints.parquet")
data.groupBy("a").agg(col("a"), avg("num")).collect()
@rxin
rxin / gist:6896688
Last active December 25, 2015 01:39
take async
def takeAsync(num: Int): FutureAction[Seq[T]] = {
val promise = new CancellablePromise[Seq[T]]
promise.run {
val buf = new ArrayBuffer[T](num)
val totalParts = self.partitions.length
var partsScanned = 0
while (buf.size < num && partsScanned < totalParts && !promise.cancelled) {
// The number of partitions to try in this iteration. It is ok for this number to be
@rxin
rxin / update.sh
Last active December 21, 2015 01:09
Update Spark/Shark on EC2 AMI
set -e
set -o pipefail
/root/spark/bin/stop-all.sh
rm -rf ~/.ivy2/local/org.spark*
rm -rf ~/.ivy2/cache/org.spark*
cd /root/spark
git checkout master
@rxin
rxin / InsertPerf.scala
Last active December 19, 2015 03:49
Scala collection insert performance (2.9.3)
// 1001 381 384 384 383 384 407 404 409 407
object ArrayBufferBenchmark extends scala.testing.Benchmark {
def run = {
val len = 10 * 1000 * 1000
val a = new scala.collection.mutable.ArrayBuffer[Int](len)
package spark.util
import java.io.{ByteArrayInputStream, ByteArrayOutputStream}
import scala.collection.mutable
import org.objectweb.asm.{ClassReader, MethodVisitor}
import org.objectweb.asm.commons.EmptyVisitor
import org.objectweb.asm.Opcodes._