Skip to content

Instantly share code, notes, and snippets.

View rxin's full-sized avatar

Reynold Xin rxin

View GitHub Profile
def testWrite(path: String): Long = {
val startTime = System.currentTimeMillis()
val out = new java.io.FileWriter(path)
var i = 1
val bytes = " " * (1024 * 1024)
while (i < 1000) {
out.write(bytes)
i += 1
}
out.close
@rxin
rxin / ampcamp-ecnu-2013-data.sh
Last active December 14, 2015 10:49
scripts to help setup ampcamp @ ECNU March 2013
################################################################################
# Step 1. Download wiki traffic log.
# from
# https://s3.amazonaws.com/ampcamp/ampcamp-ecnu-2013/wikistats/part-00095.gz
# to
# https://s3.amazonaws.com/ampcamp/ampcamp-ecnu-2013/wikistats/part-00168.gz
# Note that 095 and 168 are both 0 bytes. The sole purpose of their existence is
# to verify the downloads.
# NOTE THAT THE FOLLOWING SCRIPT STARTS wget AS BACKGROUND PROCESSES.
@rxin
rxin / benchmark.scala
Last active September 10, 2015 06:09
Spark Parquet benchmark
// Launch spark-shell
MASTER=local[4] bin/spark-shell --driver-memory 4G --conf spark.shuffle.memoryFraction=0.5 --packages com.databricks:spark-csv_2.10:1.2.0
// Read the DF in
val pdf = sqlContext.read.parquet("d_small_key.parquet")
sqlContext.setConf("spark.sql.shuffle.partitions", "8")
// Data reading
val start = System.currentTimeMillis
@rxin
rxin / CodegenTest.scala
Created August 20, 2015 05:54
code gen test
package org.apache.spark.sql
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions._
object CodegenTest {
def main(args: Array[String]): Unit = {
val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
@rxin
rxin / NaNTesting.java
Created July 21, 2015 00:39
NaN double vs float testing
package com.databricks.unsafe.util.benchmark;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2013| 1| 1| 517.0| 2.0| 830.0| 11.0| UA| N14228| 1545| EWR| IAH| 227.0| 1400| 5.0| 17.0|
|2013| 1| 1| 533.0| 4.0| 850.0| 20.0| UA| N24211| 1714| LGA| IAH| 227.0| 1416| 5.0| 33.0|
|2013| 1| 1| 542.0| 2.0| 923.0| 33.0| AA| N619AA| 1141| JFK| MIA| 160.0| 1089| 5.0| 42.0|
|2013| 1| 1| 544.0| -1.0| 1004.0| -18.0| B6| N804JB| 725| JFK| BQN| 183.0| 1576| 5.0| 44.0|
|2013| 1| 1| 554.0| -6.0| 812.0| -25.0| DL| N668DN| 461| LGA| ATL| 116.0| 762| 5.0| 54.0|
+----+-----+---+--------+---------+--------+---------+-------+--
In [1]: df = sqlContext.read.json("examples/src/main/resources/people.json")
In [2]: df.withColumn('a b', df.age)
Out[2]: DataFrame[age: bigint, name: string, a b: bigint]
In [3]: df.withColumn('a b', df.age).write.parquet('test-parquet.out')
15/06/03 01:14:56 ERROR InsertIntoHadoopFsRelation: Aborting job.
java.lang.RuntimeException: Attribute name "a b" contains invalid character(s) among " ,;{}() =". Please use alias to rename it.
at scala.sys.package$.error(package.scala:27)
@rxin
rxin / UnsafeBenchmark.arrayTraversal
Created March 13, 2015 07:38
Unsafe vs primitive array traversal speed
# {method} &apos;arrayTraversal&apos; &apos;()J&apos; in &apos;com/databricks/unsafe/util/benchmark/UnsafeBenchmark&apos;
0x000000010a8c9ae0: callq 0x000000010a2165ee ; {runtime_call}
0x000000010a8c9ae5: data32 data32 nopw 0x0(%rax,%rax,1)
0x000000010a8c9af0: mov %eax,-0x14000(%rsp)
0x000000010a8c9af7: push %rbp
0x000000010a8c9af8: sub $0x30,%rsp
0x000000010a8c9afc: mov (%rsi),%r13d
0x000000010a8c9aff: mov 0x18(%rsi),%rbp
0x000000010a8c9b03: mov 0x8(%rsi),%rbx
0x000000010a8c9b07: mov %rsi,%rdi
@rxin
rxin / gist:6be132f46b72c27d8f89
Created November 1, 2014 21:22
test.scala on constructor parameter shadowing
class LegalPerson(name: String) {
def aaaaaaaaaa = name
}
class DoomedPerson(name: String) extends LegalPerson(name) {
def curName = name
}
@rxin
rxin / dstat
Created September 15, 2014 08:06
High sys usage with Transparent Huge Pages (THP) enabled
date/time | used buff cach free|usr sys idl wai hiq siq| read writ| recv send
15-09 04:55:05|52.6G 56.8M 176G 11.4G| 4 2 82 12 0 0| 527M 0 | 0 198B
15-09 04:55:06|52.6G 56.8M 176G 10.9G| 3 2 80 15 0 0| 542M 64k| 581B 36k
15-09 04:55:07|52.6G 56.8M 177G 10.4G| 3 1 82 13 0 0| 535M 0 | 0 0
15-09 04:55:08|52.6G 56.8M 177G 9.87G| 2 1 85 12 0 0| 520M 0 | 0 506B
15-09 04:55:09|52.6G 56.8M 178G 9558M| 2 1 84 12 0 0| 549M 0 | 260B 520B
15-09 04:55:10|52.6G 56.8M 179G 9009M| 3 1 82 14 0 0| 557M 0 | 104B 594B
15-09 04:55:11|52.6G 56.8M 179G 8463M| 3 2 83 13 0 0| 530M 72k| 104B 272B
15-09 04:55:12|52.6G 56.8M 180G 7940M| 3 2 81 15 0 0| 532M 0 | 200B 888B
15-09 04:55:13|52.6G 56.8M 180G 7417M| 3 2 82 12 0 0| 510M 0 | 0 198B