Skip to content

Instantly share code, notes, and snippets.

View liancheng's full-sized avatar

Cheng Lian liancheng

View GitHub Profile
@liancheng
liancheng / arrow-schema-dsl.scala
Last active January 8, 2018 06:45
Simple Scala DSL for constructing Apache Arrow schemas.
package example
import scala.collection.JavaConverters._
import scala.language.implicitConversions
import org.apache.arrow.vector.types.pojo.{ArrowType, Field, FieldType, Schema}
trait FieldBuilder {
def named(name: String): Field

Keybase proof

I hereby claim:

  • I am liancheng on github.
  • I am liancheng (https://keybase.io/liancheng) on keybase.
  • I have a public key ASAVimRA8LFNh06-5t17L6yTHgQJp-j6gItxZLXhwVnD-Ao

To claim this, I am signing this object:

@liancheng
liancheng / scraper-repl.txt
Created March 14, 2016 04:02
Scraper REPL session example
$ ./build/sbt repl/run
...
@ context range 10 groupBy 'id agg (count('id), 'id + 1) having ('id > 0 and count('id) > 0) explain ()
# Logical plan
Filter: condition=$0 ==> [?output?]
├╴$0: ((`id` > 0:INT) AND (COUNT(`id`) > 0:INT))
╰╴UnresolvedAggregate: keys=[$0], projectList=[$1, $2] ==> [?output?]
├╴$0: `id`
@liancheng
liancheng / plan-tree.txt
Last active March 14, 2016 03:31
Scraper query plan explanation
@ context range 10 groupBy 'id agg count('id) having ('id > 0 and count('id) > 0) explain ()
# Logical plan
Filter: condition=$0 ==> [?output?]
├╴$0: ((`id` > 0:INT) AND (COUNT(`id`) > 0:INT))
╰╴UnresolvedAggregate: keys=[$0], projectList=[$1] ==> [?output?]
├╴$0: `id`
├╴$1: COUNT(`id`) AS ?alias?
╰╴LocalRelation: data=<local-data> ==> [`id`#0:BIGINT!]
# Analyzed plan
trait Expression
trait BinaryPredicate extends Expression {
def left: Expression
def right: Expression
}
case class Literal(value: Int) extends Expression
case class Lt(left: Expression, right: Expression) extends BinaryPredicate
case class HiveSampleData(ClientID: String, QueryTime: String, Market: String, DevicePlatform: String, DeviceMake: String, DeviceModel: String, State: String, Country: String, SessionId: Long, SessionPageViewOrder: Long)
val mobiletxt = sc.textFile("file:///tmp/a.csv")
mobiletxt.count()
// Import data within sc SparkContext and convert to DataFrame via .toDF()
val mobile = sc.textFile("file:///tmp/a.csv").map(_.split(",")).map(m => HiveSampleData(m(0), m(1), m(2), m(3), m(4), m(5), m(6), m(7), m(8).toLong, m(9).toLong)).toDF()
// Register table
mobile.registerTempTable("mobile")
[info] com.google.guava:guava:17.0
[info] +-com.fasterxml.jackson.module:jackson-module-scala_2.10:2.4.4 [S]
[info] | +-org.apache.spark:spark-core_2.10:1.3.0-SNAPSHOT [S]
[info] | +-org.apache.spark:spark-catalyst_2.10:1.3.0-SNAPSHOT [S]
[info] | | +-org.apache.spark:spark-sql_2.10:1.3.0-SNAPSHOT [S]
[info] | |
[info] | +-org.apache.spark:spark-sql_2.10:1.3.0-SNAPSHOT [S]
[info] |
[info] +-com.spotify:docker-client:2.7.5
[info] | +-org.apache.spark:spark-sql_2.10:1.3.0-SNAPSHOT [S]
test("save - append - ArrayType.containsNull") {
withTempPath { file =>
val df = Seq.empty[Tuple1[Seq[Int]]].toDF("arrayVal")
val nonNullSchema = StructType(df.schema.map {
case f @ StructField(_, a: ArrayType, _, _) =>
f.copy(dataType = a.copy(containsNull = false))
case f => f
})
sqlContext.createDataFrame(df.rdd, nonNullSchema).save(file.getCanonicalPath)
@liancheng
liancheng / data-sources-api.scala
Last active August 29, 2015 14:14
Data source API draft
/**
* :: DeveloperApi ::
* Base class for table scan operators.
*/
@DeveloperApi
abstract class Scan {
def sqlContext: SQLContext
/**
* Returns an estimated size of the input of this scan operator in bytes.
@liancheng
liancheng / fn.scala
Created December 18, 2014 09:49
Scala function serialization
import java.io._
object Main {
def main(args: Array[String]): Unit = {
val stream = new ByteArrayOutputStream()
val out = new ObjectOutputStream(stream)
def foo(): String => String = {
val test = "hello"
def bar(name: String): String = s"$test $name"