Skip to content

Instantly share code, notes, and snippets.

@edcote
Last active July 17, 2018 19:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save edcote/ec51652aaf872ecde2219d4f67415f69 to your computer and use it in GitHub Desktop.
Save edcote/ec51652aaf872ecde2219d4f67415f69 to your computer and use it in GitHub Desktop.
Apache Spark

Installation

TBD

Basic Concepts

See: https://www.analyticsvidhya.com/blog/2017/01/scala/

  • Lazy operation: operations that do not execute until we require results
  • Spark Context: holds a context with Spark cluster manager
  • Driver and Worker: a driver is in charge of running the main() function of an application and creating the SparkContenxt
  • In-memory computation: keeping the data in RAM, instead of HD for fast processing

Spark has three data representations: RDD, Dataframe, and Dataset:

  • RDD: Resiliant distributed database, a collection of elements distributed across multiple nodes in a cluster for parallel processing.

  • Dataset:Also a distributed collection of data, collection of JVM objects that can be manipulated using functonal transformations (not available in Python or R)

  • DataFrame: Distributed collection of data organized into named columns (conceptually equivalent to a table in a relational database).

  • Transformation: operation applied on one RDD to create a second RDD

  • Action: operation applied to RDD that performs computation and sends results back to the driver

  • Broadcast: use the broadcast variable to save the copy of data across all nodes

  • Accumulator: use the accumulator variable for aggregating the information

ML Statistics

https://spark.apache.org/docs/latest/ml-statistics.html

Correlation

Calculating the correlation between two series of data is a common operation in statistics.

Hypothesis testing

Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant. Uses Pearson Chi-squared tests for independent.

Pipelines

https://spark.apache.org/docs/latest/ml-pipeline.html

Functionality to help users create and tune practical machine learning pipelines.

  • DataFrame: Spark SQL as ML data set
  • Transformer: Transforms one DataFrame into another DataFrame (with predictions).
  • Estimator: Algorithm which can be a fit on a DataFrame to product a Transform. Learning algorithm is an Estimator which trains a DataFrame and produces a Model.
  • Pipeline: Chains multiple Transformer and Estimators together to specify ML workflow.
  • Parameter: Common API to specify parameters

Hello, World!

object HelloWorld extends App {
  val conf = new SparkConf()
  conf.setMaster("local")
  conf.setAppName("dumb")
  val sc = new SparkContext(conf)

  val textFile = sc.textFile("src/main/resources/shakespeare.txt")

  val counts = textFile.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
  counts.foreach(println)
  println("Total words", counts.count())
}

Initial experiments

val spark = SparkSession.builder
   .appName("cic")
   .master("local")
   .config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/plus.dispatch")
   .getOrCreate()

val sc = spark.sparkContext

val rdd = sc.loadFromMongoDB() // this loads the configuration specified in the SparkConf

Basic Statistics

X series, is there correlation

K Means

https://spark.apache.org/docs/2.2.0/mllib-clustering.html#k-means

  • 'K' is the number of desired clusters

RDD

RDD is resilient distributed database. You can perform functional operations on the object.

val rdd = sc.loadFromMongoDB()
println(rdd.count())
println(rdd.first.toJson) // print first element returned

// Perform filter operation on RDD
val rddf = rdd.filter(doc => {
  doc.getInteger("mcycle") > 0 &&
  doc.getInteger("minstret") > 0
})

EDF

EDF is explicit data frame

case class Result(mcycle: Int, minstret: Int, testname: Int)
val edf = MongoSpark.load[Result](spark)
edf.printSchema()

Features

In ML, a feature is an individual measurable property of a phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective algorithms. Features are usually numeric, but structural features such as strings or graphs are also used.

A set of numeric features can be described by a feature vector. A feature vector is an n-dimensional vector.

Linear Regression

https://www.cakesolutions.net/teamblogs/spark-mllib-linear-regression-example-and-vocabulary

Linear regression belongs to the family of regression algorithms. The goal is to find relationships and dependencies between continuous variable 'y' (label or target) and one or more dimensional feature vector (also, observed data, input variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment