edcote/spark.md

## spark.md

      
    Raw
  

              spark.md
            
          
    Installation

TBD
Basic Concepts

See: https://www.analyticsvidhya.com/blog/2017/01/scala/

Lazy operation: operations that do not execute until we require results
Spark Context: holds a context with Spark cluster manager
Driver and Worker: a driver is in charge of running the main() function of an application and creating the SparkContenxt
In-memory computation: keeping the data in RAM, instead of HD for fast processing

Spark has three data representations: RDD, Dataframe, and Dataset:


RDD: Resiliant distributed database, a collection of elements distributed across multiple nodes in a cluster for parallel processing.


Dataset:Also a distributed collection of data, collection of JVM objects that can be manipulated using functonal transformations (not available in Python or R)


DataFrame:  Distributed collection of data organized into named columns (conceptually equivalent to a table in a relational database).


Transformation: operation applied on one RDD to create a second RDD


Action: operation applied to RDD that performs computation and sends results back to the driver


Broadcast: use the broadcast variable to save the copy of data across all nodes


Accumulator: use the accumulator variable for aggregating the information


ML Statistics

https://spark.apache.org/docs/latest/ml-statistics.html
Correlation
Calculating the correlation between two series of data is a common operation in statistics.
Hypothesis testing
Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant.  Uses Pearson Chi-squared tests for independent.
Pipelines

https://spark.apache.org/docs/latest/ml-pipeline.html
Functionality to help users create and tune practical machine learning pipelines.

DataFrame: Spark SQL as ML data set
Transformer: Transforms one DataFrame into another DataFrame (with predictions).
Estimator: Algorithm which can be a fit on a DataFrame to product a Transform.  Learning algorithm is an Estimator which trains a DataFrame and produces a Model.
Pipeline: Chains multiple Transformer and Estimators together to specify ML workflow.
Parameter: Common API to specify parameters

Hello, World!

object HelloWorld extends App {
  val conf = new SparkConf()
  conf.setMaster("local")
  conf.setAppName("dumb")
  val sc = new SparkContext(conf)

  val textFile = sc.textFile("src/main/resources/shakespeare.txt")

  val counts = textFile.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
  counts.foreach(println)
  println("Total words", counts.count())
}

Initial experiments

val spark = SparkSession.builder
   .appName("cic")
   .master("local")
   .config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/plus.dispatch")
   .getOrCreate()

val sc = spark.sparkContext

val rdd = sc.loadFromMongoDB() // this loads the configuration specified in the SparkConf

Basic Statistics

X series, is there correlation
K Means

https://spark.apache.org/docs/2.2.0/mllib-clustering.html#k-means

'K' is the number of desired clusters

RDD

RDD is resilient distributed database.  You can perform functional operations on the object.
val rdd = sc.loadFromMongoDB()
println(rdd.count())
println(rdd.first.toJson) // print first element returned

// Perform filter operation on RDD
val rddf = rdd.filter(doc => {
  doc.getInteger("mcycle") > 0 &&
  doc.getInteger("minstret") > 0
})

EDF

EDF is explicit data frame
case class Result(mcycle: Int, minstret: Int, testname: Int)
val edf = MongoSpark.load[Result](spark)
edf.printSchema()

Features

In ML, a feature is an individual measurable property of a phenomenon being observed.  Choosing informative, discriminating, and independent features is a crucial step for effective algorithms.  Features are usually numeric, but structural features such as strings or graphs are also used.
A set of numeric features can be described by a feature vector.  A feature vector is an n-dimensional vector.
Linear Regression

https://www.cakesolutions.net/teamblogs/spark-mllib-linear-regression-example-and-vocabulary
Linear regression belongs to the family of regression algorithms.  The goal is to find relationships and dependencies between continuous variable 'y' (label or target) and one or more dimensional feature vector (also, observed data, input variables)