TBD
See: https://www.analyticsvidhya.com/blog/2017/01/scala/
- Lazy operation: operations that do not execute until we require results
- Spark Context: holds a context with Spark cluster manager
- Driver and Worker: a driver is in charge of running the main() function of an application and creating the SparkContenxt
- In-memory computation: keeping the data in RAM, instead of HD for fast processing
Spark has three data representations: RDD, Dataframe, and Dataset:
-
RDD: Resiliant distributed database, a collection of elements distributed across multiple nodes in a cluster for parallel processing.
-
Dataset:Also a distributed collection of data, collection of JVM objects that can be manipulated using functonal transformations (not available in Python or R)
-
DataFrame: Distributed collection of data organized into named columns (conceptually equivalent to a table in a relational database).
-
Transformation: operation applied on one RDD to create a second RDD
-
Action: operation applied to RDD that performs computation and sends results back to the driver
-
Broadcast: use the broadcast variable to save the copy of data across all nodes
-
Accumulator: use the accumulator variable for aggregating the information
https://spark.apache.org/docs/latest/ml-statistics.html
Correlation
Calculating the correlation between two series of data is a common operation in statistics.
Hypothesis testing
Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant. Uses Pearson Chi-squared tests for independent.
https://spark.apache.org/docs/latest/ml-pipeline.html
Functionality to help users create and tune practical machine learning pipelines.
- DataFrame: Spark SQL as ML data set
- Transformer: Transforms one DataFrame into another DataFrame (with predictions).
- Estimator: Algorithm which can be a fit on a DataFrame to product a Transform. Learning algorithm is an Estimator which trains a DataFrame and produces a Model.
- Pipeline: Chains multiple Transformer and Estimators together to specify ML workflow.
- Parameter: Common API to specify parameters
object HelloWorld extends App {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("dumb")
val sc = new SparkContext(conf)
val textFile = sc.textFile("src/main/resources/shakespeare.txt")
val counts = textFile.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _)
counts.foreach(println)
println("Total words", counts.count())
}
val spark = SparkSession.builder
.appName("cic")
.master("local")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/plus.dispatch")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.loadFromMongoDB() // this loads the configuration specified in the SparkConf
X series, is there correlation
https://spark.apache.org/docs/2.2.0/mllib-clustering.html#k-means
- 'K' is the number of desired clusters
RDD is resilient distributed database. You can perform functional operations on the object.
val rdd = sc.loadFromMongoDB()
println(rdd.count())
println(rdd.first.toJson) // print first element returned
// Perform filter operation on RDD
val rddf = rdd.filter(doc => {
doc.getInteger("mcycle") > 0 &&
doc.getInteger("minstret") > 0
})
EDF is explicit data frame
case class Result(mcycle: Int, minstret: Int, testname: Int)
val edf = MongoSpark.load[Result](spark)
edf.printSchema()
In ML, a feature is an individual measurable property of a phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective algorithms. Features are usually numeric, but structural features such as strings or graphs are also used.
A set of numeric features can be described by a feature vector. A feature vector is an n-dimensional vector.
https://www.cakesolutions.net/teamblogs/spark-mllib-linear-regression-example-and-vocabulary
Linear regression belongs to the family of regression algorithms. The goal is to find relationships and dependencies between continuous variable 'y' (label or target) and one or more dimensional feature vector (also, observed data, input variables)