isaacarnault/.gitignore

## .gitignore
________  ________  ___  __    ___
|\_____  \|\   __  \|\  \|\  \ |\  \
 \|___/  /\ \  \|\  \ \  \/  /|\ \  \
     /  / /\ \   __  \ \   ___  \ \  \
    /  /_/__\ \  \ \  \ \  \\ \  \ \  \
   |\________\ \__\ \__\ \__\\ \__\ \__\
    \|_______|\|__|\|__|\|__| \|__|\|__|


Ignore Windows, gist was made on Linux.

## README.md

      
    Raw
  

              README.md
            
          
Machine Learning using Spark-Scala - Hands-on

Tools used: IntelliJ, Spark • Programming languages: Scala


The following gist is intended to Data Engineers. It focuses on Spark and Scala for Machine Learning.

If we want to handle batch and real-time data processing, this gist is definitely worth checking.

We'll learn how to install and use Spark and Scala on a Linux system.

We'll learn latest Spark 2.0 methods and updates to the MLlib library working with Spark SQL and Dataframes.
Please fork it if you find it relevant.
How is gist is structured

This gist is structured into 2 parts:

Part 1. Machine Leaning using Spark-Scala (Linear Regression)

Important

Scala

. Scala is a general purpose programming language.

. Scala was designed by Martin Odersky (Ecole Polytechnique Fédérale de Lausanne).

. Scala source code is intended to be compiled to Java bytecode to run on a Java Virtual Machine (JVM).

. Java librairies can be used directly in Scala.

Spark

. Spark is one of the most powerful Big Data tools.

. Spark runs programs up to 100x faster than Hadoop's MapReduce.

. Spark can use data stored in Cassandra, Amazon S3, Hadoop'sHDFS, etc.

. MapReduce requires files to be stored in HDFS, Spark does not.

. Spark performs 100x faster than Mapreduce because it writes jobs in-memory. Mapreduce writes job on disk.
Data Processing

. MapReduce (Hadoop) writes most data to disk after each Map and Reduce operation.

. Spark keeps most of the data in memory after each transformation.

. At the core of Spark there are Resilient Distributed Datasets also known as RDDs.

. An RDD has 4 main features:


Distributed collection of data
Fault-tolerant
Parallel operations which are partitioned
An RDD can use many data sources

. RDDs are immutable, cacheable and lazily evaluated.
. There are 2 types of RDD operations:


Transformations: recipes to follow
Actions: performs recipe's instructions and returns a result

Environment options for Scala and Spark

Text editors, such as Sublime Text and Atom
IDEs (Integrated Development Environments), such as IntelliJ and Eclipse
Notebooks, such as Jupyter, Zeppelin and Databricks

Resources

I've uploaded a .zip file ** which contains useful slides related to MachineLearning, Spark and Scala.

https://bit.ly/2zkcrP7
Author


Isaac Arnault - Introducing Machine Learning using Spark-Scala - Related tags: #EC2 #TLS #AWSCLI #Linux

** © Perian Data


## datasets.md

      
    Raw
  

              datasets.md
            
          
    Download the following datasets to perform your tests:

sample_linear_regression_data.txt

USA_Housing.csv.

  
## SPARK_SCALA_ML.md

      
    Raw
  

              SPARK_SCALA_ML.md
            
          
    This is the part of the gist, we'll use some Machine Learning algorithms using Spark - Scala.

Why use Machine Learning ?

Here are common Uses Cases related to Machine Learning:


Fraud detection

Web engines

Credit scoring

Prediction of equipment failures

Customer segmentation

Customer churn prediction


Image regognition

Financial forecasts


Machine Learning steps


Data acquisition / ingestion > 2. Data cleaning / transform > 3. Data testing 

Model training / building > 4. Model testing > 5. Model deployment

Machine Learning types

Supervised learning, from labeled data

Unsupervised learning, from unlabed data

Reinforcement learning, from experience on data


Machine Learning APIs
. Spark has 2 ML ApIs


RDD API

Dataframe API

Data Raw operations

Before defining  a model to use in ML, make sure you follow the 3 following steps:


Extraction: selecting the pertinent variables

Transformation: scaling, converting, preparing the dataframe

Selection: select the correct model

Metrics to consider for evaluating ML models


Working environment - Install IntelliJ

Go to https://www.jetbrains.com/idea/download/#section=linux and download Community edition.


🔴 See hint
 

Extract the soft from the tarball (.tgz) and start the application from the bin repository.


🔴 See hint
 

Go to configure > Plugins > From the Marketplace, install Scala.


🔴 See hint
 

Restard IntelliJ to apply changes.


🔴 See hint
 

Create New Project > Select "Scala" and Next > Name: ML_Spark_Scala


Location: create a folder named "Projects" in the IntelliJ repository and set location to that folder.


🔴 See hint
 

Click "Finish" to apply configs and create your first project.


🔴 See hint
 

Now we are ready to start using IntelliJ.
A. Regression - Linear regression

We will apply a linear regression script on a given dataset sample_linear_regression_data.txt

Check datasets.md section of this gist to download the dataset and save it to a folder on your desktop.

In IntelliJ, open the .txt file to have a quick view on the data.

Save the below program in your Desktop as LinReg.scala.


🔴 LinReg.scala 

import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.sql.SparkSession

def main(): Unit = {
 // Create Session App
 val spark = SparkSession.builder().appName("LinearRegressionExample").getOrCreate()

 // May need to replace with full file path starting with file:///.
 val path = "url-path/sample_linear_regression_data.txt"

 // Training Data
 val training = spark.read.format("libsvm").load(path)
 training.printSchema()

 // Create new LinearRegression Object
 val lr = new LinearRegression().setMaxIter(100).setRegParam(0.3).setElasticNetParam(0.8)

 // Fit the model
 val lrModel = lr.fit(training)

 // Print the coefficients and intercept for linear regression
 println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

 // Summarize the model over the training set and print out some metrics
 val trainingSummary = lrModel.summary
 println(s"numIterations: ${trainingSummary.totalIterations}")
 println(s"objectiveHistory: ${trainingSummary.objectiveHistory.toList}")
 trainingSummary.residuals.show()
 println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
 println(s"r2: ${trainingSummary.r2}")

 // $example off$
 spark.stop()
}
main()


Open a Terminal window in IntelliJ and start Spark using $ ./spark-shell.


🔴 See on IntelliJ
 

Once Spark has started, load the above program using > :load url-path/LinRegDocExample.scala.


🔵 Program output
 

RSME Root Mean Square Error. It represents the sample standard deviation of the differences between predicted values and observed values (called residuals).


r2 r-square, usually between 0 and 1, here 0.02 which is a poor score. The higher the score is, the better the model is.


I hope this gist will give you the basics to start a Machine Learning project using Spark and IntelliJ.
	________ ________ ___ __ ___
	\|\_____ \\|\ __ \\|\ \\|\ \ \|\ \
	\\|___/ /\ \ \\|\ \ \ \/ /\|\ \ \
	/ / /\ \ __ \ \ ___ \ \ \
	/ /_/__\ \ \ \ \ \ \\ \ \ \ \
	\|\________\ \__\ \__\ \__\\ \__\ \__\
	\\|_______\|\\|__\|\\|__\|\\|__\| \\|__\|\\|__\|



	Ignore Windows, gist was made on Linux.