Machine Learning using Spark-Scala - Hands-on
Tools used: IntelliJ, Spark • Programming languages: Scala
The following gist is intended to Data Engineers. It focuses on
If we want to handle
real-time data processing, this gist is definitely worth checking.
We'll learn how to install and use
Scala on a
We'll learn latest
Spark 2.0 methods and updates to the
MLlib library working with
Spark SQL and Dataframes.
Please fork it if you find it relevant.
How is gist is structured
This gist is structured into 2 parts:
Part 1. Machine Leaning using Spark-Scala (Linear Regression)
Scala is a general purpose programming language.
Scala was designed by Martin Odersky (Ecole Polytechnique Fédérale de Lausanne).
Scala source code is intended to be compiled to
Java bytecode to run on a
Java Virtual Machine (JVM).
Java librairies can be used directly in
Spark is one of the most powerful
Big Data tools.
Spark runs programs up to 100x faster than Hadoop's
Spark can use data stored in
MapReduce requires files to be stored in
Spark does not.
Spark performs 100x faster than
Mapreduce because it writes jobs in-memory.
Mapreduce writes job on disk.
MapReduce (Hadoop) writes most data to disk after each
Spark keeps most of the data in memory after each transformation.
. At the core of
Spark there are
Resilient Distributed Datasets also known as
RDD has 4 main features:
- Distributed collection of data
- Parallel operations which are partitioned
- An RDD can use many data sources
RDDs are immutable, cacheable and lazily evaluated.
. There are 2 types of
- Transformations: recipes to follow
- Actions: performs recipe's instructions and returns a result
Environment options for
- Text editors, such as
- IDEs (Integrated Development Environments), such as
- Notebooks, such as
I've uploaded a
.zip file ** which contains useful slides related to
- Isaac Arnault - Introducing Machine Learning using Spark-Scala - Related tags: #EC2 #TLS #AWSCLI #Linux
** © Perian Data