andrew-curthoys/udemy_hadoop_training_course_notes.md

## udemy_hadoop_training_course_notes.md

      
    Raw
  

              udemy_hadoop_training_course_notes.md
            
          
    Udemy - "The Ultimate Hands-On Hadoop - Tame Your Big Data!" Course Notes

What is Hadoop? "Hadoop is an open source software platform for
distributed storage and distributed processing of
very large data sets on computer clusters built from commodity hardware"
- Hortonworks
Features


Distributed storage: stores data across many hard drives & has backup copies
distributed throughout as well so the failure of one hard drive does not result
in lost data
Distributed processing: can parallelize operations across many cores

Hadoop History


Google published GFS (Google File System - the basis that Hadoop was created
from) & MapReduce papers in 2003-2004

GFS inspired Hadoop's distributed storage
MapReduce inspired Hadoop's distributed processing


Yahoo was building "Nutch" - an open source web search engine at the same time
Hadoop was primarily driven by Doug Cutting and Tom White in 2006
It was originally built for use in batch processing, although it has outgrown
that specific use

Overview of the Hadoop Ecosystem

Core Hadoop Ecosystem

HDFS - Hadoop Distributed File System


Allows us to distribute the storage of data across our cluster
Makes the distributed system look like one single cohesive file system
Also maintains redundant copies of data across the cluster as mentioned above

YARN - Yet Another Resource Negotiator


Sits on top of HDFS
Where the data processing starts to come into play
Manages resources on the computing cluster - decides what gets to run tasks
when, what nodes are available for extra work, which ones are not
The "heartbeat" that keeps the cluster running

MapReduce


Sits on top of YARN
Programming model that allows you to process your data across a cluster
Consists of mapping & reducing functions

Mappers have the ability to transform your data in parallel across your
cluster
in an efficient manner
Reducers then aggregate that distributed data together


Intuitively simple model but can be very versatile
MapReduce & YARN used to be the same thing in Hadoop but they were split
apart, which has enabled other applications to be built on top of yarn that
solve the same problem as MapReduce but in a more efficient manner

Pig


High level programming API that allows you to write simple scripts in a
SQL-like manner that allow you to chain queries and get complex answers without
writing Java or Python scripts

Hive


Makes your distributed data look like a SQL database
Allows you to query HDFS with SQL syntax
Very useful interface if you are familiar with SQL

Ambari


Sits on top of everything and lets you see what's going on with your cluster,
e.g. lets you visualize what's running, what systems you're using, how many
resources, etc.
Also has some views that allow you to execute Hive queries, see what's in your
Hive DB, etc.

Mesos


Not part of Hadoop proper
Basically an alternative to YARN
Solves the same problems as YARN but in different ways, pros & cons to using
each one

Spark


Sitting at the same level as MapReduce, i.e. sitting on top of YARN
Need to write Spark scripts with Python, Java, or Scala (Scala preferred)
Extremely fast & under a lot of active development
Good tool for quickly, efficiently, and reliably processing data
Very exciting and powerful technology currently

Tez


Similar to Spark & uses some of the same ideas as Spark
Uses DAGs (Directed Acyclic Graphs) like Airflow

This gives it a leg up on MapReduce because it can produce more optimal
plans for executing your queries


HBase


No-SQL columnar data store database that sits off to the side
A way of exposing the data in your cluster to transactional platforms
Very fast database meant for very large transaction rates

Storm


Sits alongside the main Hadoop technologies similar to HBase
A way of processing streaming data
Spark streaming solves the same problem, Storm just does it a little
differently

Oozie


Sits off to the side
A way of scheduling jobs on your cluster

ZooKeeper


Sits off to the side like Oozie
Technology for coordinating everything on your cluster
Can keep track of which nodes are up, which nodes are down, etc.
Many applications rely on ZooKeeper to maintain reliable and consistent
performance even when a node goes down

Sqoop


Can tie Hadoop database into a relational database, it's basically a connector
Anything that can talk to ODBC or JDBC can be transformed by Sqoop into your
HDFS file system

Flume


A way of transporting weblogs at a very large & reliable scale to your cluster

Kafka


Solves a similar problem to Flume, but a little more general purpose
Can collect data from any source & publish it into your cluster

External Data Storage

MySQL


A traditional SQL database
Using Sqoop you can also export data from your cluster to a SQL database

Cassandra & MongoDB


Like HBase, they are columnar data stores
Good choice for exposing data for real time usage, like in a webapp
Store data in key-value pairs

Query Engines

Drill


Allows you to write SQL queries that will work across a wide variety of No-SQL
databases
Can talk to HBase, Cassandra, MongoDB and tie all results together

Hue


A way of interactively creating queries
Works well with Hive, HBase

Phoenix


Similar to Drill, it allows you to write SQL-style queries across the range of
data storage technologies
Gives you ACID guarantees and OLTP

Presto


Another way to execute queries across your entire cluster

Zeppelin


Another query engine, takes more of a notebook-type approach

HDFS

Overview


The core of Hadoop
Made for handling very large files
Breaks these big files into blocks (128 MB is default block size)

This allows you to distribute storage and processing among many resources


Each block is stored multiple times for redundancy
Allows you to purchase commodity computers, you don't need to buy expensive
machines that have very high availability and reliability because you can just
fail over into another hard drive if one goes down

HDFS Architecture


Name node: a single node that keeps track of where all the distributed blocks
live

Keeps a table of the locations where every copy of every block are stored
Also keeps an edit log that maintains what's being created/modified/stored


Data nodes: the nodes that actually store the blocks of data
Reading a file

First thing a client application will do is talk to the name node, which
will tell the application where the data is stored
Client application will then get the data using tools that allow it to
grab the data


Writing a file

First thing a client application will do is talk to the name node, which
will then create a new entry for the file
The client application will send the file to a data node
The data nodes will then talk to each other, where copies of all the
blocks will be stored
Once the data has been stored, the information goes back through the
client application to the name node, where the locations of all the blocks
are tracked


Name node resilience - since there is only one name node, you want to plan for
the case that it fails

Back up metadata

simplest way to do it, name node writes to local disk and NFS
(Network File System)
There will be some lag time between the name node failing and getting
a new name node set up, so while this is the easiest method, it is not
preferred if you need absolute 100% uptime


Secondary name node

Maintains a merged copy of the edit log you can restore from


HDFS Federation

Each name node manages a specific namespace volume
Less about reliability than being able to scale beyond a single name
node
Hadoop is good for large files, not necessarily lots of small files.
If you have lots of small files name nodes can reach a breaking point


HDFS High Availability

If absolutely need 100% uptime
Hot standby name node using shared edit log
ZooKeeper tracks active name node
Uses extreme measures to ensure only one name node is used at a time


Using HDFS


UI (Ambari)
Command-line interface
HTTP/HDFS Proxies
Java interface (you don't necessarily need to write your code in Java, there
are ways of translating C/Python/Scala/etc. code into Java - most of Hadoop is
written in Java)
NFS Gateway

MapReduce in Hadoop


Distributes the processing of data on your cluster
Divides your data into partitions that are mapped (transformed) and
reduced (aggregated) by mapper and reducer functions that you define

When you map data, you associate it with a certain key value that you care
about and want to aggregate on
The key is the attribute you want to aggregate on

In the movie ratings example from the tutorial Section 2, Video 10
(S2,V10) the key would be the user IDs of the people rating movies


The value is the number you want to aggregate

In the movie ratings example this would be the movies that each user
rated


You can have the same key appear multiple times, i.e. you can have the
same person rate more than one movie
The reducer processes each key's values, it gets called for each unique
key


Resilient to failure - an application master monitors your mappers & reducers
on each partition
MapReduce automatically sorts & groups the mapped data ("Shuffle & Sort")

Aggregates all the values by key & sorts the keys


Under-the-hood process


Process kicks off from a client node
The client node goes and talks to the YARN Resource Manager, which manages
what gets run on what machines
While the client node is doing that, it's also copying any required files over
to HDFS to use for the task
From there a MapReduce Application Master gets spun up by YARN, which is run
underneath a NodeManager that keeps track of the health of the node, whether it
is running or not, etc.

The Application Master is in charge of keeping track of all the different
MapReduce tasks
It works with the Resource Manager to distribute the tasks across your
cluster


The NodeManager talks to individual nodes that perform the MapReduce task -
these nodes are managed by their own NodeManager
Your Resource Manager will try to use MapReduce nodes as close to the source
of your data in HDFS as possible. If it can't use the same machine as the data
blocks, it will try to use a machine as close as possible to limit data
transmission

MapReduce Fundamentals


MapReduce is natively written in Java
Streaming allows interfacing to other languages (e.g. Python)
Failure handling

MapReduce & Hadoop give you a number of methods for dealing with failure
Application Master monitors worker tasks for errors or hanging

Restarts as needed, preferably on a different node


What if the Application Master goes down?

YARN can try to restart it


What if an entire node goes down?

This could be the Application Master
The Resource Manager will try to restart it


What if the Resource Manager goes down?

Can set up high availability (HA) using ZooKeeper to maintain a hot
standby


MapReduce has actually been superseded by other applications such as Spark or
Hive, so it isn't used quite as much as it used to be

Programming Hadoop with Pig

Pig Overview


MapReduce's biggest issue is development time, writing mappers & reducers by
hand takes a long time
Scripting language called Pig Latin that's built on top of Hadoop that makes
running MapReduce jobs much easier

Pig Latin lets you use SQL-like syntax to define your map and reduce
steps
Pig Latin is highly extensible with user-defined functions (UDFs)


Not much degradation in speed as compared to MapReduce because Pig actually
runs on top of Tez, which is actually faster than MapReduce

Sidebar: the way Tez gains this improvement in speed is by using
DAGs to efficiently plan out jobs, MapReduce only works by feeding mapping &
reducing functions in series - Instructor said he's seen up to 10x
improvement using Tez over MapReduce


A few ways of running Pig scripts

Grunt: command line interpreter, good for experimenting, small
line-by-line jobs
Script: you can save a Pig script that you can run from command line
Ambari/Hue: there is an Ambari view for Pig that allows you to
edit/save/load Pig scripts


Datasets in Pig are called relations that have a given schema that is
supplied by the user when you load in data, i.e. names & types of your data
Pig expects tab delimited files by default, if you have a different delimiter
you will need to use PigStorage to specify the delimiter
FOREACH/GENERATE allows you to create a new relation that transforms your
data, e.g. converting a date column from text to formatted date

FOREACH goes through each row of your existing relation and GENERATE
creates a new relation from that data


GROUP BY groups the data into a new relation similarly to how SQL Group By
works

Group By creates "bags" for each group, which are tuples of all the
grouped data


DESCRIBE command describes the schema of a given relation
FILTER BY applies a filter to a relation and gives a new relation
JOIN BY creates a join between relations

Syntax of a DESCRIBE command on a joined relation is:
relation name::field (column) name


ORDER BY orders a relation by a given field to create a new relation
S3V19 shows an example of the above commands, 14:00 shows a script of the
full process
In Ambari you can write & execute Pig scripts, to use Tez instead of
MapReduce, you simply check the "Execute on Tez" selector at the top of the page
next to the "Execute" button
Pig + Tez is a viable option in terms of performance that puts it on par with
Spark in some cases

List of Pig Commands


LOAD   STORE   DUMP

LOAD is used for reading
STORE is used for writing
DUMP prints out data to console
Example: STORE ratings INTO 'outRatings' USING PigStorage(':')


FILTER   DISTINCT   FOREACH/GENERATE   MAPREDUCE  
STREAM   SAMPLE

FILTER filters data in a relation based on a user supplied Boolean
expression
DISTINCT returns unique values in a relation
FOREACH/GENERATE creates a new relation from an existing relation by going
through it line-by-line and transforming it in some way
MAPREDUCE can call explicit mappers & reducers on a relation
STREAM allows you to stream the output of Pig out to a process using
stdin/stdout
SAMPLE can be used to create a random sample from your relation


JOIN   COGROUP   GROUP   CROSS   CUBE

JOIN joins two relations together on a common field (column)
COGROUP similar to JOIN, but it creates a separate tuple for each key in a
sort of nested structure, whereas JOIN puts the resulting joined rows
into a single tuple
GROUP a way of grouping data together for a given key (basically doing a
reduce)
CROSS gives you all the combinations between two relations
(Cartesian product)

An example of where this would be useful is creating a table of
product similarities between products, where you can evaluate every
product in a catalog against every other product in a catalog


CUBE similar to CROSS but it can take more than two columns
(not used that often)


ORDER   RANK   LIMIT

ORDER allows you to sort data in a relation
RANK is similar to ORDER, but it assigns a rank value to each row
LIMIT only returns the first n rows


UNION   SPLIT

UNION takes two relations and unions them
SPLIT takes a relation and splits it into more than one relation


Diagnostic Pig Commands


DESCRIBE   EXPLAIN   ILLUSTRATE

DESCRIBE prints out the schema of a given relation, with the names and
types of each field
EXPLAIN similar to SQL EXPLAIN PLAN that describes how Pig plans to
execute a query
ILLUSTRATE goes a bit further than EXPLAIN and takes a sample from each
relation and shows exactly what it is doing with each piece of data


User Defined Functions (UDFs) in Pig


REGISTER   DEFINE   IMPORT

REGISTER for importing .jar files with UDFs written in Java
DEFINE assigns names to functions imported using REGISTER
IMPORT is used for importing macros for Pig files - you can save reusable
bits of Pig code that you save as macros and use in other Pig scripts


Other Pig Functions


AVG   CONCAT   COUNT   MAX   MIN   SIZE   SUM

AVG takes the average of a set of data in a bag
CONCAT concatenates data
COUNT gives the count of items in a bag
MAX gives the maximum value in a bag
MIN gives the minimum value in a bag
SIZE gives the size of a bag
SUM sums everything up in a bag


Pig Loaders


PigStorage   TextLoader   JsonLoader   AvroStorage  
ParquetLoader   OrcStorage   HBaseStorage

PigStorage allows you to define a delimiter on each row when loading data
TextLoader loads up one line of input data or output data per row
JsonLoader loads JSON files
AvroStorage loads Avro data - a serialization and deserialization data
format that is popular with Hadoop, works well with having a schema and
being splittable across your Hadoop cluster
ParquetLoader loads Parquet data - a popular column-oriented data format
OrcStorage loads Orc data - a popular compressed data format
HBaseStorage loads HBase data


Programming Hadoop with Spark

Spark Overview


Spark is a popular and quickly emerging technology in the Hadoop ecosystem for
processing massive amounts of data in a very extensible way
Official documentation "a fast and general engine for large-scale data
processing" (pretty generic)
Allows you to write scripts in Java/Scala/Python to do complex
manipulations/transformations/analysis of your data
What sets it apart from Pig is that it has a rich ecosystem on top of it that
allows you to do complex tasks such as machine learning, data mining, graph
analysis, streaming data, etc.
It uses a Driver Program that talks to a Cluster Manager (can be YARN, but it
also has its own built in Spark Cluster Manager, and you could also use Mesos)
and the Cluster Manager talks to Executor nodes that have a cache & tasks that
they are responsible for

The cache is part of the performance because Spark is a memory based
solution that tries to retain as much data as it can in RAM instead of going
to disk in HDFS
Another key to its speed is its use of DAGs


It runs programs up to 100x faster than Hadoop MapReduce in memory or 10x
faster on disk
DAG engine optimizes workflows
It's very usable, you can code in Java, Scala, or Python
Built around one main concept: the Resilient Distributed Dataset (RDD)

Spark 2.0 builds on top of the RDDs to create a structure called a
DataSet, that is more of a SQL-like focus on RDDs


Libraries built on top of Spark Core:

Spark Streaming: allows you to use streaming data instead of batch
processing
Spark SQL: much of the current development in Spark is focused towards
this, which allows you to run Spark jobs using SQL or SQL-like syntax - it
also furthers the DAG optimizations with SQL optimization
MLLib: a library of machine learning and data mining tools - you can
create very high-level classes for machine learning analysis and Spark will
break them down into mappers & reducers for you
GraphX: an extensible way of performing graph theory analysis


Spark is written in Scala

Python is ok for initial prototyping and exploration, but Scala should be
the preferred language for production type applications
Scala's functional programming model is a good fit for distributed
processing
Gives fast performance (Scala compiles to Java bytecode)
Less code & boilerplate than Java
Python is slow in comparison & uses more resources

Good thing is that Scala code in Spark looks a lot like Python code


Resilient Distributed Dataset (RDD)


Ensures that your jobs are evenly distributed across your cluster and are
resilient to failure
RDDs are created by the SparkContext - the Spark shell creates a "sc" object
for you
You can create RDDs from:

Hive
JDBC
Cassandra
HBase
Elastisearch
JSON, CSV, sequence files, object files, various compressed formats


Transforming RDDs (similar to mappers)


map   flatmap   filter   distinct   sample   union
  intersection   subtract   cartesian
map: apply some function to every input row of your RDD to create a new RDD
that is transformed - 1 to 1 relationship between number of input rows and
output rows
flatmap: allows you to apply a function to your input RDD the same as map, but
you can have a different number of output rows compared to input rows, e.g. if
you wanted to split rows or filter out rows
filter: filters out rows of an RDD based on some criteria
distinct: returns unique values in an RDD
sample: returns a random sample of an RDD
union, intersection, subtract, cartesian: operations to combine RDDs

RDD Actions


collect: take the results of an RDD and reduce it down to something that can
fit in memory (something that can be returned as a Python object)
count: gives you a count of the number of rows in an RDD
countByValue: gives you a count of each unique value in an RDD
take: can be used to take a certain subset of rows (e.g. take the
top 10 results)
top: can return just the top n rows of an RDD
reduce: allows you to define a function to combine all the values
associated with a unique key
Nothing actually happens in your driver program until an action is called

Spark Methods and Commands


SparkConf(): Spark Configuration class - you can set the name of the
application, set how the job gets distributed, what cluster it runs on, how much
memory is allocated to each executor, etc.
SparkContext(SparkConf): sets up an RDD
spark-submit command runs your Spark job for you

Spark SQL - DataFrames and DataSets


Spark is going more towards using structured data


DataFrames extend RDDs

Contain Row objects
Can run SQL queries
Has a schema (leading to more efficient storage)
Read and write to JSON, Hive, parquet
Communicates with JDBC/ODBC, Tableau


Example DataFrame methods:

myResultDataFrame.show()
myResultDataFrame.select("someFieldName")
myResultDataFrame.filter(myResultDataFrame("someFieldName" > 200))
myResultDataFrame.groupBy(myResultDataFrame("someFieldName")).mean()
myResultDataFrame.rdd().map(mapperFunction)


In Spark 2.0 a DataFrame is really a DataSet of Row objects


DataSets can wrap known, typed data too - but this is mostly transparent in
Python since Python is dynamically typed

Not a big issue with Python, but the "Spark 2.0 way" is to use DataSets
instead of DataFrames when you can


DataSets are easily transportable between the various Spark 2.0 libraries


In Spark 2.0 you don't create a SparkContext, but rather a SparkSession, which
starts a SparkContext and a SQLContext simultaneously


Using Relational Data Stores with Hadoop

Hive: Distributing SQL Queries with Hadoop


Hive translates SQL queries to MapReduce or Tez jobs on your cluster

Why Hive?


Uses familiar SQL syntax
Interactive
Scalable - works with "big data" on a cluster

Most appropriate for data warehouse applications


Easy OLAP (Online Analytical Processing) queries - much easier than
writing MapReduce in Java
Highly optimized
Highly extensible

UDFs
Thrift server
JDBC/ODBC driver


Good for online analytics across a large dataset, not the best option for
real-time transactions

Why not Hive?


High latency - not appropriate for OLTP (Online Transaction Processing)
Stores data de-normalized (not a true relational DB under the hood)
SQL is limited in what it can do

Pig, Spark can do more complex operations


No transactions
No record-level updates, inserts, deletes

HiveQL


Pretty much MySQL with some extensions

For example: views

Can store results of a query into a view, which subsequent queries
can use as a table
Not exactly the same as a materialized view, a view in Hive doesn't
store a copy of the data anywhere


Allows you to specify how structured data is stored and partitioned

How Hive Works

Schema On Read

In comparison to a traditional relational database, which uses
Schema On Write (you define the schema before you load the data and the schema
is enforced when writing data to disk), Hive flips that structure and takes
unstructured data and applies a schema to it as it's being read
The actual data is just stored on a text file somewhere, all Hive does is
impart structure to it as it's being read
Hive maintains a "metastore" that imparts a structure you define on
unstructured data that is stored on HDFS, etc.

Hive Data Loading Commands

LOAD DATA: moves data from a distributed filesystem into Hive

Assumption is that the data is large since it's coming from a distributed
system & making a copy would be a poor use of space


LOAD DATA LOCAL: copies data from your local filesystem into Hive

Assumption is that the data isn't too massive since it's coming from
your local filesystem so it will make a copy & preserve the data on your
local system


Managed vs External tables

Managed tables are completely owned by Hive, i.e. if you drop a managed
table in Hive, that data is gone
External tables are able to be shared outside of Hive with other systems -
when you create an external table you specify a location where the data
actually lives & Hive doesn't take full ownership of it


Partitioning

If you have a massive dataset, you can store your data in partitioned
subdirectories

Huge optimization if your queries are only on certain partitions


Ways to Use Hive

Interactive via hive> prompt/CLI
Saved query files
Through Ambari/Hue
Through JDBC/ODBC server
Through Thrift service

But Hive is not suitable for OLTP, HBase would be preferred for this


Via Oozie

Integrating MySQL & Hadoop


MySQL is a popular, free relational database that does not use distributed
data
However it can be used for OLTP, so exporting data into MySQL can be useful
Existing data may exist in MySQL that you want to import into Hadoop
Sqoop (SQL + Hadoop) allows you to interact between SQL & Hadoop

It kicks off a MapReduce job to handle importing/exporting data
Mappers take data from SQL & import into HDFS
Spinning up the jobs can take some time so it's best used for large
datasets, not optimal for small jobs


You can import data from SQL directly into Hive without going to HDFS first
You can do incremental imports to keep your relational database and Hadoop
in sync
You can also export data directly from Hive into SQL

The target table must already exist in SQL with columns in expected order


Using Non-Relational Data Stores with Hadoop

Why NoSQL?


Relational databases are great for analytic work
However they may not be the best option for running queries across a massive
data set very quickly
NoSQL databases give up the rich query language of SQL for the ability to
run simple queries at great scale
These systems are built to scale horizontally forever and be very fast and
very resilient
Scaling up SQL DBs to massive loads requires extreme measures

HBase


Non-relational, scalable database built on top of HDFS
Based on Google's BigTable
No query language, just an API
Available operations (CRUD)

Create
Read
Update
Delete


It can do these simple operations really quickly and at a massive scale
HBase automatically partitions your data on top of HDFS (Auto-sharding) into
what are called Region Servers
HBase Data Model

Fast access to any given ROW
A ROW is referenced by a unique KEY
Each ROW has some small number of COLUMN FAMILIES
A COLUMN FAMILY may contain arbitrary COLUMNS
You can have a very large number of COLUMNS in a COLUMN FAMILY
Each CELL can have many VERSIONS with given timestamps
Sparse data is OK - missing columns in a row consume no storage


Ways to access HBase:

HBase shell
Java API

Wrappers for Python, Scala, etc.


Spark, Hive, Pig

Will typically use these services to manipulate/transform data then
dump the results into HBASE


REST service
Thrift service
Avro service


Integrating Pig with HBase

Must create HBase table ahead of time
Your relation must have a unique key as its first column, followed by
subsequent columns as you want them saved in HBase
USING clause allows you to STORE into an HBase table
Can work at scale - HBase is transactional on rows


Cassandra


A distributed database with no single point of failure
Unlike HBase, there is no master node at all - every node runs exactly the
same software and performs the same functions
Data model is similar to BigTable/HBase
It's non-relational, but has a limited CQL query language as its interface
Built for high availability, massive transaction rates, and high scalability
The CAP Theorem states that you can only have 2 out of 3: Consistency,
Availability, and Partition-tolerance

Partition-tolerance is a requirement with "big data" so you really only
get to choose between consistency and availability


Cassandra favors availability over consistency

It is "eventually consistent"
But you can specify your consistency requirements as part of your
requests - so really it's "tunable consistency"


Cassandra is great for fast access to rows of information
Get the best of both worlds - replicate Cassandra to another ring that is used
for analytics and Spark integration
Cassandra's API is CQL, which makes it easy to look like existing database
drivers to applications
CQL is like SQL but with some big limitations

No joins

Data must be de-normalized
So, it is still non-relational


All queries must be on some primary key

Secondary indices are supported, but primary keys will give you the
best performance


CQLSH (CQL Shell) can be used on the command line to create tables, etc.
All tables must be in a keyspace - keyspaces are like databases
DataStax offers a Spark-Cassandra connector
Allows you to read and write Cassandra tables as DataFrames
Is smart about passing queries on those DataFrames to the appropriate level
Use cases:

Use Spark for analytics on data stored in Cassandra
Use Spark to transform data and store it into Cassandra for transactional
use


MongoDB


Popular in the corporate world
Name comes from managing HuMONGOus data
Unlike Cassandra, MongoDB favors consistency over availability
Document based data model

Looks like JSON


No real schema is enforced

You can have different fields in every document if you want to
No single "key" as in other databases

But you can create indices on any fields you want, or even
combinations of fields
If you want to "shard," then you must do so on some index


Results in a lot of flexibility


MongoDB terminology

Databases
Collections
Documents


MongoDB architecture: Replication Sets

Single master
Maintains backup copies of your database instance

Secondaries can elect a new primary in seconds if your primary goes
down
But make sure your operation log is long enough to give you time to
recover the primary when it comes back


Replication Set Quirks

A majority of the servers in your set must agree on the primary

Even numbers of servers (2, 4, 6, etc.) don't work well


If you don't want to spend money on 3 servers you can set up an 'arbiter'
node (but only one)
Apps must know about enough servers in the replica set to be able to reach
one to learn who's primary
Replicas only address durability, not your ability to scale

Unless you can take advantage of reading from secondaries, which
generally isn't recommended
And your DB will still go into read-only mode for a bit while a new
primary is elected


Delayed secondaries can be set up as insurance against people doing dumb
things


Sharding

Ranges of some indexed value you specify are assigned to different
replica sets


Sharding Quirks

Auto-sharding sometimes doesn't work

Split storms, mongos processes restarted too often


You must have 3 config servers

And if any one goes down, your DB is down
This is on top of the single-master design of replica sets


MongoDB's loose document model can be at odds with effective sharding


MongoDB Positives

It's not just a NoSQL database - very flexible document model
Shell is a full JavaScript interpreter
Supports many indices

But only one can be used for sharding
More than 2-3 are still discouraged
Full text indices for text searches
Spacial indices


Built-in aggregation capabilities, MapReduce, GridFS

For some applications you might not need Hadoop at all
But MongoDB still integrates with Hadoop, Spark, and most languages


A SQL connector is available

But MongoDB still really isn't designed for joins and normalized data


MongoDB does not automatically index like Cassandra does, if you want an
index you must do it manually

Choosing a Database Technology


Integration Considerations

Different technologies have different connectors to other technologies
If you currently have a big job running in Spark, you'd probably want to
use a database that works well with Spark


Scaling Requirements

Is your data going to grow unbounded over time?
What are your transaction rates?
High growth & high transaction rates would suggest distributed NoSQL DBs
that can scale horizontally


Support Considerations

Do you have the expertise to spin up and configure the technology?
Does the technology offer paid, professional support?


Budget Considerations

Most technologies are open source installed on Linux machines, so the
largest monetary considerations will be the cost of the servers and support


CAP Considerations

Need to pick 2 of 3 of: Consistency, Availability, & Partition-tolerance
These tradeoffs have become a little more loose recently

E.g. you can configure your Cassandra DB to be more consistent than it
would be out of the box


Simplicity

Keep it simple - determine the minimum requirements for your business case


Some Examples


Building an internal phone directory app

Scale: limited
Consistency: eventual is fine
Availability requirements: not mission critical
MySQL (or another SQL based technology) is probably already installed
within your organization


Want to mine web server logs for interesting patterns

What are the most popular times of day? What's the average session length,
etc?
Not high transaction rates so importing data into HDFS & running a Spark
job would suffice (unless you want to vend this data to a massive external
audience)


Building a movie recommendation system

You have a big Spark job that produces movie recommendations for end users
nightly
Something needs to vend this data to your web applications
You work for a huge company with massive scale
Downtime is not tolerated
Must be fast
Eventual consistency is OK
Cassandra would be a good choice here


Building a massive stock trading system

Consistency is more important than anything
"Big Data" is present
It's really, really important - so having access to professional support
might be a good idea, and you have the budget to pay for it
MongoDB would be a good option


External Query Engines

Apache Drill


Apache Drill is a SQL query engine for a variety of non-relation databases and
data files

Hive, MongoDB, HBase
Even flat JSON or Parquet files on HDFS, S3, Azure, Google cloud, local
file system


Based on Google's Dremel
It's real SQL, not SQL-like

Based on SQL:2003 standard - any SQL query that you write can be run with
Drill
And it has an ODBC/JDBC driver so other tools can connect to to it just
like any relational database


It's fast and pretty easy to set up

But there are still non-relational databases under the hood so joins will
not be efficient
Allows SQL analysis of disparate data sources without having to transform
and load it first

Internally data is represented as JSON and has no fixed schema


You can even do joins across different database technologies

Or even with flat JSON files
It is essentially SQL for your entire ecosystem


Apache Phoenix


Similar to Drill but it only works with HBase
A SQL driver for HBase that supports transactions
Fast, low-latency - OLTP support
Originally developed by Salesforce, then open-sourced
Exposes a JDBC connector for HBase
Supports secondary indices and user-defined functions
Integrates with MapReduce, Spark, Hive, Pig, and Flume
Why Phoenix?

It's really fast, most likely won't pay a performance cost from having an
extra layer on top of HBase
Why Phoenix and not Drill?

Should choose the right tool for the job - if you're only dealing with
HBase and want a SQL interface, Phoenix would be a good choice

Developers working on Phoenix will likely pay closer attention to
HBase optimization than those working on Drill


Why Phoenix & not HBase's native clients?

Apps, analysts, etc. may fine SQL easier to work with
Phoenix can do the work of optimizing more complex queries for you

But remember, HBase is still fundamentally non-relational


Using Phoenix

Command line interface
Phoenix API for Java
JDBC Driver (thick client)
Phoenix Query Server (PQS) (thin client)

Intended to evenually enable non-JVM access


JARs for MapReduce, Hive, Pig, Flume, Spark


Presto


Distributing queries across different data stores
It's a lot like Drill

It can connect ot many different "big data" databases and data stores at
once and query across them
Familiar SQL syntax
Optimized for OLAP - analytical queries, data warehousing


Developed & still partially maintained by Facebook
Exposes JDBC, Command-Line & Tableau interfaces

Why Presto vs Drill?

It has a Cassandra connector
At Facebook: over 1,000 Facebook employees run more than 30,000 Presto queries
daily that scan over a petabyte each day
A single Presto query can combine data from multiple sources, allowing for
analytics across an entire organization
Presto allows for fast analytics without the need for expensive hardware or
proprietary software

Presto Connectors

Cassandra
Hive
MongoDB
MySQL
Local files
Also: Kafka, JMX, PostgreSQL, Redis, Accumulo

Managing Your Cluster

YARN


Introduced in Hadoop 2
Separates problem of managing resources on your cluster from MapReduce
Enabled development of MapReduce alternatives (Spark, Tez) built on top of
YARN
It mostly works under the hood, managing the usage of your cluster

You can write code against it but there's no real need to these day with
other tools in the Hadoop ecosystem


YARN sits on top of HDFS & MapReduce/Spark/Tez sit on top of it
YARN is the compute layer for your cluster

How YARN Works

Your application talks to the Resource Manager to distribute work to your
cluster
You can specify data locality - which HDFS block(s) do you want to process?

YARN will try to get your process on the same node that has your HDFS
blocks


You can specify different scheduling options for applications

You can run more than one application at once on your cluster
Schedulers: FIFO, Capacity, and Fair

FIFO runs jobs in sequence, First In First Out
Capacity may run jobs in parallel if there's enough spare capacity
Fair may cut into a larger running job if you want to squeeze in a
small one


Tez


Makes Hive, Pig, or MapReduce jobs run faster
It's an application framework that clients can code against as a replacement
for MapReduce
Constructs Directed Acyclic Graphs (DAGs) for more efficient processing of
distributed jobs

Relies on a more holistic view of your jod & eliminates unnecessary steps
and dependencies


Optimizes physical data flow and resource usage
All you need to do is tell Hive or Pig to use it (it probably does by default
anyway)

Mesos


Another resource negotiator, not directly associated with Hadoop
Came from Twitter, it's a system that manages resources across your data
centers
Not just for big data - it can allocate resources for web servers, small
scripts, etc.
Meant to solve a more general problem than YARN - it's really a general
container management system that manages more than just your Hadoop cluster
Mesos isn't really part of the Hadoop ecosystem, but it works well with it

Spark and Storm may run on Mesos instead of YARN
YARN may be integrated with Mesos using Myriad

So you don't necessarily need to partition your data center between
your Hadoop cluster and everything else


Differences between Mesos and YARN

YARN is a monolithic scheduler - you give it a job and YARN figures out where
to run
Mesos is a two-tiered system

Mesos just makes offers of resources to your application (framework)
Your framework decides whether to accept or reject them
You also decide your own scheduling algorithm


YARN is optimized for long, analytical jobs like you see in Hadoop
Mesos is built to handle that, asl well as long-lived processes (servers) and
short-lived processes as well

How Mesos fits in

If you're looking for an architecture that you can code all of your
organization's cluster applications against (not just Hadoop stuff) - Mesos can
be really useful

Should also look at Kubernetes/Docker


If all you typically run is Spark and Storm, Mesos is an option

Although YARN is probably better, especially if all your data is on HDFS
Spark on Mesos is limited to one executor per slave (node)


When to use Mesos

If your organization as a whole has chosen to use Mesos to manage its
computing resources in general

Then you can avoid partitioning off a Hadoop cluster by using Myriad
There is also a Hadoop on Mesos package for Cloudera that bypassses
YARN completely


Otherwise, probably not

ZooKeeper

What is ZooKeeper?

It basically keeps track of information that must be synchronized across your
cluster

Which node is the master?
What tasks are assigned to which workers?
Which workers are currently available?


It's a tool that applications can use to recover from partial failures on your
cluster
An integral part of HBase, High-Availability (HA) MapReduce, Drill, Storm,
Solr, and much more

Failure modes that ZooKeeper can help with

Master crashes - needs to fail over to a backup & ZooKeeper keeps track of who
the new master is so there are not multiple masters
Worker crashes - its work needs to be redistributed
Network trouble - part of your cluster can't see the rest of it

ZooKeeper executes "primitive" operations in a distributed system

Master election

One node registers itself as a master and holds a "lock" on that data
Other nodes cannot become master until that lock is released
Only one node allowed to hold the lock at a time


Crash detection

"Ephemeral" data on a node's availability automatically goes away if the
node disconnects, or fails to refresh itself after some time-out period


Group management
Metadata

List of outstanding tasks, task assignments


But ZooKeeper's API is not about these primitives

Instead they have built a more general purpose system that makes it easy
for applications to implement them
Its API is really a small distributed file system

With strong consistency guarantees
Replace the concept of "file" with "znode" (the name that ZooKeeper
gives to its nodes) and that is essentially what it is


ZooKeeper API: Create, delete, exists, setData, getData, getChildren


Notifications

A client can register for notifications on a znode

Avoids continuous polling
Example: register for notification on /master node - if it goes away, try
to take over as the new master


Persistent and Ephemeral znodes

Persistent znodes remain stored until explicitly deleted

i.e. assignment of tasks to workers must persist even if master crashes


Ephemeral znodes go away if the client that created it crashes or loses its
connection to ZooKeeper

i.e. if the master crashes, it should release its lock on the znode that
indicates which node is the master


ZooKeeper Architecture

Typically want to have more than one ZooKeeper server in case this server goes
down

Called a ZooKeeper ensemble


ZooKeeper quorum - the number of servers that must agree for a process to
continue

Recommended to have a minimum ZooKeeper ensemble of 5 with a quorum of 3


ZooKeeper has the same potential for consistency/availability issues that
MongoDB has

Oozie


Orchestrates your Hadoop job
Burmese for "elephant keeper"
A system for running and scheduling Hadoop tasks

Workflows

A multi-stage Hadoop job

Might chain together MapReduce, Hive, Pig, sqoop, and distcp tasks
Other systems available via add-ons (like Spark)


A workflow is a DAG of actions

Specified via XML
So you can run actions that don't depend on each other in parallel


Steps to set up a workflow in Oozie

Make sure each action works on its own
Make a directory in HDFS for your job
Create your workflow.xml file and put it in your HDFS folder
Create job.properties defining any variables your workflow.xml needs

This goes on your local file system where you'll launch the job from
You could also set these properties within your XML


Oozie Coordinators

Schedules workflow execution
Launches workflows based on a given start time and frequency
Will also wait for required input data to become available
Run in exactly the same way as a workflow

Oozie Bundles

New in Oozie 3.0
A bundle is a collection of coordinators that can be managed together
Example: you may have a bunch of coordinators for processing log data in
various ways

By grouping them in a bundle, you could suspend them all if there were
some problem with the log collection


Zeppelin


A notebook interface for big data
Makes it easy for experimenting and visualizing data
Similar to iPython (Jupyter) notebooks

Lets you interactively run scripts/code against your data
Can interleave with nicely formatted notes
Can share notebooks with others on your cluster


Apache Spark Integration

Can run Spark code interactively (like you can on the Spark shell)

Speeds up development cycle
And allows easy experimentation and exploration of big data


Can execute SQL queries directly against SparkSQL
Query results may be visualized in charts and graphs
Makes Spark feel more like a data science tool
Zeppelin also integrates with many more tools than Spark

Feeding Data to Your Cluster

What is Streaming?

Up to this point, everything has been in reference to processing historical,
existing big data

Sitting on HDFS
Sitting in a database


How does new data get into your cluster, especially if it's "Big Data"

New log entries from web servers
New sensor data from IOT devices
New stock trades


Streaming lets you publish this data to your cluster in real time

And you can even process it in real time as it comes in


Two Problems:

How to get data from many different sources flowing into your cluster
Processing it when it gets there

Kafka


Kafka is a general-purpose publish/subscribe messaging system
Kafka servers store all incoming messages from publishers for some period
of time, and publishes them to a stream of data called a topic
Kafka consumers subscribe to one or more topics, and receive data as it's
published
A stream/topic can have many different consumers, all with their own position
in the stream maintained
It's not only for Hadoop

How Kafka Scales

Kafka itself may be distributed among many processes on many servers

Will distribute the storage of stream data as well


Consumers may also be distributed

Consumers of the same group will have messages distributed amongst them
Consumers of different groups will get their own copy of each message


Flume


Another way to stream data into your cluster
Made from the start with Hadoop in mind

Built-in sinks for HDFS and HBase


Originally made to handle log aggregation

Flume Agents

Source

Where data is coming from
Can optionally have Channel Selectors and Interceptors


Channel

How data is transferred (via memory or files)


Sink

Where data is going
Can be organized into sink groups
A sink can connect to to only one channel

Channel is notified to delete a message once the sink processes it


Built-in Source Types

Spooling directory
Avro
Kafka
Exec
Thrift
Netcat
HTTP
Custom written sources written in Java

Built-in Sink Types

HDFS
Hive
HBase
Avro
Thrift
Elasticsearch
Kafka
Custom written sinks written in Java

Analyzing Streams of Data

Spark Streaming

Why Spark Streaming?

"Big Data" never stops
Analyze data streams in real time, instead of huge batch jobs daily
Analyzing streams of web log data to react to user behavior
Analyze streams of real-time sensor data for IoT applications

How it works: High Level

Data streams come in from whatever source(s) are connected
Spark cluster will have a number of receivers to receive the data
This data will be discretized by the Spark Streaming servers into RDDs based
on a batch increment time
RDDs can then be output to other systems for processing and analysis
Not exactly real-time, it processes the data in microbatches - if you set the
batch increment time to 1 second, then it will be essentially real-time from the
perspective of the user
This work can be distributed - processing of RDDs can happen in parallel on
different worker nodes

DStreams (Discretized Streams)


Generates the RDDs for each time step and can produce output at each time step
Can be transformed and acted on in much the same way as RDDs
Or you can access their underlying RDDs if you need them
Actions are applied continuously over time as each new RDD comes in

Common stateless transformations on DStreams

Map
Flatmap
Filter
reduceByKey

Stateful data

You can also maintain a long-lived state on a DStream

For example: running totals, broken down by keys
Another example: aggregating session data in web activity


Windowed Transformations

Allow you to compute results across a longer time period than your batch
interval

Example: top-sellers from the past hour

You might process data ever second (the batch interval) but maintain a
window of one hour


The window "slides" as time goes on to represent batches within the window
interval

Batch interval vs Slide interval vs Window interval

The batch interval is how often data is captured into a DStream
The slide interval is how often a windowed transformation is computed
The window interval is how far back in time the windowed transformation goes

Structured Streaming


A new (as of 2017), higher-level API for streaming structured data

Available in Spark 2.0 & 2.1


Uses DataSets

Like a DataFrame but with more explicit type information
A DataFrame is really a DataSet[Row]


It essentially creates a DataFrame that continuously gets new DataSets
appended to it

Advantages of Structured Streaming

Streaming code looks a lot like the equivalent non-streaming code
Structured data allows Spark to represent data more efficiently
SQL-style queries allow for query optimization opportunities & even better
performance
Interoperability with other Spark components based on DataSets

MLLib is also moving toward DataSets as its primary API


DataSets in general is the direction Spark is moving

Apache Storm


Another framework for processing continuous streams of data on a cluster

Can run on top of YARN (like Spark)


Works on individual events, not micro-batches like Spark Streaming

If you need sub-second latency, Storm is the best solution


Storm Terminology

A stream consists of tuples that flow through spouts that are sources
of stream data (Kafka for example)
Bolts process stream data as it's received

Transform, aggregate, write to databases/HDFS


A topology is a graph of spouts and bolts that process your stream

Storm Architecture

A Nimbus node sits at the highest level and acts as a sort of job tracker
The Nimbus node is connected to ZooKeeper nodes
The ZooKeeper nodes are connected to Supervisor nodes, which is where all the
work is actually done

Developing Storm Applications

Usually done in Java

Although bolts may be directed through scripts in other languages


Storm Core

The lower-level API for Storm
"At-least-once" semantics


Trident

Higher level API for Storm that sits on top of Storm Core
"Exactly once" semantics


Storm runs your applications "forever" once submitted - until you explicitly
stop them

Storm vs Spark Streaming


Spark streaming has the benefit of having the rest of Spark at your disposal

You can also write jobs in other languages such as Python or Scala,
whereas Storm is mostly written in Java


But if you need truly real-time (sub-second) processing of events as they come
in, Spark is the preferred tool
Core Storm offers "tumbling windows" in addition to "sliding windows"
Kafka + Storm is a popular combination

Flink


Name comes from the German for quick & nimble
Another stream processing engine - most similar to Storm
Can run on a standalone cluster or on top of YARN or Mesos
Highly scalable (1000s of nodes)
Fault-tolerant

Can survive failures while still guaranteeing exactly-once processing
Uses "state snapshots" to achieve this


Flink vs Spark Streaming vs Storm


Flink is faster than Storm (an order of magnitude faster)
Flink offers real-time streaming like Storm (but if you're using Trident with
Storm, you're actually using micro-batches)
Flink offers a higher-level API like Trident or Spark, but while still doing
real-time streaming
Flink has good Scala support like Spark Streaming
Flink has an ecosystem of its own like Spark
Flink can process data based on event times, not when data was received

Impressive windowing system
This, plus real-time streaming and exactly-once semantics is important
for financial applications


But it's relatively young
All three seem to be converging on similar features and capabilities, then it
becomes more a question of what fits best in your existing environment

Connectors

HDFS
Cassandra
Kafka
Others: Elasticsearch, NiFi, Redis, RabbitMQ

Other Technologies

Impala


Cloudera's alternative to Hive
Massively parallel SQL engine on Hadoop
Impala's always running, so you avoid the startup costs when starting a Hive
query

Made for BI-style queries


Bottom line: Impala's often faster than Hive, but Hive offers more versatility
Consider using Impala instead of Hive if you're using Cloudera

Accumulo


Another BigTable clone (like HBase)
But offers a better security model

Cell-based access control


And server-side programming
Consider it for you NoSQL needs if you have complex security requirements

But make sure the systems that need to read this data can talk to it


Redis


A distributed in-memory data store (like memcache)
But it's more than a cache
Good support for storing data structures
Can persist data to disk
Can be used as a data store and not just a cache
Popular caching layer for web apps

Ignite


An "in-memory data fabric"
Think of it as an alternative to Redis
But it's closer to a database

ACID guarantees
SQL support
But it's all done in memory


Elasticsearch


A distributed document search and analytics engine
Really popular

Wikipedia, The Guardian, Stack Overflow


Can handle things like real-time search-as-you-type
When paired with Kibana, great for interactive exploration
Amazon offers and Elasticsearch Service

Kinesis (and the AWS ecosystem)


Amazon Kinesis is basically the AWS version of Kafka
Amazon has a whole ecosystem of its own

Elastic MapReduce (EMR)
S3 (distributed storage, similar to HDFS)
Elasticsearch Service / CloudSearch
DynamoDB (similar to Cassandra)
Amazon RDS (relational database similar to MySQL)
ElastiCache
AI / Machine Learning services


EMR in particular is an easy way to spin up a Hadoop cluster on demand

Apache NiFi


Directed graphs of data routing

Can connect to Kafka, HDFS, Hive


Web UI for designing complex systems
Often seen in the context of IoT sensors, and managing their data

Apache Falcon


A "data governance engine" that sits on top of Oozie
Included in Hortonworks
Like NiFi it allows construction of data processing graphs
But it's really mean to organize the flow of your data within Hadoop

Apache Slider


Deployment tool for apps on a YARN cluster
Allows monitoring of your apps
Allows growing or shrinking your deployment as it's running
Manages mixed configurations
Start/stop applications on your cluster

Designing Real World Systems

Hadoop Architecture Design

Working Backwards

Start with the end user's needs, not from where your data is coming from

Sometimes you need to meet in the middle


What sort of access patterns do you anticipate from your end users?

Analytical queries that span large date ranges?
Huge amounts of small transactions for very specific rows of data?
Both?


What availability to these users demand?
What consistency do these users demand?
The end user patterns are the most important aspect of designing these
systems, you shouldn't start with a preferred technology and try to make it
work for use cases that it's not designed for

Thinking about requirements

Just how big is your big data?

Do you really need a cluster?


How much internal infrastructure and expertise is available?

Should you use AWS or something similar?
Do systems you already know fit the bill?


What about data retention?

Do you need to keep data around forever? For auditing perhaps?
Or do you need to purge it often? For privacy?


What about security?

Legal concerns


Latency

How quickly do end users need to get a response?

Milliseconds? Then something like HBase or Cassandra will be needed


Timeliness

Can queries be based on day-old data? Minute-old?

Oozie scheduled jobs in Hive/Pig/Spark may cut it


Or must it be near real-time?

In that case, Spark Streaming/Storm/Flink with Kafka or Flume may be
the answer


Future-proofing

Once you decide where to store your "big data," moving it will be really
difficult later on

Think carefully before choosing proprietary solutions or cloud-based
storage


Will business analysts want your data in addition to end users
(or vice versa?)

Utilizing existing resources

Does your organization have existing components you can use?

Don't build a new data warehouse if you already have one
Rebuilding existing technology always has negative business value


What's the least amount of infrastructure you need to build?

Import existing data with Sqoop, etc. if you can
If relaxing a "requirement" saves a lot of time and money - at least ask