Skip to content

Instantly share code, notes, and snippets.

@andrew-curthoys
Last active January 23, 2023 01:57
Show Gist options
  • Save andrew-curthoys/4b52bcde2bf92f5fcd882f6a992f11af to your computer and use it in GitHub Desktop.
Save andrew-curthoys/4b52bcde2bf92f5fcd882f6a992f11af to your computer and use it in GitHub Desktop.
Udemy Hadoop Training Course Notes

Udemy - "The Ultimate Hands-On Hadoop - Tame Your Big Data!" Course Notes

What is Hadoop? "Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware" - Hortonworks

Features

  • Distributed storage: stores data across many hard drives & has backup copies distributed throughout as well so the failure of one hard drive does not result in lost data
  • Distributed processing: can parallelize operations across many cores

Hadoop History

  • Google published GFS (Google File System - the basis that Hadoop was created from) & MapReduce papers in 2003-2004
    • GFS inspired Hadoop's distributed storage
    • MapReduce inspired Hadoop's distributed processing
  • Yahoo was building "Nutch" - an open source web search engine at the same time
  • Hadoop was primarily driven by Doug Cutting and Tom White in 2006
  • It was originally built for use in batch processing, although it has outgrown that specific use

Overview of the Hadoop Ecosystem

Core Hadoop Ecosystem

HDFS - Hadoop Distributed File System

  • Allows us to distribute the storage of data across our cluster
  • Makes the distributed system look like one single cohesive file system
  • Also maintains redundant copies of data across the cluster as mentioned above

YARN - Yet Another Resource Negotiator

  • Sits on top of HDFS
  • Where the data processing starts to come into play
  • Manages resources on the computing cluster - decides what gets to run tasks when, what nodes are available for extra work, which ones are not
  • The "heartbeat" that keeps the cluster running

MapReduce

  • Sits on top of YARN
  • Programming model that allows you to process your data across a cluster
  • Consists of mapping & reducing functions
    • Mappers have the ability to transform your data in parallel across your cluster in an efficient manner
    • Reducers then aggregate that distributed data together
  • Intuitively simple model but can be very versatile
  • MapReduce & YARN used to be the same thing in Hadoop but they were split apart, which has enabled other applications to be built on top of yarn that solve the same problem as MapReduce but in a more efficient manner

Pig

  • High level programming API that allows you to write simple scripts in a SQL-like manner that allow you to chain queries and get complex answers without writing Java or Python scripts

Hive

  • Makes your distributed data look like a SQL database
  • Allows you to query HDFS with SQL syntax
  • Very useful interface if you are familiar with SQL

Ambari

  • Sits on top of everything and lets you see what's going on with your cluster, e.g. lets you visualize what's running, what systems you're using, how many resources, etc.
  • Also has some views that allow you to execute Hive queries, see what's in your Hive DB, etc.

Mesos

  • Not part of Hadoop proper
  • Basically an alternative to YARN
  • Solves the same problems as YARN but in different ways, pros & cons to using each one

Spark

  • Sitting at the same level as MapReduce, i.e. sitting on top of YARN
  • Need to write Spark scripts with Python, Java, or Scala (Scala preferred)
  • Extremely fast & under a lot of active development
  • Good tool for quickly, efficiently, and reliably processing data
  • Very exciting and powerful technology currently

Tez

  • Similar to Spark & uses some of the same ideas as Spark
  • Uses DAGs (Directed Acyclic Graphs) like Airflow
    • This gives it a leg up on MapReduce because it can produce more optimal plans for executing your queries

HBase

  • No-SQL columnar data store database that sits off to the side
  • A way of exposing the data in your cluster to transactional platforms
  • Very fast database meant for very large transaction rates

Storm

  • Sits alongside the main Hadoop technologies similar to HBase
  • A way of processing streaming data
  • Spark streaming solves the same problem, Storm just does it a little differently

Oozie

  • Sits off to the side
  • A way of scheduling jobs on your cluster

ZooKeeper

  • Sits off to the side like Oozie
  • Technology for coordinating everything on your cluster
  • Can keep track of which nodes are up, which nodes are down, etc.
  • Many applications rely on ZooKeeper to maintain reliable and consistent performance even when a node goes down

Sqoop

  • Can tie Hadoop database into a relational database, it's basically a connector
  • Anything that can talk to ODBC or JDBC can be transformed by Sqoop into your HDFS file system

Flume

  • A way of transporting weblogs at a very large & reliable scale to your cluster

Kafka

  • Solves a similar problem to Flume, but a little more general purpose
  • Can collect data from any source & publish it into your cluster

External Data Storage

MySQL

  • A traditional SQL database
  • Using Sqoop you can also export data from your cluster to a SQL database

Cassandra & MongoDB

  • Like HBase, they are columnar data stores
  • Good choice for exposing data for real time usage, like in a webapp
  • Store data in key-value pairs

Query Engines

Drill

  • Allows you to write SQL queries that will work across a wide variety of No-SQL databases
  • Can talk to HBase, Cassandra, MongoDB and tie all results together

Hue

  • A way of interactively creating queries
  • Works well with Hive, HBase

Phoenix

  • Similar to Drill, it allows you to write SQL-style queries across the range of data storage technologies
  • Gives you ACID guarantees and OLTP

Presto

  • Another way to execute queries across your entire cluster

Zeppelin

  • Another query engine, takes more of a notebook-type approach

HDFS

Overview

  • The core of Hadoop
  • Made for handling very large files
  • Breaks these big files into blocks (128 MB is default block size)
    • This allows you to distribute storage and processing among many resources
  • Each block is stored multiple times for redundancy
  • Allows you to purchase commodity computers, you don't need to buy expensive machines that have very high availability and reliability because you can just fail over into another hard drive if one goes down

HDFS Architecture

  • Name node: a single node that keeps track of where all the distributed blocks live
    • Keeps a table of the locations where every copy of every block are stored
    • Also keeps an edit log that maintains what's being created/modified/stored
  • Data nodes: the nodes that actually store the blocks of data
  • Reading a file
    • First thing a client application will do is talk to the name node, which will tell the application where the data is stored
    • Client application will then get the data using tools that allow it to grab the data
  • Writing a file
    • First thing a client application will do is talk to the name node, which will then create a new entry for the file
    • The client application will send the file to a data node
    • The data nodes will then talk to each other, where copies of all the blocks will be stored
    • Once the data has been stored, the information goes back through the client application to the name node, where the locations of all the blocks are tracked
  • Name node resilience - since there is only one name node, you want to plan for the case that it fails
    • Back up metadata
      • simplest way to do it, name node writes to local disk and NFS (Network File System)
      • There will be some lag time between the name node failing and getting a new name node set up, so while this is the easiest method, it is not preferred if you need absolute 100% uptime
    • Secondary name node
      • Maintains a merged copy of the edit log you can restore from
    • HDFS Federation
      • Each name node manages a specific namespace volume
      • Less about reliability than being able to scale beyond a single name node
      • Hadoop is good for large files, not necessarily lots of small files. If you have lots of small files name nodes can reach a breaking point
    • HDFS High Availability
      • If absolutely need 100% uptime
      • Hot standby name node using shared edit log
      • ZooKeeper tracks active name node
      • Uses extreme measures to ensure only one name node is used at a time

Using HDFS

  • UI (Ambari)
  • Command-line interface
  • HTTP/HDFS Proxies
  • Java interface (you don't necessarily need to write your code in Java, there are ways of translating C/Python/Scala/etc. code into Java - most of Hadoop is written in Java)
  • NFS Gateway

MapReduce in Hadoop

  • Distributes the processing of data on your cluster
  • Divides your data into partitions that are mapped (transformed) and reduced (aggregated) by mapper and reducer functions that you define
    • When you map data, you associate it with a certain key value that you care about and want to aggregate on
    • The key is the attribute you want to aggregate on
      • In the movie ratings example from the tutorial Section 2, Video 10 (S2,V10) the key would be the user IDs of the people rating movies
    • The value is the number you want to aggregate
      • In the movie ratings example this would be the movies that each user rated
    • You can have the same key appear multiple times, i.e. you can have the same person rate more than one movie
    • The reducer processes each key's values, it gets called for each unique key
  • Resilient to failure - an application master monitors your mappers & reducers on each partition
  • MapReduce automatically sorts & groups the mapped data ("Shuffle & Sort")
    • Aggregates all the values by key & sorts the keys

Under-the-hood process

  • Process kicks off from a client node
  • The client node goes and talks to the YARN Resource Manager, which manages what gets run on what machines
  • While the client node is doing that, it's also copying any required files over to HDFS to use for the task
  • From there a MapReduce Application Master gets spun up by YARN, which is run underneath a NodeManager that keeps track of the health of the node, whether it is running or not, etc.
    • The Application Master is in charge of keeping track of all the different MapReduce tasks
    • It works with the Resource Manager to distribute the tasks across your cluster
  • The NodeManager talks to individual nodes that perform the MapReduce task - these nodes are managed by their own NodeManager
  • Your Resource Manager will try to use MapReduce nodes as close to the source of your data in HDFS as possible. If it can't use the same machine as the data blocks, it will try to use a machine as close as possible to limit data transmission

MapReduce Fundamentals

  • MapReduce is natively written in Java
  • Streaming allows interfacing to other languages (e.g. Python)
  • Failure handling
    • MapReduce & Hadoop give you a number of methods for dealing with failure
    • Application Master monitors worker tasks for errors or hanging
      • Restarts as needed, preferably on a different node
    • What if the Application Master goes down?
      • YARN can try to restart it
    • What if an entire node goes down?
      • This could be the Application Master
      • The Resource Manager will try to restart it
    • What if the Resource Manager goes down?
      • Can set up high availability (HA) using ZooKeeper to maintain a hot standby
  • MapReduce has actually been superseded by other applications such as Spark or Hive, so it isn't used quite as much as it used to be

Programming Hadoop with Pig

Pig Overview

  • MapReduce's biggest issue is development time, writing mappers & reducers by hand takes a long time
  • Scripting language called Pig Latin that's built on top of Hadoop that makes running MapReduce jobs much easier
    • Pig Latin lets you use SQL-like syntax to define your map and reduce steps
    • Pig Latin is highly extensible with user-defined functions (UDFs)
  • Not much degradation in speed as compared to MapReduce because Pig actually runs on top of Tez, which is actually faster than MapReduce
    • Sidebar: the way Tez gains this improvement in speed is by using DAGs to efficiently plan out jobs, MapReduce only works by feeding mapping & reducing functions in series - Instructor said he's seen up to 10x improvement using Tez over MapReduce
  • A few ways of running Pig scripts
    • Grunt: command line interpreter, good for experimenting, small line-by-line jobs
    • Script: you can save a Pig script that you can run from command line
    • Ambari/Hue: there is an Ambari view for Pig that allows you to edit/save/load Pig scripts
  • Datasets in Pig are called relations that have a given schema that is supplied by the user when you load in data, i.e. names & types of your data
  • Pig expects tab delimited files by default, if you have a different delimiter you will need to use PigStorage to specify the delimiter
  • FOREACH/GENERATE allows you to create a new relation that transforms your data, e.g. converting a date column from text to formatted date
    • FOREACH goes through each row of your existing relation and GENERATE creates a new relation from that data
  • GROUP BY groups the data into a new relation similarly to how SQL Group By works
    • Group By creates "bags" for each group, which are tuples of all the grouped data
  • DESCRIBE command describes the schema of a given relation
  • FILTER BY applies a filter to a relation and gives a new relation
  • JOIN BY creates a join between relations
    • Syntax of a DESCRIBE command on a joined relation is: relation name::field (column) name
  • ORDER BY orders a relation by a given field to create a new relation
  • S3V19 shows an example of the above commands, 14:00 shows a script of the full process
  • In Ambari you can write & execute Pig scripts, to use Tez instead of MapReduce, you simply check the "Execute on Tez" selector at the top of the page next to the "Execute" button
  • Pig + Tez is a viable option in terms of performance that puts it on par with Spark in some cases

List of Pig Commands

  • LOAD   STORE   DUMP
    • LOAD is used for reading
    • STORE is used for writing
    • DUMP prints out data to console
    • Example: STORE ratings INTO 'outRatings' USING PigStorage(':')
  • FILTER   DISTINCT   FOREACH/GENERATE   MAPREDUCE   STREAM   SAMPLE
    • FILTER filters data in a relation based on a user supplied Boolean expression
    • DISTINCT returns unique values in a relation
    • FOREACH/GENERATE creates a new relation from an existing relation by going through it line-by-line and transforming it in some way
    • MAPREDUCE can call explicit mappers & reducers on a relation
    • STREAM allows you to stream the output of Pig out to a process using stdin/stdout
    • SAMPLE can be used to create a random sample from your relation
  • JOIN   COGROUP   GROUP   CROSS   CUBE
    • JOIN joins two relations together on a common field (column)
    • COGROUP similar to JOIN, but it creates a separate tuple for each key in a sort of nested structure, whereas JOIN puts the resulting joined rows into a single tuple
    • GROUP a way of grouping data together for a given key (basically doing a reduce)
    • CROSS gives you all the combinations between two relations (Cartesian product)
      • An example of where this would be useful is creating a table of product similarities between products, where you can evaluate every product in a catalog against every other product in a catalog
    • CUBE similar to CROSS but it can take more than two columns (not used that often)
  • ORDER   RANK   LIMIT
    • ORDER allows you to sort data in a relation
    • RANK is similar to ORDER, but it assigns a rank value to each row
    • LIMIT only returns the first n rows
  • UNION   SPLIT
    • UNION takes two relations and unions them
    • SPLIT takes a relation and splits it into more than one relation

Diagnostic Pig Commands

  • DESCRIBE   EXPLAIN   ILLUSTRATE
    • DESCRIBE prints out the schema of a given relation, with the names and types of each field
    • EXPLAIN similar to SQL EXPLAIN PLAN that describes how Pig plans to execute a query
    • ILLUSTRATE goes a bit further than EXPLAIN and takes a sample from each relation and shows exactly what it is doing with each piece of data

User Defined Functions (UDFs) in Pig

  • REGISTER   DEFINE   IMPORT
    • REGISTER for importing .jar files with UDFs written in Java
    • DEFINE assigns names to functions imported using REGISTER
    • IMPORT is used for importing macros for Pig files - you can save reusable bits of Pig code that you save as macros and use in other Pig scripts

Other Pig Functions

  • AVG   CONCAT   COUNT   MAX   MIN   SIZE   SUM
    • AVG takes the average of a set of data in a bag
    • CONCAT concatenates data
    • COUNT gives the count of items in a bag
    • MAX gives the maximum value in a bag
    • MIN gives the minimum value in a bag
    • SIZE gives the size of a bag
    • SUM sums everything up in a bag

Pig Loaders

  • PigStorage   TextLoader   JsonLoader   AvroStorage   ParquetLoader   OrcStorage   HBaseStorage
    • PigStorage allows you to define a delimiter on each row when loading data
    • TextLoader loads up one line of input data or output data per row
    • JsonLoader loads JSON files
    • AvroStorage loads Avro data - a serialization and deserialization data format that is popular with Hadoop, works well with having a schema and being splittable across your Hadoop cluster
    • ParquetLoader loads Parquet data - a popular column-oriented data format
    • OrcStorage loads Orc data - a popular compressed data format
    • HBaseStorage loads HBase data

Programming Hadoop with Spark

Spark Overview

  • Spark is a popular and quickly emerging technology in the Hadoop ecosystem for processing massive amounts of data in a very extensible way
  • Official documentation "a fast and general engine for large-scale data processing" (pretty generic)
  • Allows you to write scripts in Java/Scala/Python to do complex manipulations/transformations/analysis of your data
  • What sets it apart from Pig is that it has a rich ecosystem on top of it that allows you to do complex tasks such as machine learning, data mining, graph analysis, streaming data, etc.
  • It uses a Driver Program that talks to a Cluster Manager (can be YARN, but it also has its own built in Spark Cluster Manager, and you could also use Mesos) and the Cluster Manager talks to Executor nodes that have a cache & tasks that they are responsible for
    • The cache is part of the performance because Spark is a memory based solution that tries to retain as much data as it can in RAM instead of going to disk in HDFS
    • Another key to its speed is its use of DAGs
  • It runs programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk
  • DAG engine optimizes workflows
  • It's very usable, you can code in Java, Scala, or Python
  • Built around one main concept: the Resilient Distributed Dataset (RDD)
    • Spark 2.0 builds on top of the RDDs to create a structure called a DataSet, that is more of a SQL-like focus on RDDs
  • Libraries built on top of Spark Core:
    • Spark Streaming: allows you to use streaming data instead of batch processing
    • Spark SQL: much of the current development in Spark is focused towards this, which allows you to run Spark jobs using SQL or SQL-like syntax - it also furthers the DAG optimizations with SQL optimization
    • MLLib: a library of machine learning and data mining tools - you can create very high-level classes for machine learning analysis and Spark will break them down into mappers & reducers for you
    • GraphX: an extensible way of performing graph theory analysis
  • Spark is written in Scala
    • Python is ok for initial prototyping and exploration, but Scala should be the preferred language for production type applications
    • Scala's functional programming model is a good fit for distributed processing
    • Gives fast performance (Scala compiles to Java bytecode)
    • Less code & boilerplate than Java
    • Python is slow in comparison & uses more resources
      • Good thing is that Scala code in Spark looks a lot like Python code

Resilient Distributed Dataset (RDD)

  • Ensures that your jobs are evenly distributed across your cluster and are resilient to failure
  • RDDs are created by the SparkContext - the Spark shell creates a "sc" object for you
  • You can create RDDs from:
    • Hive
    • JDBC
    • Cassandra
    • HBase
    • Elastisearch
    • JSON, CSV, sequence files, object files, various compressed formats

Transforming RDDs (similar to mappers)

  • map   flatmap   filter   distinct   sample   union   intersection   subtract   cartesian
  • map: apply some function to every input row of your RDD to create a new RDD that is transformed - 1 to 1 relationship between number of input rows and output rows
  • flatmap: allows you to apply a function to your input RDD the same as map, but you can have a different number of output rows compared to input rows, e.g. if you wanted to split rows or filter out rows
  • filter: filters out rows of an RDD based on some criteria
  • distinct: returns unique values in an RDD
  • sample: returns a random sample of an RDD
  • union, intersection, subtract, cartesian: operations to combine RDDs

RDD Actions

  • collect: take the results of an RDD and reduce it down to something that can fit in memory (something that can be returned as a Python object)
  • count: gives you a count of the number of rows in an RDD
  • countByValue: gives you a count of each unique value in an RDD
  • take: can be used to take a certain subset of rows (e.g. take the top 10 results)
  • top: can return just the top n rows of an RDD
  • reduce: allows you to define a function to combine all the values associated with a unique key
  • Nothing actually happens in your driver program until an action is called

Spark Methods and Commands

  • SparkConf(): Spark Configuration class - you can set the name of the application, set how the job gets distributed, what cluster it runs on, how much memory is allocated to each executor, etc.
  • SparkContext(SparkConf): sets up an RDD
  • spark-submit command runs your Spark job for you

Spark SQL - DataFrames and DataSets

  • Spark is going more towards using structured data

  • DataFrames extend RDDs

    • Contain Row objects
    • Can run SQL queries
    • Has a schema (leading to more efficient storage)
    • Read and write to JSON, Hive, parquet
    • Communicates with JDBC/ODBC, Tableau
  • Example DataFrame methods:

    • myResultDataFrame.show()
    • myResultDataFrame.select("someFieldName")
    • myResultDataFrame.filter(myResultDataFrame("someFieldName" > 200))
    • myResultDataFrame.groupBy(myResultDataFrame("someFieldName")).mean()
    • myResultDataFrame.rdd().map(mapperFunction)
  • In Spark 2.0 a DataFrame is really a DataSet of Row objects

  • DataSets can wrap known, typed data too - but this is mostly transparent in Python since Python is dynamically typed

    • Not a big issue with Python, but the "Spark 2.0 way" is to use DataSets instead of DataFrames when you can
  • DataSets are easily transportable between the various Spark 2.0 libraries

  • In Spark 2.0 you don't create a SparkContext, but rather a SparkSession, which starts a SparkContext and a SQLContext simultaneously

Using Relational Data Stores with Hadoop

Hive: Distributing SQL Queries with Hadoop

  • Hive translates SQL queries to MapReduce or Tez jobs on your cluster

Why Hive?

  • Uses familiar SQL syntax
  • Interactive
  • Scalable - works with "big data" on a cluster
    • Most appropriate for data warehouse applications
  • Easy OLAP (Online Analytical Processing) queries - much easier than writing MapReduce in Java
  • Highly optimized
  • Highly extensible
    • UDFs
    • Thrift server
    • JDBC/ODBC driver
  • Good for online analytics across a large dataset, not the best option for real-time transactions

Why not Hive?

  • High latency - not appropriate for OLTP (Online Transaction Processing)
  • Stores data de-normalized (not a true relational DB under the hood)
  • SQL is limited in what it can do
    • Pig, Spark can do more complex operations
  • No transactions
  • No record-level updates, inserts, deletes

HiveQL

  • Pretty much MySQL with some extensions
    • For example: views
      • Can store results of a query into a view, which subsequent queries can use as a table
      • Not exactly the same as a materialized view, a view in Hive doesn't store a copy of the data anywhere
  • Allows you to specify how structured data is stored and partitioned

How Hive Works

Schema On Read

  • In comparison to a traditional relational database, which uses Schema On Write (you define the schema before you load the data and the schema is enforced when writing data to disk), Hive flips that structure and takes unstructured data and applies a schema to it as it's being read
  • The actual data is just stored on a text file somewhere, all Hive does is impart structure to it as it's being read
  • Hive maintains a "metastore" that imparts a structure you define on unstructured data that is stored on HDFS, etc.

Hive Data Loading Commands

  • LOAD DATA: moves data from a distributed filesystem into Hive
    • Assumption is that the data is large since it's coming from a distributed system & making a copy would be a poor use of space
  • LOAD DATA LOCAL: copies data from your local filesystem into Hive
    • Assumption is that the data isn't too massive since it's coming from your local filesystem so it will make a copy & preserve the data on your local system
  • Managed vs External tables
    • Managed tables are completely owned by Hive, i.e. if you drop a managed table in Hive, that data is gone
    • External tables are able to be shared outside of Hive with other systems - when you create an external table you specify a location where the data actually lives & Hive doesn't take full ownership of it

Partitioning

  • If you have a massive dataset, you can store your data in partitioned subdirectories
    • Huge optimization if your queries are only on certain partitions

Ways to Use Hive

  • Interactive via hive> prompt/CLI
  • Saved query files
  • Through Ambari/Hue
  • Through JDBC/ODBC server
  • Through Thrift service
    • But Hive is not suitable for OLTP, HBase would be preferred for this
  • Via Oozie

Integrating MySQL & Hadoop

  • MySQL is a popular, free relational database that does not use distributed data
  • However it can be used for OLTP, so exporting data into MySQL can be useful
  • Existing data may exist in MySQL that you want to import into Hadoop
  • Sqoop (SQL + Hadoop) allows you to interact between SQL & Hadoop
    • It kicks off a MapReduce job to handle importing/exporting data
    • Mappers take data from SQL & import into HDFS
    • Spinning up the jobs can take some time so it's best used for large datasets, not optimal for small jobs
  • You can import data from SQL directly into Hive without going to HDFS first
  • You can do incremental imports to keep your relational database and Hadoop in sync
  • You can also export data directly from Hive into SQL
    • The target table must already exist in SQL with columns in expected order

Using Non-Relational Data Stores with Hadoop

Why NoSQL?

  • Relational databases are great for analytic work
  • However they may not be the best option for running queries across a massive data set very quickly
  • NoSQL databases give up the rich query language of SQL for the ability to run simple queries at great scale
  • These systems are built to scale horizontally forever and be very fast and very resilient
  • Scaling up SQL DBs to massive loads requires extreme measures

HBase

  • Non-relational, scalable database built on top of HDFS
  • Based on Google's BigTable
  • No query language, just an API
  • Available operations (CRUD)
    • Create
    • Read
    • Update
    • Delete
  • It can do these simple operations really quickly and at a massive scale
  • HBase automatically partitions your data on top of HDFS (Auto-sharding) into what are called Region Servers
  • HBase Data Model
    • Fast access to any given ROW
    • A ROW is referenced by a unique KEY
    • Each ROW has some small number of COLUMN FAMILIES
    • A COLUMN FAMILY may contain arbitrary COLUMNS
    • You can have a very large number of COLUMNS in a COLUMN FAMILY
    • Each CELL can have many VERSIONS with given timestamps
    • Sparse data is OK - missing columns in a row consume no storage
  • Ways to access HBase:
    • HBase shell
    • Java API
      • Wrappers for Python, Scala, etc.
    • Spark, Hive, Pig
      • Will typically use these services to manipulate/transform data then dump the results into HBASE
    • REST service
    • Thrift service
    • Avro service
  • Integrating Pig with HBase
    • Must create HBase table ahead of time
    • Your relation must have a unique key as its first column, followed by subsequent columns as you want them saved in HBase
    • USING clause allows you to STORE into an HBase table
    • Can work at scale - HBase is transactional on rows

Cassandra

  • A distributed database with no single point of failure
  • Unlike HBase, there is no master node at all - every node runs exactly the same software and performs the same functions
  • Data model is similar to BigTable/HBase
  • It's non-relational, but has a limited CQL query language as its interface
  • Built for high availability, massive transaction rates, and high scalability
  • The CAP Theorem states that you can only have 2 out of 3: Consistency, Availability, and Partition-tolerance
    • Partition-tolerance is a requirement with "big data" so you really only get to choose between consistency and availability
  • Cassandra favors availability over consistency
    • It is "eventually consistent"
    • But you can specify your consistency requirements as part of your requests - so really it's "tunable consistency"
  • Cassandra is great for fast access to rows of information
  • Get the best of both worlds - replicate Cassandra to another ring that is used for analytics and Spark integration
  • Cassandra's API is CQL, which makes it easy to look like existing database drivers to applications
  • CQL is like SQL but with some big limitations
    • No joins
      • Data must be de-normalized
      • So, it is still non-relational
    • All queries must be on some primary key
      • Secondary indices are supported, but primary keys will give you the best performance
  • CQLSH (CQL Shell) can be used on the command line to create tables, etc.
  • All tables must be in a keyspace - keyspaces are like databases
  • DataStax offers a Spark-Cassandra connector
  • Allows you to read and write Cassandra tables as DataFrames
  • Is smart about passing queries on those DataFrames to the appropriate level
  • Use cases:
    • Use Spark for analytics on data stored in Cassandra
    • Use Spark to transform data and store it into Cassandra for transactional use

MongoDB

  • Popular in the corporate world
  • Name comes from managing HuMONGOus data
  • Unlike Cassandra, MongoDB favors consistency over availability
  • Document based data model
    • Looks like JSON
  • No real schema is enforced
    • You can have different fields in every document if you want to
    • No single "key" as in other databases
      • But you can create indices on any fields you want, or even combinations of fields
      • If you want to "shard," then you must do so on some index
    • Results in a lot of flexibility
  • MongoDB terminology
    • Databases
    • Collections
    • Documents
  • MongoDB architecture: Replication Sets
    • Single master
    • Maintains backup copies of your database instance
      • Secondaries can elect a new primary in seconds if your primary goes down
      • But make sure your operation log is long enough to give you time to recover the primary when it comes back
  • Replication Set Quirks
    • A majority of the servers in your set must agree on the primary
      • Even numbers of servers (2, 4, 6, etc.) don't work well
    • If you don't want to spend money on 3 servers you can set up an 'arbiter' node (but only one)
    • Apps must know about enough servers in the replica set to be able to reach one to learn who's primary
    • Replicas only address durability, not your ability to scale
      • Unless you can take advantage of reading from secondaries, which generally isn't recommended
      • And your DB will still go into read-only mode for a bit while a new primary is elected
    • Delayed secondaries can be set up as insurance against people doing dumb things
  • Sharding
    • Ranges of some indexed value you specify are assigned to different replica sets
  • Sharding Quirks
    • Auto-sharding sometimes doesn't work
      • Split storms, mongos processes restarted too often
    • You must have 3 config servers
      • And if any one goes down, your DB is down
      • This is on top of the single-master design of replica sets
    • MongoDB's loose document model can be at odds with effective sharding
  • MongoDB Positives
    • It's not just a NoSQL database - very flexible document model
    • Shell is a full JavaScript interpreter
    • Supports many indices
      • But only one can be used for sharding
      • More than 2-3 are still discouraged
      • Full text indices for text searches
      • Spacial indices
    • Built-in aggregation capabilities, MapReduce, GridFS
      • For some applications you might not need Hadoop at all
      • But MongoDB still integrates with Hadoop, Spark, and most languages
    • A SQL connector is available
      • But MongoDB still really isn't designed for joins and normalized data
  • MongoDB does not automatically index like Cassandra does, if you want an index you must do it manually

Choosing a Database Technology

  • Integration Considerations
    • Different technologies have different connectors to other technologies
    • If you currently have a big job running in Spark, you'd probably want to use a database that works well with Spark
  • Scaling Requirements
    • Is your data going to grow unbounded over time?
    • What are your transaction rates?
    • High growth & high transaction rates would suggest distributed NoSQL DBs that can scale horizontally
  • Support Considerations
    • Do you have the expertise to spin up and configure the technology?
    • Does the technology offer paid, professional support?
  • Budget Considerations
    • Most technologies are open source installed on Linux machines, so the largest monetary considerations will be the cost of the servers and support
  • CAP Considerations
    • Need to pick 2 of 3 of: Consistency, Availability, & Partition-tolerance
    • These tradeoffs have become a little more loose recently
      • E.g. you can configure your Cassandra DB to be more consistent than it would be out of the box
  • Simplicity
    • Keep it simple - determine the minimum requirements for your business case

Some Examples

  • Building an internal phone directory app
    • Scale: limited
    • Consistency: eventual is fine
    • Availability requirements: not mission critical
    • MySQL (or another SQL based technology) is probably already installed within your organization
  • Want to mine web server logs for interesting patterns
    • What are the most popular times of day? What's the average session length, etc?
    • Not high transaction rates so importing data into HDFS & running a Spark job would suffice (unless you want to vend this data to a massive external audience)
  • Building a movie recommendation system
    • You have a big Spark job that produces movie recommendations for end users nightly
    • Something needs to vend this data to your web applications
    • You work for a huge company with massive scale
    • Downtime is not tolerated
    • Must be fast
    • Eventual consistency is OK
    • Cassandra would be a good choice here
  • Building a massive stock trading system
    • Consistency is more important than anything
    • "Big Data" is present
    • It's really, really important - so having access to professional support might be a good idea, and you have the budget to pay for it
    • MongoDB would be a good option

External Query Engines

Apache Drill

  • Apache Drill is a SQL query engine for a variety of non-relation databases and data files
    • Hive, MongoDB, HBase
    • Even flat JSON or Parquet files on HDFS, S3, Azure, Google cloud, local file system
  • Based on Google's Dremel
  • It's real SQL, not SQL-like
    • Based on SQL:2003 standard - any SQL query that you write can be run with Drill
    • And it has an ODBC/JDBC driver so other tools can connect to to it just like any relational database
  • It's fast and pretty easy to set up
    • But there are still non-relational databases under the hood so joins will not be efficient
    • Allows SQL analysis of disparate data sources without having to transform and load it first
      • Internally data is represented as JSON and has no fixed schema
  • You can even do joins across different database technologies
    • Or even with flat JSON files
    • It is essentially SQL for your entire ecosystem

Apache Phoenix

  • Similar to Drill but it only works with HBase
  • A SQL driver for HBase that supports transactions
  • Fast, low-latency - OLTP support
  • Originally developed by Salesforce, then open-sourced
  • Exposes a JDBC connector for HBase
  • Supports secondary indices and user-defined functions
  • Integrates with MapReduce, Spark, Hive, Pig, and Flume
  • Why Phoenix?
    • It's really fast, most likely won't pay a performance cost from having an extra layer on top of HBase
    • Why Phoenix and not Drill?
      • Should choose the right tool for the job - if you're only dealing with HBase and want a SQL interface, Phoenix would be a good choice
        • Developers working on Phoenix will likely pay closer attention to HBase optimization than those working on Drill
    • Why Phoenix & not HBase's native clients?
      • Apps, analysts, etc. may fine SQL easier to work with
      • Phoenix can do the work of optimizing more complex queries for you
        • But remember, HBase is still fundamentally non-relational
  • Using Phoenix
    • Command line interface
    • Phoenix API for Java
    • JDBC Driver (thick client)
    • Phoenix Query Server (PQS) (thin client)
      • Intended to evenually enable non-JVM access
    • JARs for MapReduce, Hive, Pig, Flume, Spark

Presto

  • Distributing queries across different data stores
  • It's a lot like Drill
    • It can connect ot many different "big data" databases and data stores at once and query across them
    • Familiar SQL syntax
    • Optimized for OLAP - analytical queries, data warehousing
  • Developed & still partially maintained by Facebook
  • Exposes JDBC, Command-Line & Tableau interfaces

Why Presto vs Drill?

  • It has a Cassandra connector
  • At Facebook: over 1,000 Facebook employees run more than 30,000 Presto queries daily that scan over a petabyte each day
  • A single Presto query can combine data from multiple sources, allowing for analytics across an entire organization
  • Presto allows for fast analytics without the need for expensive hardware or proprietary software

Presto Connectors

  • Cassandra
  • Hive
  • MongoDB
  • MySQL
  • Local files
  • Also: Kafka, JMX, PostgreSQL, Redis, Accumulo

Managing Your Cluster

YARN

  • Introduced in Hadoop 2
  • Separates problem of managing resources on your cluster from MapReduce
  • Enabled development of MapReduce alternatives (Spark, Tez) built on top of YARN
  • It mostly works under the hood, managing the usage of your cluster
    • You can write code against it but there's no real need to these day with other tools in the Hadoop ecosystem
  • YARN sits on top of HDFS & MapReduce/Spark/Tez sit on top of it
  • YARN is the compute layer for your cluster

How YARN Works

  • Your application talks to the Resource Manager to distribute work to your cluster
  • You can specify data locality - which HDFS block(s) do you want to process?
    • YARN will try to get your process on the same node that has your HDFS blocks
  • You can specify different scheduling options for applications
    • You can run more than one application at once on your cluster
    • Schedulers: FIFO, Capacity, and Fair
      • FIFO runs jobs in sequence, First In First Out
      • Capacity may run jobs in parallel if there's enough spare capacity
      • Fair may cut into a larger running job if you want to squeeze in a small one

Tez

  • Makes Hive, Pig, or MapReduce jobs run faster
  • It's an application framework that clients can code against as a replacement for MapReduce
  • Constructs Directed Acyclic Graphs (DAGs) for more efficient processing of distributed jobs
    • Relies on a more holistic view of your jod & eliminates unnecessary steps and dependencies
  • Optimizes physical data flow and resource usage
  • All you need to do is tell Hive or Pig to use it (it probably does by default anyway)

Mesos

  • Another resource negotiator, not directly associated with Hadoop
  • Came from Twitter, it's a system that manages resources across your data centers
  • Not just for big data - it can allocate resources for web servers, small scripts, etc.
  • Meant to solve a more general problem than YARN - it's really a general container management system that manages more than just your Hadoop cluster
  • Mesos isn't really part of the Hadoop ecosystem, but it works well with it
    • Spark and Storm may run on Mesos instead of YARN
    • YARN may be integrated with Mesos using Myriad
      • So you don't necessarily need to partition your data center between your Hadoop cluster and everything else

Differences between Mesos and YARN

  • YARN is a monolithic scheduler - you give it a job and YARN figures out where to run
  • Mesos is a two-tiered system
    • Mesos just makes offers of resources to your application (framework)
    • Your framework decides whether to accept or reject them
    • You also decide your own scheduling algorithm
  • YARN is optimized for long, analytical jobs like you see in Hadoop
  • Mesos is built to handle that, asl well as long-lived processes (servers) and short-lived processes as well

How Mesos fits in

  • If you're looking for an architecture that you can code all of your organization's cluster applications against (not just Hadoop stuff) - Mesos can be really useful
    • Should also look at Kubernetes/Docker
  • If all you typically run is Spark and Storm, Mesos is an option
    • Although YARN is probably better, especially if all your data is on HDFS
    • Spark on Mesos is limited to one executor per slave (node)

When to use Mesos

  • If your organization as a whole has chosen to use Mesos to manage its computing resources in general
    • Then you can avoid partitioning off a Hadoop cluster by using Myriad
    • There is also a Hadoop on Mesos package for Cloudera that bypassses YARN completely
  • Otherwise, probably not

ZooKeeper

What is ZooKeeper?

  • It basically keeps track of information that must be synchronized across your cluster
    • Which node is the master?
    • What tasks are assigned to which workers?
    • Which workers are currently available?
  • It's a tool that applications can use to recover from partial failures on your cluster
  • An integral part of HBase, High-Availability (HA) MapReduce, Drill, Storm, Solr, and much more

Failure modes that ZooKeeper can help with

  • Master crashes - needs to fail over to a backup & ZooKeeper keeps track of who the new master is so there are not multiple masters
  • Worker crashes - its work needs to be redistributed
  • Network trouble - part of your cluster can't see the rest of it

ZooKeeper executes "primitive" operations in a distributed system

  • Master election
    • One node registers itself as a master and holds a "lock" on that data
    • Other nodes cannot become master until that lock is released
    • Only one node allowed to hold the lock at a time
  • Crash detection
    • "Ephemeral" data on a node's availability automatically goes away if the node disconnects, or fails to refresh itself after some time-out period
  • Group management
  • Metadata
    • List of outstanding tasks, task assignments
  • But ZooKeeper's API is not about these primitives
    • Instead they have built a more general purpose system that makes it easy for applications to implement them
    • Its API is really a small distributed file system
      • With strong consistency guarantees
      • Replace the concept of "file" with "znode" (the name that ZooKeeper gives to its nodes) and that is essentially what it is
    • ZooKeeper API: Create, delete, exists, setData, getData, getChildren

Notifications

  • A client can register for notifications on a znode
    • Avoids continuous polling
    • Example: register for notification on /master node - if it goes away, try to take over as the new master

Persistent and Ephemeral znodes

  • Persistent znodes remain stored until explicitly deleted
    • i.e. assignment of tasks to workers must persist even if master crashes
  • Ephemeral znodes go away if the client that created it crashes or loses its connection to ZooKeeper
    • i.e. if the master crashes, it should release its lock on the znode that indicates which node is the master

ZooKeeper Architecture

  • Typically want to have more than one ZooKeeper server in case this server goes down
    • Called a ZooKeeper ensemble
  • ZooKeeper quorum - the number of servers that must agree for a process to continue
    • Recommended to have a minimum ZooKeeper ensemble of 5 with a quorum of 3
  • ZooKeeper has the same potential for consistency/availability issues that MongoDB has

Oozie

  • Orchestrates your Hadoop job
  • Burmese for "elephant keeper"
  • A system for running and scheduling Hadoop tasks

Workflows

  • A multi-stage Hadoop job
    • Might chain together MapReduce, Hive, Pig, sqoop, and distcp tasks
    • Other systems available via add-ons (like Spark)
  • A workflow is a DAG of actions
    • Specified via XML
    • So you can run actions that don't depend on each other in parallel

Steps to set up a workflow in Oozie

  • Make sure each action works on its own
  • Make a directory in HDFS for your job
  • Create your workflow.xml file and put it in your HDFS folder
  • Create job.properties defining any variables your workflow.xml needs
    • This goes on your local file system where you'll launch the job from
    • You could also set these properties within your XML

Oozie Coordinators

  • Schedules workflow execution
  • Launches workflows based on a given start time and frequency
  • Will also wait for required input data to become available
  • Run in exactly the same way as a workflow

Oozie Bundles

  • New in Oozie 3.0
  • A bundle is a collection of coordinators that can be managed together
  • Example: you may have a bunch of coordinators for processing log data in various ways
    • By grouping them in a bundle, you could suspend them all if there were some problem with the log collection

Zeppelin

  • A notebook interface for big data
  • Makes it easy for experimenting and visualizing data
  • Similar to iPython (Jupyter) notebooks
    • Lets you interactively run scripts/code against your data
    • Can interleave with nicely formatted notes
    • Can share notebooks with others on your cluster

Apache Spark Integration

  • Can run Spark code interactively (like you can on the Spark shell)
    • Speeds up development cycle
    • And allows easy experimentation and exploration of big data
  • Can execute SQL queries directly against SparkSQL
  • Query results may be visualized in charts and graphs
  • Makes Spark feel more like a data science tool
  • Zeppelin also integrates with many more tools than Spark

Feeding Data to Your Cluster

What is Streaming?

  • Up to this point, everything has been in reference to processing historical, existing big data
    • Sitting on HDFS
    • Sitting in a database
  • How does new data get into your cluster, especially if it's "Big Data"
    • New log entries from web servers
    • New sensor data from IOT devices
    • New stock trades
  • Streaming lets you publish this data to your cluster in real time
    • And you can even process it in real time as it comes in

Two Problems:

  • How to get data from many different sources flowing into your cluster
  • Processing it when it gets there

Kafka

  • Kafka is a general-purpose publish/subscribe messaging system
  • Kafka servers store all incoming messages from publishers for some period of time, and publishes them to a stream of data called a topic
  • Kafka consumers subscribe to one or more topics, and receive data as it's published
  • A stream/topic can have many different consumers, all with their own position in the stream maintained
  • It's not only for Hadoop

How Kafka Scales

  • Kafka itself may be distributed among many processes on many servers
    • Will distribute the storage of stream data as well
  • Consumers may also be distributed
    • Consumers of the same group will have messages distributed amongst them
    • Consumers of different groups will get their own copy of each message

Flume

  • Another way to stream data into your cluster
  • Made from the start with Hadoop in mind
    • Built-in sinks for HDFS and HBase
  • Originally made to handle log aggregation

Flume Agents

  • Source
    • Where data is coming from
    • Can optionally have Channel Selectors and Interceptors
  • Channel
    • How data is transferred (via memory or files)
  • Sink
    • Where data is going
    • Can be organized into sink groups
    • A sink can connect to to only one channel
      • Channel is notified to delete a message once the sink processes it

Built-in Source Types

  • Spooling directory
  • Avro
  • Kafka
  • Exec
  • Thrift
  • Netcat
  • HTTP
  • Custom written sources written in Java

Built-in Sink Types

  • HDFS
  • Hive
  • HBase
  • Avro
  • Thrift
  • Elasticsearch
  • Kafka
  • Custom written sinks written in Java

Analyzing Streams of Data

Spark Streaming

Why Spark Streaming?

  • "Big Data" never stops
  • Analyze data streams in real time, instead of huge batch jobs daily
  • Analyzing streams of web log data to react to user behavior
  • Analyze streams of real-time sensor data for IoT applications

How it works: High Level

  • Data streams come in from whatever source(s) are connected
  • Spark cluster will have a number of receivers to receive the data
  • This data will be discretized by the Spark Streaming servers into RDDs based on a batch increment time
  • RDDs can then be output to other systems for processing and analysis
  • Not exactly real-time, it processes the data in microbatches - if you set the batch increment time to 1 second, then it will be essentially real-time from the perspective of the user
  • This work can be distributed - processing of RDDs can happen in parallel on different worker nodes

DStreams (Discretized Streams)

  • Generates the RDDs for each time step and can produce output at each time step
  • Can be transformed and acted on in much the same way as RDDs
  • Or you can access their underlying RDDs if you need them
  • Actions are applied continuously over time as each new RDD comes in

Common stateless transformations on DStreams

  • Map
  • Flatmap
  • Filter
  • reduceByKey

Stateful data

  • You can also maintain a long-lived state on a DStream
    • For example: running totals, broken down by keys
    • Another example: aggregating session data in web activity

Windowed Transformations

  • Allow you to compute results across a longer time period than your batch interval
    • Example: top-sellers from the past hour
      • You might process data ever second (the batch interval) but maintain a window of one hour
  • The window "slides" as time goes on to represent batches within the window interval

Batch interval vs Slide interval vs Window interval

  • The batch interval is how often data is captured into a DStream
  • The slide interval is how often a windowed transformation is computed
  • The window interval is how far back in time the windowed transformation goes

Structured Streaming

  • A new (as of 2017), higher-level API for streaming structured data
    • Available in Spark 2.0 & 2.1
  • Uses DataSets
    • Like a DataFrame but with more explicit type information
    • A DataFrame is really a DataSet[Row]
  • It essentially creates a DataFrame that continuously gets new DataSets appended to it

Advantages of Structured Streaming

  • Streaming code looks a lot like the equivalent non-streaming code
  • Structured data allows Spark to represent data more efficiently
  • SQL-style queries allow for query optimization opportunities & even better performance
  • Interoperability with other Spark components based on DataSets
    • MLLib is also moving toward DataSets as its primary API
  • DataSets in general is the direction Spark is moving

Apache Storm

  • Another framework for processing continuous streams of data on a cluster
    • Can run on top of YARN (like Spark)
  • Works on individual events, not micro-batches like Spark Streaming
    • If you need sub-second latency, Storm is the best solution

Storm Terminology

  • A stream consists of tuples that flow through spouts that are sources of stream data (Kafka for example)
  • Bolts process stream data as it's received
    • Transform, aggregate, write to databases/HDFS
  • A topology is a graph of spouts and bolts that process your stream

Storm Architecture

  • A Nimbus node sits at the highest level and acts as a sort of job tracker
  • The Nimbus node is connected to ZooKeeper nodes
  • The ZooKeeper nodes are connected to Supervisor nodes, which is where all the work is actually done

Developing Storm Applications

  • Usually done in Java
    • Although bolts may be directed through scripts in other languages
  • Storm Core
    • The lower-level API for Storm
    • "At-least-once" semantics
  • Trident
    • Higher level API for Storm that sits on top of Storm Core
    • "Exactly once" semantics
  • Storm runs your applications "forever" once submitted - until you explicitly stop them

Storm vs Spark Streaming

  • Spark streaming has the benefit of having the rest of Spark at your disposal
    • You can also write jobs in other languages such as Python or Scala, whereas Storm is mostly written in Java
  • But if you need truly real-time (sub-second) processing of events as they come in, Spark is the preferred tool
  • Core Storm offers "tumbling windows" in addition to "sliding windows"
  • Kafka + Storm is a popular combination

Flink

  • Name comes from the German for quick & nimble
  • Another stream processing engine - most similar to Storm
  • Can run on a standalone cluster or on top of YARN or Mesos
  • Highly scalable (1000s of nodes)
  • Fault-tolerant
    • Can survive failures while still guaranteeing exactly-once processing
    • Uses "state snapshots" to achieve this

Flink vs Spark Streaming vs Storm

  • Flink is faster than Storm (an order of magnitude faster)
  • Flink offers real-time streaming like Storm (but if you're using Trident with Storm, you're actually using micro-batches)
  • Flink offers a higher-level API like Trident or Spark, but while still doing real-time streaming
  • Flink has good Scala support like Spark Streaming
  • Flink has an ecosystem of its own like Spark
  • Flink can process data based on event times, not when data was received
    • Impressive windowing system
    • This, plus real-time streaming and exactly-once semantics is important for financial applications
  • But it's relatively young
  • All three seem to be converging on similar features and capabilities, then it becomes more a question of what fits best in your existing environment

Connectors

  • HDFS
  • Cassandra
  • Kafka
  • Others: Elasticsearch, NiFi, Redis, RabbitMQ

Other Technologies

Impala

  • Cloudera's alternative to Hive
  • Massively parallel SQL engine on Hadoop
  • Impala's always running, so you avoid the startup costs when starting a Hive query
    • Made for BI-style queries
  • Bottom line: Impala's often faster than Hive, but Hive offers more versatility
  • Consider using Impala instead of Hive if you're using Cloudera

Accumulo

  • Another BigTable clone (like HBase)
  • But offers a better security model
    • Cell-based access control
  • And server-side programming
  • Consider it for you NoSQL needs if you have complex security requirements
    • But make sure the systems that need to read this data can talk to it

Redis

  • A distributed in-memory data store (like memcache)
  • But it's more than a cache
  • Good support for storing data structures
  • Can persist data to disk
  • Can be used as a data store and not just a cache
  • Popular caching layer for web apps

Ignite

  • An "in-memory data fabric"
  • Think of it as an alternative to Redis
  • But it's closer to a database
    • ACID guarantees
    • SQL support
    • But it's all done in memory

Elasticsearch

  • A distributed document search and analytics engine
  • Really popular
    • Wikipedia, The Guardian, Stack Overflow
  • Can handle things like real-time search-as-you-type
  • When paired with Kibana, great for interactive exploration
  • Amazon offers and Elasticsearch Service

Kinesis (and the AWS ecosystem)

  • Amazon Kinesis is basically the AWS version of Kafka
  • Amazon has a whole ecosystem of its own
    • Elastic MapReduce (EMR)
    • S3 (distributed storage, similar to HDFS)
    • Elasticsearch Service / CloudSearch
    • DynamoDB (similar to Cassandra)
    • Amazon RDS (relational database similar to MySQL)
    • ElastiCache
    • AI / Machine Learning services
  • EMR in particular is an easy way to spin up a Hadoop cluster on demand

Apache NiFi

  • Directed graphs of data routing
    • Can connect to Kafka, HDFS, Hive
  • Web UI for designing complex systems
  • Often seen in the context of IoT sensors, and managing their data

Apache Falcon

  • A "data governance engine" that sits on top of Oozie
  • Included in Hortonworks
  • Like NiFi it allows construction of data processing graphs
  • But it's really mean to organize the flow of your data within Hadoop

Apache Slider

  • Deployment tool for apps on a YARN cluster
  • Allows monitoring of your apps
  • Allows growing or shrinking your deployment as it's running
  • Manages mixed configurations
  • Start/stop applications on your cluster

Designing Real World Systems

Hadoop Architecture Design

Working Backwards

  • Start with the end user's needs, not from where your data is coming from
    • Sometimes you need to meet in the middle
  • What sort of access patterns do you anticipate from your end users?
    • Analytical queries that span large date ranges?
    • Huge amounts of small transactions for very specific rows of data?
    • Both?
  • What availability to these users demand?
  • What consistency do these users demand?
  • The end user patterns are the most important aspect of designing these systems, you shouldn't start with a preferred technology and try to make it work for use cases that it's not designed for

Thinking about requirements

  • Just how big is your big data?
    • Do you really need a cluster?
  • How much internal infrastructure and expertise is available?
    • Should you use AWS or something similar?
    • Do systems you already know fit the bill?
  • What about data retention?
    • Do you need to keep data around forever? For auditing perhaps?
    • Or do you need to purge it often? For privacy?
  • What about security?
    • Legal concerns
  • Latency
    • How quickly do end users need to get a response?
      • Milliseconds? Then something like HBase or Cassandra will be needed
  • Timeliness
    • Can queries be based on day-old data? Minute-old?
      • Oozie scheduled jobs in Hive/Pig/Spark may cut it
    • Or must it be near real-time?
      • In that case, Spark Streaming/Storm/Flink with Kafka or Flume may be the answer

Future-proofing

  • Once you decide where to store your "big data," moving it will be really difficult later on
    • Think carefully before choosing proprietary solutions or cloud-based storage
  • Will business analysts want your data in addition to end users (or vice versa?)

Utilizing existing resources

  • Does your organization have existing components you can use?
    • Don't build a new data warehouse if you already have one
    • Rebuilding existing technology always has negative business value
  • What's the least amount of infrastructure you need to build?
    • Import existing data with Sqoop, etc. if you can
    • If relaxing a "requirement" saves a lot of time and money - at least ask
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment