wagnerjgoncalves/week_1.md

## week_1.md

      
    Raw
  

              week_1.md
            
          
    Hadoop Basics


Hadoop Stack Basics

What is Haddop (2005)

Is an open-source software framework for storage and large scale processing of data-sets on cluster of commodity hardware.

Scalability
Realibility

Keep all the data to raw format and use schema on reading style.
The Apache Framework: Basic Modules


Hadoop Common: libraries and utilities to other hadoop modules
Hadoop Distributed File System (HDFS): distributed file system that stores data.
Hadoop Map Reduce: programing model
Hadoop YARN: resource managemnt platform responsible for managing compute resources in the cluster

Map Reduce Layer


Hadoop Distributed File System (HDFS)

Distributed, scalable and portable file-system written in Java for the Hadoop framework.
The Hadoop Zoo

Hadoop Ecosystem Major Components

Sqoop

Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
HBASE

Is a key component of the Hadoop stack

Column-oriented database management
Key-value store
Based on Google Big Table
Can hold extremely large data
Dynamic data model
Not a relational DRMS

PIG

It's a scripting language

High level programming on top of Hadoop MapReduce
Pig Latin
Data analysis problems as data flows
Originally developed at Yahoo in 2006

UDF: User defined functions
Apache Hive

Data warehouse software facilitates quering and managing large datasets residing in distribute storage.
Oozie


Workflow scheduler system to manage Apache Hadoop jobs
oozie Coordinator jobs
Supports: MapReduce, Pig, Apache, Hive and Sqoop

Zookeeper


Provides operational services for a Hadoop cluster
Centralized service for maintaining configuration information, naming, provising distributed syncronization and providing

Flume


Distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data.

Impala

Is Cloudera's open source massively parallel processing (MPP) SQL
Spark

Is a fast and general engine for large-scale data processing

Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
Allows users prograns to load data into a cluster's memory and query it repeatedly
Well-suited to machine learning


## week_2.md

      
    Raw
  

              week_2.md
            
          
    Overview of Hadoop Stack

The Hadoop Distributed File System (HDFS) and HDFS2

Original HDFS Design Goals


Resilience
Scalable
Application Locality
Portability

By default de DataNode is replicated 3 times.

Single NameNode
Multiple DataNodes

Manage store - blocks of daa
Serving read/write requests from clients
Block creation, deletion, replication


HDFS in Hadoop 2

HDFS Federation

Increased namespace scalability
Performance
Isolation

How it's works:

Multiple NameNode Servers
Multiple namespaces
Block pools

MapReduce Framework and YARN


Software framework - for writing parallel data processing applications
MapReduce jobs splits data into chuncks
Map tasks process data chuncks
Framework sorts mao output
Reduce tasks use sorted map data as input

Original MapReduce Framework


Single Master JobTracker
JobTracker schedules, monitors and re-executes failed tasks
One slace TaskTracker per cluster node
TaskTracker executes tasks per JobTracker requests

YARN - Next Generation of MapReduce


Separate resource management and job scheduling/monitoring
Global ResourceManager (RM)
NodeManage on each node
ApplicationMaster - one for each application


Additional features

High Avalilability ResourceManager
Timeline Serer
Use of Cgroups
Secure Containers
YARN - web services REST APIs

The Apache Tez™ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.
The Hadoop Execution Environment

DGA - Directed Acyclic Graph
YARN, Tez, and Spark

Execution frameworks
Tez


Dataflow graphs
Custom data types
Can run complex DAG of tasks
Dynamic DAG changes
Resource usage efficiency

Spark


Advances DAG execution engine
Supports cyclic data flow
In-memory computing
Java, Scala, Python, R
Existing optimized libraries

Hadoop Resource Scheduling


Resource management
Different kinds of scheduling algorithms
Types of parameters that can be controlled

Basic:

Default - First in First out (FIFO)
Fairshare
Capacity

Capacity Scheduler


Queues and sub-queues
Capacity Guarantee with elasticity
ACLs for security
Runtime changes/draining apps
Resources based scheduling

Fairshare Scheduler


Balances out resource allocation among app over time
Can organize into queues/sub-queues
Guarantee minimum shares
Limits per user/app
Weighted app priorities