Skip to content

Instantly share code, notes, and snippets.

@wagnerjgoncalves
Last active July 22, 2019 07:52
Show Gist options
  • Save wagnerjgoncalves/d93810f40babb962413ea12f5987d2e6 to your computer and use it in GitHub Desktop.
Save wagnerjgoncalves/d93810f40babb962413ea12f5987d2e6 to your computer and use it in GitHub Desktop.
Hadoop Platform and Application

Hadoop Basics

Hadoop Ecosystem

Hadoop Stack Basics

What is Haddop (2005)

Is an open-source software framework for storage and large scale processing of data-sets on cluster of commodity hardware.

  • Scalability
  • Realibility

Keep all the data to raw format and use schema on reading style.

The Apache Framework: Basic Modules

  1. Hadoop Common: libraries and utilities to other hadoop modules
  2. Hadoop Distributed File System (HDFS): distributed file system that stores data.
  3. Hadoop Map Reduce: programing model
  4. Hadoop YARN: resource managemnt platform responsible for managing compute resources in the cluster

Map Reduce Layer

Image of Map Reduce Layer

Hadoop Distributed File System (HDFS)

Distributed, scalable and portable file-system written in Java for the Hadoop framework.

The Hadoop Zoo

Hadoop Ecosystem Major Components

Sqoop

Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

HBASE

Is a key component of the Hadoop stack

  • Column-oriented database management
  • Key-value store
  • Based on Google Big Table
  • Can hold extremely large data
  • Dynamic data model
  • Not a relational DRMS

PIG

It's a scripting language

  • High level programming on top of Hadoop MapReduce
  • Pig Latin
  • Data analysis problems as data flows
  • Originally developed at Yahoo in 2006

UDF: User defined functions

Apache Hive

Data warehouse software facilitates quering and managing large datasets residing in distribute storage.

Oozie

  • Workflow scheduler system to manage Apache Hadoop jobs
  • oozie Coordinator jobs
  • Supports: MapReduce, Pig, Apache, Hive and Sqoop

Zookeeper

  • Provides operational services for a Hadoop cluster
  • Centralized service for maintaining configuration information, naming, provising distributed syncronization and providing

Flume

  • Distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data.

Impala

Is Cloudera's open source massively parallel processing (MPP) SQL

Spark

Is a fast and general engine for large-scale data processing

  • Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
  • Allows users prograns to load data into a cluster's memory and query it repeatedly
  • Well-suited to machine learning

Overview of Hadoop Stack

The Hadoop Distributed File System (HDFS) and HDFS2

Original HDFS Design Goals

  • Resilience
  • Scalable
  • Application Locality
  • Portability

By default de DataNode is replicated 3 times.

  • Single NameNode
  • Multiple DataNodes
    • Manage store - blocks of daa
    • Serving read/write requests from clients
    • Block creation, deletion, replication

HDFS in Hadoop 2

HDFS Federation

  • Increased namespace scalability
  • Performance
  • Isolation

How it's works:

  • Multiple NameNode Servers
  • Multiple namespaces
  • Block pools

MapReduce Framework and YARN

  • Software framework - for writing parallel data processing applications
  • MapReduce jobs splits data into chuncks
  • Map tasks process data chuncks
  • Framework sorts mao output
  • Reduce tasks use sorted map data as input

Original MapReduce Framework

  • Single Master JobTracker
  • JobTracker schedules, monitors and re-executes failed tasks
  • One slace TaskTracker per cluster node
  • TaskTracker executes tasks per JobTracker requests

YARN - Next Generation of MapReduce

  • Separate resource management and job scheduling/monitoring
  • Global ResourceManager (RM)
  • NodeManage on each node
  • ApplicationMaster - one for each application

YARN

Additional features

  • High Avalilability ResourceManager
  • Timeline Serer
  • Use of Cgroups
  • Secure Containers
  • YARN - web services REST APIs

The Apache Tez™ project is aimed at building an application framework which allows for a complex directed-acyclic-graph of tasks for processing data. It is currently built atop Apache Hadoop YARN.

The Hadoop Execution Environment

DGA - Directed Acyclic Graph

YARN, Tez, and Spark

Execution frameworks

Tez

  • Dataflow graphs
  • Custom data types
  • Can run complex DAG of tasks
  • Dynamic DAG changes
  • Resource usage efficiency

Spark

  • Advances DAG execution engine
  • Supports cyclic data flow
  • In-memory computing
  • Java, Scala, Python, R
  • Existing optimized libraries

Hadoop Resource Scheduling

  • Resource management
  • Different kinds of scheduling algorithms
  • Types of parameters that can be controlled

Basic:

  • Default - First in First out (FIFO)
  • Fairshare
  • Capacity

Capacity Scheduler

  • Queues and sub-queues
  • Capacity Guarantee with elasticity
  • ACLs for security
  • Runtime changes/draining apps
  • Resources based scheduling

Fairshare Scheduler

  • Balances out resource allocation among app over time
  • Can organize into queues/sub-queues
  • Guarantee minimum shares
  • Limits per user/app
  • Weighted app priorities
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment