Hadoop Stack Basics
What is Haddop (2005)
Is an open-source software framework for storage and large scale processing of data-sets on cluster of commodity hardware.
Keep all the data to raw format and use schema on reading style.
The Apache Framework: Basic Modules
- Hadoop Common: libraries and utilities to other hadoop modules
- Hadoop Distributed File System (HDFS): distributed file system that stores data.
- Hadoop Map Reduce: programing model
- Hadoop YARN: resource managemnt platform responsible for managing compute resources in the cluster
Map Reduce Layer
Hadoop Distributed File System (HDFS)
Distributed, scalable and portable file-system written in Java for the Hadoop framework.
The Hadoop Zoo
Hadoop Ecosystem Major Components
Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Is a key component of the Hadoop stack
- Column-oriented database management
- Key-value store
- Based on Google Big Table
- Can hold extremely large data
- Dynamic data model
- Not a relational DRMS
It's a scripting language
- High level programming on top of Hadoop MapReduce
- Pig Latin
- Data analysis problems as data flows
- Originally developed at Yahoo in 2006
UDF: User defined functions
Data warehouse software facilitates quering and managing large datasets residing in distribute storage.
- Workflow scheduler system to manage Apache Hadoop jobs
- oozie Coordinator jobs
- Supports: MapReduce, Pig, Apache, Hive and Sqoop
- Provides operational services for a Hadoop cluster
- Centralized service for maintaining configuration information, naming, provising distributed syncronization and providing
- Distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data.
Is Cloudera's open source massively parallel processing (MPP) SQL
Is a fast and general engine for large-scale data processing
- Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
- Allows users prograns to load data into a cluster's memory and query it repeatedly
- Well-suited to machine learning