Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rupeshtiwari/b12b87fa5ab480e5fd0813d4fc535ac2 to your computer and use it in GitHub Desktop.
Save rupeshtiwari/b12b87fa5ab480e5fd0813d4fc535ac2 to your computer and use it in GitHub Desktop.
Learning Apache Hadoop Notes

Comprehensive Overview of Hadoop Ecosystem Components with Cloud Service Equivalents

Here's a concise table summarizing the key Hadoop ecosystem components along with their cloud service equivalents:

Component Purpose Created by Language Support Limitations Alternatives Fit GCP Service AWS Service Azure Service
Apache Hive SQL-like data querying in Hadoop. Facebook HiveQL High latency for some queries. Presto Batch processing Dataproc Amazon EMR HDInsight
Apache Pig Data transformations with high-level scripting. Yahoo Pig Latin Steeper learning curve. Hive, Spark Data flow management Dataproc Amazon EMR HDInsight
Apache Oozie Manages and schedules Hadoop jobs. Yahoo XML Complex setup. Apache Airflow Job scheduling Composer (Airflow) AWS Step Functions Logic Apps
Hue Web interface for Hadoop. Cloudera GUI for HiveQL, Pig Latin Dependent on Hadoop’s performance. Command-line tools, third-party platforms User interface GCP console, Dataproc UI AWS management console, AWS Glue Azure portal, HDInsight apps
Apache HBase Real-time read/write access on HDFS. Powerset Java, REST, Avro, Thrift APIs Complexity in management. Cassandra Real-time querying Bigtable Amazon DynamoDB Cosmos DB
Presto SQL query engine for big data analytics. Facebook SQL Requires substantial memory for large datasets. Hive Analytic queries BigQuery Amazon Athena Synapse Analytics
Apache Sqoop Bulk data transfer between Hadoop and databases. Cloudera Command-line interface Limited to simple SQL transformations. Apache Kafka Data import/export Dataflow AWS Data Pipeline, AWS Glue Data Factory

This table efficiently encapsulates each component's essential details and the corresponding cloud services from Google Cloud Platform (GCP), Amazon Web Services (AWS), and Microsoft Azure to provide a quick reference guide. Apache Hive:

  • Purpose: Enables SQL-like data querying and management within Hadoop.
  • Created by: Facebook, 2007.
  • Languages: HiveQL.
  • Limitations: High latency for some queries.
  • Alternatives: Presto for faster querying.
  • Fit: Suitable for batch processing frameworks like MapReduce and Spark.
  • Cloud Services:
    • GCP: Dataproc
    • AWS: Amazon EMR
    • Azure: HDInsight

Apache Pig:

  • Purpose: Facilitates complex data transformations with a high-level scripting language.
  • Created by: Yahoo, 2006.
  • Languages: Pig Latin.
  • Limitations: Steeper learning curve.
  • Alternatives: Hive for SQL-like querying, Spark for in-memory processing.
  • Fit: Effective for data flow management in batch processes.
  • Cloud Services:
    • GCP: Dataproc
    • AWS: Amazon EMR
    • Azure: HDInsight

Apache Oozie:

  • Purpose: Manages and schedules Hadoop jobs in workflows.
  • Created by: Yahoo, 2008.
  • Languages: XML.
  • Limitations: Complex setup.
  • Alternatives: Apache Airflow for more flexible scripting.
  • Fit: Integrates with Hadoop components for job scheduling.
  • Cloud Services:
    • GCP: Composer (managed Airflow)
    • AWS: AWS Step Functions
    • Azure: Logic Apps

Hue (Hadoop User Experience):

  • Purpose: Simplifies user interactions with Hadoop through a web interface.
  • Created by: Cloudera, 2009.
  • Languages: Supports HiveQL, Pig Latin via GUI.
  • Limitations: Dependent on Hadoop’s performance.
  • Alternatives: Command-line tools, third-party platforms.
  • Fit: Useful for non-command-line users.
  • Cloud Services:
    • GCP: GCP console and Dataproc jobs UI
    • AWS: AWS management console and AWS Glue
    • Azure: Azure portal and HDInsight applications

Apache HBase:

  • Purpose: Provides real-time read/write access to large datasets on HDFS.
  • Created by: Powerset (acquired by Microsoft), 2007.
  • Languages: Java, REST, Avro, Thrift APIs.
  • Limitations: Complexity in management.
  • Alternatives: Cassandra for easier scaling.
  • Fit: Ideal for real-time querying on large datasets.
  • Cloud Services:
    • GCP: Bigtable
    • AWS: Amazon DynamoDB
    • Azure: Cosmos DB

Presto:

  • Purpose: High-performance, distributed SQL query engine for big data analytics.
  • Created by: Facebook, 2012.
  • Languages: SQL.
  • Limitations: Requires substantial memory for large datasets.
  • Alternatives: Hive for Hadoop-specific environments.
  • Fit: Best for interactive analytic queries across multiple data sources.
  • Cloud Services:
    • GCP: BigQuery
    • AWS: Amazon Athena
    • Azure: Synapse Analytics

Apache Sqoop:

  • Purpose: Transfers bulk data between Hadoop and relational databases.
  • Created by: Cloudera, 2009.
  • Languages: Command-line interface.
  • Limitations: Limited to simple SQL transformations.
  • Alternatives: Apache Kafka for ongoing data ingestion.
  • Fit: Effective for batch imports and exports between HDFS and structured databases.
  • Cloud Services:
    • GCP: Dataflow
    • AWS: AWS Data Pipeline or AWS Glue
    • Azure: Data Factory

This overview provides a comprehensive look at each component's role, limitations, and the cloud services available for each, ensuring you can match the right tools to your specific cloud environment and data processing needs.

Hadoop

For an in-depth understanding and explanation of the Hadoop concepts and the interview practice questions, let’s break down each topic, including diagrams and examples where applicable.

1. Hadoop Distributed File System (HDFS)

Architecture:

  • NameNode: The master node that manages the file system namespace and controls access to files by clients. It maintains the file system tree and the metadata for all files and directories in the tree. This metadata is stored in memory for fast access.
  • DataNode: These nodes store the actual data in HDFS. A typical file is split into blocks (default size is 128MB in Hadoop 2.x), and these blocks are stored in a set of DataNodes.
  • Secondary NameNode: Works alongside the NameNode to keep a copy of the merged namespace image, which can be used in case of NameNode failure.

Fault Tolerance:

  • Replication: Data blocks are replicated across multiple DataNodes based on the replication factor (default is 3). This ensures that even if one DataNode fails, two other copies of the data block remain available.
  • Heartbeats and Re-replication: DataNodes send heartbeats to the NameNode; if a DataNode fails to send a heartbeat for a specified amount of time, it is considered failed, and the blocks it hosted are replicated to other nodes.

Data Locality:

  • Optimizing data processing by moving computation to the data rather than the data to the computation. MapReduce tasks are scheduled on nodes where data blocks are located to reduce network congestion and increase processing speed.

Diagram for HDFS Architecture:

graph TD;
    NN[NameNode] -- Manages--> DN1[DataNode 1]
    NN -- Manages--> DN2[DataNode 2]
    NN -- Manages--> DN3[DataNode 3]
    SNN[Secondary NameNode] -- Syncs with--> NN

2. MapReduce

Concept and Workflow:

  • Map Phase: Processes input data (usually in key-value pair format) to produce intermediate key-value pairs.
  • Shuffle and Sort: Intermediate data from mappers are shuffled (transferred) to reducers, and during this process, data is sorted by key.
  • Reduce Phase: Aggregates and processes intermediate key-value pairs to produce the final output.

Real-World Use Cases:

  • Large-scale data processing tasks like log analysis, word count, and data summarization.
  • Batch processing of vast amounts of structured and unstructured data.

Performance Optimization:

  • Combiner: A mini-reducer that performs local aggregation on the map output, which can significantly reduce data transferred across the network.
  • Partitioner: Custom partitioning of data can be used to distribute the workload evenly across reducers.

3. YARN (Yet Another Resource Negotiator)

Role in Hadoop:

  • Separates the resource management capabilities of Hadoop from the data processing components, providing more flexibility in data processing approaches and improving cluster utilization.

Components:

  • ResourceManager (RM): Manages the allocation of computing resources in the cluster, scheduling applications.
  • NodeManager (NM): The per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.
  • ApplicationMaster (AM): Per-application component that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor tasks.

Diagram for YARN Architecture:

graph TD;
    RM[ResourceManager] -- Allocates resources --> AM[ApplicationMaster]
    NM[NodeManager] -- Manages --> Containers
    AM -- Negotiates with --> RM
    AM -- Communicates with --> NM

Practice Questions:

How does Hadoop ensure data reliability and fault tolerance?

  • Answer: Hadoop ensures data reliability through its replication policy in HDFS, where data is stored on multiple DataNodes. Fault tolerance is achieved by automatic handling of failures, such as reassigning tasks from failed nodes to healthy ones and re-replicating data from nodes that are no longer accessible.

Compare and contrast HBase and Hive. When would you use one over the other?

  • HBase: A non-relational, NoSQL database that runs on top of HDFS and is used for real-time read/write access. Best suited for scenarios requiring high throughput and low latency data access, such as user profile management in web applications.
  • Hive: A data warehouse infrastructure built on top of Hadoop, providing a SQL-like interface to query data stored in HDFS. Ideal for data discovery, large-scale data mining, and batch processing with complex queries.
  • Use Case Decision: Use HBase for real-time querying and data updates; use Hive for complex queries over large datasets where response time is

When discussing real-world scenarios where Hadoop and Google Cloud Platform (GCP) can be synergistically integrated, it's essential to consider both the migration of traditional Hadoop environments to the cloud and the optimization of hybrid setups. Here, I'll provide a couple of scenarios and an architecture diagram to illustrate these integrations using GCP services.

Scenario 1: Migrating a Traditional Hadoop Setup to GCP

A common scenario is moving a traditional on-premise Hadoop setup to GCP to leverage cloud scalability, reliability, and managed services.

Use Case: A retail company using Hadoop for customer behavior analytics wants to migrate to GCP to handle increasing data volumes and incorporate advanced machine learning capabilities.

Steps:

  1. Assessment and Planning: Evaluate the existing Hadoop architecture, data volumes, and specific use cases.
  2. Data Migration: Use Google Cloud Storage (GCS) as a durable and scalable replacement for HDFS. Transfer data from HDFS to GCS using services like Cloud Data Transfer or gsutil.
  3. Compute Migration: Replace on-premise Hadoop clusters with Google Cloud Dataproc, which allows running Hadoop and Spark jobs at scale. It manages the deployment of clusters and integrates easily with other Google Cloud services.
  4. Advanced Analytics Integration: Utilize Google BigQuery for running SQL-like queries on large datasets and integrate Google Cloud AI and Machine Learning services to enhance analytical capabilities beyond traditional MapReduce jobs.

Scenario 2: Optimizing a Hybrid Hadoop Environment

Some organizations prefer keeping some data on-premise for compliance or latency issues while leveraging cloud benefits.

Use Case: A financial institution uses on-premise Hadoop for sensitive financial data but wants to utilize cloud resources for less sensitive computational tasks.

Steps:

  1. Hybrid Connectivity: Establish a secure connection between the on-premise data center and GCP using Cloud VPN or Cloud Interconnect for seamless data exchange.
  2. Data Management: Use Google Cloud Storage for less sensitive data while keeping sensitive data on-premise. Employ Google Cloud’s Storage Transfer Service for periodic synchronization between on-premise HDFS and GCS.
  3. Distributed Processing: Use Cloud Dataproc for additional processing power during peak loads, connecting to both on-premise HDFS and GCS for data access.
  4. Monitoring and Management: Utilize Google Cloud Operations (formerly Stackdriver) for monitoring resources both on-premise and in the cloud.

Architecture Diagram

Here's a basic architecture diagram using Mermaid to visualize the hybrid Hadoop setup described in Scenario 2:

graph TB
    subgraph On-Premise
    HDFS[Hadoop HDFS] -- Data Sync --> VPN[Cloud VPN/Interconnect]
    end

    subgraph GCP
    VPN -- Secure Connection --> GCS[Google Cloud Storage]
    GCS --> Dataproc[Cloud Dataproc]
    Dataproc --> BigQuery[Google BigQuery]
    Dataproc --> AI[Google Cloud AI/ML]
    GCS --> Transfer[Storage Transfer Service]
    end

    style On-Premise fill:#f9f,stroke:#333,stroke-width:2px
    style GCP fill:#ccf,stroke:#333,stroke-width:2px

Discussion Points for Interview:

  • Benefits: Discuss the scalability, cost-effectiveness, and enhanced capabilities provided by GCP, such as machine learning integration and advanced analytics.
  • Challenges: Address potential challenges such as data security, network latency, and compliance issues when moving data between on-premise and cloud environments.
  • Best Practices: Highlight best practices for cloud migration, including incremental data migration, thorough testing before full-scale deployment, and continuous monitoring for performance and security.

These examples and the architecture diagram can help illustrate your understanding of integrating Hadoop with GCP in various operational scenarios during your interview.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment