Skip to content

Instantly share code, notes, and snippets.

@descico
Last active September 4, 2017 09:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save descico/f74a5eb5c09ebb988ae9149fbf636054 to your computer and use it in GitHub Desktop.
Save descico/f74a5eb5c09ebb988ae9149fbf636054 to your computer and use it in GitHub Desktop.

Hadoop Lecture

  • What is "Hadoop"?:
    • Distributed Compute Engine
    • Distributed Storage
  • Draw a picture of Hadoop and its ecosystem.
    • HDFS
    • YARN
    • MapReduce
    • Tez
    • Spark
    • Hive
    • HBase
    • ...
    • (depends on their projects/environment)
  • Shallow Dive(Not Deep Dive)
    • HDFS
      • NameNode and DataNode
      • Replica, Not RAID
      • Replica for data locality
    • YARN
      • ResourceManager and NodeManager
      • Queue and Scheduler
      • Process path
        • show the page of gihyo's Hadoop series.
    • Hive
      • HiveServer2, Hive Metastore
      • Where is data of Hive tables?
      • Partition
      • File format. Plain Text and ORC
      • Stats, Optimizer, Vectorization
    • HBase
      • Master, RegionServer and ZooKeeper
      • Master is not used for usual data access.
      • (TBD)
    • Spark
      • RDD is ...
      • DataFrame
      • Spark SQL    - memory management(executor-memory and overhead)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment