Skip to content

Instantly share code, notes, and snippets.

@prodeezy
Created May 8, 2015 16:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save prodeezy/90af4ef5cccf22b2b46b to your computer and use it in GitHub Desktop.
Save prodeezy/90af4ef5cccf22b2b46b to your computer and use it in GitHub Desktop.
Hbasecon2015 Notes
Hbase 1.0
1500 jiras
API standardization
Read available region replicas
Automatic tuning of global memstore and block cache sizes
Compressed blocks in block cache.
1.0.x and 1.1.x will have monthly releases.
Rolling upgrades supported on minor 1.x versions. Not garunteed between 1.x and 2.x
Upgrade path from .98.x :
Full Client compatibility to move from 0.98.x to 1.x (but recommended to move to new client APIs)
Wired compatibility ( .98 client can talk to 1.0 server). Can do rolling upgrade to 1.0
1.0 can read .98 snapshots
Recommended:
Hadoop 2.5 or 2.6 (recommended) , Zookeeper 3.4 ( wont run well on 3.3)
Support for JDK 7, 8( not as well tested as 1.7 ).. 1.6 is Gone!
1.0 API
Connections are now user managed
Region locator is a lightweight object to talk to regions more flexibly
Removed Autoflush , replaced with BufferedMutator, easier to use. Used when writes are small and many. Cached objects are lost if process dies.
Pinterest
On fully AWS provisioned hardware: hi1.4xlarge (moving to i2)
Running 0.94.x, Planning to move to 1.x ( w/ no downtime )
Using CMS, ParNew GC. JVM object Promotion failures
Went off heap for Blockcache (Bucketcache).
Ensured memory space for memstore.
Moved GC H-Daemon /var/log/ logging to linked to first ephemeral disk
Monitoring:
TSDB for all metrics. Use opentsdb-R to mine metrics using R
Nagios based alerting
hotspot monitoring: tcpdump usfdul (tcpdump -i eth0 -w -s 0 tcp port 60020 | strings)
Capacity planning
salted keys for uniform distribution
migrated to dedicated cluster
Disaster recovery:
Exportsnapshot to HDFS cluster with throttle, distcp WAL logs hourly to HDFS. Clone snapshot + WALPlayer to recover.
hbasehbackup -t unique_index -d new_cluster -d 10
hbaserecover -t unique_index -ts DATE_TIME --source-cluster zennotifications
Hbase Perf Tuning (Lars)
HDFS Correctness
dfs.datanode.synconclose=true
mount ext4 with dirsync or use XFS
syncs partial blocks to disk - best effort
dfs.datanode.sync.behind.writes=true
short circuit buffer size = 128M
distribute data across disks
dfs.block.size = 256M
ipc.tcpnodeelay=true
datanode.max.xceivers=8192
datanode handler count = 8
RS settings
blockingStorefiles=10 (small for read, large for write)
compactionThreashold=3 (small for read, large for write)
memstore flush size = 128M
majorcompaction jitter = 0.5
memstore.block.multiplier (allow single memstore to grow by this multiplier)
Don't use HDFS balancer
Compression:
BLock encoding : DATA_BLOCK_ENCODING => FAST_DIFF good for scans
Compression => SNAPPY
block.data.cachecompressed=true
HFile block size
BLOCK_SIZE=64K (increase for large scans)
GC
Memstore, Blockcache is relatively longlived
suggest -Xmn512m (small eden space)
Not tested G1 properly yet Java 1.8 impl looks very promising.
RS Disk/JVM heap ratio
region size / memstore size * repl factor * heapfractionformemstores * 2
192 bytes on disk need 1 byte of Heap
32 GB of heap can barely fill 6T disk
hbase.rs.maxlogs : hdfs blocksize * .95 * maxlogs should be > 0.4 javaheap
HTrace
Trace Span : metadata about work being done, timeline, id description
Sampling
diagnose perf, swimlanes on calls
Client (hbase, hadoop) , Span Receivers, Htraced receiver
Spark On Hbase
Long lived jobs ( spark streaming )
Fewer HDFS syncs in DAG workflows
Trying to address MR on Hbase usecase.
newAPIHadoopRDD - TableInputFormat
saveAsNewAPIHadoopFile - TableOutputFormat
Table snapshots work out of the box
RDD - RS locality needs better shuffling upstream
Operator's Panel
Multi vs. Single Tenancy (Single Large vs. Multiple Tiny clusters)
Bloomberg/Google: Better to have fewer large cluster(s) than many small/isolated clusters
Flipboard, Dropbox: multiple small clusters,
Facebook: Single Sharded cluster
Monitoring:
Bloomberg, Dropbox: JMX metrics, Hannibal, log scraping/mining, Graphite, 15s granularity event aggregation
Google, Facebook: TSDB-like internal system
Medians, mean metrics hide important events (statistics lie!), high percentile numbers are important, look at minimums,
Capacity Planning:
Over-provisioning is better
Mostly disk-constraints , iops . Hbase almost never blocks on CPU/Memory.
Upgrade Path
Try to maintain frequent upgrades
H/W Sizing
Think about ratios of resources (mem/cpu,disk)
FB, Dropbox, Google: we don't do any specialized hardware sizing. Mostly workload specific. Homogenizing H/W helps.
Cloud vs. Bare Metal
noisy neighbor problem (network is always shared). Cloud can be painful as a result.
Hbase with Marathon/Mesos
Node level:
containers to statically link everything
Marathon "runs" apps on Mesos. REST based. Handles deployment of docker containers.
Docker/Mesos deployment
Trivial integration. no code.
Scale up and down based on load. React to Traffic spikes.
Smaller heaps. less GC-induced latency.
Hbase with Slider/YARN : Deathstar
Usually go with separate clusters for Hbase and others.
Problem with separate clusters: Want common HDFS, non-uniform network usage across apps.
Hbase managed using YARN + Slider
Provisioning Model
Test apps on test cluster
Once matured move to prod cluster
Common HDFS layer
JSON config managed cluster setup.
Key Challenges:
RM failures means hbase cluster downtime. No HA in-built. Soln: auto restarts
Slider had issues acking container allocs
Zombie RSs
Long running not exact use case for YARN. Logging an unsolved problem. Soln: store logs locally (looking at ELK)
Rolling restarts for config changes not possible currently. SLIDER-226.
Data locality. Soln: allocate new RS on other replica DNs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment