Created
May 8, 2015 16:44
-
-
Save prodeezy/90af4ef5cccf22b2b46b to your computer and use it in GitHub Desktop.
Hbasecon2015 Notes
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hbase 1.0 | |
1500 jiras | |
API standardization | |
Read available region replicas | |
Automatic tuning of global memstore and block cache sizes | |
Compressed blocks in block cache. | |
1.0.x and 1.1.x will have monthly releases. | |
Rolling upgrades supported on minor 1.x versions. Not garunteed between 1.x and 2.x | |
Upgrade path from .98.x : | |
Full Client compatibility to move from 0.98.x to 1.x (but recommended to move to new client APIs) | |
Wired compatibility ( .98 client can talk to 1.0 server). Can do rolling upgrade to 1.0 | |
1.0 can read .98 snapshots | |
Recommended: | |
Hadoop 2.5 or 2.6 (recommended) , Zookeeper 3.4 ( wont run well on 3.3) | |
Support for JDK 7, 8( not as well tested as 1.7 ).. 1.6 is Gone! | |
1.0 API | |
Connections are now user managed | |
Region locator is a lightweight object to talk to regions more flexibly | |
Removed Autoflush , replaced with BufferedMutator, easier to use. Used when writes are small and many. Cached objects are lost if process dies. | |
On fully AWS provisioned hardware: hi1.4xlarge (moving to i2) | |
Running 0.94.x, Planning to move to 1.x ( w/ no downtime ) | |
Using CMS, ParNew GC. JVM object Promotion failures | |
Went off heap for Blockcache (Bucketcache). | |
Ensured memory space for memstore. | |
Moved GC H-Daemon /var/log/ logging to linked to first ephemeral disk | |
Monitoring: | |
TSDB for all metrics. Use opentsdb-R to mine metrics using R | |
Nagios based alerting | |
hotspot monitoring: tcpdump usfdul (tcpdump -i eth0 -w -s 0 tcp port 60020 | strings) | |
Capacity planning | |
salted keys for uniform distribution | |
migrated to dedicated cluster | |
Disaster recovery: | |
Exportsnapshot to HDFS cluster with throttle, distcp WAL logs hourly to HDFS. Clone snapshot + WALPlayer to recover. | |
hbasehbackup -t unique_index -d new_cluster -d 10 | |
hbaserecover -t unique_index -ts DATE_TIME --source-cluster zennotifications | |
Hbase Perf Tuning (Lars) | |
HDFS Correctness | |
dfs.datanode.synconclose=true | |
mount ext4 with dirsync or use XFS | |
syncs partial blocks to disk - best effort | |
dfs.datanode.sync.behind.writes=true | |
short circuit buffer size = 128M | |
distribute data across disks | |
dfs.block.size = 256M | |
ipc.tcpnodeelay=true | |
datanode.max.xceivers=8192 | |
datanode handler count = 8 | |
RS settings | |
blockingStorefiles=10 (small for read, large for write) | |
compactionThreashold=3 (small for read, large for write) | |
memstore flush size = 128M | |
majorcompaction jitter = 0.5 | |
memstore.block.multiplier (allow single memstore to grow by this multiplier) | |
Don't use HDFS balancer | |
Compression: | |
BLock encoding : DATA_BLOCK_ENCODING => FAST_DIFF good for scans | |
Compression => SNAPPY | |
block.data.cachecompressed=true | |
HFile block size | |
BLOCK_SIZE=64K (increase for large scans) | |
GC | |
Memstore, Blockcache is relatively longlived | |
suggest -Xmn512m (small eden space) | |
Not tested G1 properly yet Java 1.8 impl looks very promising. | |
RS Disk/JVM heap ratio | |
region size / memstore size * repl factor * heapfractionformemstores * 2 | |
192 bytes on disk need 1 byte of Heap | |
32 GB of heap can barely fill 6T disk | |
hbase.rs.maxlogs : hdfs blocksize * .95 * maxlogs should be > 0.4 javaheap | |
HTrace | |
Trace Span : metadata about work being done, timeline, id description | |
Sampling | |
diagnose perf, swimlanes on calls | |
Client (hbase, hadoop) , Span Receivers, Htraced receiver | |
Spark On Hbase | |
Long lived jobs ( spark streaming ) | |
Fewer HDFS syncs in DAG workflows | |
Trying to address MR on Hbase usecase. | |
newAPIHadoopRDD - TableInputFormat | |
saveAsNewAPIHadoopFile - TableOutputFormat | |
Table snapshots work out of the box | |
RDD - RS locality needs better shuffling upstream | |
Operator's Panel | |
Multi vs. Single Tenancy (Single Large vs. Multiple Tiny clusters) | |
Bloomberg/Google: Better to have fewer large cluster(s) than many small/isolated clusters | |
Flipboard, Dropbox: multiple small clusters, | |
Facebook: Single Sharded cluster | |
Monitoring: | |
Bloomberg, Dropbox: JMX metrics, Hannibal, log scraping/mining, Graphite, 15s granularity event aggregation | |
Google, Facebook: TSDB-like internal system | |
Medians, mean metrics hide important events (statistics lie!), high percentile numbers are important, look at minimums, | |
Capacity Planning: | |
Over-provisioning is better | |
Mostly disk-constraints , iops . Hbase almost never blocks on CPU/Memory. | |
Upgrade Path | |
Try to maintain frequent upgrades | |
H/W Sizing | |
Think about ratios of resources (mem/cpu,disk) | |
FB, Dropbox, Google: we don't do any specialized hardware sizing. Mostly workload specific. Homogenizing H/W helps. | |
Cloud vs. Bare Metal | |
noisy neighbor problem (network is always shared). Cloud can be painful as a result. | |
Hbase with Marathon/Mesos | |
Node level: | |
containers to statically link everything | |
Marathon "runs" apps on Mesos. REST based. Handles deployment of docker containers. | |
Docker/Mesos deployment | |
Trivial integration. no code. | |
Scale up and down based on load. React to Traffic spikes. | |
Smaller heaps. less GC-induced latency. | |
Hbase with Slider/YARN : Deathstar | |
Usually go with separate clusters for Hbase and others. | |
Problem with separate clusters: Want common HDFS, non-uniform network usage across apps. | |
Hbase managed using YARN + Slider | |
Provisioning Model | |
Test apps on test cluster | |
Once matured move to prod cluster | |
Common HDFS layer | |
JSON config managed cluster setup. | |
Key Challenges: | |
RM failures means hbase cluster downtime. No HA in-built. Soln: auto restarts | |
Slider had issues acking container allocs | |
Zombie RSs | |
Long running not exact use case for YARN. Logging an unsolved problem. Soln: store logs locally (looking at ELK) | |
Rolling restarts for config changes not possible currently. SLIDER-226. | |
Data locality. Soln: allocate new RS on other replica DNs. | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment