prodeezy/gist:90af4ef5cccf22b2b46b

## gistfile1.txt


Hbase 1.0

1500 jiras
API standardization
Read available region replicas
Automatic tuning of global memstore and block cache sizes
Compressed blocks in block cache.
1.0.x and 1.1.x will have monthly releases.
Rolling upgrades supported on minor 1.x versions. Not garunteed between 1.x and 2.x

Upgrade path from .98.x :
Full Client compatibility to move from 0.98.x to 1.x (but recommended to move to new client APIs)
Wired compatibility ( .98 client can talk to 1.0 server). Can do rolling upgrade to 1.0
1.0 can read .98 snapshots

Recommended:
Hadoop 2.5 or 2.6 (recommended) , Zookeeper 3.4 ( wont run well on 3.3)
Support for JDK 7, 8( not as well tested as 1.7 )..  1.6 is Gone!

1.0 API
Connections are now user managed
Region locator is a lightweight object to talk to regions more flexibly
Removed Autoflush , replaced with BufferedMutator, easier to use. Used when writes are small and many. Cached objects are lost if process dies.

Pinterest
On fully AWS provisioned hardware:  hi1.4xlarge (moving to i2)
Running 0.94.x, Planning to move to 1.x ( w/ no downtime )
Using CMS, ParNew GC. JVM object Promotion failures
Went off heap for Blockcache (Bucketcache).
Ensured memory space for memstore.
Moved GC H-Daemon /var/log/ logging to linked to first ephemeral disk
Monitoring:
TSDB for all metrics. Use opentsdb-R to mine metrics using R
Nagios based alerting
hotspot monitoring: tcpdump usfdul (tcpdump -i eth0 -w -s 0 tcp port 60020 | strings)
Capacity planning
salted keys for uniform distribution
migrated to dedicated cluster
Disaster recovery:
Exportsnapshot to HDFS cluster with throttle, distcp WAL logs hourly to HDFS. Clone snapshot + WALPlayer to recover.
hbasehbackup -t unique_index -d new_cluster -d 10
hbaserecover -t unique_index -ts DATE_TIME --source-cluster zennotifications

Hbase Perf Tuning (Lars)
HDFS Correctness
dfs.datanode.synconclose=true
mount ext4 with dirsync or use XFS
syncs partial blocks to disk - best effort
dfs.datanode.sync.behind.writes=true
short circuit buffer size = 128M
distribute data across disks
dfs.block.size = 256M
ipc.tcpnodeelay=true
datanode.max.xceivers=8192
datanode handler count = 8
RS settings
blockingStorefiles=10 (small for read, large for write)
compactionThreashold=3 (small for read, large for write)
memstore flush size = 128M
majorcompaction jitter = 0.5
memstore.block.multiplier (allow single memstore to grow by this multiplier)
Don't use HDFS balancer
Compression:
BLock encoding : DATA_BLOCK_ENCODING => FAST_DIFF good for scans
Compression => SNAPPY
block.data.cachecompressed=true
HFile block size
BLOCK_SIZE=64K (increase for large scans)
GC
Memstore, Blockcache is relatively longlived
suggest -Xmn512m (small eden space)
Not tested G1 properly yet Java 1.8 impl looks very promising.
RS Disk/JVM heap ratio
region size / memstore size * repl factor * heapfractionformemstores * 2
192 bytes on disk need 1 byte of Heap
32 GB of heap can barely fill 6T disk
hbase.rs.maxlogs : hdfs blocksize * .95 * maxlogs should be > 0.4 javaheap

HTrace
Trace Span : metadata about work being done, timeline, id description
Sampling
diagnose perf, swimlanes on calls
Client (hbase, hadoop) , Span Receivers, Htraced receiver
Spark On Hbase
Long lived jobs ( spark streaming )
Fewer HDFS syncs in DAG workflows
Trying to address MR on Hbase usecase.
newAPIHadoopRDD - TableInputFormat
saveAsNewAPIHadoopFile - TableOutputFormat
Table snapshots work out of the box
RDD - RS locality  needs better shuffling upstream

Operator's Panel
Multi vs. Single Tenancy  (Single Large vs. Multiple Tiny clusters)
Bloomberg/Google: Better to have fewer large cluster(s) than many small/isolated clusters
Flipboard, Dropbox: multiple small clusters,
Facebook: Single Sharded cluster
Monitoring:
Bloomberg, Dropbox: JMX metrics, Hannibal, log scraping/mining, Graphite, 15s granularity event aggregation
Google, Facebook: TSDB-like internal system
Medians, mean metrics hide important events (statistics lie!), high percentile numbers are important, look at minimums,
Capacity Planning:
Over-provisioning is better
Mostly disk-constraints , iops . Hbase almost never blocks on CPU/Memory.
Upgrade Path
Try to maintain frequent upgrades
H/W Sizing
Think about ratios of resources (mem/cpu,disk)
FB, Dropbox, Google: we don't do any specialized hardware sizing. Mostly workload specific. Homogenizing H/W helps.
Cloud vs. Bare Metal
noisy neighbor problem (network is always shared). Cloud can be painful as a result.

Hbase with Marathon/Mesos
Node level:
containers to statically link everything
Marathon "runs" apps on Mesos. REST based. Handles deployment of docker containers.
Docker/Mesos deployment
Trivial integration. no code.
Scale up and down based on load. React to Traffic spikes.
Smaller heaps. less GC-induced latency.

Hbase with Slider/YARN : Deathstar
Usually go with separate clusters for Hbase and others.
Problem with separate clusters: Want common HDFS, non-uniform network usage across apps.
Hbase managed using YARN + Slider
Provisioning Model
Test apps on test cluster
Once matured move to prod cluster
Common HDFS layer
JSON config managed cluster setup.
Key Challenges:
RM failures means hbase cluster downtime. No HA in-built. Soln: auto restarts
Slider had issues acking container allocs
Zombie RSs
Long running not exact use case for YARN. Logging an unsolved problem. Soln: store logs locally (looking at ELK)
Rolling restarts for config changes not possible currently. SLIDER-226.
Data locality. Soln: allocate new RS on other replica DNs.


	Hbase 1.0

	1500 jiras
	API standardization
	Read available region replicas
	Automatic tuning of global memstore and block cache sizes
	Compressed blocks in block cache.
	1.0.x and 1.1.x will have monthly releases.
	Rolling upgrades supported on minor 1.x versions. Not garunteed between 1.x and 2.x

	Upgrade path from .98.x :
	Full Client compatibility to move from 0.98.x to 1.x (but recommended to move to new client APIs)
	Wired compatibility ( .98 client can talk to 1.0 server). Can do rolling upgrade to 1.0
	1.0 can read .98 snapshots

	Recommended:
	Hadoop 2.5 or 2.6 (recommended) , Zookeeper 3.4 ( wont run well on 3.3)
	Support for JDK 7, 8( not as well tested as 1.7 ).. 1.6 is Gone!

	1.0 API
	Connections are now user managed
	Region locator is a lightweight object to talk to regions more flexibly
	Removed Autoflush , replaced with BufferedMutator, easier to use. Used when writes are small and many. Cached objects are lost if process dies.

	Pinterest
	On fully AWS provisioned hardware: hi1.4xlarge (moving to i2)
	Running 0.94.x, Planning to move to 1.x ( w/ no downtime )
	Using CMS, ParNew GC. JVM object Promotion failures
	Went off heap for Blockcache (Bucketcache).
	Ensured memory space for memstore.
	Moved GC H-Daemon /var/log/ logging to linked to first ephemeral disk
	Monitoring:
	TSDB for all metrics. Use opentsdb-R to mine metrics using R
	Nagios based alerting
	hotspot monitoring: tcpdump usfdul (tcpdump -i eth0 -w -s 0 tcp port 60020 \| strings)
	Capacity planning
	salted keys for uniform distribution
	migrated to dedicated cluster
	Disaster recovery:
	Exportsnapshot to HDFS cluster with throttle, distcp WAL logs hourly to HDFS. Clone snapshot + WALPlayer to recover.
	hbasehbackup -t unique_index -d new_cluster -d 10
	hbaserecover -t unique_index -ts DATE_TIME --source-cluster zennotifications

	Hbase Perf Tuning (Lars)
	HDFS Correctness
	dfs.datanode.synconclose=true
	mount ext4 with dirsync or use XFS
	syncs partial blocks to disk - best effort
	dfs.datanode.sync.behind.writes=true
	short circuit buffer size = 128M
	distribute data across disks
	dfs.block.size = 256M
	ipc.tcpnodeelay=true
	datanode.max.xceivers=8192
	datanode handler count = 8
	RS settings
	blockingStorefiles=10 (small for read, large for write)
	compactionThreashold=3 (small for read, large for write)
	memstore flush size = 128M
	majorcompaction jitter = 0.5
	memstore.block.multiplier (allow single memstore to grow by this multiplier)
	Don't use HDFS balancer
	Compression:
	BLock encoding : DATA_BLOCK_ENCODING => FAST_DIFF good for scans
	Compression => SNAPPY
	block.data.cachecompressed=true
	HFile block size
	BLOCK_SIZE=64K (increase for large scans)
	GC
	Memstore, Blockcache is relatively longlived
	suggest -Xmn512m (small eden space)
	Not tested G1 properly yet Java 1.8 impl looks very promising.
	RS Disk/JVM heap ratio
	region size / memstore size * repl factor * heapfractionformemstores * 2
	192 bytes on disk need 1 byte of Heap
	32 GB of heap can barely fill 6T disk
	hbase.rs.maxlogs : hdfs blocksize * .95 * maxlogs should be > 0.4 javaheap

	HTrace
	Trace Span : metadata about work being done, timeline, id description
	Sampling
	diagnose perf, swimlanes on calls
	Client (hbase, hadoop) , Span Receivers, Htraced receiver
	Spark On Hbase
	Long lived jobs ( spark streaming )
	Fewer HDFS syncs in DAG workflows
	Trying to address MR on Hbase usecase.
	newAPIHadoopRDD - TableInputFormat
	saveAsNewAPIHadoopFile - TableOutputFormat
	Table snapshots work out of the box
	RDD - RS locality needs better shuffling upstream

	Operator's Panel
	Multi vs. Single Tenancy (Single Large vs. Multiple Tiny clusters)
	Bloomberg/Google: Better to have fewer large cluster(s) than many small/isolated clusters
	Flipboard, Dropbox: multiple small clusters,
	Facebook: Single Sharded cluster
	Monitoring:
	Bloomberg, Dropbox: JMX metrics, Hannibal, log scraping/mining, Graphite, 15s granularity event aggregation
	Google, Facebook: TSDB-like internal system
	Medians, mean metrics hide important events (statistics lie!), high percentile numbers are important, look at minimums,
	Capacity Planning:
	Over-provisioning is better
	Mostly disk-constraints , iops . Hbase almost never blocks on CPU/Memory.
	Upgrade Path
	Try to maintain frequent upgrades
	H/W Sizing
	Think about ratios of resources (mem/cpu,disk)
	FB, Dropbox, Google: we don't do any specialized hardware sizing. Mostly workload specific. Homogenizing H/W helps.
	Cloud vs. Bare Metal
	noisy neighbor problem (network is always shared). Cloud can be painful as a result.

	Hbase with Marathon/Mesos
	Node level:
	containers to statically link everything
	Marathon "runs" apps on Mesos. REST based. Handles deployment of docker containers.
	Docker/Mesos deployment
	Trivial integration. no code.
	Scale up and down based on load. React to Traffic spikes.
	Smaller heaps. less GC-induced latency.

	Hbase with Slider/YARN : Deathstar
	Usually go with separate clusters for Hbase and others.
	Problem with separate clusters: Want common HDFS, non-uniform network usage across apps.
	Hbase managed using YARN + Slider
	Provisioning Model
	Test apps on test cluster
	Once matured move to prod cluster
	Common HDFS layer
	JSON config managed cluster setup.
	Key Challenges:
	RM failures means hbase cluster downtime. No HA in-built. Soln: auto restarts
	Slider had issues acking container allocs
	Zombie RSs
	Long running not exact use case for YARN. Logging an unsolved problem. Soln: store logs locally (looking at ELK)
	Rolling restarts for config changes not possible currently. SLIDER-226.
	Data locality. Soln: allocate new RS on other replica DNs.