Skip to content

Instantly share code, notes, and snippets.

@rayhassan
Last active July 27, 2018 12:36
Show Gist options
  • Save rayhassan/db16512fd30f822b8f897d4ca81e3846 to your computer and use it in GitHub Desktop.
Save rayhassan/db16512fd30f822b8f897d4ca81e3846 to your computer and use it in GitHub Desktop.
Kafka , Flume, Spark sizing notes
Flume
=====
ingest : number or sources of origin for events / number of events (per source) <<<<* this is required
R.o.T : one aggregator for every 4-16 agents
so using the above info work from outer tier to final inner tier - and we know final destination of events : hdfs ?
R.o.T. HDFS can sink 1000 events per sink on the backend - we assume that a "fan-in" type configuration (so many to one from outer to
innner tier)
Start with this rule of thumb:
# Cores = (# Sources + # Sinks) / 2
• If using Memory Channel, RAM should be sufficient to hold maximum channel capacity
• If using File Channel, more disks are better for throughput
- in both cases need - expected resolution for amy downstream failure ...2-4hrs?
Example :
Number of Tiers For aggregation flow from 100 web servers
• Using ratio of 1:16 for outer tier, no load balancing or failover requirements
-? Number of tier 1 agents = 100/16 ~ 7
• Using ratio of 1:4 for inner tier, no load balancing or failover
- Number of tier 2 agents = 7/4 ~ 2
Total of 2 tiers, 9 agents
Kafka
=====
Amount of data to ingest : request/sec say or the total amount of data in TBs? once we have total throughput we can calculate number of
required partitions (for parallelism) based on throughput per partition for both producer and consumer (usually needs to be neasured,
but R.o.T 10MB/s??) <<<<* this is required
Additional info ...
kafka heap - 5G max
64G/24vCPUs configs are most common
6 vdisks RAID 10 - xfs filesystem
RF2 - replication
Kafka uses a very large number of files and a large number of sockets to communicate with the clients. All of this requires a relatively
high number of available file descriptors. You should increase your file descriptor count to to at least 100,000
You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say
your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The per-partition throughput that one can achieve
on the producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc
Spark
=====
ingest rate/ amount of ingest - can configure on the HDFS worker (reduces latency) or as a label (not usually recommended?), then only
need history server on master nodes <<<<* this is required
additional info...
4-8 disks per node, up to 75% or RAM to spark (8- ~200s of GB, more than 200GB, JVM unstable, need to use additional JVMs on a host.
Not an issue using a VM), 8-16 cores per machine. Install config as standby master with ZK - most likely done via cloudera manager.
https://legacy.gitbook.com/book/umbertogriffo/apache-spark-best-practices-and-tuning/details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment