rayhassan/kafka-flume-spark-notes

## kafka-flume-spark-notes


Flume
=====

ingest : number or sources of origin for events  / number of events (per source) <<<<* this is required

R.o.T : one aggregator for every 4-16 agents

so using the above info work from outer tier to final inner tier - and we know final destination of events : hdfs ?

R.o.T. HDFS can sink 1000 events per sink on the backend - we assume that a "fan-in" type configuration (so many to one from outer to
innner tier)

Start with this rule of thumb:
# Cores = (# Sources + # Sinks) / 2
• If using Memory Channel, RAM should be sufficient to hold maximum channel capacity
• If using File Channel, more disks are better for throughput
- in both cases need - expected resolution for amy downstream failure ...2-4hrs?

Example :

Number of Tiers For aggregation flow from 100 web servers
• Using ratio of 1:16 for outer tier, no load balancing or failover requirements
	-? Number of tier 1 agents = 100/16 ~ 7
• Using ratio of 1:4 for inner tier, no load balancing or failover
 	- Number of tier 2 agents = 7/4 ~ 2

Total of 2 tiers, 9 agents

Kafka
=====

Amount of data to ingest : request/sec say or the total amount of data in TBs? once we have total throughput we can calculate number of
required partitions (for parallelism) based on throughput per partition for both producer and consumer (usually needs to be neasured,
but R.o.T 10MB/s??) <<<<* this is required

Additional info ...

kafka heap - 5G max
64G/24vCPUs configs are most common
6 vdisks RAID 10 - xfs filesystem
RF2 - replication
Kafka uses a very large number of files and a large number of sockets to communicate with the clients. All of this requires a relatively
high number of available file descriptors. You should increase your file descriptor count to to at least 100,000

You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say
your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The per-partition throughput that one can achieve
on the producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc


Spark
=====

ingest rate/ amount of ingest - can configure on the HDFS worker (reduces latency) or as a label (not usually recommended?), then only
need history server on master nodes <<<<* this is required

additional info...

4-8 disks per node, up to 75% or RAM to spark (8- ~200s of GB, more than 200GB, JVM unstable, need to use additional JVMs on a host.
Not an issue using a VM), 8-16 cores per machine. Install config as standby master with ZK - most likely done via cloudera manager.

https://legacy.gitbook.com/book/umbertogriffo/apache-spark-best-practices-and-tuning/details


	Flume
	=====

	ingest : number or sources of origin for events / number of events (per source) <<<<* this is required

	R.o.T : one aggregator for every 4-16 agents

	so using the above info work from outer tier to final inner tier - and we know final destination of events : hdfs ?

	R.o.T. HDFS can sink 1000 events per sink on the backend - we assume that a "fan-in" type configuration (so many to one from outer to
	innner tier)

	Start with this rule of thumb:
	# Cores = (# Sources + # Sinks) / 2
	• If using Memory Channel, RAM should be sufficient to hold maximum channel capacity
	• If using File Channel, more disks are better for throughput
	- in both cases need - expected resolution for amy downstream failure ...2-4hrs?

	Example :

	Number of Tiers For aggregation flow from 100 web servers
	• Using ratio of 1:16 for outer tier, no load balancing or failover requirements
	-? Number of tier 1 agents = 100/16 ~ 7
	• Using ratio of 1:4 for inner tier, no load balancing or failover
	- Number of tier 2 agents = 7/4 ~ 2

	Total of 2 tiers, 9 agents

	Kafka
	=====

	Amount of data to ingest : request/sec say or the total amount of data in TBs? once we have total throughput we can calculate number of
	required partitions (for parallelism) based on throughput per partition for both producer and consumer (usually needs to be neasured,
	but R.o.T 10MB/s??) <<<<* this is required

	Additional info ...

	kafka heap - 5G max
	64G/24vCPUs configs are most common
	6 vdisks RAID 10 - xfs filesystem
	RF2 - replication
	Kafka uses a very large number of files and a large number of sockets to communicate with the clients. All of this requires a relatively
	high number of available file descriptors. You should increase your file descriptor count to to at least 100,000

	You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say
	your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The per-partition throughput that one can achieve
	on the producer depends on configurations such as the batching size, compression codec, type of acknowledgement, replication factor, etc


	Spark
	=====

	ingest rate/ amount of ingest - can configure on the HDFS worker (reduces latency) or as a label (not usually recommended?), then only
	need history server on master nodes <<<<* this is required

	additional info...

	4-8 disks per node, up to 75% or RAM to spark (8- ~200s of GB, more than 200GB, JVM unstable, need to use additional JVMs on a host.
	Not an issue using a VM), 8-16 cores per machine. Install config as standby master with ZK - most likely done via cloudera manager.

	https://legacy.gitbook.com/book/umbertogriffo/apache-spark-best-practices-and-tuning/details