idris75/Kafka performance tuning

## Kafka performance tuning
1.Producer
	1.request.required.acks=[0,1,all/-1]  0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated

	2.use Async producer	- use callback for the acknowledgement, using property  producer.type=1
	3.Batching data - send multiple messages together.
		batch.num.messages
		queue.buffer.max.ms
	4.Compression for Large files - gzip, snappy supported
		very large files can be stored in shared location and just the file path can be logged by the kafka producer.

	5.timeouts/retry settings - defaults may be longer, so change based on use case - using property request.timout.ms

2.Brokers:
	1. choose high partitions - as cannot have more consumers than partitions.
	2. try 1 partition per physical disc, to avoid IO bottleneck.
	3. load balance partitions - use tool -
		kafka-reassign-partitions.sh --generate (the plan) --execute --verify.
	some parameters to tune:
		num.io.thread - at least as many threads as the disks.
		log.flush.interval - higher interval will increase the speed, but risk data loss of server crashes.

3.Consumers:
	1.have as many consumers in a group as there are partitions.
	2.keep up with the number of producers
	3.adding more consumers to a group will enhance performance but not adding a consumer group.
	4.checkpoint interval
		replica.high.watermark.checkpoint.interval.ms - high value will increase performance, as checkpointing is done infrequently - slight data risk possibility

4.pipeline performance:
	The other systems in the pipeline, for example data is written to HDFS, should also perform as good as Kafka.


extra:
kafka vs other messaging system:
	1.Decoupling of producer & consumer	- data is persisted when consumers go down, work periodically like ETL
	2.Data is not stored per consumer like in a queue - data saved once - any number of consumer can independently read with their own offset values.
	3.Replication is by default - not in just specialized cases with complicated configuration.
	1.Producer
	1.request.required.acks=[0,1,all/-1] 0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated

	2.use Async producer - use callback for the acknowledgement, using property producer.type=1
	3.Batching data - send multiple messages together.
	batch.num.messages
	queue.buffer.max.ms
	4.Compression for Large files - gzip, snappy supported
	very large files can be stored in shared location and just the file path can be logged by the kafka producer.

	5.timeouts/retry settings - defaults may be longer, so change based on use case - using property request.timout.ms

	2.Brokers:
	1. choose high partitions - as cannot have more consumers than partitions.
	2. try 1 partition per physical disc, to avoid IO bottleneck.
	3. load balance partitions - use tool -
	kafka-reassign-partitions.sh --generate (the plan) --execute --verify.
	some parameters to tune:
	num.io.thread - at least as many threads as the disks.
	log.flush.interval - higher interval will increase the speed, but risk data loss of server crashes.

	3.Consumers:
	1.have as many consumers in a group as there are partitions.
	2.keep up with the number of producers
	3.adding more consumers to a group will enhance performance but not adding a consumer group.
	4.checkpoint interval
	replica.high.watermark.checkpoint.interval.ms - high value will increase performance, as checkpointing is done infrequently - slight data risk possibility

	4.pipeline performance:
	The other systems in the pipeline, for example data is written to HDFS, should also perform as good as Kafka.



	extra:
	kafka vs other messaging system:
	1.Decoupling of producer & consumer - data is persisted when consumers go down, work periodically like ETL
	2.Data is not stored per consumer like in a queue - data saved once - any number of consumer can independently read with their own offset values.
	3.Replication is by default - not in just specialized cases with complicated configuration.