This note is for setting up rapid prototyping of MongoDB -> Kafka -> Spark Streaming pipeline.
Pyspark is being used for fast turn around. Pyspark uses Kafka connector that uses ZooKeeper so the topic has to be created with offsets stored in ZooKeeper.
$> brew install zookeeper kafka apache-spark jupyter
$> brew services start zookeeper
$> brew services start kafka
$> kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 5 --topic checklist
Check that the partition is created.
$> kafka-topics --describe --zookeeper localhost:2181 --topic checklist
Topic:checklist PartitionCount:5 ReplicationFactor:1 Configs:
Topic: checklist Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Topic: checklist Partition: 1 Leader: 0 Replicas: 0 Isr: 0
Topic: checklist Partition: 2 Leader: 0 Replicas: 0 Isr: 0
Topic: checklist Partition: 3 Leader: 0 Replicas: 0 Isr: 0
Topic: checklist Partition: 4 Leader: 0 Replicas: 0 Isr: 0
Once Spark streaming code is run, it creates a new consumer group.
$> kafka-consumer-groups --zookeeper localhost:2181 --describe --group checklist-group
Note: This will only show information about consumers that use ZooKeeper (not those using the Java consumer API).
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID
checklist 0 25094 25094 0 checklist-group_Rydeen.local-1517642255383-8ec43f30
checklist 1 25387 25387 0 checklist-group_Rydeen.local-1517642255383-8ec43f30
checklist 2 24989 24989 0 checklist-group_Rydeen.local-1517642255383-8ec43f30
checklist 3 24887 24887 0 checklist-group_Rydeen.local-1517642255383-8ec43f30
checklist 4 25138 25138 0 checklist-group_Rydeen.local-1517642255383-8ec43f30
kafka-consumer-groups --bootstrap-server localhost:9092 --reset-offsets --group checklist-group --all-topics --to-earliest
kafka-configs --zookeeper localhost:2181 --alter --entity-type topics --add-config retention.ms=1000 --entity-name checklist
kafka-configs --zookeeper localhost:2181 --alter --entity-type topics --delete-config retention.ms --entity-name checklist