Skip to content

Instantly share code, notes, and snippets.

@maasg
Last active April 22, 2018 13:05
Show Gist options
  • Save maasg/d01b40e28b4de5d5f4df07b263ef8ecc to your computer and use it in GitHub Desktop.
Save maasg/d01b40e28b4de5d5f4df07b263ef8ecc to your computer and use it in GitHub Desktop.
Spark Streaming 2.2.0 Earliest offset reset

Earliest Offset Reset Test Scenario

Env: Spark 2.2.0 using Kafka integration 0.10

./spark-shell  --packages org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/
         
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161)

Scenario I

Client with a new consumer group subscribed to a new topic starts from 'earliest' offset.

  • Create a new Topic.
  • Publish records 10-20 to it
  • Start Streaming job with consumer group 1
  • Observe job consumes 10-20

New consumer group must start first time from earliest given 'earliest' offset reset

-------------
1524397040000 ms -> 10...20
-------------
-------------
1524397050000 ms -> empty
-------------

Break job

Scenario II

Existing consumer group using 'earliest' offset reset starts from where it left last time.

  • Publish records 21-30
  • Restart Job with consumer group 1; offset reset = earliest
  • Observe running job consumes 21-30
-------------
1524397240000 ms -> 21...30
-------------
-------------
1524397250000 ms -> empty
-------------

Break job

Scenario III

Job with new consumer group and 'earliest' offset reset starts from the beginning

  • Change consumer group to 'group2'; auto-offset-reset= 'earliest'; Restart job
  • Observe running job consumes 10-30
-------------
1524397360000 ms -> 10...30
-------------
-------------
1524397370000 ms -> empty
-------------

Scenario IV

(Sanity check) Running job keeps consuming data

  • While job runs, publish records 31-40
  • Observe running job consumes 31-40
-------------
1524397420000 ms -> empty
-------------
-------------
1524397430000 ms -> 31...40
-------------
-------------
1524397440000 ms -> empty
-------------

Break job

Scenario V

A new consumer group with auto-offset-reset=lastest starts at the end of the topic

  • Change consumer group to 'group3'; change offset reset to 'latest'; Restart job
  • Observe the job does not consume data the first streaming iteration
-------------
1524397560000 ms -> empty
-------------
-------------
1524397570000 ms -> empty
-------------

Scenario VI

(Sanity check) Running job started with auto-offset-reset=latest consumes new data.

  • While job with consumer group 'group3' and offset reset='latest' runs, publish records 41-50
  • Observe running job consumes 41-50
-------------
1524397600000 ms -> empty
-------------
-------------
1524397610000 ms -> 41...50
-------------
-------------
1524397620000 ms -> empty
-------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment