Skip to content

Instantly share code, notes, and snippets.

@dselans
Last active April 26, 2023 23:01
Show Gist options
  • Save dselans/eaf905647b96fb66cceeb08c499314ab to your computer and use it in GitHub Desktop.
Save dselans/eaf905647b96fb66cceeb08c499314ab to your computer and use it in GitHub Desktop.
Notes about Pulsar

I compiled these while learning about running Apache Pulsar. The notes are written from the perspective of a Go developer that interfaces primarily with Kafka, RabbitMQ and NATS. Something may not be entirely correct. Sorry!

  1. Concept of “bundles”
  2. Each broker is in charge of certain bundles
  3. A bundle represents multiple topics
  4. Pulsar has a tunable automatic load shedder
  5. Possible to configure to auto shed bundles to less loaded brokers
  6. Bundles are automatically split when under heavy load
    1. Bundle is split into more bundles
      1. ^ How do you know that has happened?
  7. Pulsar’s built-in dashboard sucks
    1. Does not contain up-to date information (has 1m+ lag for stats)
    2. Unable to view actual messages
    3. Use pulsar manager instead
    4. UPDATE: Oh, dashboard is deprecated. That explains it.
  8. By default, retention is not enabled for namespaces/topics
    1. Can be set via pulsar-admin or pulsarctl
  9. Partitions
    1. How do partitions per topic work?
      1. Same as Kafka - partitions are a unit of parallelism
    2. By default, a topic is created with 1 partition (known as a "non-partitioned topic")
    3. Partitioned topics must be explicitly created via admin API (pulsarctl etc)
    4. If you use a regular topic (with 1 partition) - you can have as many consumers as you want
    5. If you use a partitioned topic - it falls under the same requirements as kafka - one consumer per partition.
  10. What is the ledger?
    1. A bookkeeper concept
    2. Pulsar stores data in bookkeeper ledgers
    3. Ledger contains metadata about the topics underneath it
    4. Unlike kafka, Pulsar does not store data on brokers - data is stored on bookkeeper nodes
    5. Ledger == unit of storage in bookkeeper
  11. Pulsar supports server-side schema registry
    1. Does it support protobuf?
      1. Looks like it is possible to create a producer with protobuf support in golang
      2. Haven’t checked consumer - but probably yes
  12. Connectors
    1. Same as Kafka
      1. “source” == get data INTO pulsar
      2. “sink” == get data OUT of pulsar
  13. Replication factor
    1. Specify the replication factor via pulsarctl when creating the topic
    2. You can update replication factor via pulsarctl
    3. There is no concept of "replication" 3. Data is stored on bookies (bookkeeper concept) based on satisfying quorum 4. Possible to disable/enable "replication" per topic (maybe per namespace?) via pulsar admin api
  14. Can you set message speed per topic???
  15. No, doesn't seem like it
  16. What is a “cursor”?
    1. Same as “current offset” in Kafka
    2. You can change the cursor via pulsar admin api
  17. You can define “interceptors” for producers (in golang client)!!!
    1. Ie. custom roundtripper
    2. Could attach some sort of basic validation - neat!
  18. Has concept of persistent and non-persistent topics
  19. Non-partitioned topics are automatically deleted after 60 seconds of inactivity - nice!
  20. There is no way to convert non-partitioned topics -> partitioned topics... - lame
    1. Have to delete non-partitioned topic and re-create as partitioned topic
  21. Pulsar comes with pulsar-perf — a tool to test pulsar performance
  22. Pulsar has a concept of “transactions”
  23. Think DB transactions - emit 5 messages as part of the same atomic operation.
  24. Pulsar golang client
    1. The pulsar golang client lib by default sets a 64MB memory limit!!!! - nice!
    // Limit of client memory usage (in byte). The 64M default can guarantee a high producer throughput
    // Config less than 0 indicates off memory limit.
    MemoryLimitBytes int64
    
    1. It gets better, client.CreateProducer() returns an interface already - niiiiice
    2. The native Go client is really well thought out. Has batching, has chunking support, has automatic (tunable) broker reconnect, has schema (incl. protobuf) and much more.
    3. It is possible to deliver messages with a delay or delayAt!
  25. Downside - the library does not contain management functionality (ie. can't manage topics, subscriptions, etc.) - need to use another lib (pulsarctl)
  26. Pulsar has server-side dedupe

Update 04.26.2023

Recent findings after writing producer and consumer code:

  1. For high-speed production, async send works best
  2. To avoid hitting send timeouts, set SendTImeout when instantiating producer
  3. The golang lib doesn't include admin API support - need to use a separate lib if you want to programmatically manage topics, subscriptions, etc.
  4. Not able to produce more than ~6K/s on a default k8s deployment (3 node cluster via official helm) -- something needs to be tuned probably

Rudimentary Pulsar producer code here: https://github.com/batchcorp/event-generator/blob/main/output/pulsar.go#L41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment