Both Yahoo Pulsar and Apache DistributedLog are built over Apache BookKeeper. They have different focuses but also share a lot of similarities on design principles and implementation details.
Pulsar is a full fledged pub/sub messaging system that provides very flexible messaging model, while DistriubtedLog focuses more on buidling a replicated log store that offers replicated log as a storage primitive that other applications/systems can use. In theory, Pulsar can use DistributedLog to build its messaging system.
Internally, Pulsar built a library called 'ManagedLedger' for interacting with Apache BookKeeper. ManagedLedger shares a lot of similarities on implementions with DistributedLog. They are described as below:
ManagedLedger | DistributedLog | |
---|---|---|
Read/Write Semantic | Single writer, Single reader | Single writer, Multiple readers |
Tailing Read Semantic | No tailing read semantic | Support tailing read. Applications don't have to close a log to read data. |
Layout | A ManagedLedger is comprised of a list of Ledgers. | A Log is comprised of a list of Segments. Segment is the storage abstraction of a Ledger. |
Cursor | A ManagedLedger also maintains a list of Cursors. Each Cursor represents the consume point of a consumer. The updates of a Cursor are stored in a Ledger. | DistributedLog doesn't maintain any Cursors. |
Data Retention | The data written before the Cursors can be deleted or expired after the configured Time. | The data can be deleted by explicitly truncation or expired after a configured Time. |
This proposal is to propose merging ManagedLedger into DistributedLog and co-develop the Replicated Log library for common usage.
The proposed interface for merging ManagedLedger and DistributedLog will be comprised of two parts, one is Log interface, while the other one is Cursor interface.
The Log interface will be based on the DistributedLog Log interface.
It includes following operations:
- create log
- delete log
- open a writer to write records to the log
- open a reader to read records from the log
- be able to append the record to the log synchronously or asynchronously
- be able to truncate the log based either explicitly or based on a configure time period.
- be able to read from a provided position in the log.
The Cursor interface will be based on the ManagedCursor interface. A cursor is indicating the position of a reader that is reading from the log.
A Cursor is a reader with position/offset tracking. Several operations are supposed in the cursor:
- be able to read next records after this cursor
- be able to seek and rewind the cursor
- be able to mark deletion on a cursor
The default Log retention (without any cursors) policies will be still same with DistributedLog:
- Explicit Truncation
- Time-Based Expiration
The cursor management will truncate the Log use Explicit Truncation to satisify the cursor based retention.
The proposed changes will be:
-
Import ManagedLedger code in DistributedLog
-
Add Cursor interface the existing DistributedLog Library and improve the Log interface. The new set of API should support both ML and DLog feature sets.
-
Include current 2 different implementation of the Log and Cursor interface. One is current ML implementation, while the other one is current DLog library implementation.
-
Release Dlog with the new Cursor and Log API.
-
Pulsar will use the DLog API and existing ML implementation.
-
Eventually merging these two implementation towards one implementation and provides a seamless upgrade for both implemention users.
-
Pulsar can then leverage the futures like tailing reads to support read-only brokers, live topic migration and such.
- Till step 5, there is no real migration.
- At step 6, a backward compatible upgrade will be applied.