Skip to content

Instantly share code, notes, and snippets.

@shiv4289
Last active May 2, 2019 07:35
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shiv4289/1618580ec7a8195646806950ae791b2f to your computer and use it in GitHub Desktop.
Save shiv4289/1618580ec7a8195646806950ae791b2f to your computer and use it in GitHub Desktop.

Terminology

Bookies manage data in a log-structured way, which is implemented using three kind of files:

  1. Journal : A journal file contains the BookKeeper transaction logs. Before any update takes place, a bookie ensures that a transaction describing the update is written to non-volatile storage. A new journal file is created once the bookie starts or the older journal file reaches the journal file size threshold.

  2. Entry Log : An entry log file manages the written entries received from BookKeeper clients. Entries from different ledgers are aggregated and written sequentially, while their offsets are kept as pointers in LedgerCache for fast lookup. A new entry log file is created once the bookie starts or the older entry log file reaches the entry log size threshold. Old entry log files are removed by the Garbage Collector Thread once they are not associated with any active ledger.

  3. Index File : An index file is created for each ledger, which comprises a header and several fixed-length index pages, recording the offsets of data stored in entry log files. Since updating index files would introduce random disk I/O, for performance consideration, index files are updated lazily by a Sync Thread running in the background. Before index pages are persisted to disk, they are gathered in LedgerCache for lookup.

  4. LedgerCache : A memory pool caches ledger index pages, which more efficiently manage disk head scheduling.

Recovery

An application first creates a ledger before writing to bookies through a local BookKeeper client instance. Upon creating a ledger, a BookKeeper client writes metadata about the ledger to ZooKeeper. Each ledger currently has a single writer. This writer has to execute a close ledger operation before any other client can read from it. If the writer of a ledger does not close a ledger properly because, for example, it has crashed before having the opportunity of closing the ledger, then the next client that tries to open a ledger executes a procedure to recover it. As closing a ledger consists essentially of writing the last entry written to a ledger to ZooKeeper, the recovery procedure simply finds the last entry written correctly and writes it to ZooKeeper.

Note that currently this recovery procedure is executed automatically upon trying to open a ledger and no explicit action is necessary. Although two clients may try to recover a ledger concurrently, only one will succeed, the first one that is able to create the close znode for the ledger.

https://bookkeeper.apache.org/archives/docs/r4.2.1/bookkeeperOverview.html apache/bookkeeper#1193

APIs

A set of bookies implements BookKeeper, and we use a quorum-based protocol to replicate data across the bookies. There are basically two operations to an existing ledger: read and append. Here is the complete API list:

  1. Create ledger: creates a new empty ledger;
  2. Open ledger: opens an existing ledger for reading;
  3. Add entry: adds a record to a ledger either synchronously or asynchronously;
  4. Read entries: reads a sequence of entries from a ledger either synchronously or asynchronously

There is only a single client that can write to a ledger. Once that ledger is closed or the client fails, no more entries can be added. (We take advantage of this behavior to provide our strong guarantees.)

Add Entry

When a bookie receives entries from clients to be written, these entries will go through the following steps to be persisted to disk:

  1. Append the entry in Entry Log, return its position { logId , offset } ;
  2. Update the index of this entry in Ledger Cache ;
  3. Append a transaction corresponding to this entry update in Journal ;
  4. Respond to BookKeeper client ; For performance reasons, Entry Log buffers entries in memory and commit them in batches, while Ledger Cache holds index pages in memory and flushes them lazily. We will discuss data flush and how to ensure data integrity in the following section 'Data Flush'.

Use cases

A simple use of BooKeeper is to implement a write-ahead transaction log. A server maintains an in-memory data structure (with periodic snapshots for example) and logs changes to that structure before it applies the change. The application server creates a ledger at startup and store the ledger id and password in a well known place (ZooKeeper maybe). When it needs to make a change, the server adds an entry with the change information to a ledger and apply the change when BookKeeper adds the entry successfully. The server can even use asyncAddEntry to queue up many changes for high change throughput. BooKeeper meticulously logs the changes in order and call the completion functions in order.

When the application server dies, a backup server will come online, get the last snapshot and then it will open the ledger of the old server and read all the entries from the time the snapshot was taken. (Since it doesn't know the last entry number it will use MAX_INTEGER). Once all the entries have been processed, it will close the ledger and start a new one for its use.

Bookkeeper Entries

BookKeeper is a replicated service to reliably log streams of records. In BookKeeper, servers are "bookies", log streams are "ledgers", and each unit of a log (aka record) is a "ledger entry"

A client library takes care of communicating with bookies and managing entry numbers. An entry has the following fields:

  1. Field Type Description
  2. Ledger number long The id of the ledger of this entry
  3. Entry number long The id of this entry
  4. last confirmed ( LC ) long id of the last recorded entry
  5. data byte[] the entry data (supplied by application)
  6. authentication code byte[] Message authentication code that includes all other fields of the entry

The client library generates a ledger entry. None of the fields are modified by the bookies and only the first three fields are interpreted by the bookies.

To add to a ledger, the client generates the entry above using the ledger number. The entry number will be one more than the last entry generated. The LC field contains the last entry that has been successfully recorded by BookKeeper. If the client writes entries one at a time, LC is the last entry id. But, if the client is using asyncAddEntry, there may be many entries in flight. An entry is considered recorded when both of the following conditions are met:

  1. the entry has been accepted by a quorum of bookies
  2. all entries with a lower entry id have been accepted by a quorum of bookies

Once all the other fields have been field in, the client generates an authentication code with all of the previous fields. The entry is then sent to a quorum of bookies to be recorded. Any failures will result in the entry being sent to a new quorum of bookies.

To read, the client library initially contacts a bookie and starts requesting entries. If an entry is missing or invalid (a bad MAC for example), the client will make a request to a different bookie. By using quorum writes, as long as enough bookies are up we are guaranteed to eventually be able to read an entry.

Bookkeeper Metadata Management

There are some meta data that needs to be made available to BookKeeper clients:

  1. The available bookies;
  2. The list of ledgers;
  3. The list of bookies that have been used for a given ledger;
  4. The last entry of a ledger;

We maintain this information in ZooKeeper. Bookies use ephemeral nodes to indicate their availability. Clients use znodes to track ledger creation and deletion and also to know the end of the ledger and the bookies that were used to store the ledger. Bookies also watch the ledger list so that they can cleanup ledgers that get deleted.

Bookkeeper Guarantees

  1. If an entry has been successfully recorded, it must be readable.
  2. If an entry is read once, it must always be available to be read.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment