Skip to content

Instantly share code, notes, and snippets.

@busbey
Last active August 29, 2015 14:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save busbey/a8d8eafaa3bbe7029a2e to your computer and use it in GitHub Desktop.
Save busbey/a8d8eafaa3bbe7029a2e to your computer and use it in GitHub Desktop.

In order to increase the likelihood that our planned WAL interfaces make sense, I'll try to step through how we get from the current APIs to the new ones and how we build on them to make the other WAL improvements we have planned. At the moment, I'd like this to be branch-1 / master.

Step 1, clean up WAL interface (HBASE-10378 / (HBASE-8610)

Current interface/class relationship in 0.98 / branch-1 / master

HLog <-- is_a -- FSHLog

becomes 2 interfaces and a current implementation

Simple WAL interactions like append/sync/roll/close, and our current RingBuffer logic.

WAL  <-- is_a -- FSHLog

WALProvider gives a WAL given an opaque identifier (we'll use region encoded names). We'll consolidate the current interaction between WAL and FS in a default provider that will rely on a single FSHLog instance.

WALProvider <-- is_a -- DefaultWALProvider
                                |
                              getWal(*)
                                |
                               \/
                             FSHLog

WALFactory serves as the injection point so we change change out the backing WALProvider without changing the rest of the code base.

HRegionServer uses WALFactory to instantiate general + meta wals and passes them to Regions on open, similar to current behavior (with some TODOs in HRegionServer taken care of). The end result obviates HBASE-8610.

Current behavior is maintained because WALFactory always returns the same FSHLog.

This step might cause API breaks in wal management tools and coprocessors due to class name changes for the former and HLog leaking into the latter. Discussion on RB for HBASE-10378 for what level of effort we should use to avoid this.

Step 2, create RegionGroupingProvider and a starting strategy (HBASE-5699 / HBASE-6981)

Uses a RegionGroupingStrategy that maps Regions to keys for WALs

WALProvider <-- is_a -- RegionGroupingProvider
                               |
                           delegates_to
                               |
                              \/
                          WALProvider[]

RegionGroupingProvider creates a (configurable) number of WALProviders, say X, when it is created. To determine which provider a given WALEdit goes to it'll rely on a RegionGroupingStrategy to provide f(WALEdit) -> [0,X).

Initially we'll provide StickyRoundRobinGroupingStrategy; essentially a reimplementation from HBASE-6981 on 0.89-fb: given a configuration parameter of X WALs, just assign regions to them using a one up counter and modulus.

At least for now, the RegionGroupingProvider will do straight pass through of everything. Any optimizations (like buffering edits until sync) will be left to another WAL implementation. That way they can be checked both before the grouping (by wrapping this implementation) or after the grouping (by being the provider that gets wrapped by this implementation) once we reach Step 6.

To maintain current behavior, we'll configure RegionGroupingProvider to use DefaultWALProvider with X=1.

Step 3, pluggable (HBASE-4529)

Instead of hard coding the above, make them configurable and default them to what we had hard coded.

Up until this point, ultimately everything eventually is writing protobufs to the filesystem. How do we go from the current entry writer configuration (to know if the entries in a wal are written encrypted or not) to knowing what WALProvider was configured when the file was written? St.Ack suggested sticking with metadata at the head of the file.

Can you change WALProviders while recovery is needed from a previous provider? For example, if we are doing a rolling restart and all of the RegionServers with a particular WALProvider needed for recovery are gone what do we do? I think this won't be a problem until we have something that requires a different WAL Reader/Writer, like a ZK-backed WAL.

I think the answer to the above is that we have the WALFactory involved in recovery and a means for it to ask each WALProvider that is configured to possibly be used on a system "can you recover this region?" Kind of like the way Hadoop's io.serialization. Serialization.accept works. Then wrapping implementations that just wrap others could respond "no" and let things fall through to the basic file implementation.

Step 4, quantify trade offs

things to measure: Write throughput, Write latency distribution, MTTR

How many disks should we presume are available for writing? Do metrics look okay, or do we need to add some? Do the additional number of (shorter) split tasks impact MTTR?

Details here should help inform what changes from current behavior we enable by default.

Step 5, isolate in WAL module (new ticket)

Before we attempt further improvements (like improving recovery), we should move all the WAL related code into a module and clean it up. Right now there's tons of abstraction leakage across recovery work and replication. A module will give a more obvious boundary. In addition to letting us know what might be general purpose for the WAL, it'll make it easier for more aggressive WAL reimplementations to minimize changes outside of the WAL (e.g. HBASE-12259).

Step 6, allow recovery to leverage RegionGroupingStrategy (new ticket)

If the RegionGroupingStrategy is being used (even if wrapped by another WAL implementation), we should be able to use that in recovery to assign all of the regions that are grouped together in a given WAL to the same RS. I believe this means we can avoid splitting such a WAL entirely. I think we'll need to update assignment so that sets of regions can be assigned at once to leverage this.

Step 7, turn RingBuffer implementation into a wrapping implementation

(This was originally Step 2. However, wrapping the RingBufferProvider with the RegionGrouplingProvider gives less lock contention than vice versa.)

Our current impelmentation:

WAL <-- is_a -- FSHLog

is actually doing two things: dealing with contention via the ringbuffer and dealing with HDFS. In this step it becomes two WAL implementations, with one set to use the other an an opaque interface.

First, the part of the FSHLog that deals with concurrency becomes a RingBufferWAL. Where the disruptor client would currently write to an HDFS client, it instead writes to a WAL pulled out of a delegate WALProvider.

WAL <-- is_a -- RingBufferWAL
                    |
                delegates_to
                    |
                   \/
                 WALProvider

Secondly, the part of FSHLog that dealt with HDFS becomes a SimpleFSWAL.

WAL <-- is_a -- SimpleFSWAL

Current behavior is maintained by configuring RingBufferWAL to use a provider of SimpleFSWAL as the delegate.

Part of this cleanup should be changing the entry writer impl stuff (secure vs non, see HBase book section 8.7 Transparent Server Side Encryption section on encrypting WALs); turn them into first class WAL implemantations to keep the number of configuration points down. We can make this change without breaking compatibility by deprecating the configuration names for the wal writer implementations but still using them to make WALProviders that understand the same thing as the reader/writers from those impls.

Step 8, create PipelineSwitchingWAL (HBASE-10278)

Presumably will have the tools to easily insert functionality and test improvement at this point.

WAL <-- is_a -- PipelineSwitchingWAL

Since this is the first additional FileSystem based WAL, we might find common pieces to pull out of SimpleFSWAL here.

Depending on what HDFS client details we need to make sure the hot-swappable writer doesn't have a pipeline on the same nodes, might want to keep this just on later branches to avoid gymnastics. Keeping this as a separate implementation we change in also makes it easier if someone is going to integrate a filesystem that isn't actually HDFS, since they can keep using SimpleFSWAL.

@busbey
Copy link
Author

busbey commented Aug 27, 2014

Please make sure comments go on the ASF review board for this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment