This document explains the infrastructural changes required in GlusterFS to support various data compliance
feature. Data management is a generic term that includes filesystem data handling and management activities such as locality aware data placement
, data tiering
, BitRot detection
and the likes. Operational mechanism of these features are more or less similar w.r.t. the input operation set being worked on. Additionally and more importantly order of operations (or traces) tend to be much more relaxed in nature unlike replication which relies of strict ordering
of operation for correctness.
This document is split into two parts. The first part elaborates on the infrastructure design required for the correct functioning of various data classification mechanisms. Requirements for each sub-feature is presented briefly and correctness is proven as part of the design. Thereafter, the nature of changes for each component is listed and links to appropriate task breakup etherpad are provided. The below mentioned data classification mechanism are explained in this document.
Readers are recommended to read through the feature page to get a good understanding for each feature. The rest of the document covers the use-cases, design (+ decisions) and workflow for each of the classification mechanism.
NOTE: As of now this document only covers the infrastructure design.
GlusterFS already has an effective change detection mechanism in the form of changelog translator (also called as journal xlator). Changelog translator (xlator) is a kind of write ahead logging
mechanism that persists file operation (FOP) information in flat file (called journals). Journal records are just markers (or hints) signifying that particular operation has been performed. Although journal records are merely hints, they carry enough information to be made sense of. GlusterFS geo-replication makes use of change-logging infrastructure to perform remote replication of data. As of now the journal record format is very much tailored for this feature. As more consumers of changelog come into existence, the record format would need to be generic enough to provide an single truthful source of filesystem changes.
Changelog translator alone does not fully solve the problem of identifying filesystem changes as it's a persistent data store of events/traces. There needs to be a mechanism to relay the persisted information to interested applications.libgfchangelog
provides a mechanism to relay notifications at an interval basis, interval being selected/configured in the changelog xlator. Although this mechanism to receive filesystem notification is effective, there are still areas of improvement:
-
Rollover interval: This tunable controls the granularity (in seconds) at which changes can be relayed to the application (publishing). Sadly this tunable is global and cannot be tuned on a per application basis. This tight coupling proves to be harmful when there are multiple parties interested in receiving filesystem notifications. Increasing/decreasing the value effects all parties. Moreover such mechanism induces consumption delays and user perceivable lags. On a side note, if the application (consumer) consumes the changes at a lower rate than what is produced (producer), the notifications are queued. Since the notification is nothing but a set of processed changelog files (readable/understandable and persisted in a well known directory structure), the queue can grow upto to a large size (as large as the filesystem on which the directory lives on supports).
-
Programming model: libgfchangelog uses on a mix of "poll and notify" to relay filesystem changes (changelogs in particular) to the application. Although efficient enough there is much more room to improve performance by using pure callback based notification facility. Moreover, the application consuming the changelogs need to be aware of the record format.
-
Crash Consistency: Journal records are persisted in the callback path, i.e., after the completion of the operation. This opens up a window where there can be missing journal entries in the event of a crash after then operation is successfully completed (and acknowledged to the client) but not yet persisted in the journal since journal updates are not durable (for performance reasons). Applications solely relying on journals face the issue of missed notifications. To counter this, there's a one time optimized filesystem crawl that needs to be performed in order to collect the missing updates. But this comes at a cost. More on this in the
Design
section.
The above mentioned constraints restrict the wide adoption by applications. The next section explains the changes in the current design to eliminate the constraints. Subsequent sections goes into detail on how various data classification mechanisms (and other applications) benefit from the improved design. Later in this doc, some light is thrown regarding the changelog record format and how it might change in the future to be more descriptive so as to be generic enough for any consumer such as synchronous replication and the likes.
GlusterFS changelogs are flat files that record file system changes and is a form of metadata journaling. But unlike regular filesystem journals (which reuse the journal after successfully persisting filesystem data/metadata), they are made available for anyone to consume and make sense of. This implies journals are never reused/truncated.
Changelog updates are non durable. This is done to avoid latencies involved with synchronous/durable writes. Thus, a power failure may result in missed updates to the journal. A fully crash consistent changelog would require logging in the call ("PRE-OP + journal data") and callback ("POST-OP") path. Moreover, updates need to be synchronous to guarantee crash consistency and is very much the requirement for log based synchronous replication (such as GlusterFS NSR). With NSR in picture, things change a bit: POST-OP marking is not performed in the callback path as data may still reside in the page cache unless synced (pseudo hole concept in NSR spec). Designing changelog's record format keeping NSR in mind would be future proofing, but the following items need to be churned out before finalizing:
- Performance: Synchronous writes are expensive. Sync writes on spinning disks induces huge performance penalty, especially when data and changelog(s) reside on the same filesystem. Having changelogs reside on an SSD would be the most ideal case of achieving crash consistency and not compromising performance.
- Development effort: Since NSR is still in it's beta form (as of this writing) and is not the default replication stratergy, is there a need to tie the record format accordingly now or is it beneficial to have an accommodating design and fill gaps later without resorting to evasive changes?
The rest of the document assumes journal updates to be non durable and a non-descriptive record format. Changes required for NSR friendly record format is not covered in this doc.
Due to non durable journals updates, there needs to be a mechanism to find missed changelog entries after a node crash or during the time interval when change-logging was disabled. Logging updates in the call path might also help in reducing the crash consistency window to some extent, but there is no surety that it will. GlusterFS Geo-replication relies on an optimized filesystem crawl to gather any missed updates. To avoid each consumer duplicating the effort, the filesystem crawl mechanism would be a part of libgfchangelog and an interface would be provided to initiate the scan and collect updates for a specified time interval. Such a scan does not detect unlinks in a tree, which be OK for most of the consumers (except those which rely on logs for lazy replication).
Changelog translator, in it's current form publishes changelogs (to consumers) after a configured time interval (rollover-time). The new design does away with this kind of interval based publishing. Changelogs are ever growing (preallocated, maybe rolled over each day) recording changes in binary format (ASCII for non-production workloads), thus reducing the number of bytes logged per record. Since there is no notification as such, consumers interact with changelog (on the brick) via RPC (APIs exposed by the changelog translator) covered in the subsequent sections along with the API set. Journal updates are not atomic and may lead to journal corruption in the event of power failure midst an update. As a counter measure, each journal record is followed by it's checksum. It may be beneficial to log updates in memory and flush to disk (checkpointing) on buffer limit or fsync().
With the new changelog design, journal updates and reads (for consumption) are parallel. The consumer (linked to libgfchangelog) registers a callback to be invoked on filesystem modification for a given brick (local to a node).
API prototype for register:
int gf_changelog_lowlevel_register(char *brick, gf_callback_t *cbk, uint32_t bufsize, uint32_t parallel, void *cookie);
where parallel is the parallelism count the consumer is willing to use. The API returns a unique cookie that's used in subsequent API calls. Filesystem notifications can be streamed in parallel in cases where fop ordering is not a necessity. This is beneficial for other file processing application (such as backup utils, etc.) that don't rely on fop order for correct functionality. Once registered, cbk is invoked with a buffer (of length bufsize) containing one or more of the following structures:
struct gf_event {
g_event_type type;
glusterfs_fop_t fop;
union {
struct g_ev_data d;
struct g_ev_entry e;
struct g_ev_metadata m;
}u;
};
type: file operation type (data, metadata, entry)
fop: numeric value of file operation (valid when type is 'entry')
u: union encapsulating structs for data, metadata and entry
There is facility to stop and stall (pause) notifications too.
The real workhorse of the notification infrastructure lives as a set of APIs in the changelog translator. API prototype may change during develpoment cycle and are not listed as of now. Explained below is the detailed interaction between libgfchangelog and the changelog translator and how notifications are streamed to the consumer.
Changelog translator maintains a per connection queue containing a reference to an thread that's interested in receieving notifications. A consumer can have multiple threads of execution and therefore the connection block has a notion of execution contexts. Registering to the queue is done via an API. Threads which are free register themselves for notification. On a successful register, callback is invoked when there are events to be notified. Changelog updates and read happen in parallel, care needs to be taken to read till the last valid record. Changelog keeps track of the current size of the journal in memory which is updated during changelog update and initialized on translator load (init()).
Changelog keeps track of the point till where each consumer has consumed events. This way a consumer can start where it left off (after a crash or on a restart). This meta information per consumer (client) can be persisted in extended attributes of the journal. Since a consumer can be multi-threaded and notifications are streamed to a set of free threads, the consumers position in the changelog needs to be carefully persisted. Consider the following scenario:
s0: T1 | T2 | T3 | T4
s1: | | T3 | T4
s2: | T2 | | T4
s3: T1 | | T3 |
The above table is just a depection of the intricacies involved. The actual algorithm would be different at the time of develpoment.
Each time slot (s#) represents a set of free execution contexts (thread: T#) which is assigned a job mapping to a bunch of records in the changelog. Each thread is assigned an (offset, length) tuple to work on. A thread can report it's completion in any subsequent time slot depending on how fast or slow it is. This implies the consumer position (basically, the offset till which a consumer has processed the journal) needs to be updated after all threads for a given time spot have reported completion. In the above representation, s0's completion happens at time slot s3. At the same time, s1 too reaches completion at time slot s3. Therefore, the implementation should take care of thread/client crash consistency and persist the correct position to the store. There's also a case where a thread (associated with a client/consumer) dies and for some reason is never able to spawn again (or spawns and dies before it can register itself). In such a case the thread count (or IDs) can be a part of the keep-alive payload. On identification of a faulting thread, it's work can be delegated to a free thread from the current time slot.
Another important enhancement to the changelog translator is the introduction of an LFU (or LRU?) cache used to track hot/cold files. Data classification stratergies make use of this information to migrate hot files to a faster store (typically SSD backed). Each object in the cache has a read and write count updated by the changelog translator during respective read/write operation. A "current" view of the cache can be fetched via an API. Objects expiring from the cache are batched and notified to the consumer.
The changelog format as of this writing is similar to the current format apart from the addition of a timestamp to each record.
gluster-devel thread: http://supercolony.gluster.org/pipermail/gluster-devel/2014-December/043152.html