sachin-j-joshi/Pravega-Simplified-Storage-Subsystem-(S3)-design-backup.md

## Pravega-Simplified-Storage-Subsystem-(S3)-design-backup.md

      
    Raw
  

              Pravega-Simplified-Storage-Subsystem-(S3)-design-backup.md
            
          
    Background

This Document discusses design for Pravega Simplified Storage Subsystem (S3).
This also covers design for PDP 34: CRWD Tier 2.
https://github.com/pravega/pravega/wiki/PDP-34:-CRWD-Tier-2
Design Objectives

Goals


Support additional cloud storage options by simplifying the API contract and requirements on the tier-2 storage bindings.
Eliminate need to implement complex fencing logic in tier-2 bindings. More specifically, provide stronger guarantees around isolation of writes from two different segment store instances. (In other words, two segment store instances will never write to a same underlying file.)
Support bindings that do not support append operations. Especially support storage bindings that only support plain (ie. not extended) Amazon S3 compatible APIs.
Leverage additional Merge/Concat capabilities effectively whenever underlying binding provides them with strong guarantees. (Eg multipart upload in S3 ECS, concating existing files in HDFS)
Allow storage bindings complete freedom to optimize their implementation by using additional capabilities effectively whenever/wherever underlying binding provides them with strong guarantees. (Eg conditional append in ECS/Azure/GCP, truncate)
Enable creation of additional background services like defragmenter that merge number of smaller tier-2 objects into large contiguous ones by leveraging concat operation where underlying storage provides it.
Enable basic support for additional admin functionality like self-repair and integrity checks (like fschk).
Enable basic architectural extension points/ mechanisms to implement importing and exporting data in Pravega from pre-existing external sources.

Non-goals


Ability to read data back without using Pravega or outside of Pravega is out of scope. The data written by Pravega must still be read back using Pravega.
While our goal is to minimize storage overhead and keep it below 1%, it is not our goal to completely eliminate a possibility of temporarily creating extra data, metadata or even some garbage data on tier-2. There may be small amounts of extra metadata or additional orphaned or temporary data stored on the tier-2 at any point in time during normal operations and/or failover scenarios.
Implementing import or export of pre-existing data in other formats (eg avro, porque) is out of scope.

Assumptions

It is helpful for us to explicitly state these assumption to better understand available design space and it’s constraints.
Throughput

It is a fundamental assumption of Pravega architecture that Tier-1 provides low latency using high end storage devices whereas Tier-2 provides cheap long-term storage at higher throughput. The storage system design will take advantage of following facts

Tier-2 is not on a critical path for write operation. Write to tier-2 is designed to be async.
Tail reads should be mostly served from cache. Tier-2 should not be on a critical path for tail read.
In case of historical/Batch reads, throughput matters substantially more than fast “first byte read time” and any latency in reading the first bytes is eventually amortized by higher throughput.
In addition, for sequential reads the data can be prefetched in large read operations to increase performance further.

Sequence of bytes

Segment as a sequence of bytes is only a conceptual construct. There is no hard requirement that the actual storage of data bytes  inside a storage object/file must match byte by byte at each offset. We may have any of the following present -

Multiple storage objects that correspond to single segment,
Additional headers/footers as separate metadata objects in addition to segment data,
Additional headers/footers plus segment data may all be included in same storage object,
Sequence of bytes that make a segment may be serialized as sequence of multiple records with their own headers, footers and additional fields.

Priorities and Tradeoffs

Any system design inevitably involves making tough choices when conflicting requirements cannot be achieved simultaneously. Below are our guidelines when making such choices.

Prefer consistency and correctness guarantees over higher performance.
Prefer write throughput over write latency.
Prefer focus on making normal performance fast and keeping it just-good-enough/tolerable/acceptable in exceptional situations.
Prefer writing big chunks of data over eagerly evicting cache.
Prefer not evicting system segments whenever possible.
Prefer admin initiated maintenance operations.
Prefer conservative approach to changing on-disk metadata formats by being backward and forward compatible.

Key Concepts

Chunk

A chunk is a basic unit of storage in tier-2. Conceptually it is a contiguous range of bytes. A segment is made up of sequence of non-overlapping chunks.
How chunks are stored:


A chunk is always stored contiguously on tier-2 as a single indivisible unit. It is never split into multiple tier-2 objects/files.
For current discussion, chunk is simply a serialized record which has a byte array field that contains user data. This serialized  record needs to be saved somewhere in actual tier-2 object. (To clarify, a chunk and underling file/object are not one and the same thing. There could be multiple chunks in the same file. See diagram below.)

Structure of a persited chunk:


For each chunk, the persisted data may include additional metadata in addition to user supplied bytes of data. We can think chunk as a serialized record or frame that has structure [chunk header| user data…….| chunk footer]. We should not unnecessarily constrain design choices by making strong assumptions (that additional headers/footers will never be written or required). However, for current version both header and footer happen to be empty.
In future, the data bytes saved to tier-2 may be different than original sequence of user supplied bytes to support additional features like encryption, compression, error/bit-rot resilience, interoperability with open formats (eg. Avro) etc.

Tier-2 storage providers operate only at chunk level


We require tier-2 bindings to provide only the following operations - Create, Delete, Write and Read (CRWD) chunks to/from underlying storage.
However, the specific binding may optimize its internal implementation by utilizing additional functionality like append. Append is not necessary, and not included in API contract. It is entirely up to tier-2 implementation to optimize using whatever primitives underlying storage provides.
In addition, optionally the operations to Merge (M) and Truncate(T) might be provided.
The List (L) operation is not required for normal operation but is generally available and may be required for implementing admin tools functionality.

The API contract is described later in the document.
Storage of chunks


Following tuple should completely and uniquely specify location of a persisted chunk on tier-2 <object/file name, start offset, length>. A single file may contain multiple chunks. In this case each chunk starts at different offset although the object name remains same.
If storage provider does not support append (eg. vanilla S3), then each chunk is written to a separate file/object on tier-2. In this special case, each chunk is separate object and each object contains just one chunk.
If storage provider does support conditional appends (or otherwise concurrency safe appends), then multiple chunks can be stored in a single file/object on tier-2. Instead of creating new object/file for each chunk, a new chunk is simply written/appended to an existing file/object.

Names are arbitrary.


Except possibly for the system segments, the actual chunk names used while persisting on tier-2 are arbitrary. There is no special meaning attached to the format of the name. The names could be UUIDs or may contain any combination of segment name, segment store epoch, offset etc. However the name of chunk is required to be globally unique.

All writes are immutable.


Once data is written to a chunk, that same chunk is not modified or appended to. A new chunk is created for the next write. For all practical purposes a single write to tier-2 by storage writer results in a single new chunk being created and saved. A chunk generally corresponds to single write operation (except when merged or truncated).

All writes to chunk are atomic.


If a write operation fails to write complete chunk, then system will not read inconsistent state or incomplete chunk. (explained in detail later).

Single Writer Guarantees.


It is guaranteed that two segment store instances will never write to a same underlying file. (By imposing restriction that a segment store instance never appends to file/object created by another instance. Explained in detail later)

Some size estimates/assumptions

It will be useful to have a rough estimate of number of chunks and segments for understanding design space. The intention here is to capture estimate of orders of magnitutes.

A chunk could be as small as one byte, but we prefer them to be really large. There is no upper limit on the size of the chunk.
The average size of chunk is expected to be low 100s of MB.
There can be upto 25K active segments per container with estimated average around 1K segments.
A segment can have upto 100s of chunks. The average number of chunks is expected to be order of magnitude lower with aggressive and efficient merging.
Thus segment can contain upto 100 MB * 100 = 10 GB. (With average around 100s of MB per segment)

Segment

A segment is made up of sequence of non-overlapping chunks.


Conceptually the segment metadata consists of a header describing various properties of a segment plus a linked list of chunk metadata records describing each chunk.

Below is how segment metadata records are organized.

Metadata is stored in table segment:


The segment metadata and layout information are stored using multiple key-value tuples using pinned table segment (as opposed to single record containing all metadata including chunk layout data of the segment).

There is one KV pair for the header for each segment. In addition to segment properties it contains pointer to first/last chunk metadata records.
There is one KV pair for each chunk. In addition to chunk properties it contains pointer to previous/next chunks.
NOTE - The pointer here means a string that can be used as a key to retrieve the full record by calling get on KV store.


Metadata is authoritative:


The segment metadata and layout information stored in the KV is the only source of truth considered while reading or appending data to a segment.

If metadata does not point to a chunk, then tier-2 data for that chunk is never considered part of a segment and therefore never used.
In other words, a write to chunk is not considered complete until metadata corresponding to it is also updated. It is not enough to write data to tier-2, metadata also needs to be committed for that data to be considered “persisted.


Metadata updates are atomic.


All metadata operations are performed using conditional updates and using a single transaction that updates multiple keys at once. This guarantees that concurrent operations will never leave the metadata in inconsistent state.

Metadata updates are fenced by tier-1:


Table segments internally use Tier-1. Therefore, a segment store instance is fenced in a failover scenario and it can no longer make change to segment metadata. This provides strong guarantees against data corruption by zombie segment stores.
More specifically the write operation involves two updates. Before write, a new chunk metadata record is created but not linked yet. In case of zombie writer, this first step will fail and will prevent unneeded write. In case of failover, the new segment store instance has way to know all the writes that were in flight and takes care of "sealing/closing" the chunks file allowing us to reconcile partially written data. When write is complete, this record is then updated and inserted into chunk list for a segment appropriately.

Metadata updates are efficient:


By splitting the data in multiple records, we avoid writing same data again and again but update only the records regarding modified (or last) chunk.
Problem of writing tiny metadata updates to tier-2 is also solved as table store will aggregate updates to metadata for number of segments into large writes to tier-2.(recall that table store saves data in system segments).

Metadata records are cached using write-through cache:


Given the “append only” only semantics of segments, a very few KV records are updated during segment metadata operations except for last chunk.
The segment level operations like concat, seal, unseal are relatively infrequent.
Therefore, metadata records are excellent candidates for being cached. (eg. Guava cache)

Metadata offers points of extensibility:


All metadata persisted should contain a version.
In future, Segment metadata may contain some additional fields – Eg. Access Control Lists (ACL)
In future, Chunk metadata may contain some additional fields – Eg. CRC/Checksums,

Metadata can be exported to or imported from a tier-2 object:


This is useful for backward compatibility, backup and disaster recovery
A snapshot of segment metadata (with metadata for all its chunks) can be exported to tier-2 on demand. Current header file format will be supported.
A snapshot of segment metadata (with metadata for all its chunks) can be imported from existing tier-2 snapshot on demand.
Automatic import of current header file format will be implemented.

Key Operations

Segment Merge and other Layout change operations

Segment Merge and other Layout change operations are segment level operations that change the number and layout of the chunks that make up a segment. Segment layout is specified by a linked list of chunk records. (recall here that records are stored in table store as KV pair. However, they are conceptually linked list nodes and they “point” to other records using string key as a “pointer”)
Concatenation


When two segments are merged via concat operation, the data stored on the tier-2 is not changed. Only the layout (sequence) of chunks is changed which is purely metadata operation. This is a simple operation consisting of concatenating two linked list of chunk metadata records by appending one linked list to other. Only last chunk metadata record of the target segment needs to “point” to first chunk metadata record of the source. The properties of target segment is updated in its header record to reflect result of concat and header record for the source is deleted. This is purely metadata operation.

Defrag


Multiple small chunks are copied to a big chunk using storage binding’s native capabilities (Eg, multipart upload for S3 or concat for HDFS).

The sublist corresponding copied chunks is replaced with single new chunk record corresponding to new large file/object. Metadata operation is like deleting nodes from middle of the list.
It is assumed that both S3 and HDFS implement concat as mostly metadata operation on their side without moving data.
Therefore, once again it ends up being mostly a metadata operation for Pravega.
A special corner case of defrag is when series of chunks are in the same file consecutively and chunks can be merged in situ without modifying. This case can be optimized by inline merge while writing new chunk.


Truncate


The chunks at the head of the segment that are situated completely below the new start offset are deleted. Metadata operation is like deleting nodes from start of the list.

Swap file/object [TBD/Experimental]


As name suggests is a temporary object/file used to quickly move large amounts of cache data to tier-2 to make room in segment store. Here chunks from multiple segments are stored in a single object.

Rolling Storage


Whenever there is a need to rollover the segment or when large write needs to be split in smaller writes, new chunks is added as needed and metadata updated.

Metadata-only Operations

Create

New segment header added to KV store. No tier-2 object created.
(If segment header record does not exist, then attempt is made to create and populate segment header and chunk records using files found on tier-2, before creating new segment header record.)
OpenRead/OpenWrite


If segment header record exists in the KV store then the layout information from chunk metadata is read and cached.
If segment header record does not exist, then attempt is made to create and populate segment header and chunk records using files found on tier-2.

Exists


Presence of segment header record which is marked active, indicates whether segment exists or not.
Presence of segment header record which is marked active, indicates whether segment exists or not.
If segment header record does not exist, then attempt is made to create and populate segment header and chunk records using files found on tier-2.

Seal


This is purely a metadata operation on segment header.

Unseal


This is purely a metadata operation on segment header.

Delete


Mark segment as deleted in segment header.
All tier-2 objects and chunk metadata is deleted asynchronously.

Data Operations

Read


The list of chunk metadata is used to identify the chunks containing data.
Using chunk metadata, the corresponding tier-2 object/file is read. For efficient lookup, the chunk metadata is cached.

Write


The list of chunk metadata is used to identify the last chunk if any.
A new chunk is written to the tier-2 file/object. (to either an existing file or new depending on underlying storage provider.)
The metadata for new chunk is added.
If the new chunk can be merged inline or triggers a merge, it is merged. Relevant fields header and affected chunk metadata is updated in a single transaction.

Data Operations in Burst Mode

Read


Data need not be read sequentially one chunk after another. All the relevant chunks could be read in parrallel. This is benefetial where underlying storage itself is highly distributed and replicated (Eg. HDFS, Cloud storages like S3) giving higher effective throughput.
Just like normal read The list of chunk metadata is used to identify the chunks containing data.
Using chunk metadata, the corresponding tier-2 object/file is read in parallel. For efficient lookup, the chunk metadata is cached.

Write


The burst mode write is useful when large range of data needs be moved to tier-2, in this case the range can be broken down into multiple chuncks and each chunk can be written in parallel. Thus providing larger effective throughput.

Backup and Restore of the segments.

Backup


List all segments
For each segment metadata is exported to a file (using well-known file convention)

Restore


Import segment metadata previously imported to a file.

Compatibility

Backward compatibility

Forward compatibility

Working with mixed versions (Upgrade/downgrade)

Migration

Key Questions Answered


How do we guarantee zombie segment store cannot write to tier-2?
Write operation is not considered complete until metadata about that write is committed. When tier-1 ledger is closed, the zombie segment store is fenced out and cannot write to tier-1. This means it cannot metadata in the table segment neither can it truncate the ledger. Whatever data written to tier-2 by zombie after its last successful metadata update will be simply ignored. This guarantees consistency for tier-2.


Will this break if tier-1 is changed to something other than Bookkeeper?
Yes. Unless its replacement also provides automatic fencing on closed ledgers.


How do we guarantee multiple segment stores never write to the same file?
Each segment store instance gets a unique id when instantiated (epoch in current case).  This id is used/included while generating a file/object name for a given segment.


You mentioned both Append and Merge. Aren’t we supposed to have CRWD only?
The API contract does not require append.  But is free to use append to improve performance by writing multiple chunks to the same underlying file/object. Merge is also optional but if efficient implemntation is available we should be able to leverage it.


There will be large number of small objects. Will this not be a problem?
No.


(a)	For the storage providers that do provide append operations - Same object/file will contain multiple consecutive chunks. Because of guaranteed Single Writer pattern no fencing gymnastics are necessary thereby providing fast write performance. Because same file is appended to the resultant file contains several chunks which improves read performance as well. The chunk metadata can be merged “inline”.
(b)	For the storage providers that do not provide append we defragment the segment by using native concat capabilities which are assumed to be efficiently implemented by underlying storage.
(c)	For reads – intermediate results for offset search will be aggressively cached.

Key Concerns

Bootstrap


[WIP] How do we make sure system is able to boot properly by loading table segment.

Design Details

Components


Component
Responsibilities
Collaborations


SegmentMetadataTableStore(SMTS)
Store all the metadata for the segments associated   with for the given container.   Data is stored as simple KV store.    Interface follows repository pattern.


SMTSCache
write through cache for SMTS entry


StorageAdapter
Translates segment operations to operations on   individual chunks.


StorageProvider (s)
Component that provides bindings for various storage providers.


StorageWriter
Periodically writes data to tier-2


StorageReader
Prefetches the data from tier-2


StorageDeframenter
Periodically merges set of small tier-2 objects into   a larger object.   Remove deleted objects from the tier-2.


API contracts

PersistentStorageProvider

/**
 * Defines an abstraction for Permanent Storage.
 * Note: not all operations defined here are needed.
 */
public interface PersistentStorageProvider extends AutoCloseable {
    /**
     * Gets a value indicating whether this Storage implementation supports truncate operation on underlying storage object.
     *
     * @return True or false.
     */
    boolean supportsTruncation();


    /**
     * Gets a value indicating whether this Storage implementation supports append operation on underlying storage object.
     *
     * @return True or false.
     */
    boolean supportsAppend();


    /**
     * Gets a value indicating whether this Storage implementation supports merge operation on underlying storage object.
     *
     * @return True or false.
     */
    boolean supportsConcat();

    /**
     * Determines whether named file/object exists in underlying storage.
     *
     * @param name Name of the storage object to check.
     * @return True if the object exists, false otherwise.
     */
    boolean exists(String name);

    /**
     * Creates a new file.
     *
     * @param name Name of the storage object to create.
     * @return CompletableFuture A CompletableFuture corresponding to completion of the operation.
     * @throws IOException
     */
    CompletableFuture create(String name) throws IOException;

    /**
     * Deletes a file.
     *
     * @param name Name of the storage object to delete.
     * @return CompletableFuture A CompletableFuture corresponding to completion of the operation.
     * @throws IOException
     */
    CompletableFuture delete(String name) throws IOException;


    /**
     * Reads a range of bytes from the underlying storage object.
     *
     * @param name Name of the storage object to read from.
     * @param fromOffset Offset in the file from which to start reading.
     * @param length Number of bytes to read.
     * @param buffer Byte buffer to which data is copied.
     * @param bufferOffset Offset in the buffer at which to start copying read data.
     * @return CompletableFuture A CompletableFuture containing number of bytes read.
     * @throws IOException
     * @throws NullPointerException
     * @throws IndexOutOfBoundsException
     */
    CompletableFuture<Integer> read(String name, long fromOffset, int length, byte[] buffer, int bufferOffset) throws IOException, NullPointerException, IndexOutOfBoundsException;


    /**
     * Writes the given data to the underlying storage object.
     *
     * @param name Name of the storage object to write to.
     * @param offset Offset in the file to start writing.
     * @param length Number of bytes to write.
     * @param data An InputStream representing the data to write.
     * @return CompletableFuture A CompletableFuture containing number of bytes written.
     */
    CompletableFuture<Integer> write(String name, long offset, int length, InputStream data) throws IOException;

    /**
     * Concatenates two or more files.
     *
     * @param target Name of the target file to concat to.
     * @param sources Array of strings of the names of existing files to be appended to the target. The files are appended in the same sequence the names are provided.
     * @return CompletableFuture A CompletableFuture corresponding to completion of the operation.
     * @throws IOException
     */
    CompletableFuture concat(String target, String... sources) throws IOException, UnsupportedOperationException;

    /**
     * Truncates
     *
     * @param name Name of the storage object to truncate.
     * @param offset Offset to truncate to.
     * @return CompletableFuture A CompletableFuture corresponding to completion of the operation.
     * @throws IOException
     */
    CompletableFuture truncate(String name, long offset) throws IOException, UnsupportedOperationException;
}
Data structures

Segment Metadata (1 per segment)


Key – segment name ,
Mutable – yes
Value -  Serialized record containing following info

Version
Status (sealed | deleted )
Storage Provider ID. (enum)
Start offset
End offset
Last chunk pointer
First chunk pointer


Chunk Metadata ( 1 per chunk, multiple )


Key – “chunk-” ,   where N is contiguous monotonically increasing chunk number starting with 0
Mutable – Mostly immutable , once written this data seldom changes.
Value -  Serialized record containing following info

Version
Chunk path/name on the tier-2 storage
Start offset ,
length
Status (sealed | deleted | merging )
Pointer to next chunk metadata record.
Pointer to previous chunk metadata record. (We don’t really need this yet)


Algorithms
Component	Responsibilities	Collaborations
SegmentMetadataTableStore(SMTS)	Store all the metadata for the segments associated with for the given container. Data is stored as simple KV store. Interface follows repository pattern.
SMTSCache	write through cache for SMTS entry
StorageAdapter	Translates segment operations to operations on individual chunks.
StorageProvider (s)	Component that provides bindings for various storage providers.
StorageWriter	Periodically writes data to tier-2
StorageReader	Prefetches the data from tier-2
StorageDeframenter	Periodically merges set of small tier-2 objects into a larger object. Remove deleted objects from the tier-2.