The main problem is locking and dealing with multiple computers sharing a repository. In general, s3 is eventually consistent. It does (as of Aug 4, 2015) have read-after-write consistency in all regions for all NEW files, but updates/deletions to existing files are still only eventually consistent. Additionally, listing of directories is also only eventually consistent, so you may not see a file in the directory list for some time after its created (However, if you knew its name, you could read it immediately after it was created -> this is the read after write consistency that s3 now has in all regions).
Now, locking the repository when the repository itself is only eventually consistent is going to be problematic - ideally coming up with a system that does not require locking would be the best. Alternatively the locking could be done out-of-band using something strongly consistent ( ie: dynamodb? ). I'm only going to consider the former approach, since extra dependencies and configuration are undesirable. In any case, it should be more "portable" between cloud providers then trying to depend on amazon specific behavior - google cloud and azure bulk storage have different consistency models (AFAICT, stronger consistency, but I'm not sure about rackspace and others).
This is a brain dump of one possible approach. It is incomplete and impractical in its current form:
- store every chunk on the file system directly. Ie: chunk
<id>
is stored in./repository/chunks/<id>
. Looking up chunks is now a file-system job. - Archives references are stored in a manifest directory. Archives are dictionaries containing archive metadata and a list of chunks, and they are msg-packed and chunked as normal. The file in the manifest directory simply lists all the chunks neccessary for the archive metadata. To "avoid" contention between multiple systems accessing the same repository at approximately the same time, an archive reference in the manifest directory is given a unique filename [the same sha256 id used currently?]
- A "transaction" is committed when a reference to that archive appears in the manifest. Archives references are created once, atomically, and list every chunk required for the archive metadata. In turn, the archive metadata lists every chunk required for the entire archive. Archives references are only created after every chunk in the archive, including all the chunks neccessary for the archive metadata, are commited on the filesystem. Backups simply create every chunk required, skipping chunks that already exist.
- There should be a local cache of which chunks are on the server. This cache can be the same as the existing chunk cache. If you don't have a chunk cache, you can list all of the chunks on the server to bootstrap your chunk cache.
- Unreferenced chunks (which can be determined by listing all the chunks in the file system and removing all the the chunks referenced in all of the archives) can/should be cleaned up periodically. These chunks are either from aborted backups, or chunks that were previously referenced from now-deleted archives. This process involves listing all of the chunks, and downloading/parsing all of the archives listed in the manifest. There is a race condition where a previously unreferenced chunk may want to be removed at the same time that a parallel backup task is expecting to use it. The protocol for dealing with this race condition on an eventually consistent file system has yet to be determined. This step is somewhat OPTIONAL in that it doesn't make sense to do if the bucket is versioned (since deleting the chunk wouldn't save you anything, it would still be stored). Additionally, this is not something that every client neccessarily needs to do; you may want most of your backup clients to be "dumb" and just dump chunks on the server, and limit your longer-term file maitenance to a specific client.
- 1 file per chunk is impractical, but without an atomic read-modify-write, or locking, i'm not sure how else to do it. For reference, If you plan to transition files to amazon glacier (economically), then you want files >= 200KiB.[https://therub.org/2015/11/18/glacier-costlier-than-s3-for-small-files/]. However, keep in mind s3 is a little different than a "normal" file system, and has different limitations [http://stackoverflow.com/questions/394198/max-files-per-directory-in-s3]. Ideally there would be some mechanism to ensure files for long term storage are > 200kB, preferably something closer to the 5MB the current system uses. Ideally we would support backing up to amazon glacier directly (rather than only through an s3 bucket with lifecycle configurations that migrate data to glacier).
- using 1 file per chunk is not gonna work practically - too many chunks, too much overhead. you have to consider that 1 chunk is not just the usual 64kiB (or soon: 1MiB) target chunk size, but can be way smaller if the input file is smaller. you can't really ignore that in the end, this is something that has to be solved.
- that "eventually consistent" S3 property is scary. it's already hard enough to design such a system without that property.
NOTE: this whole discussion ignores encryption / data security. This needs to be addressed.
- The main idea has some commonalities to git and bitcoin. It is essentially a merkle tree. In the simple case, where there is no contention, a linear path is formed from the first sector to the last sector. If there is contention, then "forks"/branches emerge. Forks can only start after a "commit".
- Use a Sector based approach (bundle objects/chunks cronologically in a log) but allow for multiple simulataneous writers without explicit locking. If two computers are trying to write the "next" sector at the "same" time, then there is a fork. During a fork, there could be duplicate chunks written. One of the forks should "win", and any chunks writen into the losing paths should be merged into the wining path (eventually - we may not detect a fork during the intial upload because of eventual consistency). Contention is avoided by using unique file names derived based on the name of the previous sector (the "parent" sector) and the contents of the current sector (see below).
- Each sector would contain a fixed size header. At a minimum the header would contain the id of the previous sector.
- Immediately following the fixed size header is a seperate section summarizing (in order) the contents of the section (ADD CHUNK xyz, Reference CHUNK abc, COMMIT). If Chunks are added, this section should also include the offsets for those chunks. This will allow a complete index rebuild without downloading/parsing the entire file. See also: borgbackup/borg#474
- Both of the above headers (the fixed and the variable) serve as the metadata for the sector. This metadata is actually probably stored as a seperate "file" because it needs to be accessed in order to rebuild the index from scratch. Storing it as a seperate file may simplify the use of automatic lifetime transitions to glaicer for the actual chunk data while still allowing a bare-bones index check without requiring time-consuming and costly glacier restores.
- The metadata described above is hashed using the appropriate hash algorithm. This hash serves as the sector name. Because the hashed content contains the hash id of the parent node, a merkle-tree is formed.
- The metadata allows you to determine the correct order of all of the available sectors by downloading the header and re-constructing the merkle-tree. Note: we know how nodes are ordered along the path, but not their global ordering as would be the case in the "locked"/single writer senario.
- We introduce a new tag for "REFERENCED CHUNK". If an uploader detects that a fork has emerged, it can choose (optinally) to reference chunks in other forks. This helps restore some level of de-dup for clients that have started writing to a fork but detect that other writers to other forks have already included a neccessary chunk in their fork. Because eventual consitency caries no hard timeline garentees, the writers may never see the other chunks duplicate chunks will inevitably occur. Unfortunately this works both ways; If a writer can see a fork, it doesn't mean that some other reader will be able to see it (immediately - although it SHOULD (AFAICT) after the eventual consistency horizon). [NOTE TO SELF: it would be nice to use the read-after-write consistency to our advantage and simply explicity list the sector ids, but because we are constantly pruning the head of the tree, the chunk/sector mappings change over time and the sector ids would eventually become stale, so don't do that]
- Because we cannot lock / force a unified global path due to eventual consistency, we choose to trade-off some amount of de-duplication in order to ensure data consistency, and we also trade-off the ability to instantly "remove" data. At a bare minimum we need to ensure that sectors are AT LEAST 24 hours "old" before we even think about removing them, as this is the "eventual consistency" horizon.
- As opposed to "traditional" borg sectors, there are NO delete tags. Chunks are only added to sectors, or referenced, they are never REMOVED. How can we delete / prune old data / archives? Instead of deleting chunks, we ONLY remove stale / unreferenced sectors from the head of the tree. This implies that we should prioritize re-playing any still referenced chunks from the head of the tree to the tail. Once those replayed chunks have passed the eventual consistency horizon, we can then remove any unreferenced sectors from the head of the tree.
- An unreferenced sector is a sector that has chunks that are either not-referenced or are re-played in "newer" sectors.
- we can make larger sectors, making them more suitable for transitioning to glacier. Note: for index purposes, this is better aligned with a two-file sector approach, one file containing the metadata and another file containing the chunks. (again, see borgbackup/borg#474 for a discussion). Then, we can at least rebuild the merkle-tree with the s3 data without having to pull data out of glacier. This may all be moot however, since the merkle tree only contains references to chunks. Those actual chunks still contain the archive metadata and they have to parsed themselves at some point to determine which chunks are actually referenced or not. Dumb clients don't care whether chunks are referenced, but they do care about where the head/tail of the merkle tree is.
- dedup is imperfect, but works in the simple case where all writers see a consistent version of the repo.
- How/when to replay/resolve forks?
- How to generate/sycnhronize a consistent index?
I guess its a matter of partitioning what needs to be strongly consistent and what doesn't.
The chunk cache and repository index need to be consistent and shared between machines. Fundamentally they are both key->value stores mapping a chunk id to objects. The chunk cache and repository index update every time a segment is added. The chuck cache could also change without segments being added.
The Manifest also needs to be consistent. Right now it is stored in-band with the data /w a sentinel value for a chunk-id. It updates based on archive changes, which generally span many segment changes.
Right now the general approach is to serialize everything into segments, lock everything while changes are made, and rebuild all the index/caches on machines that are stale. This approach is taken because the index/chucks kv-store is not shared among machines. In theory, the indexes could be stored as part of the repository and shared ( see borgbackup/borg#474 for example ). On "traditional" backends, this could very well be a "mini borg repo" with traditional file-system locking. On an s3-backend, this is much better aligned with a true strongly consistent kv store, such as dynamodb.
If dynamo-db was used as a kv store, and s3 for the segment storage, I think it would be fairly straight forward problem. But of course, you then have a dependency on dynamodb, which is probably not what most people want when they say they want to use s3 as a backend.
The solution, therefore, is to paper-over the eventual consistency problem of s3 through some sort of clever use of versioning and read-after-write consistency and come up with a reliable kv-store for sharing index/chunk state between machines. HOW???? IS THIS EVEN POSSIBLE?
structure the "index" and "chunk" mini-repo's like the following:
s3://<bucketname>/<reponame>/<chunk-id>/index/<uniquenames>
s3://<bucketname>/<reponame>/<chunk-id>/cache/<uniquenames>
Where uniquenames follows a format similair to the one described above in solution 2. IE: a merkle tree is formed, where forks determine a winner based on something like uniquename-a > uniquename-b ? uniquename-a : uniquename-b
. FORK RESOLUTION STILL AN ISSUE. unit of update becomes per-chunk. note: index and cache potentinally out-of-sync. Can we combine?
Embrace eventual consistency. It is the only way. If you have an eventually consistent backend, then you need an eventually consistent datastructure: CRDTs.
Segments are canonical. Chunk cache and repository index are COMBINED and are composed of CRDTs. Chunks are "tagged" /w SegmentIDs (a la ORSETs), and each (chunk, segment) combo has a PN-counter. Garbage collection will have to occur at SEGMENT level (for recognizing/fixing bad de-dup for concurrent writes), and at the chunk cache level (to cleanup CRDT datastructures that would otherwise grow without bound). Vacum of unreferenced segments would occur after some time-horizion. Chunks that are to be referenced, but have a reference count of zero since (t - timehorizon*unknownfactor) MUST be added using a new segment.
Segments contain a log of chunks. However, a seperate file would contain a log of CRDT deltas. Garbage collection of old deltas is TBD. I'd look at Datanet design docs https://github.com/JakSprats/datanet for inspiration.
Using this system the metadata of chunk-cache/repository-index is eventually consistent and shared amongst all clients. No locking is neccessary. Other than changing how the implementation of those components work, it would require minimal changes to borg.
Manifest will have to be different, but i don't think it would cause too many issues.
This seems like the most logical way to go, and assuming the CRDT strucutures can be efficiently implemented / cleaned up, so this would be a logical choice for even non s3 backed storage
Here is how netflix address's the eventual consistency problem (spoiler: dynamodb). http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html