Skip to content

Instantly share code, notes, and snippets.

@krisis
Last active July 28, 2017 22:35
Show Gist options
  • Save krisis/761ff87ee0a7368681351239dca5733a to your computer and use it in GitHub Desktop.
Save krisis/761ff87ee0a7368681351239dca5733a to your computer and use it in GitHub Desktop.

Multipart upload backend format proposal

Both the schema below have the following properties

  1. Crash-consistent during a completeMultipartUpload/abortMultipartUpload
  2. Concurrency-safe in presence of multiple concurrent uploads

These properties allows us to avoid fcntl(3) based locking in shared mode with FS backend.

Schema-1

.minio.sys.tmp
    ├── <uploadId> -----------------> created on initMultipartUpload
        ├── <eTag>.0 ---------------> created on initMultipartUpload
        ├── <eTag>.<partNumber> ----> created on putObjectPart

where, <uploadId>/<eTag>.0 contains the following json object

{
    "Bucket": "bucketName",
    "Object": "objectName"
}

Example

The following example contains 3 parts each of 2 concurrent uploads with uploadId

  1. 6e463bb8-35bd-4408-809e-78f509f558b3
  2. 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda

to mybucket/myobject.

.minio.sys.tmp
    ├── 6e463bb8-35bd-4408-809e-78f509f558b3
    │   ├── 467886be95c8ecfd71a2900e3f461b4f.0
    │   ├── 467886be95c8ecfd71a2900e3f461b4f.1
    │   ├── 467886be95c8ecfd71a2900e3f461b4f.2
    │   └── 467886be95c8ecfd71a2900e3f461b4f.3
    └── 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda
        ├── 467886be95c8ecfd71a2900e3f461b4f.0
        ├── 467886be95c8ecfd71a2900e3f461b4f.1
        ├── 467886be95c8ecfd71a2900e3f461b4f.2
        └── 467886be95c8ecfd71a2900e3f461b4f.3

6e463bb8-35bd-4408-809e-78f509f558b3/467886be95c8ecfd71a2900e3f461b4f.0 would contain

{
    "Bucket": "mybucket",
    "Object": "myobject"
}

Pros

  • Easier to support/debug since uploadId directory will have at most 10000 entries
  • uploadId modTime will reflect last updated part for a given upload; simplifies detection of 'stale' uploads
  • Limited number of entries per uploadId directory; listing during cleanup of stale upload parts will be faster

Schema-2 (flat namespace)

.minio.sys.tmp
    ├── <uploadId>.<eTag>.0 ---------------> created on initMultipartUpload
    ├── <uploadId>.<eTag>.<partNumber> ----> created on putObjectPart

where, <uploadId>.<eTag>.0 contains the following json object

{
    "Bucket": "bucketName",
    "Object": "objectName"
}

Example

The following example contains 3 parts each of 2 concurrent uploads with uploadId

  1. 6e463bb8-35bd-4408-809e-78f509f558b3
  2. 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda

to mybucket/myobject.

.minio.sys.tmp
    ├── 6e463bb8-35bd-4408-809e-78f509f558b3.467886be95c8ecfd71a2900e3f461b4f.0
    ├── 6e463bb8-35bd-4408-809e-78f509f558b3.467886be95c8ecfd71a2900e3f461b4f.1
    ├── 6e463bb8-35bd-4408-809e-78f509f558b3.467886be95c8ecfd71a2900e3f461b4f.2
    ├── 6e463bb8-35bd-4408-809e-78f509f558b3.467886be95c8ecfd71a2900e3f461b4f.3
    ├── 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda.467886be95c8ecfd71a2900e3f461b4f.0
    ├── 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda.467886be95c8ecfd71a2900e3f461b4f.1
    ├── 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda.467886be95c8ecfd71a2900e3f461b4f.2
    └── 7bed54f8-ad71-48b1-a4f8-bd2e2a8efbda.467886be95c8ecfd71a2900e3f461b4f.3

6e463bb8-35bd-4408-809e-78f509f558b3.467886be95c8ecfd71a2900e3f461b4f.0 contains

{
    "Bucket": "bucketName",
    "Object": "objectName"
}

Pros

  • Single directory to hold all ongoing uploads including single PUT object

Cons

  • Maximum object name supportable is limited by platform-specific path segment length limits. In GNU/Linux it's 255.
@balamurugana
Copy link

balamurugana commented Jul 20, 2017

  1. In GNU/Linux, the path segment is limited to 256 character long. Having <bucket-name> and <object-name> embedded into part name limits object name length.
  2. My guess is meta data provided at init-multipart-upload of the object is stored as extended attribute of <upload-id> directory in schema-1 and its not clear about in schema-2

@krisis
Copy link
Author

krisis commented Jul 21, 2017

@balamurugana,

  1. In GNU/Linux, the path segment is limited to 256 character long. Having and embedded into part name limits object name length.

Yes, I shall add this in Cons for Schema-2.

  1. My guess is meta data provided at init-multipart-upload of the object is stored as extended attribute of directory in schema-1 and its not clear about in schema-2

This document doesn't describe how metadata (currently stored in fs.json) will be handled. If extended attributes (EA) are supported by the disk filesystem, then we can save them in EAs else user-metadata is not supported in such a setup. In both schema we can save user-metadata as EA on the first part until the time of completeMultipartUpload.

@krisis
Copy link
Author

krisis commented Jul 22, 2017

The recent revision moved bucket and object names to a json file (i.e, "part-0" file) to get around the path segment max length restrictions.

@fwessels
Copy link

I would go with Schema-1 since it is a little cleaner while not being much more complicated (or expensive in terms of performance).

Also the 255-limitation for Schema-2 could be a nasty one that Schema-1 doesn't have.

@krisis
Copy link
Author

krisis commented Jul 28, 2017

Here are some cases of interest and what happens in the proposed multipart backend.
[N B Background append process need to be rewritten to not use fs.json to track parts uploaded so far.]

  1. When a completeMultipartUpload request and abortMultipartUpload request arrives at different minio instances around the same time.
    In this case it is possible for both these requests to return success. This is incorrect.

  2. When two or more completeMultipartUpload requests arrive at different minio instances around the same time with different parts to be committed.
    In this case the "last (successful) writer" wins, others may fail while appending parts since a successful completeMultipartUpload removes the parts.

  3. When a completeMultipartUpload and a putObjectPart request arrive at different minio instances around the same time.
    There are two possibilties,
    a. the part uploaded is present in the completeMultipartUpload request
    - if putObjectPart didn't complete on time, completeMultipartUpload would fail with part missing error.

    b. the part uploaded isn't present in the completeMultipartUpload request
    - in this case the part uploaded doesn't affect the result of completeMultipartUpload. It is possible that the uploadId directory and this part alone is not cleaned up.

@krisis
Copy link
Author

krisis commented Jul 28, 2017

To avoid the above cases from adding/removing entries of the form uploadId/eTag.partNumber as part of completeMultipart, abortMultipart and putObjectPart request, we can rename the uploadId directory to (say) uploadId-1 during completeMultipart/abortMultipart request. This ensures that the first completeMultipart/abortMultipart can proceed successfully while subsequent/concurrent requests fail with NoSuchUploadId.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment