Skip to content

Instantly share code, notes, and snippets.

@sergei-maertens
Last active July 26, 2019 09:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sergei-maertens/81b55192e7f0e99eac5bbfaf14806168 to your computer and use it in GitHub Desktop.
Save sergei-maertens/81b55192e7f0e99eac5bbfaf14806168 to your computer and use it in GitHub Desktop.
Architecture of large-file uploads for DRC

(Large) file uploads in DRC

Definitions

Small file: < 3GB Large file: > 3GB

Possible situations:

Summary, each case is specified in more detail further in this document.

  • creating a new document:
    • small file -> inline, base64 encoded upload, together with metadata (= current situation)
    • large file: use multiple upload URLs, based on bestandsomvang
  • updating an existing document
    • updating only the metadata
    • updating the content itself with a small file -> inline
    • updating the content itself with a large file -> upload URLs

New document - small file

  1. POST /api/v1/enkelvoudiginformatieobjecten
  2. inhoud is base64 encoded
  3. the entire POST body must be < 4GB, or nginx/uwsgi throws an error (HTTP 413)
  4. [validation] the file size should match the provided bestandsgrootte field
  5. No BestandsDeel objects should be created
  6. The document is not locked, the entire creation takes place in a single API call and database transaction

New document - large file

This decision is made if inhoud is empty, otherwise you are in the previous situation.

  1. POST /api/v1/enkelvoudiginformatieobjecten with only metadata (and bestandsgrootte field)
    1. BestandsDeel objects are prepared
    2. the document is immediately assigned a lock ID to prevent others from messing with the content/upload
    3. the response contains the upload URLs and lock ID
  2. Client PUT /api/v1/bestandsdelen
    1. Upload inhoud
    2. [validation] inhoud.size must match grootte
    3. [validation] lock ID must be provided and match the lock set on the document itself
  3. If you retrieve the document, the response contains:
    • inhoud: null -> because the file uploads are not ready yet
    • locked: true
  4. Finalizing the upload is done by unlocking the document
    1. [validation] Validate that the lock ID is correct
    2. [validation] Validate that all part uploads are completed, if not, communicate which part(s) (url and index) is (are) incomplete
    3. Stitch the files together and assign to the document
    4. Remove the lock of the document
  5. Notifications: probably the create action notification should only be set after the upload has completed, NOT when a locked document is initially created.

Existing document - metadata only

Patch events

PATCH is straight forward, the existing machinery remains:

  1. Lock document
  2. PATCH only relevant metadata fields
    • [validation] inhoud may NOT be present
  3. Create new version and set same file as previous version
  4. Unlock document

PUT events

With PUT, the entire resource is replaced. The inhoud field can be left empty/absent.

  1. Lock document
  2. PUT resource, without inhoud
    • Create BestandsDeel for bestandsgrootte that's provided
    • response with the upload URLs in the body
  3. Client ignores the upload URLs - the parts remain empty
  4. Create new version and set same file as previous version if all upload parts are empty
  5. Unlock document
    • Discard the upload parts

Existing document - small file

Inline updates are allowed for small files

  1. Lock document
  2. PUT or PATCH to /api/v1/enkelvoudiginformatieobjecten/<uuid>
  3. inhoud is base64 encoded
  4. the entire PUT/PATCH body must be < 4GB, or nginx/uwsgi throws an error (HTTP 413)
  5. [validation] the file size should match the provided bestandsgrootte field (or be derived from it?)
  6. No BestandsDeel objects should be created
  7. Unlock document

Existing document - large file

Without inline file updates

  1. Lock document
  2. PUT or PATCH to /api/v1/enkelvoudiginformatieobjecten/<uuid>
    • include the bestandsgrootte: 100000 value to prepare BestandsDelen
    • response contains upload URLs
  3. Client PUT /api/v1/bestandsdelen
    1. Upload inhoud
    2. [validation] inhoud.size must match grootte
    3. [validation] lock ID must be provided and match the lock set on the document itself
  4. If you retrieve the document, the response contains:
    • inhoud: null -> because the file uploads are not ready yet
    • locked: true
  5. Finalizing the upload is done by unlocking the document
    1. [validation] Validate that the lock ID is correct
    2. [validation] Validate that all part uploads are completed, if not, communicate which part(s) (url and index) is (are) incomplete
    3. Stitch the files together and assign to the document
    4. Remove the lock of the document
  6. Notifications: probably the (partial)_update action notification should only be set after the upload has completed, NOT when a locked document is initially updated w/ the metadata/preparation.

Summary:

The serializer for Documents must:

  • change behaviour on create based on inhoud being present or not. This means that the inhoud field is declared as optionally.

  • metadata-only is detected as inhoud being empty/absent OR all upload parts being empty when the document is being unlocked.

Complete action

Removing this and merging it with the unlock action.

Edge cases

Aborted upload

  1. Application locks document
  2. Upload is prepared (= BestandsDeel objects are created
  3. What happens if an administrator forcibly unlocks the document?

Proposal: discard BestandsDeel objects, assign previous version of inhoud and unlock document.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment