Skip to content

Instantly share code, notes, and snippets.

@snarlysodboxer
Created July 13, 2018 20:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save snarlysodboxer/b1bd92c9bb20efe126bb6b8c4500a654 to your computer and use it in GitHub Desktop.
Save snarlysodboxer/b1bd92c9bb20efe126bb6b8c4500a654 to your computer and use it in GitHub Desktop.
Notes on scaling file uploads, storage, retrieval, & background processing.

Scaling file uploads, storage, retrieval, & background processing

Storage

Object store systems like Openstack Swift, Minio, and Ceph are worth consideration. They offer various advantages in scalability, accessibility, and things like arbitrary object metadata. Most of them also have S3 compatible APIs if that makes things easier.

If instead we need or desire to read files from traditional mounted filesystems, GlusterFS is one option worth considering. With it's Distributed Replicated configuration, storage space can be scaled by adding more disks, and access speed can be scaled by adding more nodes. It's supported by Kubernetes, and has systems for asynchronous replication, usually used for syncing the data to another datacenter. It can support replicated applications via ReadWriteMany.

Resource versioning
  • Using an object store: versioning is built in, record the version ID in the HTTP server's database.
  • Using a file store: too many files in a single directory can cause operational bottlenecks. A directory tree in a form such as user_id/resource_id/version_id/filename.txt allows the relative filepath to be calculated on the fly via information in the HTTP server's database.

Post-upload background processing

We want to ensure certain actions have been taken for each uploaded file. Examples are virus scanning, parsing and storing metadata, resizing images, etc. A message broker (AKA message queue) with acknowledgments and at-least-once delivery could well be a top design choice.

Using a message broker could look something like
Upload logic in HTTP server
  • Receive PUT request from user, do not return yet
  • Store file
    • If failure between this step and the completion of the next two is a concern, consider:
      • (using an object store): set a short TTL on the object after which it will be automatically removed, and then update that TTL to infinite after the next two steps.
      • (using a file store): first take out an etcd distributed lock named after the filename, version, and userID. If next steps fail, PUT request fails, user tries again, lock pre-exists but no file, so continue with upload. If next steps succeed, remove lock.
  • Add file info to database to support GET reqeusts and the like. Use transactions for replica safety.
  • Send a message containing the object ID or filepath to each action's queue, waiting for acknowledgments that the messages were received. (I.E. the scan_for_viruses queue, the load_metadata queue, etc.)
  • Return success to PUT request
Queue logic in processing worker
  • Receive message from queue, do not acknowledge yet
  • Process (I.E. scan for viruses, load metadata, etc.)
  • Inform broader system of completion. It's recommended to avoid the tight coupling that connecting directly to the HTTP server's database would create. A couple of options are:
    • (using an object store): update the object metadata with info like scanned for viruses on 7/12/2018, etc.
    • (using an object store or file store): send messages to separate "done" queues to which the HTTP server subscribes, updating the database accordingly. (I.E. the scan_for_viruses_done queue, etc.)
  • Acknowledge original message, profit!

Notes

  • Failure anywhere in the processing workers will cause the message broker to redeliver the message that was not acknowledged.
  • Failure anywhere in the upload logic will cause a failed PUT request and the user will retry.
  • Both sides can be run in replica and scaled horizontally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment