Skip to content

Instantly share code, notes, and snippets.

@djmitche
Last active August 29, 2015 14:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save djmitche/5860cb6547ca33b6acd9 to your computer and use it in GitHub Desktop.
Save djmitche/5860cb6547ca33b6acd9 to your computer and use it in GitHub Desktop.
Tooltool in S3

Requirements

A reliable, secure system for downloading large artifacts used during the build process.

Automated build processes and developers (with credentials, if required) download sets of related files using a "manifest" that specifies the sha512 digest of each required file along with the destination filename. Files are cached locally, and only downloaded if not present locally. Digests are verified on every run. Last-accessed times are kept for all files.

Each file has a visibility level -- public or internal. public files are available for anyone to download. internal files require identification as an employee or an internall automated build process. The name "internal" is chosen to correspond to the data-security level of this data.

Authorized users can upload new files in batches tagged with a username and a message (similar to a commit message). Each file in such a batch is tagged with its visiblity level. Storage is deduplicated, so an upload of a batch with a file that already exists will not require additional storage. Existing files with conflicting visibility levels cause the batch to be rejected. It's possible to find all uploads which contained a given file.

Implementation

All files are stored in Amazon S3, across at least two regions. Access is managed by a RelengAPI blueprint.

To download a file, the client sends a request (with authentication, if required) to RelengAPI containing the digest and (optionally) a preferred AWS region name. RelengAPI responds, if everything checks out, with a 302 redirect to S3 download URLs for the file. The URL is signed -- direct, unauthenticated access to the files, even public, is not permitted. The signing only takes place if the authenticated user has the proper RelengAPI permissions, although anonymous users are allowed access to public files. The client then fetches the files -- if they are not in the local cache -- from the given URLs. It's important that the client makes a request for all files in its manifest, even if they are available locally, as this allows accurate tracking of last-used dates.

To upload a batch, the client sends a request (again with authentication) to RelengAPI containing a structure similar to a manifest, along with a batch message and file visibilities. RelengAPI verifies that the user has permission to upload files (a more restricted set of users than downloads, and depending on the requested visibilities), and responds with a structure containing, for each file, either a signed upload URL or an indication that the file already exists. The client then uses the upload URLs to upload the desired files. Once the upload is complete, the client sends a courtesy notification to RelengAPI indicating that the files are uploaded.

Internally, the RelengAPI blueprint keeps track of what files are available in which regions. On receipt of an upload request, none of the files are on S3 yet, and if the client fails they may never be. So the server polls for successful uploads, short-circuiting that poll when it receives a courtesy notification. When an upload is complete, the server verifies its digest and size before making it available for download. The blueprint regularly runs a task to look for inadequately checked and replicated files. It downloads and hashes those files if necessary, deleting them if the hashes don't match. And for files which are not in all regions, it uses the object-copy operation to copy them to the required regions. Once these verifications and copies are complete, the new locations are recorded in the DB and handed out to clients.

The administrative UI should allow deletion of individual files as well as manual changes to the visibility level of a file (all via properly authenticated API calls, of course).

Aging Out

At some point, files will be unused and we'll want to get rid of them. This won't be implemented in the first go-round, but the plan is as follows. Each file has an associated TTL (which can be specified for a whole batch or individually on upload). If the file has not been accessed since its TTL has elapsed, then it is eligible for deletion during a periodic cleaning process. When a file is referenced in multiple batches, the largest TTL is maintained. This means that a developer iterating on, say, fixing a compiler bug can upload a bunch of batches with a 1-day TTL, then re-upload the final batch with the working compiler with a 1-year TTL. No actual data will be uploaded, but the TTLs of the associated files will be updated to at least 1 year.

TODO

  • Server
    • Download endpoint with 302 redirects
    • Upload support
      • Add upload date
    • Verification of new uploads and courtesy notification endpoint
    • Permissions checking
    • Verify upload author matches the logged-in user
    • File visibility support
    • Verify ACLs, etc. on each object when verifying a new upload
    • Automatically replicate betweeen regions
    • bring usage, deployment docs up to date
    • API / UI
      • API for managing files (CRUD)
      • UI to browse files and batches
      • UI to upload files
      • UI to admin files (delete, change visibility, etc.)
    • (1137793) access and modification logging, into MozDef
    • Automatically create buckets on start
    • Add a config option to allow anonymous users to download public files
  • Client
    • Authentication support (token only)
    • Add optional support for specifying a download region
      • --region=auto to discover region from instance metadata
    • Replace upload functionality with upload batches
  • Later (so, file bugs..)
    • Track files' last-used date
    • Support for aging out
@djmitche
Copy link
Author

djmitche commented Mar 9, 2015

Per legal, the restricted files we know about need to be limited to employees, rather than contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment