Mathieu Boespflug wrote a blog post about proposed improvements to the Haskell package central repository, targetting high reliability and availability among other goals. In parallel, Chris Done proposed improvements to package security (full proposal and update email). After some discussion, it seems that these two proposals compliment each other quite well. This document is intended to be a concrete proposal to address both goals.
There is some additional work ongoing in the community that we're aware of. Since that work hasn't been publicly announced yet, I'm not going to describe it in this document. Since it addresses one component of the system, it may fit in with this proposal. If others are aware of ongoing work and they'd like it to be compatible with this proposal, please bring it up for discussion.
- Allow package authors to release new versions of packages
- Allow package authors and maintainers to update metadata for packages
- Releases and updates are all authenticated and authorized to allow only a subset of people to perform these actions
- Provide for a natural transition from current Hackage hosting to this new system
- Provide for cryptographically secure distribution of packages and metadata
- Continue with a central authority on who is authorized for various actions
- Unlike the current Hackage approach, make this authorization system fully transparent, and allow non-authoritative sets of packages with different authorization rules
- Minimal centralization: only the decision of who is authorized to perform actions must be centralized
This proposal is based strongly on cryptography. Instead of reinventing the wheel (badly), we will reuse existing cryptographic algorithms. In particular, we must choose a cryptographic hash algorithm and a signature system. Possible proposals for each are SHA512 and GPG.
A "package version" consists of the following:
- A package name (following current package naming rules)
- A version number (following current version rules)
- A hash of the tarball contents
Each package version is stored in a single JSON file (to allow for hashing it)
Every package version must have at least one revision
Each tarball contents hash may be associated with 1 or more URLs, where the content may be downloaded from. Implementations MUST implement support for HTTP and HTTPS. The signature MUST be checked upon downloading and, if it doesn't match, MUST be deleted from the user system as untrusted.
Package version revision
A "package version revision" consists of the following:
- A package name
- A version number
- A revision number (integer >= 0)
- The contents of a cabal file for that revision
Each package version revistion is stored in a single JSON file (to allow for hashing it)
There are lists of authorization rules, saying who can do different activities. These activities are:
- Release a package version
- Release a package version revision
- Grant these privileges to other people
Each authorization list is stored in a single JSON file (to allow for hashing it)
TODO: Perhaps we should have unique names for each authorization list and revision numbers, allowing someone to say "Hackage" and look up the newest revision for that list. The lists would also need to be identified by GPG public key fingerprint.
TODO: Figure out how complex these ACLs should be. Minimally: some users are allowed to upload package/versions, other users (Trustees, for instance) are allowed to upload revisions only. We may requre multiple sign-offs, allow only some versions to be maintained by some people, etc. We probably want to define a DSL for this.
A package version, package version revision, and authorization list can each be signed via GPG. Each signature may be revoked using standard GPG revokation.
All of the content listed above can be uploaded arbitrarily. Cryptographic signatures are used to validate all data as coming from a trusted source. As such, no security need be placed on upload rights. Similarly, multiple storage mechanisms (such as Git) may be used for all of the above content. Content is identified uniquely via its hash, and therefore merging multiple storage repositories is a trivial copy.
End user tools may download the above data however desired (e.g., a Git clone).
Once data is downloaded, the end user tool MUST verify signatures and ignore data without a verified, authorized signature. Revokations MUST be respected when verifying signatures. This process works by:
- End user specifies an authorization list to use
- End user specifies a GPG public key to trust for signing that authorization list
- Tool finds all authorization lists matching those criterion
- Any lists matching those criterion with verified signatures are accepted and merged together
- All package versions and package version revisions authorized by that combined list and with verified signatures are accepted
This produces a set of verified, authorized package versions and package version revisions which can be used by the end user.
The above setup allows multiple copies of a single package version or package version revision to come into existence. Upload tools should try and prevent this from happening, but the decentralized nature of this system naturally allows it to happen. If conflicts occur, the following (very arbitrary) system is used to disambiguate:
- Hash of each potential copy is taken
- Copy with lexicographically first hash is accepted, others are rejected
The above defines a set of files which requires no central authority to maintain, instead relying completely on signatures. Ultimately, authors and users will need to interact with this system. Due to the decrentralized nature of things, many different systems could be built on top of this. Here's a proposal for the immediate-term Hackage replacement:
- There's a Git repository publicly cloneable that has all these data files
- There's a server that has commit access to that Git repository, and allows anyone to upload files to that repo
- No signature validation is performed at this phase, but some basic sanity checks may be applied: valid JSON data, URLs contain the right SHAs, backup download links to S3, etc
- We'll have a well respected private GPG key used to create the main authorization list. Users will be encouraged to trust that one by default, but there will be nothing requiring that behavior. Additionally, if that key is somehow compromised, it's possible to switch to a new key
- When installing packages: no need to even talk to the server, just clone the Git repo
- Git repo could be hosted in many different places
The proposal above is purosely vague about the structure of the storage itself. Here's one approach we could take, to hopefully clarify the proposal above:
- We'd store the JSON for a specific package foo version 1.2.3 with hash DEADBEEF at
packages/foo/1.2.3/DEADBEEF.json. There could be multiple individual JSON files in that directory.
- The signatures for that packages would be stored at
- Similar paths would be used for revision, URL mapping, and authorization list files.
- Remember that for each of these files, there will be 0 or more signature files