Skip to content

Instantly share code, notes, and snippets.

@snoyberg

snoyberg/package-proposal.md Secret

Last active Aug 29, 2015
Embed
What would you like to do?

Background

Mathieu Boespflug wrote a blog post about proposed improvements to the Haskell package central repository, targetting high reliability and availability among other goals. In parallel, Chris Done proposed improvements to package security (full proposal and update email). After some discussion, it seems that these two proposals compliment each other quite well. This document is intended to be a concrete proposal to address both goals.

There is some additional work ongoing in the community that we're aware of. Since that work hasn't been publicly announced yet, I'm not going to describe it in this document. Since it addresses one component of the system, it may fit in with this proposal. If others are aware of ongoing work and they'd like it to be compatible with this proposal, please bring it up for discussion.

Goals

  • Allow package authors to release new versions of packages
  • Allow package authors and maintainers to update metadata for packages
  • Releases and updates are all authenticated and authorized to allow only a subset of people to perform these actions
  • Provide for a natural transition from current Hackage hosting to this new system
  • Provide for cryptographically secure distribution of packages and metadata
  • Continue with a central authority on who is authorized for various actions
    • Unlike the current Hackage approach, make this authorization system fully transparent, and allow non-authoritative sets of packages with different authorization rules
  • Minimal centralization: only the decision of who is authorized to perform actions must be centralized

Cryptography

This proposal is based strongly on cryptography. Instead of reinventing the wheel (badly), we will reuse existing cryptographic algorithms. In particular, we must choose a cryptographic hash algorithm and a signature system. Possible proposals for each are SHA512 and GPG.

Package version

A "package version" consists of the following:

  • A package name (following current package naming rules)
  • A version number (following current version rules)
  • A hash of the tarball contents

Each package version is stored in a single JSON file (to allow for hashing it)

Every package version must have at least one revision

URL mapping

Each tarball contents hash may be associated with 1 or more URLs, where the content may be downloaded from. Implementations MUST implement support for HTTP and HTTPS. The signature MUST be checked upon downloading and, if it doesn't match, MUST be deleted from the user system as untrusted.

Package version revision

A "package version revision" consists of the following:

  • A package name
  • A version number
  • A revision number (integer >= 0)
  • The contents of a cabal file for that revision

Each package version revistion is stored in a single JSON file (to allow for hashing it)

Authorization list

There are lists of authorization rules, saying who can do different activities. These activities are:

  • Release a package version
  • Release a package version revision
  • Grant these privileges to other people

Each authorization list is stored in a single JSON file (to allow for hashing it)

TODO: Perhaps we should have unique names for each authorization list and revision numbers, allowing someone to say "Hackage" and look up the newest revision for that list. The lists would also need to be identified by GPG public key fingerprint.

TODO: Figure out how complex these ACLs should be. Minimally: some users are allowed to upload package/versions, other users (Trustees, for instance) are allowed to upload revisions only. We may requre multiple sign-offs, allow only some versions to be maintained by some people, etc. We probably want to define a DSL for this.

Signature

A package version, package version revision, and authorization list can each be signed via GPG. Each signature may be revoked using standard GPG revokation.

Storage

All of the content listed above can be uploaded arbitrarily. Cryptographic signatures are used to validate all data as coming from a trusted source. As such, no security need be placed on upload rights. Similarly, multiple storage mechanisms (such as Git) may be used for all of the above content. Content is identified uniquely via its hash, and therefore merging multiple storage repositories is a trivial copy.

End user tools may download the above data however desired (e.g., a Git clone).

Verifying signatures

Once data is downloaded, the end user tool MUST verify signatures and ignore data without a verified, authorized signature. Revokations MUST be respected when verifying signatures. This process works by:

  • End user specifies an authorization list to use
  • End user specifies a GPG public key to trust for signing that authorization list
  • Tool finds all authorization lists matching those criterion
  • Any lists matching those criterion with verified signatures are accepted and merged together
  • All package versions and package version revisions authorized by that combined list and with verified signatures are accepted

This produces a set of verified, authorized package versions and package version revisions which can be used by the end user.

Resolving conflicts

The above setup allows multiple copies of a single package version or package version revision to come into existence. Upload tools should try and prevent this from happening, but the decentralized nature of this system naturally allows it to happen. If conflicts occur, the following (very arbitrary) system is used to disambiguate:

  • Hash of each potential copy is taken
  • Copy with lexicographically first hash is accepted, others are rejected

Usage

The above defines a set of files which requires no central authority to maintain, instead relying completely on signatures. Ultimately, authors and users will need to interact with this system. Due to the decrentralized nature of things, many different systems could be built on top of this. Here's a proposal for the immediate-term Hackage replacement:

  1. There's a Git repository publicly cloneable that has all these data files
  2. There's a server that has commit access to that Git repository, and allows anyone to upload files to that repo
    • No signature validation is performed at this phase, but some basic sanity checks may be applied: valid JSON data, URLs contain the right SHAs, backup download links to S3, etc
  3. We'll have a well respected private GPG key used to create the main authorization list. Users will be encouraged to trust that one by default, but there will be nothing requiring that behavior. Additionally, if that key is somehow compromised, it's possible to switch to a new key
  4. When installing packages: no need to even talk to the server, just clone the Git repo
  5. Git repo could be hosted in many different places

File paths

The proposal above is purosely vague about the structure of the storage itself. Here's one approach we could take, to hopefully clarify the proposal above:

  • We'd store the JSON for a specific package foo version 1.2.3 with hash DEADBEEF at packages/foo/1.2.3/DEADBEEF.json. There could be multiple individual JSON files in that directory.
  • The signatures for that packages would be stored at packages/foo/1.2.3/DEADBEEF/gpgfingerprint.asc
  • Similar paths would be used for revision, URL mapping, and authorization list files.
    • Remember that for each of these files, there will be 0 or more signature files
@TobyGoodwin

This comment has been minimized.

Copy link

@TobyGoodwin TobyGoodwin commented Apr 13, 2015

I want to express my support for the overall idea here. I have "business critical" haskell code that relies on the integrity of hackage. It's great that you are thinking about fixing the holes before there's a major security incident (at least that we know about...).

I think using GPG is a good plan. It works well for the Linux distros. Wonderful as the haskell community is, I doubt that it would succeed to implement PKI where everyone else has failed.

I'd strongly urge you to make the choice of cryptographic hash flexible: for example, everywhere that you have a field that is "crypto hash value", add a "crypto hash algorithm" field. (Possibly allow multiple hash values using different algorithms too?) This will encourage implementers to plan for the future. Hopefully SHA-512 has many years of life left, but history teaches us that all crypto hashes eventually succumb to the powerful combination of research and Moore's law.

@DaveCTurner

This comment has been minimized.

Copy link

@DaveCTurner DaveCTurner commented Apr 13, 2015

+1 from me. Definitely another important step in the direction of fully industrialising Haskell.

Use GPG. There's a lot of subtle and rare stuff it does that you'd have to reimplement otherwise; in particular, as new crypto algorithms arise in the future it will handle the transition nicely, so you don't have to make the crypto strength a design decision at this stage.

@magthe

This comment has been minimized.

Copy link

@magthe magthe commented Apr 17, 2015

I really like adding more meta data, hashes and multiple URIs for download, would be clear wins!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment