djspiewak/sbt-trust-model.md

## sbt-trust-model.md

      
    Raw
  

              sbt-trust-model.md
            
          
    A Workable Trust Model for sbt

I've been spending a lot of time recently thinking about trust as it relates to artifact publication. Some of this has led me to spend a bit more quality time with the implementations of sbt-pgp and sbt-gpg, some of it has led to raging despair and sadness, and some of it has led to what I think are productive lines of thought and potential action. This gist represents the last of those three.
All of this started with a question posed to me by Jakob Odersky: how can we make artifact verification usable and secure for everyone?
Note: I'm going to refer to the general library ecosystem as "Scala library authors". Really what I mean here is anyone who publishes to a public repository from which artifacts can be resolved, which in practice probably means Maven Central, Bintray, or a private Nexus.
Artifact Verification

Whenever a library author publishes to Maven Central, they are forced (by Sonatype's automated restrictions) to sign every file which comprises their published artifact. Bintray imposes no such restriction, and it's optional for private Nexus instances, but in general most people seem to do it anyway. Signing is simply a cryptographic hashing process. Basically, if I sign a file with my GPG key, that signature contains a SHA1 checksum of the file contents as well as a cryptographic assertion that the checksum is validated by my key. The cryptographic assertion is derived from the private key but can be verified by anyone, and the fact that it was derived from my private key can be checked by anyone who has my public key.
These signatures provide a strong cryptographic proof that the files in question were published by someone in control of my GPG private key and the contents of those files have not been modified by some third party. This is a rather useful guarantee to provide in theory, since the files in question are compiled artifacts which will be included on someone's (perhaps many someone's!) classpaths. The artifacts represent code which random people will run, many times in privileged server environments. If I were a nefarious individual attempting to compromise a decent chunk of the software industry, I would attempt to modify the contents of some popular upstream library and trick people into using my modified version rather than the original (which was presumably published by the trusted author of the library).
Now, we're going to leave aside the problem of how such a nefarious actor would trick people into using the modified version rather than the official one. There are several possibilities here, some (but not all!) of which are eliminated by the fact that Maven Central and Bintray both secure their connections with HTTPS. For the moment, let's just assume that a hypothetical nefarious actor is indeed capable of this kind of file injection, by whatever means. Remember that we're not necessarily dealing with hypotheticals here.

Very well-funded and highly motivated Russian agents have infiltrated the US political machine in several ways, many of which involving attacks on software infrastructure. There are some really interesting postmortem reports on the famous attack on the 2016 US presidential election, but this is far from the only instance. Microsoft recently reported an ongoing attack attempting to influence the US 2018 midterm election.
There is strong evidence, mostly provided by the 2016 leak of the "Shadow Broker" toolchain, that the US intelligence agencies actively attempt to penetrate consumer software stacks for various purposes, with techniques that include hoarding zero day vulnerabilities.
The Chinese government has been repeatedly caught attempting to exfiltrate intellectual property from US and European companies, often taking advantage of undisclosed software vulnerabilities and multi-stage attacks.

They've also been repeatedly caught attempting to infiltrate communications infrastructure with the apparent goal of monitoring dissidents.


Raw server infrastructure and computing power is in high demand from various sources, including the aforementioned nation state actors, but also more banal threats attempting to skim CPU/GPU cycles for cryptocurrency mining or network infrastructure for launching DDoS attacks.

There are other examples, but these are just a few that come immediately to mind. And these are just the threats that we know of! As makers of software, we need to accept the fact that we are being targeted. The work we do must be secured, often against very motivated, well-funded and well-equipped adversaries, and our toolchain is definitely not too obscure or too difficult to exploit.
This argument applies especially strongly to libraries and frameworks, which provide an attacker with an avenue to infiltrate vast numbers of running software instances and companies. The Scala dependency ecosystem is quite vast and provides an exceptionally broad attack surface area with surprisingly significant downstream effects.
Artifact verification, in theory, allows us to reduce the problem of trusting an opaque binary JAR file that we're about to add to our classpath to a problem of trusting a person or organization. For example, I am confident in adding the specs2 compiled JAR to my classpath because I trust Eric Torreborre, and Eric Torreborre signed the JAR file (asterisk here; more on this in a bit). I trust Eric, therefore I trust things signed by Eric, therefore I trust specs2. As long as I'm able to somehow verify that Eric's key does in fact belong to Eric, and I trust that Eric's key has been appropriately safeguarded (i.e. no one else has stolen it), then I can trust Eric's JAR files.
Trust Rooting Rabbit Hole

Unfortunately, it's not so easy. There are two problems here:

How do I know that Eric's key is in fact in the possession of the real Eric Torreborre and no one else? Did I just get the key randomly off the internet? What if someone tricked me into downloading a different key that they control?
How do I actually verify the signatures of each and every one of my dependency JAR files? I'm certainly not going to do it all by hand; there are probably thousands of them! So now I need my build tool to do it for me, but how do I trust that my build tool is doing this verification correctly?

The first problem is really hard, and we're going to come back to it. The second problem is also extremely hard, and it is the gateway to a giant depressing rabbit hole of paranoia.
Here's the deal... The problem we're trying to solve is the fact that we don't necessarily trust artifacts resolved by sbt. Some nefarious actor may be tricking sbt into downloading compromised artifacts and we want to close that loophole. Our build tool (sbt) needs to provide us with functionality which cryptographically verifies these artifacts and presumably sounds the alarm bells if something is off. But our build system (sbt) is itself comprised of artifacts resolved by sbt! So if some nefarious actor is able to trick us into downloading compromised artifacts for our classpath, clearly they can also trick us into downloading compromised artifacts for our build tool, since those artifacts come from the same source (usually Maven Central)! Who watches the watchers?
So we need to somehow verify sbt itself and it's upstream dependencies (which include Scala, Ivy, as well as a number of other libraries, not to mention all of sbt's own published artifacts). How can we do that? Well, the process of bootstrapping sbt and resolving all of its constituent core artifacts is handled by the sbt launcher, which is a self-contained executable JAR file downloaded from... Maven Central. Nuts.
Ok, so not only do we need the sbt launcher to verify artifacts, but we need something to verify the sbt launcher! The sbt launcher itself is downloaded by the sbt script we have installed on our systems. This script, in practice, comes from one of two sources: either Lightbend or Paul Phillips. But what if the script itself is compromised? How do we know that the nefarious actor hasn't tricked us into installing a script which verifies the sbt launcher incorrectly?
Well, we don't. Fortunately though, neither of the sbt script sources are Maven Central, Bintray, or anything at all related to the Scala ecosystem. They're shell scripts that people usually just curl onto their laptops or CI servers. That's still relatively easy to compromise, but it's at least an additional vector which must be attacked in addition to Maven Central/Bintray/Nexus/etc. Of course, this attack vector isn't necessarily hard to attack (as was shown recently with Homebrew), but it's another stage.
Technically, we should continue this rabbit hole further. We can't necessarily trust Homebrew without verifying, but for that matter how can we trust our shell, or our operating system, or our hardware? You see where this is going.
We've built a chain of hypothetical trust. As long as we get verification in place, we'll be able to trust the artifacts... if we trust sbt. We'll be able to trust sbt... if we trust sbt launcher. We'll be able to trust sbt launcher... if we trust the sbt script. And so on and so on.
Ultimately, I think it's reasonable to prune off this rabbit hole at the sbt script. We really have to trust curl and the package managers and so on. There's no other option. If curl is compromised, we're probably going to have bigger problems, and ultimately securing that stuff is not our job.
Making it Real

What we can do is secure sbt. We can make our hypothetical chain of trust a real chain of trust. Here's how.
sbt-launch.jar is signed upon publication and has been since sbt 0.10. This is good! We don't know anything about the key used to sign it. This is bad! It could be Joe Hacker from Virginia, for all we know. This is the very root of the trust ecosystem, and so we need to verify it.
The sbt script should hard code the fingerprint of the key used to sign sbt-launch.jar. Then, every time the script runs, right before it calls java -jar, it should use gpg to verify that the signature (contained in sbt-launch.jar.asc) is a) valid, and b) comes from the key which matches the hard-coded fingerprint. Even better, we could hardcode the public key itself into the script (GPG keys are not huge) to avoid truncated fingerprint attacks (which are quite common and surprisingly easy).
Of course, a nefarious actor could modify this hard-coded key in the sbt script, but we're trusting our package manager, our curl, our gpg, our $PATH, and our operating system, so we're going assume that this attack vector is covered.
Once sbt-launch.jar has been verified, we can take things to the next stage. sbt-launch.jar itself needs to hard-code several keys, which it must use to verify all of the sbt core artifacts. This includes (just off the top of my head):

Scala
Ivy
The sprawling sbt binary modules
more stuff

All of these publisher keys need to be hard-coded into sbt-launch.jar (after appropriate verification by the sbt team), which is itself of course signed and verified by the sbt script. This is actually a surprisingly small amount of work – I could probably do it in an afternoon – and it won't need to change very often since sbt's upstream dependencies are relatively stable.
This is only part of the work though. We still need to verify the artifacts themselves! This can be done via one of two mechanisms:

A magic, trusted (somehow...), top-level sbt plugin
A part of the sbt core

A magic top-level plugin would need to be verified by sbt-launch.jar in addition to all of the core sbt functionality, so I would submit that it's probably easier just to make this part of sbt core.
Either way, a checkArtifacts task needs to be exposed which verifies each dependency in turn using the associated .asc file, checking that the signature is valid and made by a key which is trusted to publish that artifact (more on this below). If any artifacts fail this check, the task fails and the failures are printed. If everything passes, then the task passes. We can trust this task because we trust sbt-launch.jar to verify it, and we trust sbt-launch.jar because we trust sbt to verify it, and we trust sbt... because we do.
Note that sbt-launch.jar and checkArtifacts don't need to delegate to the external gpg executable. All we're doing are simple signature checks based on hard-coded keys. Bouncycastle is a well-trusted Java library for performing these sorts of checks, and its failings as a basis for artifact signing are unrelated to signature verification. So we can just use that.
Anyway, our whole chain of trust comes down to trust of the sbt script and the gpg executable on our system. Sort of...
Who's Who?

There's a bit of a loophole that we've glossed over here. When the checkArtifacts task looks at a particular artifact, say org.specs2:specs2-core, it's relatively easy to verify that the specs2-core.jar.asc signature that we got along with the specs2-core.jar file is valid in that it it does in fact describe the correct file contents, it's much harder to check the second part: that the signature was created by a key who we trust to publish specs2. In other words, it's really really hard for us to be sure that the key that signed specs2 claiming to be Eric Torreborre, is in fact Eric Torreborre and not someone who just knows how to type those letters into gpg's key generator.
This is a hard problem. There have been a number of attempts to resolve this issue in the broader cryptography space. The most widely-cited is probably the GPG Web of Trust.
WoT was designed by a large number of very smart, very paranoid people over quite a long period of time. It's as close as you can get to bullet proof from a cryptography standpoint. There's just one problem: (almost) no one uses it. The GPG WoT is incredibly annoying from a usability standpoint, and its value diminishes catastrophically when people (like, say, nearly all of the Scala ecosystem library publishers) don't participate. To make matters worse, it requires participation not just from the library publishers, but also all of the people consuming Scala libraries! Everyone needs to buy in, otherwise they can't verify the trustworthiness of keys and thus, cannot verify the trustworthiness of artifacts themselves.
Spoiler alert: this isn't going to happen.
To make matters even worse (as if they weren't already impossible), WoT-based validation of artifacts is missing a really critical component: authorization. GPG can theoretically use the WoT to verify that the key which signed specs2-core.jar does in fact belong to the real Eric Torreborre, but it can't verify that the real Eric Torreborre is someone we trust to publish specs2 specifically. Imagine that we were instead trying to verify scalatest-core.jar and we discovered that it was signed by the real Eric Torreborre (instead of the real Bill Venners). If we just rely on Web of Trust, our verification won't catch this situation, despite the fact that we most definitely do not trust Eric to be the one to publish ScalaTest (any more than we would trust Bill to publish Specs2). You see the problem. Even when it's working properly (which it almost never does), GPG's Web of Trust model just doesn't line up with what we need to verify in checkArtifacts.
So we need another answer. I've spent quite a bit of time thinking about this over the past month (really, over the past few years), and it occurred to me that we don't need to reinvent the wheel here. Verifying the trustworthiness of artifact signatures is actually very similar to a trust problem that all of us are extremely familiar with on a daily basis: verifying the trustworthiness of web hosts.
Certificate Authorities

SSL Certificate Authorities provide a workable solution for this problem. The idea is pretty simple: there are a small handful of root authorities who we "just trust". These authorities, represented by their public keys, are trusted to do anything. They are the ultimate source of truth on the internet when it comes to who is who. Their public keys are hard-coded into all major browsers and operating systems. Any website operator who wishes to obtain a trusted key for their website (say, scala-lang.org) must first go to one of these root certificate authorities and convince them that they are the real and trusted scala-lang.org organization. Certificates can be further delegated via subdomains. For example, the organization which controls the certificate for google.com is authorized to themselves authorize the mail.google.com domain, which itself could authorize the a.mail.google.com domain, and so on, all without consulting the root authority.
This is a very nice model, in general, and it's made all the more elegant by the natural nesting provided by DNS. It's certainly not without its problems – revocation is quite difficult and requires publication of increasingly-lengthy lists which browsers and operating systems must all know to "not trust" – but it is in general a very tested model developed by some very smart people.
I propose that we adopt this model. And fortunately, we already have a very natural root authority for the system: the key which was used to sign sbt-launch.jar. This is by definition the hard-coded root authority for the entire sbt ecosystem. It's the key that we just trust because we have to trust something, and so it forms the natural root of any certificate authority system for the ecosystem.
sbt's artifact verification should rely on signed statements by this root key and any keys which are delegated from it. A "signed statement" is likely in practice something like the following JSON (probably nospace formatted):
{
  "authorization": {
    "domain": "com.codecommit",
    "key": "27DDC0300B8C55E0FF6A14E8466304991E23B3D7",
    "grants": ["publication", "authorization", "revocation"]
  }
}
I would imagine "domain" would often correspond to a Maven groupId. In the above example, my personal key is being granted the authority to publish arbitrary artifacts under the "com.codecommit" groupId, as well as delegate authority to other keys under that same domain (e.g. I might authorize someone else to publish artifacts under that same groupId).
We can also have a special "domain" which is *, representing the root of the entire ecosystem. Thus, a root authority (in this case, the key which signs sbt-launch.jar) could delegate the ability sign arbitrary authorities to additional keys, effectively authorizing additional root authorities. This seems likely to be useful when building compatible ecosystems with other, non-sbt build tools (more on this in a bit).
These signed statements, published in some way (we'll come back to this), allow us to form chains of trust. We can trust any artifact under some groupId g signed by key k iff there is some statement under domain d signed by key k' which gives k authorization to publish under g, and k' is either the root key or there is some statement under domain d' signed by key k'' such that d' > d and k'' is either the root key or the inductive case holds. Thus, verifying an artifact involves looking for a chain of cryptographically signed statements leading us back to the root key. These statements would of course be cached on the local system (the cache itself is not an attack vector, since we would verify the signatures upon each cache load). Incidentally, this chain forms a Merkle Tree.
Publication

Publication of the signed statements can be done in any number of ways, though the easiest is simply tossing it into a .json file and putting that file onto an HTTP server. Here we can use a model similar to DNS, where there is a root published source of knowledge (which, note, doesn't need to itself be trusted since it is simply hosting cryptographic signatures!) that other sources can delegate to, providing caching and additional signatures as appropriate. I would envision most organizations and even a lot of private individuals might host delegating servers in this way.
Discovery of these servers could be done through configuration in an sbt setting, DNS-based discovery (e.g. the com.codecommit groupId corresponds to the codecommit.com domain, which could have a DNS entry which points to a hosted authorities file), a combination of the two, or some unconsidered option. One interesting possibility here would be using the <properties/> element of the Maven pom.xml file, which allows the insertion of arbitrary metadata about the author into the build. In this way, library publication could be self-contained with the authorization chain-to-root which gives the publisher the rights to do so. Hosting itself doesn't need to be trusted in any way because the statements themselves are signed, so untrusted mirrors are fine. All statements need to form a trusted chain to the root key, so it is cryptographically impossible to falsify.
Revocation

This merits some further thought, but obviously we will need a way of revoking keys that have become compromised. Again, HTTPS (and GPG) provides a model here: revocation lists synchronized through the same mechanism which synchronizes authorizations in the first place. Any path to the root which passes through a revoked key should be considered invalid.
One slight subtlety here is that revocation should be retroactive to a particular date. In this way, publications prior to that date which rely on that link in the Merkle Tree can still be considered valid, but subsequent publications can be invalidated even if that publication is in the past relative to the time of issuance of the revocation.
Alternate Roots

It might conceivably be useful to allow users to configure additional signing roots. I would imagine organizations might want to have some more control over this, or perhaps support for alternative ecosystems. Even just providing the tools for individuals who distrust centralized grant authorities is a worthwhile goal. Unfortunately, we cannot allow mutable configuration of signing roots from within the build tool, since this would corrupt the security model. Imagine if the signing roots were an sbt setting. Any drive-by plugin could modify that setting and corrupt the security of the whole chain.
Since we're already trusting the OS userspace (e.g. the sbt script itself), we can extend that trust to include the environment variables. Environment variables are not modifiable within the JVM (without using JNI), so we can safely use them to obtain an alternative root (or set of roots).
Bootstrapping

Unfortunately, some amount (read: a lot) of manual work is going to be involved in bootstrapping trust before checkArtifacts could be relied on to verify a build without false negatives. I'm sure most of the major Scala library authors could be brought on board very quickly, but there are hundreds of smaller, less-maintained libraries which would be harder to track down. Even more importantly, this affects the entire Java ecosystem, almost none of which use sbt.
We should probably consider any artifacts published before a certain timeframe (say, five or ten years ago) to be "grandfathered" and give their publishing key an automatic grant. Some library authors are simply not going to be reachable or interested in putting forth the effort to claim their groupId. More importantly, it is unlikely that any particularly old artifacts are involved in ongoing attacks, simply because such attacks would have either been uncovered or their underlying motivation obsoleted. Sonatype already does some cursory verification to ensure that someone has nominal rights to publish to the groupId before they grant credentials, so we can lean on that.
Beyond that, I think our immediate goal should be to get the Scala Community Build running with checkArtifacts. That's not a comprehensive test, but it does a good job of covering a very significant percentage of the Scala-relevant ecosystem. We should make sure it is very easy for people to report failing artifacts that are specific to their build, especially early on in this bootstrapping process. Eventually the onus of proper authorization can be shifted to the publishers, but for now it's on the tool makers.
The Broader Ecosystem

This is where things get really interesting. What has been described here is not so much a trust model for sbt as it is a trust model for the entire Maven ecosystem. Nothing in here is specific to Scala or to sbt. Really, what we're talking about is a distributed system, rooted by trust in build tools, for verifying the trustworthiness of arbitrary Maven artifacts.
I can imagine a system wherein each major build tool has its own root grant, allowing each to publish compatible authorizations. In this way, artifacts published by Maven could remain consumable and authorized within sbt, and artifacts published by sbt could be consumable and authorized within Gradle or Lein, and so on. Starting with sbt seems to make a lot of sense, since we have direct control here, but it makes a lot of sense to bring in the other build tools as soon as possible.
Rollout

The first step here would be to get the basic infrastructure in place. Decide on a very stable format for authorizations (we should consider permanent forward compatibility a hard requirement). Appropriately secure the root key (this shouldn't be sitting on anyone's laptop, or CI server for that matter). Carefully analyze the security model and threat vectors, then analyze them again. Get all the other major build tools on board, particularly Maven.
Once the basic infrastructure is in place, we can update the sbt scripts to verify sbt-launch.jar and get the sbt launcher itself to verify sbt. This will probably involve writing the checkArtifacts task and appropriately securing it in the core so it cannot be overridden like a normal task.
Then comes the hard part: getting people on-board. We should start encouraging people to add checkArtifacts to their CI build, starting with the Scala Community Build. Obviously, a lot of projects will fail on this task, so we'll need to do a lot of proactive grant issuance, especially to popular upstream projects.
Once we have a fairly critical mass of builds which pass checkArtifacts (including the entire SCB), we should take the next step and incorporate checkArtifacts into sbt's update task. This effectively changes the grant system from opt-in to opt-out. To that end, we can use an optional environment variable to convert check failures into warnings, if people wish to have an insecure build, but the default should definitely be security. This is obviously a compatibility-breaking feature, since builds could conceivably start failing, so we're talking about sbt 2.0 here.
Summary

To summarize what is being proposed here:

Build a trust system similar to SSL CAs, rooted in trust of build tools
Build tools themselves should be verified by hard-coded keys in their bootstrap mechanisms, thus delegating the trust to the OS package manager and userspace (which is itself shifting sand but we're not here to solve that problem)
Propagate cryptographic grants using an untrusted distributed caching mechanism similar to DNS, perhaps incorporating metadata directly in the POM for self-contained reproducibility
Get all the other major build tools on board, so we can secure the Maven ecosystem as a whole
Start as opt-in with a secure build task which verifies dependencies and plugins
Gradually transition to opt-out with enforced verification on dependency resolution, disabled by an environment variable

This is going to take several years to fully implement, but I think it's worth it given the vulnerability of the ecosystem and the damage which could be done by an adversary who is capable of compromising it.