terabyte/amazon.md

## amazon.md

      
    Raw
  

              amazon.md
            
          
    Prologue

I wrote this answer on stackexchange, here:
https://stackoverflow.com/posts/12597919/
It was wrongly deleted for containing "proprietary information" years later.  I think that's bullshit so I am posting it here.  Come at me.
The Question

Amazon is a SOA system with 100s of services (or so says Amazon Chief Technology Officer Werner Vogels). How do they handle build and release?
Anyone know what they use or the overall structure? I'll bet it would be very interesting to read about their lessons learned.
I worked on Amazon's Build Team for approximately 3 years, from 2006 to 2009. Amazon has produced a world-class build and deployment system which is pretty difficult to imagine and unmatched in open source. It is really sad (and a personal goal of mine to change) that there is nothing even close that is freely available.
My Answer

It is my assertion that the details included in this post are, though possibly not obvious, inevitable discoveries of any sufficiently large engineering organization and the only way to build and deploy quality software in a massive SOAservice-oriented architecture (SOA) environment where one team blocking on another is considered more harmful than integration pain (for an alternative perspective, see any organization which hates topic branches and insists everyone commit directly to trunk and "integrate early" - works less well for SOA). As such, I do not consider this level of detail to be proprietary information, especially combined with the fact that it is 3 years old.

Reproducibility - They should guarantee the ability to reproduce any artifact that has ever been produced in the past (or any artifact tagged as "released" at a minimum). Furthermore, they should guarantee the ability to produce any artifact with a known delta - i.e. "this exact version, but with only this one bug fixed". This is critical to know you are only making the minimum change, and not introducing risk into a production system.
Consistency - In the past (though to a much lesser extent these days) Amazon had a lot of C/C++ code. Regardless of the language you use (but especially in languages where ABI compatabilitycompatibility is "a thing") you need to guarantee consistency. This means knowing that a particular set of artifacts all work together and are binary compatible. In C/C++ this means ABI compatibility and in Java this might mean that for all libraries on the classpath there is exactly one specific version which works with all the other jars on the classpath (i.e. spring 2.5 or 3.0 but not both, they have different APIs). Ideally, this also involves running unit tests (and possibly other acceptance tests) to confirm nothing is broken - when I was at Amazon tests were far less common than I HOPE they are now...
Change Management - The two features above mean a whole lot less if you cannot manage your changes. 100s of services run by individual teams with shared library dependencies inevitably means owners of shared code will need to perform migrations. Team X may need version 1.0 of your library, but team Y might need version 1.1, and the two versions might not be ABI or API compatible. Forcing you to make all your versions compatible forever is an undue burden. Forcing all your clients to suddenly migrate to your new API all at once is also an undue burden. Therefore, your build pipeline must allow some teams to consume older versions, while others consume newer versions. From the bottom up, each piece of your dependency graph must migrate to the new version, but anyone still on the old version must continue to be able to reproduce their build (including making bug fixes). This usually leads to having "major versions" and "minor versions", where minor version changes are non-breaking changes that are "automatically picked up" but major version changes are "picked up by request only". You then run a migration by having each piece of your dependency graph, from the bottom up, migrate to the new major version.

Amazon's build system is called Brazil (haha! Lots of things are called Brazil within Amazon, it is a crazily overloaded term)
The main build driver is a mess 'o Perl scripts that generate makefiles. The build system is bootstrapped by a minimal Perl script which assumes only base Perl deps and GCC are available, and downloads all other dependencies.
The build system is "data driven", meaning there are configuration files which explain what to build. Code is broken up into units called "packages", each has a configuration file which says what to build, what artifacts are produced, what the package depends on, and frequently details about how it is deployed as well. The build system can be run on the desktop to develop and test, and in the package builder as well (see next point)
The build system ensures that nothing is depended upon besides GCC/Glibc, and your explicitly listed dependencies, by ensuring nothing else is on your linker line / classpath / PERL5LIB / whatever for your language. THis This is critical to reproducibility. If some randorandom library in your home directory was accidentally depended upon, builds would succeed locally but then fail later for other people, and if those dependencies were not available elsewhere, could never be reproduced.
There is a massive scheduling build system called "Package Builder" which (rather than continuously) developers can request makes a release build. Every release build is reproducible and its artifacts are kept effectively forever (backed by S3 last I heard).
Each build requested is built "in a version set", which is a list of package versions that are known to be consistent together. To change the versions in a versionset, you "build package X against versionset Y" - then using all the dependencies listed in versionset Y, the new version is X is built. X could be multiple packages for lock-step changes (called a coordinated build). Only after all packages are build successfully (and, should they have tests, all tests run successfully), the artifacts are published and the version set is updated so future builds against that version set will use the new package versions as dependencies. Builds are serialized on version set, so only one build can occur against each version set at the same time, otherwise you might have two concurrent changes break something without the build system noticing. This means that if each team has their own version set (which makes sense) no one team is blocked on antoheranother team's builds.
There is a "default versionset" called live which, when you create a new versionset, is used as the source to figure out what versions to take. So if you own shared code, you "publish" it by building it against the live version set.
There is a deployment system run by a completely different team called Apollo (which was written to replace Houston - as in "Houston, we have a problem!" - it's a pretty funny story). Apollo is probably the single most critical piece of infrastructure. A deployment system such as Apollo must takes artifacts from a consistent set of versions produced by the build system (in Brazil's case, a version set) and transform them to artifacts ready to deploy (usually a trivial transform), then put them onto hosts, be they desktops or servers in data centers. Apollo has probably deployed petabytes of data since its inception.
Apollo uses network disk tricks to efficiently move bits around so it doesn't have to copy to each and every box, but then builds symlink trees to get the files symlinked into the place they need to be. Generally, applications will live in /apollo/ENVIRONMENT_NAME/{lib, bin, etc}. Most applications use a wrapper that adjusts the dynamic link path to include the environment's lib dir, etc. so only dependencies in your environment are used. In this way multiple apps with different dependencies (at different versions) can all run on the same machine, so long as they are in different "environments". This, like in step 4 above, is CRITICAL because if applications depend upon things outside the environment, then those applications are not running reproducibly and future deployments of the same packages may not behave the same way.
Apollo has startup and shutdown scripts which can be run before/during/after a deployment of an environment - think of it as a re-implementation of a domain-specific init.dinit.d. It's very similar and not particularly special, but important because you want to version your startup/shutdown proceedureprocedure just like you version the rest of your application.
I've heard descriptions and seen blog entries about many other large companies build systems, but to be honest, nothing even comes close to the amazing technology Amazon has produced. I would probably argue that what Google, Facebook, and most other companies of comparable size and larger do is at best objectively less good and at worst wasting millions of dollars of lost productivity. Say what you will about Amazon's frugality, their turnover, and their perks, but the tools available at Amazon make it a world-class place to build software. I hope to bring a similar environment to my current company some time soon =)