I wrote this answer on stackexchange, here: https://stackoverflow.com/posts/12597919/
It was wrongly deleted for containing "proprietary information" years later. I think that's bullshit so I am posting it here. Come at me.
Amazon is a SOA system with 100s of services (or so says Amazon Chief Technology Officer Werner Vogels). How do they handle build and release?
Anyone know what they use or the overall structure? I'll bet it would be very interesting to read about their lessons learned. I worked on Amazon's Build Team for approximately 3 years, from 2006 to 2009. Amazon has produced a world-class build and deployment system which is pretty difficult to imagine and unmatched in open source. It is really sad (and a personal goal of mine to change) that there is nothing even close that is freely available.
It is my assertion that the details included in this post are, though possibly not obvious, inevitable discoveries of any sufficiently large engineering organization and the only way to build and deploy quality software in a massive SOAservice-oriented architecture (SOA) environment where one team blocking on another is considered more harmful than integration pain (for an alternative perspective, see any organization which hates topic branches and insists everyone commit directly to trunk and "integrate early" - works less well for SOA). As such, I do not consider this level of detail to be proprietary information, especially combined with the fact that it is 3 years old.
- Reproducibility - They should guarantee the ability to reproduce any artifact that has ever been produced in the past (or any artifact tagged as "released" at a minimum). Furthermore, they should guarantee the ability to produce any artifact with a known delta - i.e. "this exact version, but with only this one bug fixed". This is critical to know you are only making the minimum change, and not introducing risk into a production system.
- Consistency - In the past (though to a much lesser extent these days) Amazon had a lot of C/C++ code. Regardless of the language you use (but especially in languages where ABI compatabilitycompatibility is "a thing") you need to guarantee consistency. This means knowing that a particular set of artifacts all work together and are binary compatible. In C/C++ this means ABI compatibility and in Java this might mean that for all libraries on the classpath there is exactly one specific version which works with all the other jars on the classpath (i.e. spring 2.5 or 3.0 but not both, they have different APIs). Ideally, this also involves running unit tests (and possibly other acceptance tests) to confirm nothing is broken - when I was at Amazon tests were far less common than I HOPE they are now...
- Change Management - The two features above mean a whole lot less if you cannot manage your changes. 100s of services run by individual teams with shared library dependencies inevitably means owners of shared code will need to perform migrations. Team X may need version 1.0 of your library, but team Y might need version 1.1, and the two versions might not be ABI or API compatible. Forcing you to make all your versions compatible forever is an undue burden. Forcing all your clients to suddenly migrate to your new API all at once is also an undue burden. Therefore, your build pipeline must allow some teams to consume older versions, while others consume newer versions. From the bottom up, each piece of your dependency graph must migrate to the new version, but anyone still on the old version must continue to be able to reproduce their build (including making bug fixes). This usually leads to having "major versions" and "minor versions", where minor version changes are non-breaking changes that are "automatically picked up" but major version changes are "picked up by request only". You then run a migration by having each piece of your dependency graph, from the bottom up, migrate to the new major version.
Amazon's build system is called Brazil (haha! Lots of things are called Brazil within Amazon, it is a crazily overloaded term)
The main build driver is a mess 'o Perl scripts that generate makefiles. The build system is bootstrapped by a minimal Perl script which assumes only base Perl deps and GCC are available, and downloads all other dependencies.
The build system is "data driven", meaning there are configuration files which explain what to build. Code is broken up into units called "packages", each has a configuration file which says what to build, what artifacts are produced, what the package depends on, and frequently details about how it is deployed as well. The build system can be run on the desktop to develop and test, and in the package builder as well (see next point)
The build system ensures that nothing is depended upon besides GCC/Glibc, and your explicitly listed dependencies, by ensuring nothing else is on your linker line / classpath / PERL5LIB / whatever for your language. THis This is critical to reproducibility. If some randorandom library in your home directory was accidentally depended upon, builds would succeed locally but then fail later for other people, and if those dependencies were not available elsewhere, could never be reproduced.
There is a massive scheduling build system called "Package Builder" which (rather than continuously) developers can request makes a release build. Every release build is reproducible and its artifacts are kept effectively forever (backed by S3 last I heard).
Each build requested is built "in a version set", which is a list of package versions that are known to be consistent together. To change the versions in a versionset, you "build package X against versionset Y" - then using all the dependencies listed in versionset Y, the new version is X is built. X could be multiple packages for lock-step changes (called a coordinated build). Only after all packages are build successfully (and, should they have tests, all tests run successfully), the artifacts are published and the version set is updated so future builds against that version set will use the new package versions as dependencies. Builds are serialized on version set, so only one build can occur against each version set at the same time, otherwise you might have two concurrent changes break something without the build system noticing. This means that if each team has their own version set (which makes sense) no one team is blocked on antoheranother team's builds.
There is a "default versionset" called live which, when you create a new versionset, is used as the source to figure out what versions to take. So if you own shared code, you "publish" it by building it against the live version set.
There is a deployment system run by a completely different team called Apollo (which was written to replace Houston - as in "Houston, we have a problem!" - it's a pretty funny story). Apollo is probably the single most critical piece of infrastructure. A deployment system such as Apollo must takes artifacts from a consistent set of versions produced by the build system (in Brazil's case, a version set) and transform them to artifacts ready to deploy (usually a trivial transform), then put them onto hosts, be they desktops or servers in data centers. Apollo has probably deployed petabytes of data since its inception.
Apollo uses network disk tricks to efficiently move bits around so it doesn't have to copy to each and every box, but then builds symlink trees to get the files symlinked into the place they need to be. Generally, applications will live in /apollo/ENVIRONMENT_NAME/{lib, bin, etc}. Most applications use a wrapper that adjusts the dynamic link path to include the environment's lib dir, etc. so only dependencies in your environment are used. In this way multiple apps with different dependencies (at different versions) can all run on the same machine, so long as they are in different "environments". This, like in step 4 above, is CRITICAL because if applications depend upon things outside the environment, then those applications are not running reproducibly and future deployments of the same packages may not behave the same way.
Apollo has startup and shutdown scripts which can be run before/during/after a deployment of an environment - think of it as a re-implementation of a domain-specific init.dinit.d. It's very similar and not particularly special, but important because you want to version your startup/shutdown proceedureprocedure just like you version the rest of your application.
I've heard descriptions and seen blog entries about many other large companies build systems, but to be honest, nothing even comes close to the amazing technology Amazon has produced. I would probably argue that what Google, Facebook, and most other companies of comparable size and larger do is at best objectively less good and at worst wasting millions of dollars of lost productivity. Say what you will about Amazon's frugality, their turnover, and their perks, but the tools available at Amazon make it a world-class place to build software. I hope to bring a similar environment to my current company some time soon =)
Version Sets have "event" ids, which are named versions for that set and a pointer back to the parent (much like a git commit). When you build, by default the latest event is chosen to build against (if you're building locally, the "latest" is periodically updated) or you can choose to build from a different event (for example, if you need to patch a release).
The version set itself will have a name (e.g. "MyTeamsGreatService" as well as a link back to where the VS was created from "live@1593214096"). As already hinted at, events in the version set are delimited by an
@
.When you create your version set, you supply one more more "root" packages that act as anchors for all of the other packages. As dependencies get discared, they are removed from the version set event.
You can also merge version sets. live typically has a lot of shared packages used by lots of teams (JVM, JDK, shared config, third party libraries, etc.). Instead of picking and choosing specific versions, you can get the latest that are built into live by importing which will pull all of the new package versions in and rebuild anything that depended on them and create a new version set event for your version set (this is such a common operation that it's possible to schedule it on a regular cadence to avoid stale dependencies). You can import from version sets other than live though, which is useful when sharing internal frameworks across close teams that are not intended to be published outside of that group of teams. That makes libraries shared within a multi-team group easy to build once and push out to all consumers.
Hinted at above, but not detailed is the separation of interface version and build version. When you take a dependency on another package in Brazil, you specify its name and interface version. You never take a dependency on a specific build version (in fact, you can't). The package config provides the graph of packages and interface versions necessary to build, but the version set is what resolves those interface versions to specific build versions.
If you create a package, say the IDL for your service API or config for clients to connect, you specify the interface version (say, "1.0"). Every time you build, the generated package is given a concrete build version ("1.0.1593214096"). That build version is what's stored in the version set for each package node. Developers never manage the build version either as an author or consumer.
There's an implicit contract, then, that within an interface version all changes are backwards compatible. Internal tooling (service definitions, config, etc.) has all been built with this in mind. APIs, object fields, enumeration values, etc. can all be added without concern about affecting consumers (at build or runtime), but if APIs need to be modified in some incompatible way, the interface version is updated (maybe "1.1") and no existing consumers are affected, only consumers that update their dependency to the new interface version. Regardless, when you build a package all consumers are rebuilt to ensure build-time consistency.
I left Amazon about a year before the OP was written, but the conclusion is spot on. Once you understand the build and deployment tools you first wonder how you ever did anything before and then start to fear how you'll do anything once you leave. One major issue, in my experience, is the Brazil packaging concept doesn't have critical mass or would require boiling the ocean to get there. At Amazon, there's one package manager - Brazil. You want a Gem, NPM package, *.so, or JAR dependency? You add it to Brazil. Sometimes you go through great pain to add it to Brazil (especially when importing a new build system). But once the package is there, everything just works. If you look at something like BSD ports, you get the comprehensive set of packages from varying package managers merged into one, but you don't get version sets. MacPorts can get pretty close since you can point to a specific SHA of package definitions along with fuzzy dependency definitions, but there's one primary set. Other package managers with fuzzy versions can't produce repeatable builds. In the Java space most dependency managers use strict versioning, which means its extraordinarily difficult to regularly pick up small minor version or patch version changes.
Going into a bit more detail, Brazil gets dependency types right as well as relatively straightforward version conflict resolution. As OP said, Brazil is a thin set of bootstrap tools which forward control to a "build system" as soon as possible. A build system is just another package in Brazil with an executable in a consistent location, so it's versioned along with the rest of your version set. Brazil itself isn't versioned, but it is really so small that in practice it doesn't matter. Build tool dependencies are scoped to work with each other but are not added to the library path during compile by default. Compile-time dependencies are exactly that. Runtime dependencies are packages that will be deployed, but not added to the library path during compile. Both build and runtime dependencies ultimately get deployed. Test dependencies are just that, a separate test set of dependencies which also include compile dependencies. Dependencies are transitive.
Because
brazil-build
is primarily a package manager and defers to other build systems for generating the bits, package configuration is super simple. Primarily dependencies and maybe some additional metadata describing the produced artifacts that will be added to any resolved library path. Unlike something like Maven where dependency management and build logic are intermingled, language native build tools are then used. As OP said, make or cmake for C/C++, rake for Ruby, ant for Java. By the time these tools are used and dependency paths are already added, so LD_LIBRARY_PATH or class paths don't need to be mucked with by developers. It also means that if you don't like your build tool, you can switch it out and none of your other dependencies need to change.When version sets are imported into Apollo, all of the test and build dependencies are stripped away and the remaining graph is passed in.