martinamps/gist:c290bc01454d3c1b8c39

## gistfile1.txt


Rachel Potvin - Engineering Manager

Started career in video game industry.
Company work on multiple games at once
One game per repo
One copy of game engine in each repo + diverge
Features would be wanted between diverged game engines, and the merge conflicts ensue.


One giant shared codebase
Google repository scale
1 Billion Files (generated source, config file, data files, documentation, includes copies into release branches, etc)
9 million source files
2 billion lines of code
35 million commits
86 terabytes of content
45k commits/day


Repository Usage
25 thousand googlers in dozens of offices around the world
15 thousand commits by humans
30 thousand commits by automated systems
800k QPS of reads at daily peak. Avg 500k
Mostly from distributed build/test tools. See Bazel.io for subset.

Perspective
Linux Kernel repo
15 million LOC
40,000 files
Google repo
15 million lines in 250k files changed per week by humans
2 Billion LOC, 9 million files


Google systems and workflows
Sync user workspace to repo ( Basically a fork)
Write Code
Code review
Commit
All code is reviewed before commit by humans and tooling
Each directory has a set of owners who must approve the change tot heir area of the repo
Tests and automated checks are performed before and after commit
Auto-rollback of a commit may occur in the case of widespread breakage
Google has a tree structure showing owners which must give approval before a commit passes.

Piper
Stores a single, large repo
Implemented on top of standard google infra replicated over 10 data centers worldwide
CitC
Cloud based storage backend and a local file system view
users see local changes overlaid on top of the full piper source tree
Users can navigate and edit files across the entire codebase
Supports regular tooling on local machines as it (sounds like) it essentially is NFS.
All writes are saved as CitC snapshots to make rollbacks easily, tooling works from snapshots.
Only modified files are stored in their workspace, but CitC allows you to see the entire codebase seamlessly

Tools
Critique
Code review
CodeSearch
Code browsing, exploration, understanding and archeology
Tricorder
Static analysis of code surfaced in Critique, CodeSearch. Code quality, test results, etc.
Can offer suggestions for fixes to common errors with one click acceptance
Presubmits
Customizable checks, testing, can block commit
TAP
Comprehensive testing before and after commit, auto-rollback
Allows teams to defend code against breaking changes from others
Rosie
Large scale change distribution and management
After teams make changes, tests happen and Rosie automatically submits a PR equivalent

Google does trunk based development
Combined with a centralizedrepo that defines the monolithic model
Piper users work at “head”, a conssitent view of the codebase.
Commits are immediately visible and usable by other engineers
Branching is incredibly rare
Avoids painful merges
Branches are used for releases - snapshot of trunk with optional cherrypicks.
Simple conditionals can mean different versions of code is executed in production.


Advantages of a monolithic repository
Unified versioning
Single source of truth
No confusion about which is the authoritative version of a file
No forking of shared libraries
No painful cross-repo merging of copied code
No artificial boundaries between teams/projects
Supports gradual refactoring and reorganization of codebase
Changes to base libraries are instantly propagated through the dependency chain, greatly simplifying dependency management
No broken dependencies downstream (e.g. if D depends on B and C which depends on A and all are differing versions)
Entire history of project remains intact and browsable
Extensive code sharing and reuse
simplified dependency management
atomic changes
Make large, backwards incompatible changes easily
Change hundreds/thousands of files in a single consistent operation
Rename a class or function in a single commit, with no broken builds or tests
large scale refactoring
codebase modernization
Single view of the codebase facilities clean-up, modernization efforts
Can be centrally managed by dedicated specialists
e.g. updating the codebase to make use of c++11 features
Monolithic codebase captures all dependency information
Old APIs can be removed with confidence
Software errors or design mistakes can be found and fixed across the entire codebase and coupled with new compiler warnings or presubmit checks
collaboration across teams
flexible team boundaries and code ownership
code visibility and clear structure providing implicit team namespacing
easier to reason about relationship between code


Costs associated with this model
Tooling investments are valuable but can be costly
Development to scale tools
Cost of execution of computationally intensive tools (e.g. builds)
Codebase complexity is a risk to productivity
encourages tons of sharing and reuse
Very easy to add dependencies
Un-necessary dependencies increase:
exposure to build breakage
binary sizes
costs for building/testing and maintenance
Code health must be a priority
Tools have been built to:
Find and remove unused/underused dependencies and dead code
Support large-scale clean-ups and refactoring
Google introduced API visibility, with default set to “private”
APIs must explicitly be set as appropriate for use
APIs can be marked as deprecated
Lesson learned: Add these early to encourage sane/hygienic dependency structures.

Conclusions

Monolithic codebase != monolithic software design
Monolithic model of source management works well when coupled with an engineering culture of transparency and collaboration
Google has invested heavily in scalability and productivity tooling to support this model, due to the significant advantages it provides
This may or may not be the right approach for all companies
Google has shown this model can scale to a repo with 1 bn files and 35mm commits, and thousands of users around the globe


	Rachel Potvin - Engineering Manager

	Started career in video game industry.
	Company work on multiple games at once
	One game per repo
	One copy of game engine in each repo + diverge
	Features would be wanted between diverged game engines, and the merge conflicts ensue.


	One giant shared codebase
	Google repository scale
	1 Billion Files (generated source, config file, data files, documentation, includes copies into release branches, etc)
	9 million source files
	2 billion lines of code
	35 million commits
	86 terabytes of content
	45k commits/day


	Repository Usage
	25 thousand googlers in dozens of offices around the world
	15 thousand commits by humans
	30 thousand commits by automated systems
	800k QPS of reads at daily peak. Avg 500k
	Mostly from distributed build/test tools. See Bazel.io for subset.

	Perspective
	Linux Kernel repo
	15 million LOC
	40,000 files
	Google repo
	15 million lines in 250k files changed per week by humans
	2 Billion LOC, 9 million files


	Google systems and workflows
	Sync user workspace to repo ( Basically a fork)
	Write Code
	Code review
	Commit
	All code is reviewed before commit by humans and tooling
	Each directory has a set of owners who must approve the change tot heir area of the repo
	Tests and automated checks are performed before and after commit
	Auto-rollback of a commit may occur in the case of widespread breakage
	Google has a tree structure showing owners which must give approval before a commit passes.

	Piper
	Stores a single, large repo
	Implemented on top of standard google infra replicated over 10 data centers worldwide
	CitC
	Cloud based storage backend and a local file system view
	users see local changes overlaid on top of the full piper source tree
	Users can navigate and edit files across the entire codebase
	Supports regular tooling on local machines as it (sounds like) it essentially is NFS.
	All writes are saved as CitC snapshots to make rollbacks easily, tooling works from snapshots.
	Only modified files are stored in their workspace, but CitC allows you to see the entire codebase seamlessly

	Tools
	Critique
	Code review
	CodeSearch
	Code browsing, exploration, understanding and archeology
	Tricorder
	Static analysis of code surfaced in Critique, CodeSearch. Code quality, test results, etc.
	Can offer suggestions for fixes to common errors with one click acceptance
	Presubmits
	Customizable checks, testing, can block commit
	TAP
	Comprehensive testing before and after commit, auto-rollback
	Allows teams to defend code against breaking changes from others
	Rosie
	Large scale change distribution and management
	After teams make changes, tests happen and Rosie automatically submits a PR equivalent

	Google does trunk based development
	Combined with a centralizedrepo that defines the monolithic model
	Piper users work at “head”, a conssitent view of the codebase.
	Commits are immediately visible and usable by other engineers
	Branching is incredibly rare
	Avoids painful merges
	Branches are used for releases - snapshot of trunk with optional cherrypicks.
	Simple conditionals can mean different versions of code is executed in production.



	Advantages of a monolithic repository
	Unified versioning
	Single source of truth
	No confusion about which is the authoritative version of a file
	No forking of shared libraries
	No painful cross-repo merging of copied code
	No artificial boundaries between teams/projects
	Supports gradual refactoring and reorganization of codebase
	Changes to base libraries are instantly propagated through the dependency chain, greatly simplifying dependency management
	No broken dependencies downstream (e.g. if D depends on B and C which depends on A and all are differing versions)
	Entire history of project remains intact and browsable
	Extensive code sharing and reuse
	simplified dependency management
	atomic changes
	Make large, backwards incompatible changes easily
	Change hundreds/thousands of files in a single consistent operation
	Rename a class or function in a single commit, with no broken builds or tests
	large scale refactoring
	codebase modernization
	Single view of the codebase facilities clean-up, modernization efforts
	Can be centrally managed by dedicated specialists
	e.g. updating the codebase to make use of c++11 features
	Monolithic codebase captures all dependency information
	Old APIs can be removed with confidence
	Software errors or design mistakes can be found and fixed across the entire codebase and coupled with new compiler warnings or presubmit checks
	collaboration across teams
	flexible team boundaries and code ownership
	code visibility and clear structure providing implicit team namespacing
	easier to reason about relationship between code


	Costs associated with this model
	Tooling investments are valuable but can be costly
	Development to scale tools
	Cost of execution of computationally intensive tools (e.g. builds)
	Codebase complexity is a risk to productivity
	encourages tons of sharing and reuse
	Very easy to add dependencies
	Un-necessary dependencies increase:
	exposure to build breakage
	binary sizes
	costs for building/testing and maintenance
	Code health must be a priority
	Tools have been built to:
	Find and remove unused/underused dependencies and dead code
	Support large-scale clean-ups and refactoring
	Google introduced API visibility, with default set to “private”
	APIs must explicitly be set as appropriate for use
	APIs can be marked as deprecated
	Lesson learned: Add these early to encourage sane/hygienic dependency structures.

	Conclusions

	Monolithic codebase != monolithic software design
	Monolithic model of source management works well when coupled with an engineering culture of transparency and collaboration
	Google has invested heavily in scalability and productivity tooling to support this model, due to the significant advantages it provides
	This may or may not be the right approach for all companies
	Google has shown this model can scale to a repo with 1 bn files and 35mm commits, and thousands of users around the globe