jeremy-w/[Better developers] Where (and how) Git stores your files.txt

## [Better developers] Where (and how) Git stores your files.txt
This week, let's take a break from Python and talk a little bit
about Git ( http://t.dripemail2.com/c/eyJhY2NvdW50X2lkIjoiNjE2ODIxOCIsImRlb=
Gl2ZXJ5X2lkIjoiMjMxMTgyMzY3OSIsInVybCI6Imh0dHA6Ly9naXQtc2NtLmNvbS8_X19zPXhz=
d2dxc3FiNHFuMml6d3c0aXloIn0 ). I teach Git courses every few
months, and without fail people come into the class saying that
they have been using Git for a few months, and it seems to work
OK so long as they use the list of commands that their boss
provided. But if something happens that isn't on that list, and
if they cannot figure out what to do based on Stack Overflow,
they're sunk.

The goal of my Git course is not only to help them use the
different Git commands, but also to give them insights into
what's happening inside of Git, so that when things go wrong --
or appear to go wrong -- they can fix the problem, rather than
removing their local copy of the repository and cloning again.
Which is what a huge number of people do.

I want to point out that Git is one of the best tools I've ever
used, and has made me a better developer. And yet, I should also
point out that the user interface is exactly what you would
expect from a bunch of kernel hackers whose primary language is
C. The naming of Git features is terrible and inconsistent, the
number of options you can invoke is nearly infinite, and many of
the terms and commands were seemingly chosen because they clashed
with completely different commands used by other version-control
systems.

The thing is, once you understand how Git works, it suddenly
starts to make sense. And that's because Git doesn't do very
much at all: It's a specialized database, containing a very small
number of objects. And part of the genius of Git, in my opinion,
is that you can have a robust and fully operational
version-control system by implementing just a few ideas.

Indeed, you can think of Git as a database that contains just
four types of objects:

* blobs (i.e., file contents)
* trees (i.e., directories)
* tags
* commits

When you say "git commit", you're creating a new commit object.
That object points to a tree, and that tree then points to
additional trees and blobs. Assuming that your commit is not the
first in a repository, then it also points back to its parent.

Let's create and go through a Git repository to see what I'm
talking about. On the command line, I'll create a new directory
and repository:

$ mkdir gitfun
$ cd gitfun
$ git init

Git responds by saying:

Initialized empty Git repository in
/Users/reuven/Desktop/gitfun/.git/

Great! We now have a new repository!

Um, but what does that mean? It means that Git has configured a
few things, including the special ".git" directory, under which
things are stored. What is stored there? Well, right now,
there's not much to see. Looking at ".git/objects", which is
where Git stores things, we'll see two subdirectories, but no
actual objects.

$ ls .git/objects
info/ pack/

So, let's now create a new file in Git:

$ cat >> test1.txt
This is a test.
And a very good test it is!

$ git add test1.txt

$ git commit -m 'Added test1.txt'
[master (root-commit) 5816544] Added test1.txt
1 file changed, 2 insertions(+)
create mode 100644 test1.txt

In the above shell commands, I created a simple text file. Then
I staged it by using the "add" command -- what, you think that
there should be a "stage" command? But that would deprive
consultants of business opportunities! -- and then committed it
using "git commit".

The moment I did that, Git created a number of different objects.
Each object is represented in Git with a SHA-1 value. SHA-1 is a
hash function that doesn't guarantee that every file will have a
unique hash value, but it's close enough for all practical
purposes. If you had a way to deliberately create a file with a
given SHA-1, then Git would probably break -- but that's not
realistic, so far as I know, so we should be OK.

Git reported above that it created a new commit, and even gave us
the first few digits of its SHA-1, 5816544. We can see this more
clearly, and with a longer name, if we use "git log":

$ git log
commit 58165443eca522ef35bad68964fc09ec000449ef
Author: Reuven Lerner <reuven@lerner.co.il>
Date: Mon Jan 9 00:11:41 2017 +0200

Added test1.txt

We can thus see that our most recent commit has a SHA-1 that
starts with 5816544, and continues until we get a 40-character
SHA-1. But we can use the first four hex digits, so longer as
they're unique in our repository.

Where did Git store this object? Inside of .git/objects. But
because our repository might contain lots of objects, we aren't
going to store everything straight inside of .git/objects.
Rather, Git takes the first two characters of the SHA-1, and
uses that as the name of a subdirectory in which to store
objects. For example:

$ ls .git/objects
37/ 58/ 79/ info/ pack/

Our commit object is inside of the "58" directory:

$ ls .git/objects/58
165443eca522ef35bad68964fc09ec000449ef

So as you can see, knowing the SHA-1 of an object allows Git to
find it right away in our filesystem. That's one of the reasons
why Git is so fast; the file's contents tell Git where a file is.
And when the file changes? Then Git will create a new object,
with a new SHA-1, reflecting the hash value of the new contents.
And thus, Git stores separate copies of each version of each
file that you might have written.

You might have noticed that Git created two other directories
above, "37" and "79". Why are those there?

Well, because Git didn't just create a commit object. It also
created a tree object that sits between the commit and one or
more trees and blobs. We can use the low-level Git command
cat-file, along with its "p" option, to inspect these files:

$ git cat-file -p 58165443eca522ef35bad68964fc09ec000449ef
tree 37675fc023b0863cd8a702041de28282caa17c1d
author Reuven Lerner <reuven@lerner.co.il> 1483913501 +0200
committer Reuven Lerner <reuven@lerner.co.il> 1483913501
+0200

Added test1.txt

In other words, what are the contents of our commit object? it
contains a tree object (SHA-1 37675f), as well as information
about the author and committer (who are generally one and the
same), and then a comment. So the comment is actually part of
the commit object, which means that if you modify the comment on
a commit, you get a totally new commit object with new SHA-1.

Where is this tree object stored? Well, it has a SHA-1. And
look, its SHA-1 starts with 37! What if we look in that
directory? Can you guess what will be there? (I know, it's
obvious when I say it...)

$ ls .git/objects/37
675fc023b0863cd8a702041de28282caa17c1d

And if we get the contents of our tree object, what do we find?

$ git cat-file -p 37675fc023b0863cd8a702041de28282caa17c1d
100644 blob 797f7c1809e83fd6122cb4a247d345e7f5de4f5d
test1.txt

See? Our tree object points to a blob. And if we look at the
blob:

$ git cat-file -p 797f7c1809e83fd6122cb4a247d345e7f5de4f5d
This is a test.
And a very good test it is!

Now, what happens when I modify test1.txt, and then commit it?
The answer: None of the existing objects are affected. They
stay precisely the way they were before. But if we create a new
commit, then it is our main, default commit (known as the HEAD),
and is the basis for any new commits we make. But the existing
commits remain around... well, basically forever.

For example:

$ cat >> test1.txt
Still a great file, right?

$ git add test1.txt

$ git commit -m 'Added amazing brilliance to our text file'
[master b6c4ec9] Added amazing brilliance to our text file
1 file changed, 1 insertion(+)

Notice that the SHA-1 returned by Git is different from the
previous one. If we look at it:

$ git cat-file -p b6c4ec9
tree eeca41cd12f46cd4c237f28c78b7e11762a0b22b
parent 58165443eca522ef35bad68964fc09ec000449ef
author Reuven Lerner <reuven@lerner.co.il> 1483914372 +0200
committer Reuven Lerner <reuven@lerner.co.il> 1483914372
+0200

Added amazing brilliance to our text file

Notice that our commit, since it isn't the first one in the
system (the "root" commit), has a "parent" field, pointing back
to the commit from which it came. But we still have a tree -- a
different tree object -- and the other standard stuff. Following
the tree along to the new file, we see:

$ git cat-file -p 909f2de7c8a572d91f06b188790416a2c195f0ed
This is a test.
And a very good test it is!
Still a great file, right?

But what if I'm nostalgic for the old version of the file? Is it
gone? Definitely not; Git holds onto it forever. I can even
look it

$ git cat-file -p 797f7c1809e83fd6122cb4a247d345e7f5de4f5d
This is a test.
And a very good test it is!

Now, cat-file isn't the sort of thing you use every day with Git.
But it does let you see that Git manages to do a lot with just a
few objects.

Next time, I'll talk about branches in Git, and how they're far
simpler than you might think. (Unless you already think that
they're simple!) And of course, if you have questions (about Git
or anything else!) that you would like me to address, please
respond to this message. I've been overwhelmed with suggestions
and ideas, so it'll take a while to get to all of them, but I
promise that I will.

Until next week,

Reuven

Sign up for newsletter at: https://lerner.co.il/newsletter/
	This week, let's take a break from Python and talk a little bit
	about Git ( http://t.dripemail2.com/c/eyJhY2NvdW50X2lkIjoiNjE2ODIxOCIsImRlb=
	Gl2ZXJ5X2lkIjoiMjMxMTgyMzY3OSIsInVybCI6Imh0dHA6Ly9naXQtc2NtLmNvbS8_X19zPXhz=
	d2dxc3FiNHFuMml6d3c0aXloIn0 ). I teach Git courses every few
	months, and without fail people come into the class saying that
	they have been using Git for a few months, and it seems to work
	OK so long as they use the list of commands that their boss
	provided. But if something happens that isn't on that list, and
	if they cannot figure out what to do based on Stack Overflow,
	they're sunk.

	The goal of my Git course is not only to help them use the
	different Git commands, but also to give them insights into
	what's happening inside of Git, so that when things go wrong --
	or appear to go wrong -- they can fix the problem, rather than
	removing their local copy of the repository and cloning again.
	Which is what a huge number of people do.

	I want to point out that Git is one of the best tools I've ever
	used, and has made me a better developer. And yet, I should also
	point out that the user interface is exactly what you would
	expect from a bunch of kernel hackers whose primary language is
	C. The naming of Git features is terrible and inconsistent, the
	number of options you can invoke is nearly infinite, and many of
	the terms and commands were seemingly chosen because they clashed
	with completely different commands used by other version-control
	systems.

	The thing is, once you understand how Git works, it suddenly
	starts to make sense. And that's because Git doesn't do very
	much at all: It's a specialized database, containing a very small
	number of objects. And part of the genius of Git, in my opinion,
	is that you can have a robust and fully operational
	version-control system by implementing just a few ideas.

	Indeed, you can think of Git as a database that contains just
	four types of objects:

	* blobs (i.e., file contents)
	* trees (i.e., directories)
	* tags
	* commits

	When you say "git commit", you're creating a new commit object.
	That object points to a tree, and that tree then points to
	additional trees and blobs. Assuming that your commit is not the
	first in a repository, then it also points back to its parent.

	Let's create and go through a Git repository to see what I'm
	talking about. On the command line, I'll create a new directory
	and repository:

	$ mkdir gitfun
	$ cd gitfun
	$ git init

	Git responds by saying:

	Initialized empty Git repository in
	/Users/reuven/Desktop/gitfun/.git/

	Great! We now have a new repository!

	Um, but what does that mean? It means that Git has configured a
	few things, including the special ".git" directory, under which
	things are stored. What is stored there? Well, right now,
	there's not much to see. Looking at ".git/objects", which is
	where Git stores things, we'll see two subdirectories, but no
	actual objects.

	$ ls .git/objects
	info/ pack/

	So, let's now create a new file in Git:

	$ cat >> test1.txt
	This is a test.
	And a very good test it is!

	$ git add test1.txt

	$ git commit -m 'Added test1.txt'
	[master (root-commit) 5816544] Added test1.txt
	1 file changed, 2 insertions(+)
	create mode 100644 test1.txt

	In the above shell commands, I created a simple text file. Then
	I staged it by using the "add" command -- what, you think that
	there should be a "stage" command? But that would deprive
	consultants of business opportunities! -- and then committed it
	using "git commit".

	The moment I did that, Git created a number of different objects.
	Each object is represented in Git with a SHA-1 value. SHA-1 is a
	hash function that doesn't guarantee that every file will have a
	unique hash value, but it's close enough for all practical
	purposes. If you had a way to deliberately create a file with a
	given SHA-1, then Git would probably break -- but that's not
	realistic, so far as I know, so we should be OK.

	Git reported above that it created a new commit, and even gave us
	the first few digits of its SHA-1, 5816544. We can see this more
	clearly, and with a longer name, if we use "git log":

	$ git log
	commit 58165443eca522ef35bad68964fc09ec000449ef
	Author: Reuven Lerner <reuven@lerner.co.il>
	Date: Mon Jan 9 00:11:41 2017 +0200

	Added test1.txt

	We can thus see that our most recent commit has a SHA-1 that
	starts with 5816544, and continues until we get a 40-character
	SHA-1. But we can use the first four hex digits, so longer as
	they're unique in our repository.

	Where did Git store this object? Inside of .git/objects. But
	because our repository might contain lots of objects, we aren't
	going to store everything straight inside of .git/objects.
	Rather, Git takes the first two characters of the SHA-1, and
	uses that as the name of a subdirectory in which to store
	objects. For example:

	$ ls .git/objects
	37/ 58/ 79/ info/ pack/

	Our commit object is inside of the "58" directory:

	$ ls .git/objects/58
	165443eca522ef35bad68964fc09ec000449ef

	So as you can see, knowing the SHA-1 of an object allows Git to
	find it right away in our filesystem. That's one of the reasons
	why Git is so fast; the file's contents tell Git where a file is.
	And when the file changes? Then Git will create a new object,
	with a new SHA-1, reflecting the hash value of the new contents.
	And thus, Git stores separate copies of each version of each
	file that you might have written.

	You might have noticed that Git created two other directories
	above, "37" and "79". Why are those there?

	Well, because Git didn't just create a commit object. It also
	created a tree object that sits between the commit and one or
	more trees and blobs. We can use the low-level Git command
	cat-file, along with its "p" option, to inspect these files:

	$ git cat-file -p 58165443eca522ef35bad68964fc09ec000449ef
	tree 37675fc023b0863cd8a702041de28282caa17c1d
	author Reuven Lerner <reuven@lerner.co.il> 1483913501 +0200
	committer Reuven Lerner <reuven@lerner.co.il> 1483913501
	+0200

	Added test1.txt

	In other words, what are the contents of our commit object? it
	contains a tree object (SHA-1 37675f), as well as information
	about the author and committer (who are generally one and the
	same), and then a comment. So the comment is actually part of
	the commit object, which means that if you modify the comment on
	a commit, you get a totally new commit object with new SHA-1.

	Where is this tree object stored? Well, it has a SHA-1. And
	look, its SHA-1 starts with 37! What if we look in that
	directory? Can you guess what will be there? (I know, it's
	obvious when I say it...)

	$ ls .git/objects/37
	675fc023b0863cd8a702041de28282caa17c1d

	And if we get the contents of our tree object, what do we find?

	$ git cat-file -p 37675fc023b0863cd8a702041de28282caa17c1d
	100644 blob 797f7c1809e83fd6122cb4a247d345e7f5de4f5d
	test1.txt

	See? Our tree object points to a blob. And if we look at the
	blob:

	$ git cat-file -p 797f7c1809e83fd6122cb4a247d345e7f5de4f5d
	This is a test.
	And a very good test it is!

	Now, what happens when I modify test1.txt, and then commit it?
	The answer: None of the existing objects are affected. They
	stay precisely the way they were before. But if we create a new
	commit, then it is our main, default commit (known as the HEAD),
	and is the basis for any new commits we make. But the existing
	commits remain around... well, basically forever.

	For example:

	$ cat >> test1.txt
	Still a great file, right?

	$ git add test1.txt

	$ git commit -m 'Added amazing brilliance to our text file'
	[master b6c4ec9] Added amazing brilliance to our text file
	1 file changed, 1 insertion(+)

	Notice that the SHA-1 returned by Git is different from the
	previous one. If we look at it:

	$ git cat-file -p b6c4ec9
	tree eeca41cd12f46cd4c237f28c78b7e11762a0b22b
	parent 58165443eca522ef35bad68964fc09ec000449ef
	author Reuven Lerner <reuven@lerner.co.il> 1483914372 +0200
	committer Reuven Lerner <reuven@lerner.co.il> 1483914372
	+0200

	Added amazing brilliance to our text file

	Notice that our commit, since it isn't the first one in the
	system (the "root" commit), has a "parent" field, pointing back
	to the commit from which it came. But we still have a tree -- a
	different tree object -- and the other standard stuff. Following
	the tree along to the new file, we see:

	$ git cat-file -p 909f2de7c8a572d91f06b188790416a2c195f0ed
	This is a test.
	And a very good test it is!
	Still a great file, right?

	But what if I'm nostalgic for the old version of the file? Is it
	gone? Definitely not; Git holds onto it forever. I can even
	look it

	$ git cat-file -p 797f7c1809e83fd6122cb4a247d345e7f5de4f5d
	This is a test.
	And a very good test it is!

	Now, cat-file isn't the sort of thing you use every day with Git.
	But it does let you see that Git manages to do a lot with just a
	few objects.

	Next time, I'll talk about branches in Git, and how they're far
	simpler than you might think. (Unless you already think that
	they're simple!) And of course, if you have questions (about Git
	or anything else!) that you would like me to address, please
	respond to this message. I've been overwhelmed with suggestions
	and ideas, so it'll take a while to get to all of them, but I
	promise that I will.

	Until next week,

	Reuven

	Sign up for newsletter at: https://lerner.co.il/newsletter/