Skip to content

Instantly share code, notes, and snippets.

@nicowilliams
Last active September 15, 2024 01:55
Show Gist options
  • Save nicowilliams/a6e5c9131767364ce2f4b3996549748d to your computer and use it in GitHub Desktop.
Save nicowilliams/a6e5c9131767364ce2f4b3996549748d to your computer and use it in GitHub Desktop.
git noobie and power-user crash course

Guide to Git Noobies and would-be Power-Users

This is an attempt at a concise guide to git that can help noobies and power users think their way around git. We assume the user knows a little bit about Unix filesystems or, alternatively, programming.

We explain the most essential git concept and then explain a second tier of concepts in terms of that one essential concept, then a third tier of concepts that power users must know (e.g., rebasing). This approach should give the reader all the mental tools needed to make the most of git.

What this is NOT:

  • this is NOT a git cheat-sheet (though there will be examples);
  • this is NOT a guide for people who are new to computers in general.

Table of Contents

What a Repository Notionally is: a Bag of Commits + Name Resolution Table

A lot of systems consist of a resources identified by non-user-friendly names plus a name resolution service that provides user-friendly naming. For example:

  • A Unix filesystem is... a bag of files identified by {inode number, generation number}, and a name resolution service (directories).
  • The Internet is a bunch of connected nodes identified by IP addresses, and a name resolution service (DNS).

So it is with git. A git repository is:

  • a bag of "commits" identified by a cryptographic hash value and which organize into trees via parent commit references in each commit
  • a name resolution system to resolve symbolic "branches" and "tags" to individual commits

(Now, Git does not store things internally this way, but this is a simplified and equivalent conception of what Git does.)

A commit is just a set of changes bundled together along with a commit comment and references to a parent commit(s). That's it for repositories. There are local repositories, and remote ones too, but we'll get to that in a bit.

The key takeaway is that a git repository is a bag of commits with unfriendly names (the commit hashes, because they are meaningless, largish, and there can be so many!) plus a name resolution mechanism. This is just like a filesystem, or the Internet.

This bag-of-commits+name-resolution concept is the essential git concept, though it is true that git is not exactly like this internally, but that's OK.

All operations on a repository involve adding commits and/or manipulating the name resolution table. One normally never removes commits from a repository except by garbage collection / pruning (about which more later).

I said commits organize into a tree. This is the version control tree. Each commit refers to one parent, and possibly more "merge parents". One can reach the root of the tree from any leaf or interior node by following the parent link of each commit. Each commit represents the state of a version of the repository: all the files and their contents at that point in time. Branches of the tree represent... version control branches (whether named or not).

A repository can contain multiple disjoint commit trees.

Because the parent links of a commit are cryptographic hashes of the parent commits, commits are immutable: modifying a commit will modify its hash, and as that would break the tree, one does not modify commits -- instead of modifying commits, one adds commits.

There is no link from a commit to its children, however. One has to "know" what the leaves are or search for them the hard way, and commits have cryptographic, non-human-friendly names anyways, which is why we need name resolution for branches and tags.

This concept of a repository is the most important thing to know about git. Understanding this is essential to being a productive git user. It is not unlike knowing the concept of inodes and directories, and the concept of hard-links, in Unix filesystems -- one quickly learns about mv(1), ln(1), rm(1), and developers quickly learn the underlying POSIX functions (system calls) such as rename(2), link(2), and unlink(2).

On to the Second Tier of Git Concepts

With the essential concept of repository as "bag of commits forming trees + name resolution (branches, tags) tables" in mind, we can now move on to a few other important concepts in the git user experience:

  • remotes
  • workspace
  • the index / staging area
  • tags and branches
  • fast-forward push/fetch

Remotes

A remote is just... another -typically remote, as the name implies- repository with which a local repository can exchange commits and name resolution entries. Fetching is the act of adding commits (and name resolution entries) to a local repository from a remote one. Pushing is the inverse: the act of adding commits (and names) to a remote repository from a local one. Remotes are identified by a pair of URIs (fetch and push), and a user-friendly local alias name.

Typically the name of a remote from which a local was cloned is origin, though one can change this -- it's just a local name!

Local repositories that are copies of remote repositories are called "clones".

IMPORTANT!  All operations are on a _local_ repository *except* for
            `fetch` (or `pull`) and `push` operations.

            Note that `push` only modifies a remote repository, while
            `fetch` and `pull` only modify a local repository by adding
            content fetched from a remote one.

            A `git pull --ff` or `git pull --merge` is actually just a
            `fetch` and followed by adding the fetched commits to the
            currently checked-out branch.  More on `--ff`, `--merge`,
            and `--rebase` later.

Workspaces

The workspace is just the set of files checked out from a repository at some version (a "HEAD" commit, possibly resolved from a branch name or tag name). One can make changes to files in the workspace, then commit the changes, or throw them away -- the workspace is a work in progress: the user's work in progress.

The "index"

The index -or staging area- is a mechanism for only taking some of the extant changes in a workspace and making a commit out of them while leaving other changes extant. The primary utility of the index is to make it easier to split up work in progress into logical commits. The index is also useful for reviewing changes to be committed without concern for any other changes about to be made to the workspace by, say, background processes.

Tags and Branches

A tag is a name for a commit and is meant never to change once created (though tags can be force-changed).

A branch is a name for a branch on the version control tree. A branch is a name just like a tag, but it is meant to automatically "move" to the "latest" commit whenever one a) has that branch checked out in a workspace and b) one adds commits to it.

That's the gist of tags vs. branches.

A fast-forward push/pull of commits (being pushed or being fetched) is a set of commits such that following the parent links of the would-be new head commit of the branch will find the current head commit of the branch that those commits are being applied to.

That's It

That's it! That's git. That's all you need to know to get started on thinking your way through using git. Of course, you'll need a few git commands to get you started on using git, and we'll need to talk about power use of git.

Notice that we did not discuss here anything about "cherry-picking", "rebasing", or "merging" -- these operations follow naturally from the conceptual nature of a repository as "bag of commits organizing into trees + name resolution", and we'll discuss them below.

This is a good place to take a break. The reader might come back minutes or days later. The key is to remember the bag-of-commits+name-table conception of a git repository. To be sure, there's a lot more to git, but this concept will help the reader manage most everything else about git.

On Immutability

A repository's "bag of commits" is mostly add-only: one mostly only adds commits, rarely removes commits, and never modifies commits. This means that it's difficult to lose work that has been committed without losing the repository itself (do NOT rm -rf .git!! do NOT step on .git/). It is possible to lose track of commits -- after all, there can be so many of them, but relatively few branch/tag names that are meaningful to a human! This is where the reflog comes to the rescue: it keeps track of changes to a local repository, thus it can help one find past resolutions of branch/tag names, and possibly restore them (or give them new names). This means that one need never lose work, provided one commits frequently (extant, uncommitted changes to the workspace can be lost as they are not tracked).

The only exception to the rule that one mostly only adds commits is garbage collection: because a repository could get very large if one only ever adds to it, one does have to cleanup occasionally, which means deleting commits that are not reachable from any branch/tag names. For local clones of remote repositories one should disable automatic garbage collection -- this makes the reflog much more useful.

It's important to understand that the bag of commits is immutable, with a cryptographic "Merkle hash tree" tree structure (every commit includes the cryptographic hash of its parent(s)), and this means that history can never be rewritten as such, but the name resolution table (branches and tags) can be modified destructively. It is by destructively changing which commits branches/tags point to that one gives the appearance of rewriting history!

Third Tier of Git Concepts: Cherry-picking, Rebasing, and Merging

The most important concept after the above is cherry-picking. To cherry-pick a commit is to apply its changes to a [possibly-]different version of the repository than it was originally meant for. This can yield conflicts. Some conflicts can be resolved automatically, while others require manual intervention. Once conflicts (if any) are resolved, a new commit is created from the workspace modified by applying the cherry-picked commit, with some of the original commit's metatadata (author, commit message) used to construct the new commit. The new commit's hash will be different than the old one's unless the workspace's HEAD was the same as the cherry-picked commit's parent.

A merge consists of finding all the commits from a branch that are not in another, then applying all those commits' changes to the other, committing them all as one commit. A "merge" commit is left to point to the head commit of the branch from which those changes were taken.

A rebase is very simple: a sequence of cherry-picks -- a script, practically. The idea is to replicate onto some branch a set of commits from another. This is done by finding the "merge base" for the two branches, listing all of the commits not in the target, changing the workspace to have the head of the target, then cherry-picking all those listed commits. A variant is to explicitly specify the merge base.

An interactive rebase is one where the sequence of commits to cherry-pick onto the target branch is shown to the user so that the user may alter the rebase script. Here the user ha a number of options, such as stopping after some or each pick so that the user may do things like: test each commit, or add additional commits in the middle, or rewrite commits. Other options include: reordering the pick script, squashing commits (meaning: merge two or more contiguous commits into one), and so on.

A merge is really a variation on rebase, with all picked commits squashed into one and with metadata left about the merge parent (so that one may find it in the history).

The key building block of both, rebase and merge is the cherry-pick operation!

As with making a normal commit with git commit, if one has checked out a branch, then at the end of a cherry-pick/merge/rebase operation the named branch will be updated to point to the head commit resulting from that operation.

Discussion of Implications of the Essential Git Concept

Now, let's talk about the implications of a repository as a bag of commits with name resolution.

  1. Branches and tags point to commits from which we can reach tree roots by following the parents. These are all "reachable" commits in that we can reach them just by knowing symbolic names of tags/branches.

  2. As long as you don't prune/garbage-collect unreachable commits, you can recover from errors in changing which commits a tag or branch points to. The only errors you can't recover from are: deleting extant workspace content that is neither in the index nor committed.

    This is generally true of version constrol systems: they can't know when to snapshot extant workspace changes without the user's help in identifying what changes they want to... commit to keeping.

    This is why users should git add to the index, and/or git commit, often. Users can fix issues "in post", but recreating lost work is harder, so git add/git commit often.

  3. Unlike Unix filesystems, where the {inode number, generation number} identifiers of files are not usable for opening files (mostly), in git the hash of a commit absolutely can be used to refer to it. This means that the name resolution from branches and tags to commits isn't entirely necessary -- it's only necessary due to humans' poor memory, as we can remember a few names easily and can't remember any pseudo-random 160- or 256-bit strings!

    It is perfectly legitimate to perform a number of operations on a repository by referring only to commits, and only at the end ensure that a name (branch or tag) can be used to find the resulting head commit. Power users do sometimes do this. When working in this way one sees Git warn of being in "detached HEAD mode".

  4. Reprising (1) and (2)... Git doesn't know what the full tree looks like without scanning all the commits in the bag to find leaves and then work its way up to root commits. There can be many, many leaf commits, and many interior node commits, all identified canonically by a hash.

    Of course, git always knows about local branches and tags, so git can always find those commits, and their paths to root commits, very quickly -- git does not have to scan all commits to do this.

  5. Naming is always critical, in git as on the Internet (DNS) as on filesystems as in programming (variable and member names).

  6. Immutability is not difficult to deal with. Commits are immutable, but branch/tag names are not. Almost all operations consist of creating a new commit and updating a branch to point to it. Even rebasing is just a method of constructing new commits from others and then updating a branch to point to the resulting new HEAD commit.

    Users who are familiar with data immutability in programming languages like Haskell or jq should find it easy to think about git. Alternatively, users who learn to think of git as explained here may have an easier time understanding data immutability in programming languages that feature it.

More Git concepts: objects, trees

In reality Git tracks "objects", which are basically files with an object type metadatum. All of these are always identified by a cryptographich hash. There are several types of objects:

  • blob (basically: a file's content that you've put under version control)
  • tree (basically: a special file mapping paths to object hashes)
  • commit (basically: a file whose format is that of a commit and which points to the new tree object for the state of the respository after applying the commit)

Do not confuse a tree object with a tree of commits!

A tree object's content looks like:

<mode> <hash> <path>
...

where mode is a Unix-like file mode (permissions), hash is the hash of the object containing the contents expected of the file at path (which is relative to the root of the tree).

A commit object's contents looks like:

tree <hash>
parent <hash>
author <name> <time>
committer <name> <time>

<commit comment>

For example:

$ git cat-file -p HEAD
tree d48c3c4ae796926c9298be68611fef0f20bb73b2
parent bafb43e58985b8db21a8c338a7637e6f0e35051f
author Jakub Jirutka <jakub@jirutka.cz> 1518133453 +0100
committer Nico Williams <nico@cryptonector.com> 1519146688 -0500

Build static binaries and deploy to GH Releases from Travis
$ 
$ git cat-file -p d48c3c4ae796926c9298be68611fef0f20bb73b2 | head -5
100644 blob 596615322fb3e97de6b2ce208e3557e4b416a972    .gitattributes
100644 blob 666749f5bffcd46dc21a55aaeb352952d51287c8    .gitignore
100644 blob 3fff3b881cb4875b084efac51074d649d5c75e09    .gitmodules
100644 blob 3ec64674a1118b81ad54837a73b7c4fb5f799dff    .travis.yml
100644 blob 198238e4a014021899a0b574337aa43f5341b52e    AUTHORS
$ git cat-file -p 198238e4a014021899a0b574337aa43f5341b52e | head -3
Created By:
Stephen Dolan        <mu@netsoc.tcd.ie>

$ 

Note that commit objects don't actually store diffs! Git will reconstruct diffs as needed from the actual objects in the actual trees referred to by the various commits. That's because reconstructing diffs is faster than reconstructing objects.

So Git is in fact not quite a bag of commits as described at the top of this gist! But that's Ok: a "bag of commits" is good enough to take you to the Git power-user level. The key is that viewing Git as a "bag of commits" makes it easier to think about operations like cherry-picking, rebasing, and merging.

Consider git stash

git stash is a feature that allows you to put away your current, extant changes in the workspace and undo them. It's a lot like doing git diff, saving its output, then git checkout -f -- to reset the workspace to the contents of the HEAD commit.

The git stash does not record the HEAD that a saved stash was meant to apply though. A better way to do what git stash does is:

$ branch=$(git rev-parse --abbrev-ref HEAD)
$ git checkout -b stash_blahblah
$ git commit -am WIP
$ git checkout "$branch"

Then when you want to go back:

$ git checkout stash_blahblah # go back to the "stashed" branch "blahblah"
$ git reset HEAD^             # undo the WIP commit but leave changes in workspace

Here we see the power of understanding git as a bag of commits + name resolution: we can trivially build what git stash does, and we can do it better while also reducing the cognitive load of git as you now don't need to bother learning how to use git stash!

Actual Usage

<TBD. Besides the usual git init/clone, add, commit, and other commands, show detached head mode in order to make the student think of the bag of commits model. Also show a sequence of two cherry-picks; do it twice, once showing the -n option as the basis of squashing. Show one interactive rebase. Show one rebase --onto to further drive the point of how it computes a set of commits to pick.>

Init

$ mkdir foo
$ cd foo
$ git init
$ 

Clone

$ git clone https://github.com/stedolan/jq
$ cd jq
$ ...

Make changes, use the index, add commits

$ $EDITOR README
<write README>
$ git add -e README
<edit the diffs to pick out what to commit now>
$ # you can git add again if you like
$ git commit -m 'First commit!'
$ git add README # add remaining changes
$ git commit -m 'Second commit!'

Create a branch

$ git checkout -b new_branch_name
$ 

Undo workspace changes to specific files

$ git checkout -f -- filename
$ 

refspecs -- Dealing with remote branches

The namespace of branches on a remote repository is distinct from that in a (local) clone of that repository. Typically one keeps the two namespaces in sync: fetch everything, push everything. But this is not always what one wants to do. Sometimes you want to push a branch such that it has a different name on the remote end. Other times one might want to delete a branch on a remote. This is where refspecs come in.

A refspec is just a string of the form <local_name>:<remote_name>:

$ # Push the current HEAD as branch myfeature
$ git push origin HEAD:myfeature
$ # Delete remote branch myfeature
$ git push origin :myfeature
$ # Fetch branch foo as bar
$ git fetch origin bar:foo

...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment