Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
How is git commit sha1 formed

Ok, I geeked out, and this is probably more information than you need. But it completely answers the question. Sorry. ☺

Locally, I'm at this commit:

$ git show
commit d6cd1e2bd19e03a81132a23b2025920577f84e37
Author: jnthn <jnthn@jnthn.net>
Date:   Sun Apr 15 16:35:03 2012 +0200

    When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

So that's the sha1 I want to reproduce. d6cd1e2bd19e03a81132a23b2025920577f84e37

When I started my investigations, I thought it was something like these things that went into a commit:

$ git cat-file commit HEAD
tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

That is

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info
  • The committer info (right, those are different!)
  • The commit message

But it turns out there is also a NUL-terminated header that gets appended to this, containing the word "commit", and the length in bytes of all of the above information:

$ printf "commit %s\0" $(git cat-file commit HEAD | wc -c)
commit 327

(No, you can't see the NUL byte.)

Put this header and the rest of the information together:

$ (printf "commit %s\0" $(git cat-file commit HEAD | wc -c); git cat-file commit HEAD)
commit 327tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

...and what you get hashes to the right sha1!

$ (printf "commit %s\0" $(git cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
d6cd1e2bd19e03a81132a23b2025920577f84e37  -

Excellent writeup!

(and excellent title)

Thanks for this, very informative.

This is really cool! But then how about also guessing the tree hash? I'm trying to apply a patch completely out of context, sort of like a patch-rebase, and I need to fabricate what would be a valid hash for the given commit info I have in hand (while changing the original patch timestamp to something that is more recent than the new HEAD commit I'm patching over), so the patch goes through.

icyflame commented Aug 6, 2014

Informative.
Nerd Climate (out of 10) : tending to 10!

pchaigno commented Nov 4, 2014

Thanks for this, it has been very useful!

As @goldfeld, I'm trying to form the tree hash.
Any idea on how this one is formed?

Did anybody think that the git branches & commits - it looks like the Bitcoin blockchain without "Work of Proof"? :)

Thanks for this!

Very interesting. Thank you.

For those wondering, creating the tree hash is a little more involved. Git will lie to you (a little bit) when you ask for the contents of a tree object.

git cat-file -p HEAD^{tree}

will produce something like

100644 blob f73693a16cdf594532ee4c423a46d32ce3430c4e    blah.txt
040000 tree 86c2509f4c12c5d3bf9a486925ed051666ee2d97    new_dir
100644 blob b5fd817de972cdb092b7dfbeeb1bedb4f05eb218    new_file.txt
100644 blob 0861b9114fba8c82892d89e53f2a34447bd4c9e7    newer_file.txt

But this is not how a tree object is saved before it is compressed. For one, there are no newlines in the uncompressed tree object, but I'm going to add them for output here.

tree 196\0
100644 blah.txt\0f73693a16cdf594532ee4c423a46d32ce3430c4e
40000 new_dir\086c2509f4c12c5d3bf9a486925ed051666ee2d97
100644 new_file.txt\0b5fd817de972cdb092b7dfbeeb1bedb4f05eb218
100644 newer_file.txt\00861b9114fba8c82892d89e53f2a34447bd4c9e7

Okay, this looks a little better, but there's still one more "lie" (and if you count the characters and compare to the 196 I added in the tree header, you can see what it is). Unlike commit objects, tree object don't store sha1 hashed in plaintext. They are packed down to just 20 bytes. Each two-character pair is converted to a single hex value, which is more like this:

tree 196\0
100644 blah.txt\0\xf7\x36\x93\xa1\x6c\xdf\x59\x45\x32\xf70\xf71\xf72\xf73\xf74\xf75\xf76\xf77\xf78\xf79\x360
40000 new_dir\0\x86\xc2\x50\x9f\x4c\x12\xc5\xd3\xbf\x860\x861\x862\x863\x864\x865\x866\x867\x868\x869\xc20
100644 new_file.txt\0\xb5\xfd\x81\x7d\xe9\x72\xcd\xb0\x92\xb50\xb51\xb52\xb53\xb54\xb55\xb56\xb57\xb58\xb59\xfd0
100644 newer_file.txt\0\x08\x61\xb9\x11\x4f\xba\x8c\x82\x89\x080\x081\x082\x083\x084\x085\x086\x087\x088\x089\x610

So that is what you should be taking the sha1 hash of to create a tree object in git's object store.

Hope that helps!

In Ruby, you would open a file like this:

require 'zlib'
#  This will open that new_dir tree object above.
#  Be sure to open with "rb" since it's a binary file, and then run .read to grab the whole thing
file = File.open("c2509f4c12c5d3bf9a486925ed051666ee2d97", "rb").read
content = Zlib::Inflate.inflate(file)
=> "tree 44\x00100644 sub_dir_file.txt\x00=\xFD\xC5\x9BF\xD2\xAA7*vz\xA1$\xDFq\xB5\xDDs\x10A"

And if you unpack those last 20 bytes to something prettier:

hash = content.chars.last(20).map {|c| c.unpack("C")[0].to_s(16).rjust(2,"0")}.join
=> "3dfdc59b46d2aa372a767aa124df71b5dd731041"
content[0...-20] + hash
=> "tree 44\x00100644 sub_dir_file.txt\x003dfdc59b46d2aa372a767aa124df71b5dd731041"

MUCH better.

Here's the StackOverflow answer where I learned this: http://stackoverflow.com/questions/14790681/format-of-git-tree-object

Note that he adds in spaces and newlines for output as well.

That is cool. Thanks.

ytrezq commented Oct 23, 2015

@masak what a about the sha1 binary form that is used internally, is the hex form simply base64 encoded?

Thanks :)

@ytrezq: it is base16 encoded: just a hex representation of the binary hash.

danger89 commented Jun 9, 2016

Thanks clear :)

Just used this information today. Thanks!

yeasy commented Jul 27, 2016

@Perlover blockchain is mostly a dynamic chain, while git is a dag.
However, the content-based-addressing idea is quite similar with each other!

xtbl commented Sep 1, 2016

Thanks, awesome explanation.

Just came across this — thanks for the writeup! :D

Thanks, very clear.

firogh commented Feb 15, 2017

Cool and thanks.

👍

dalzuga commented Mar 31, 2017

Very nice!

jguevara commented Jul 2, 2017

Thanks, that proves that commit hashes are generated in a predictable and reproducible way. This info is useful for users of tools like subgit, which imports SVN repos into git.

Thanks for this; it saved me a lot of effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment