Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
How is git commit sha1 formed

Ok, I geeked out, and this is probably more information than you need. But it completely answers the question. Sorry.

Locally, I'm at this commit:

$ git show
commit d6cd1e2bd19e03a81132a23b2025920577f84e37
Author: jnthn <jnthn@jnthn.net>
Date:   Sun Apr 15 16:35:03 2012 +0200

    When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

So that's the sha1 I want to reproduce. d6cd1e2bd19e03a81132a23b2025920577f84e37

When I started my investigations, I thought it was something like these things that went into a commit:

$ git --no-replace-objects cat-file commit HEAD
tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

That is

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info
  • The committer info (right, those are different!)
  • The commit message

But it turns out there is also a NUL-terminated header that gets appended to this, containing the word "commit", and the length in bytes of all of the above information:

$ printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c)
commit 327

(No, you can't see the NUL byte.)

Put this header and the rest of the information together:

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD)
commit 327tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

...and what you get hashes to the right sha1!

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
d6cd1e2bd19e03a81132a23b2025920577f84e37  -
@workplaylifecycle
Copy link

workplaylifecycle commented May 15, 2019

awesome, but why not check it out in the source code of git, that would not be so much inference, does that works?

@zgauhar
Copy link

zgauhar commented Jun 7, 2019

Thanks for a very informative article. I'm trying to reproduce some git commit hashes manually, meaning i just have a txt file containing git log output. I am just wondering how to generate the first parent hash? For the very first commit, i have zero parent so should i just put sha1sum of 0?
Any example for first and second parents will be extremely helpful.

@masak
Copy link
Author

masak commented Jun 7, 2019

@zgauhar No, all the lines of the first commit are still there, except the line that starts with parent. The computation of the SHA-1 sum is otherwise the same.

Similarly, commits that are merges have two or more parent lines.

@zgauhar
Copy link

zgauhar commented Jun 7, 2019

Thanks, so we don't mention the parent at all in the very first commit. And that first commit (hash) becomes the parent for the second commit and so on?

@zgauhar
Copy link

zgauhar commented Jun 7, 2019

Secondly, if i have only one file in the whole project, say abc.txt, then the tree hash will be calculated as
tree 100\0
100644 abc.txt\0\xf7\x36\x93\xa1\x6c\xdf\x59\x45\x32\xf70\xf71\xf72\xf73\xf74\xf75\xf76\xf77\xf78\xf79\x360
with correct length and the file hash?

@zgauhar
Copy link

zgauhar commented Jun 7, 2019

0f73693a16cdf594532ee4c423a46d32ce3430c4e

How do you get
\xf7\x36\x93\xa1\x6c\xdf\x59\x45\x32\xf70\xf71\xf72\xf73\xf74\xf75\xf76\xf77\xf78\xf79\x360
from
f73693a16cdf594532ee4c423a46d32ce3430c4e

My concern is the 3 character values (f71, f72 etc.), as i understand the hex bytes contain only two digits. Shouldn't they rather be \xee\x4c\x42\x3a and so on? Or am i missing something?

@jbarrick-mesosphere
Copy link

jbarrick-mesosphere commented Sep 30, 2019

This is super helpful. Thanks a bunch!

@gpltaylor
Copy link

gpltaylor commented Nov 13, 2019

I don't think I will ever need this knowledge but somehow I don't think it could live without it :)
really good writeup!

@HectorRicardo
Copy link

HectorRicardo commented May 24, 2020

Maybe I am missing something.....but doesn´t it also take into account the timestamp of the commit ?
https://stackoverflow.com/questions/23791999/why-does-git-commit-amend-change-the-hash-even-if-i-dont-make-any-changes

@masak
Copy link
Author

masak commented May 25, 2020

@HectorRicardo "author info" and "committer info" both contain timestamps.

@gebitang
Copy link

gebitang commented Jan 6, 2021

you are the MAN

@jeffrade
Copy link

jeffrade commented Feb 3, 2021

If you sign your commits gpgsig will also be apart of the commit (this example seen below):

$ git cat-file commit a401338d245961323815e32c94b9ca831c21e07b
tree 6751ff7d3dedafdeae175cefc968fe41e8aec928
parent 538e41375a1799f664fc54ffee70a911d611226e
author Brooke Kuhlmann <brooke@alchemists.io> 1610567835 -0700
committer Brooke Kuhlmann <brooke@alchemists.io> 1610567835 -0700
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE0UiFiNLe33PmLweh8rxJvE/7mkgFAl//UNEACgkQ8rxJvE/7
 mkivlA//TECiT4prHNA8woylOmDbktRWzTtDzXUson08VqhgLIxx8NWEehXUYP0/
 tF2ec10ED/n+qd1Ts035aJJxJGuNFkdFeTBUA3T+iQkLwg7MCWpnm83cPCPTGiTw
 Jk0G5fb0pV5QFY9qFMzBg5MzteyBD69i8Un02Tnu7yVIQsZ/+eFZETVfkYuCDq+R
 K8IPRUIITzN0CacHTi4K/NuAdhyYpZgyEnhamUXwpu4J3rVEOf90x1Vh1XbwW9yk
 1D7uoUKg0vz7FlYGyfd7y8ZNdDFF12Vq6UfcFyaU6x0jC3NqdgUsGEWLKnFWKsav
 8GPYWeUyJKVDThoiIKvESeaI6d7Fp2cnectX1/vO7xnsgtBC5DhbYyYmbFTUamwa
 I3U9+rAhufC+EH9YwyUeuFz0E06Vrp3htlj/S6w2hxOZAgfZiPt7EyAhtf7fqkBD
 gEZmUDQ3vRoEPCx1T0kvm69ZXapVQMuzlT9MbtJ9NiEw1SUTOYJCNUzy/fhPZDCS
 zWxexT6zGDq2oxhAkwciYHNtljreuYI02snXarqL9HqnKG4guEt44tyGNXwBK58g
 Hq9uq5bNMm0n0eRe7m6ab7UP0PhK8b+lFlWnoWzPMq4/2m2bwEy1DOI6NSRVPisy
 NUJkwD3dVGfXTkdteie5ALiV0u4qswFQOXO0vSV0Wd1DzmV5g9I=
 =uvLw
 -----END PGP SIGNATURE-----

Added Git metadata cloning article link

Provides additional insight into different kinds of cloning, especially
when you only care about repository metadata.

@masak
Copy link
Author

masak commented Feb 4, 2021

@jeffrade Oh, interesting. I haven't looked at how that is modeled, but I wouldn't be surprised at all if that was in some sense considered "part of the commit comment", but then also filtered out by tools. Heh, that is totally a falsifiable claim, and I could be wrong. 😉

@gjohnsonCO
Copy link

gjohnsonCO commented Mar 9, 2021

You can 'git init' a repo, create and commit a file, and end up with the same git hash every time with the following bash script:
#!/bin/bash

export GIT_COMMITTER_DATE="Mon, 3 Jul 2020 17:18:43 +0200"
export GIT_AUTHOR_DATE="Mon, 3 Jul 2020 17:18:43 +0200"

mkdir $1
cd $1
git init
echo > "hi 1." > foo
git add foo
git commit -m 'initial' --date="Mon, 3 Jul 2020 17:18:43 +0200"
git log

@masak
Copy link
Author

masak commented Mar 9, 2021

@gjohnsonCO Cute!

Of course, that comes at the price of providing an incorrect commit date. Still, definitely useful to know — usually I've only thought of blobs and trees as being perfectly reproducible because of that ever-moving timestamp on commits.

Cheers!

@silvestrst
Copy link

silvestrst commented Apr 29, 2021

@masak , thank you, 9 years later - still very useful to some random developers on the internet :)

@masak
Copy link
Author

masak commented Apr 29, 2021

I swear, I think this silly exploratory gist might end up being my legacy — the mark I made on the world. 😄

@milahu
Copy link

milahu commented Oct 13, 2021

in python:

dulwich/objects.py#L512

    def sha(self):
        """The SHA1 object that is the name of this object."""
        if self._sha is None or self._needs_serialization:
            # this is a local because as_raw_chunks() overwrites self._sha
            new_sha = sha1()
            new_sha.update(self._header())
            for chunk in self.as_raw_chunks():
                new_sha.update(chunk)
            self._sha = new_sha
        return self._sha

dulwich/objects.py#L155

def object_header(num_type: int, length: int) -> bytes:
    """Return an object header for the given numeric type and text length."""
    return object_class(num_type).type_name + b" " + str(length).encode("ascii") + b"\0"

for the chunks, see class Commit(ShaFile)def _serialize and class Tree(ShaFile), etc

docs: https://www.samba.org/~jelmer/dulwich/docs/tutorial/file-format.html

refs: do_commit, ...

@Konubinix
Copy link

Konubinix commented Oct 13, 2021

Thanks for this gist. It helped me a lot investigating a strange issue I have with two identical commits having the same hash.

https://konubinix.eu/braindump/posts/9b9dc018-3a12-4fd4-960b-52737ac9f671/?title=why_the_same_git_commit_does_not_have_the_same_hash

By the way, are you aware of an alternative way of computing the hash that could explain why I could have two identical commits with the same hash?

@masak
Copy link
Author

masak commented Oct 14, 2021

@Konubinix

Thanks for this gist. It helped me a lot investigating a strange issue I have with two identical commits having the same hash.

https://konubinix.eu/braindump/posts/9b9dc018-3a12-4fd4-960b-52737ac9f671/?title=why_the_same_git_commit_does_not_have_the_same_hash

Curious!

By the way, are you aware of an alternative way of computing the hash that could explain why I could have two identical commits with the same hash?

I am not aware of such an alternative way, but I can think of two possible reasons:

  • At some point, the canonical Git SHA-1 computation changed. (Like you yourself point out.)
  • Some non-canonical Git implementation was used to compute those hashes.

The way content addressing works, if the wrong SHA-1 hash was computed in the past, it will be much like those commits are not there; the SHA-1 is the unique identifier for finding the commit later — if it's wrong, then the commit simply isn't found. It's similar to storing a hashable object in a HashMap, and then the hash of that object changes. (Something that's not supposed to happen but which could.) Asking the HashMap whether it contains that object would get the result false.

@Konubinix
Copy link

Konubinix commented Oct 14, 2021

Thanks for your answer, I the mean time, I found out the issue. It is linked to a strange behavior of git (a bug). It does not show the gpg signature of a hash if there is a ref with the name of the hash in the repository. And git-filter-repo create such a ref.
Then, both commit where indeed differents, but git cat-file did not show the difference.

I wrote the rest of the analysis and the conclusion in the note I linked above.

So, once again, this gist totally helped me understand what is going on.

@Konubinix
Copy link

Konubinix commented Oct 14, 2021

Well, it looks like I just found out about git replace the hard way ;-). It is what caused the commits to appear to be the same while one replaced the other.

@masak
Copy link
Author

masak commented Oct 15, 2021

git replace

I am at a loss for words. When would this ever be a good idea? This seems to cross the line from "not a great API" to "let's corrupt our own data model".

Maybe there's something I'm missing. But this seems to break the invariant that if you find object o using the SHA-1 checksum S, then computing SHA-1(o) will give you S. That, to me, seemed to be the whole point of a content-addressable system.

@Konubinix
Copy link

Konubinix commented Oct 15, 2021

Hehe. Anyway, I warmly suggest you change the gist to provide --no-replace-objects in the git cat-file examples of the gist.

@masak
Copy link
Author

masak commented Oct 15, 2021

I'm sorry, I find no such option, for example here or in my local Git install (v2.24.3).

@Konubinix
Copy link

Konubinix commented Oct 16, 2021

@masak
Copy link
Author

masak commented Oct 18, 2021

Updated; maybe it helps some poor soul discover git replace quicker.

@Konubinix
Copy link

Konubinix commented Oct 18, 2021

You rock :-)

@milahu
Copy link

milahu commented Nov 27, 2021

in python

also see my verify_github_api.py which is simpler than the dulwich (git.py) code

i wanted to verify a source archive ("git tree") by commit hash
which is surprisingly hard, cos the github commit api is lossy
cos the timezones are missing (author timezone and committer timezone)

@xerZV
Copy link

xerZV commented Dec 4, 2021

Noice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment