Skip to content

Instantly share code, notes, and snippets.

@masak
Last active February 21, 2024 10:53
Star You must be signed in to star a gist
Save masak/2415865 to your computer and use it in GitHub Desktop.
How is git commit sha1 formed

Ok, I geeked out, and this is probably more information than you need. But it completely answers the question. Sorry. ☺

Locally, I'm at this commit:

$ git show
commit d6cd1e2bd19e03a81132a23b2025920577f84e37
Author: jnthn <jnthn@jnthn.net>
Date:   Sun Apr 15 16:35:03 2012 +0200

    When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

So that's the sha1 I want to reproduce. d6cd1e2bd19e03a81132a23b2025920577f84e37

When I started my investigations, I thought it was something like these things that went into a commit:

$ git --no-replace-objects cat-file commit HEAD
tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

That is

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info
  • The committer info (right, those are different!)
  • The commit message

But it turns out there is also a NUL-terminated header that gets appended to this, containing the word "commit", and the length in bytes of all of the above information:

$ printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c)
commit 327

(No, you can't see the NUL byte.)

Put this header and the rest of the information together:

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD)
commit 327tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

...and what you get hashes to the right sha1!

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
d6cd1e2bd19e03a81132a23b2025920577f84e37  -
@zgauhar
Copy link

zgauhar commented Jun 7, 2019

Thanks, so we don't mention the parent at all in the very first commit. And that first commit (hash) becomes the parent for the second commit and so on?

@zgauhar
Copy link

zgauhar commented Jun 7, 2019

Secondly, if i have only one file in the whole project, say abc.txt, then the tree hash will be calculated as
tree 100\0
100644 abc.txt\0\xf7\x36\x93\xa1\x6c\xdf\x59\x45\x32\xf70\xf71\xf72\xf73\xf74\xf75\xf76\xf77\xf78\xf79\x360
with correct length and the file hash?

@zgauhar
Copy link

zgauhar commented Jun 7, 2019

0f73693a16cdf594532ee4c423a46d32ce3430c4e

How do you get
\xf7\x36\x93\xa1\x6c\xdf\x59\x45\x32\xf70\xf71\xf72\xf73\xf74\xf75\xf76\xf77\xf78\xf79\x360
from
f73693a16cdf594532ee4c423a46d32ce3430c4e

My concern is the 3 character values (f71, f72 etc.), as i understand the hex bytes contain only two digits. Shouldn't they rather be \xee\x4c\x42\x3a and so on? Or am i missing something?

@jbarrick-mesosphere
Copy link

This is super helpful. Thanks a bunch!

@gpltaylor
Copy link

gpltaylor commented Nov 13, 2019

I don't think I will ever need this knowledge but somehow I don't think it could live without it :)
really good writeup!

@HectorRicardo
Copy link

HectorRicardo commented May 24, 2020

Maybe I am missing something.....but doesn´t it also take into account the timestamp of the commit ?
https://stackoverflow.com/questions/23791999/why-does-git-commit-amend-change-the-hash-even-if-i-dont-make-any-changes

@masak
Copy link
Author

masak commented May 25, 2020

@HectorRicardo "author info" and "committer info" both contain timestamps.

@gebitang
Copy link

gebitang commented Jan 6, 2021

you are the MAN

@jeffrade
Copy link

jeffrade commented Feb 3, 2021

If you sign your commits gpgsig will also be apart of the commit (this example seen below):

$ git cat-file commit a401338d245961323815e32c94b9ca831c21e07b
tree 6751ff7d3dedafdeae175cefc968fe41e8aec928
parent 538e41375a1799f664fc54ffee70a911d611226e
author Brooke Kuhlmann <brooke@alchemists.io> 1610567835 -0700
committer Brooke Kuhlmann <brooke@alchemists.io> 1610567835 -0700
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE0UiFiNLe33PmLweh8rxJvE/7mkgFAl//UNEACgkQ8rxJvE/7
 mkivlA//TECiT4prHNA8woylOmDbktRWzTtDzXUson08VqhgLIxx8NWEehXUYP0/
 tF2ec10ED/n+qd1Ts035aJJxJGuNFkdFeTBUA3T+iQkLwg7MCWpnm83cPCPTGiTw
 Jk0G5fb0pV5QFY9qFMzBg5MzteyBD69i8Un02Tnu7yVIQsZ/+eFZETVfkYuCDq+R
 K8IPRUIITzN0CacHTi4K/NuAdhyYpZgyEnhamUXwpu4J3rVEOf90x1Vh1XbwW9yk
 1D7uoUKg0vz7FlYGyfd7y8ZNdDFF12Vq6UfcFyaU6x0jC3NqdgUsGEWLKnFWKsav
 8GPYWeUyJKVDThoiIKvESeaI6d7Fp2cnectX1/vO7xnsgtBC5DhbYyYmbFTUamwa
 I3U9+rAhufC+EH9YwyUeuFz0E06Vrp3htlj/S6w2hxOZAgfZiPt7EyAhtf7fqkBD
 gEZmUDQ3vRoEPCx1T0kvm69ZXapVQMuzlT9MbtJ9NiEw1SUTOYJCNUzy/fhPZDCS
 zWxexT6zGDq2oxhAkwciYHNtljreuYI02snXarqL9HqnKG4guEt44tyGNXwBK58g
 Hq9uq5bNMm0n0eRe7m6ab7UP0PhK8b+lFlWnoWzPMq4/2m2bwEy1DOI6NSRVPisy
 NUJkwD3dVGfXTkdteie5ALiV0u4qswFQOXO0vSV0Wd1DzmV5g9I=
 =uvLw
 -----END PGP SIGNATURE-----

Added Git metadata cloning article link

Provides additional insight into different kinds of cloning, especially
when you only care about repository metadata.

@masak
Copy link
Author

masak commented Feb 4, 2021

@jeffrade Oh, interesting. I haven't looked at how that is modeled, but I wouldn't be surprised at all if that was in some sense considered "part of the commit comment", but then also filtered out by tools. Heh, that is totally a falsifiable claim, and I could be wrong. 😉

@gjohnsonCO
Copy link

You can 'git init' a repo, create and commit a file, and end up with the same git hash every time with the following bash script:
#!/bin/bash

export GIT_COMMITTER_DATE="Mon, 3 Jul 2020 17:18:43 +0200"
export GIT_AUTHOR_DATE="Mon, 3 Jul 2020 17:18:43 +0200"

mkdir $1
cd $1
git init
echo > "hi 1." > foo
git add foo
git commit -m 'initial' --date="Mon, 3 Jul 2020 17:18:43 +0200"
git log

@masak
Copy link
Author

masak commented Mar 9, 2021

@gjohnsonCO Cute!

Of course, that comes at the price of providing an incorrect commit date. Still, definitely useful to know — usually I've only thought of blobs and trees as being perfectly reproducible because of that ever-moving timestamp on commits.

Cheers!

@silvestrst
Copy link

@masak , thank you, 9 years later - still very useful to some random developers on the internet :)

@masak
Copy link
Author

masak commented Apr 29, 2021

I swear, I think this silly exploratory gist might end up being my legacy — the mark I made on the world. 😄

@milahu
Copy link

milahu commented Oct 13, 2021

in python:

dulwich/objects.py#L512

    def sha(self):
        """The SHA1 object that is the name of this object."""
        if self._sha is None or self._needs_serialization:
            # this is a local because as_raw_chunks() overwrites self._sha
            new_sha = sha1()
            new_sha.update(self._header())
            for chunk in self.as_raw_chunks():
                new_sha.update(chunk)
            self._sha = new_sha
        return self._sha

dulwich/objects.py#L155

def object_header(num_type: int, length: int) -> bytes:
    """Return an object header for the given numeric type and text length."""
    return object_class(num_type).type_name + b" " + str(length).encode("ascii") + b"\0"

for the chunks, see class Commit(ShaFile)def _serialize and class Tree(ShaFile), etc

docs: https://www.samba.org/~jelmer/dulwich/docs/tutorial/file-format.html

refs: do_commit, ...

@Konubinix
Copy link

Konubinix commented Oct 13, 2021

Thanks for this gist. It helped me a lot investigating a strange issue I have with two identical commits having the same hash.

https://konubinix.eu/blog/posts/9b9dc018-3a12-4fd4-960b-52737ac9f671/?title=why_the_same_git_commit_does_not_have_the_same_hash

By the way, are you aware of an alternative way of computing the hash that could explain why I could have two identical commits with the same hash?

@masak
Copy link
Author

masak commented Oct 14, 2021

@Konubinix

Thanks for this gist. It helped me a lot investigating a strange issue I have with two identical commits having the same hash.

https://konubinix.eu/braindump/posts/9b9dc018-3a12-4fd4-960b-52737ac9f671/?title=why_the_same_git_commit_does_not_have_the_same_hash

Curious!

By the way, are you aware of an alternative way of computing the hash that could explain why I could have two identical commits with the same hash?

I am not aware of such an alternative way, but I can think of two possible reasons:

  • At some point, the canonical Git SHA-1 computation changed. (Like you yourself point out.)
  • Some non-canonical Git implementation was used to compute those hashes.

The way content addressing works, if the wrong SHA-1 hash was computed in the past, it will be much like those commits are not there; the SHA-1 is the unique identifier for finding the commit later — if it's wrong, then the commit simply isn't found. It's similar to storing a hashable object in a HashMap, and then the hash of that object changes. (Something that's not supposed to happen but which could.) Asking the HashMap whether it contains that object would get the result false.

@Konubinix
Copy link

Thanks for your answer, I the mean time, I found out the issue. It is linked to a strange behavior of git (a bug). It does not show the gpg signature of a hash if there is a ref with the name of the hash in the repository. And git-filter-repo create such a ref.
Then, both commit where indeed differents, but git cat-file did not show the difference.

I wrote the rest of the analysis and the conclusion in the note I linked above.

So, once again, this gist totally helped me understand what is going on.

@Konubinix
Copy link

Well, it looks like I just found out about git replace the hard way ;-). It is what caused the commits to appear to be the same while one replaced the other.

@masak
Copy link
Author

masak commented Oct 15, 2021

git replace

I am at a loss for words. When would this ever be a good idea? This seems to cross the line from "not a great API" to "let's corrupt our own data model".

Maybe there's something I'm missing. But this seems to break the invariant that if you find object o using the SHA-1 checksum S, then computing SHA-1(o) will give you S. That, to me, seemed to be the whole point of a content-addressable system.

@Konubinix
Copy link

Konubinix commented Oct 15, 2021

Hehe. Anyway, I warmly suggest you change the gist to provide --no-replace-objects in the git cat-file examples of the gist.

@masak
Copy link
Author

masak commented Oct 15, 2021

I'm sorry, I find no such option, for example here or in my local Git install (v2.24.3).

@Konubinix
Copy link

Konubinix commented Oct 16, 2021 via email

@masak
Copy link
Author

masak commented Oct 18, 2021

Updated; maybe it helps some poor soul discover git replace quicker.

@Konubinix
Copy link

You rock :-)

@milahu
Copy link

milahu commented Nov 27, 2021

in python

also see my verify_github_api.py which is simpler than the dulwich (git.py) code

i wanted to verify a source archive ("git tree") by commit hash
which is surprisingly hard, cos the github commit api is lossy
cos the timezones are missing (author timezone and committer timezone)

@xerZV
Copy link

xerZV commented Dec 4, 2021

Noice

@lemanschik
Copy link

lemanschik commented Apr 14, 2023

i liked this gist i am senior engineer so i want to give you some magic because it was so entertaining for me: Try:

git init
git commit --allow-empty -m "Magic Root Commit"

now the first commit of your repo is empty and you maybe wonder why people could want that?

when you now fork that repo or create repos with the same pattern they are compatible and you get less merge issues also you can always clear historys and start from scratch while all already existing copies stay valid.

i often start from such a repo with additional branches so i can linear switch all branches start from the first empty commit always. Guess what this gives most time zero merge issues ;)

oh and of course you can git clone -s ./magic-repo ./new-repo really fast with zero objects if needed and do sparse checkouts via git worktree add which allows you to combine multiple branches into a single worktree checkout dir holding refs to many branches parallel checked out ;)

@masak
Copy link
Author

masak commented Apr 14, 2023

@lemanschik Nice. I like it.

Although I should add, I have successfully merged together unrelated repositories where a magic root commit did not sit at the top. The thing that happened to me was that Git gave a weak warning like "warning: these histories seem unrelated", and then went on and merged the repositories anyway, into disjoint history graphs. YMMV.

@lemanschik
Copy link

@masak i went with my own meta versioning system i simple store the additional git compatible information as additional meta i go for content unification and then SHA-512 also doing the same for Large assets as i can version blocks. I do not store initial files i store content Blobs of a fixed size 20MB Blocks as this looks like a magic number at present for performance. And as this does not depend on one block per file it reduces space needs a lot.

I do it a bit like dockers overlay file implementations which do also content hashing on block level but less deterministic and predictable. But when you use Docker + BTRFS you come near to my feature set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment