Skip to content

Instantly share code, notes, and snippets.

@masak
Last active October 2, 2024 09:32
Show Gist options
  • Save masak/2415865 to your computer and use it in GitHub Desktop.
Save masak/2415865 to your computer and use it in GitHub Desktop.
How is git commit sha1 formed

Ok, I geeked out, and this is probably more information than you need. But it completely answers the question. Sorry. ☺

Locally, I'm at this commit:

$ git show
commit d6cd1e2bd19e03a81132a23b2025920577f84e37
Author: jnthn <jnthn@jnthn.net>
Date:   Sun Apr 15 16:35:03 2012 +0200

    When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

So that's the sha1 I want to reproduce. d6cd1e2bd19e03a81132a23b2025920577f84e37

When I started my investigations, I thought it was something like these things that went into a commit:

$ git --no-replace-objects cat-file commit HEAD
tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

That is

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info
  • The committer info (right, those are different!)
  • The commit message

But it turns out there is also a NUL-terminated header that gets appended to this, containing the word "commit", and the length in bytes of all of the above information:

$ printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c)
commit 327

(No, you can't see the NUL byte.)

Put this header and the rest of the information together:

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD)
commit 327tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

...and what you get hashes to the right sha1!

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
d6cd1e2bd19e03a81132a23b2025920577f84e37  -
@masak
Copy link
Author

masak commented Mar 9, 2021

@gjohnsonCO Cute!

Of course, that comes at the price of providing an incorrect commit date. Still, definitely useful to know — usually I've only thought of blobs and trees as being perfectly reproducible because of that ever-moving timestamp on commits.

Cheers!

@silvestrst
Copy link

@masak , thank you, 9 years later - still very useful to some random developers on the internet :)

@masak
Copy link
Author

masak commented Apr 29, 2021

I swear, I think this silly exploratory gist might end up being my legacy — the mark I made on the world. 😄

@milahu
Copy link

milahu commented Oct 13, 2021

in python:

dulwich/objects.py#L512

    def sha(self):
        """The SHA1 object that is the name of this object."""
        if self._sha is None or self._needs_serialization:
            # this is a local because as_raw_chunks() overwrites self._sha
            new_sha = sha1()
            new_sha.update(self._header())
            for chunk in self.as_raw_chunks():
                new_sha.update(chunk)
            self._sha = new_sha
        return self._sha

dulwich/objects.py#L155

def object_header(num_type: int, length: int) -> bytes:
    """Return an object header for the given numeric type and text length."""
    return object_class(num_type).type_name + b" " + str(length).encode("ascii") + b"\0"

for the chunks, see class Commit(ShaFile)def _serialize and class Tree(ShaFile), etc

docs: https://www.samba.org/~jelmer/dulwich/docs/tutorial/file-format.html

refs: do_commit, ...

@Konubinix
Copy link

Konubinix commented Oct 13, 2021

Thanks for this gist. It helped me a lot investigating a strange issue I have with two identical commits having the same hash.

https://konubinix.eu/blog/posts/9b9dc018-3a12-4fd4-960b-52737ac9f671/?title=why_the_same_git_commit_does_not_have_the_same_hash

By the way, are you aware of an alternative way of computing the hash that could explain why I could have two identical commits with the same hash?

@masak
Copy link
Author

masak commented Oct 14, 2021

@Konubinix

Thanks for this gist. It helped me a lot investigating a strange issue I have with two identical commits having the same hash.

https://konubinix.eu/braindump/posts/9b9dc018-3a12-4fd4-960b-52737ac9f671/?title=why_the_same_git_commit_does_not_have_the_same_hash

Curious!

By the way, are you aware of an alternative way of computing the hash that could explain why I could have two identical commits with the same hash?

I am not aware of such an alternative way, but I can think of two possible reasons:

  • At some point, the canonical Git SHA-1 computation changed. (Like you yourself point out.)
  • Some non-canonical Git implementation was used to compute those hashes.

The way content addressing works, if the wrong SHA-1 hash was computed in the past, it will be much like those commits are not there; the SHA-1 is the unique identifier for finding the commit later — if it's wrong, then the commit simply isn't found. It's similar to storing a hashable object in a HashMap, and then the hash of that object changes. (Something that's not supposed to happen but which could.) Asking the HashMap whether it contains that object would get the result false.

@Konubinix
Copy link

Thanks for your answer, I the mean time, I found out the issue. It is linked to a strange behavior of git (a bug). It does not show the gpg signature of a hash if there is a ref with the name of the hash in the repository. And git-filter-repo create such a ref.
Then, both commit where indeed differents, but git cat-file did not show the difference.

I wrote the rest of the analysis and the conclusion in the note I linked above.

So, once again, this gist totally helped me understand what is going on.

@Konubinix
Copy link

Well, it looks like I just found out about git replace the hard way ;-). It is what caused the commits to appear to be the same while one replaced the other.

@masak
Copy link
Author

masak commented Oct 15, 2021

git replace

I am at a loss for words. When would this ever be a good idea? This seems to cross the line from "not a great API" to "let's corrupt our own data model".

Maybe there's something I'm missing. But this seems to break the invariant that if you find object o using the SHA-1 checksum S, then computing SHA-1(o) will give you S. That, to me, seemed to be the whole point of a content-addressable system.

@Konubinix
Copy link

Konubinix commented Oct 15, 2021

Hehe. Anyway, I warmly suggest you change the gist to provide --no-replace-objects in the git cat-file examples of the gist.

@masak
Copy link
Author

masak commented Oct 15, 2021

I'm sorry, I find no such option, for example here or in my local Git install (v2.24.3).

@Konubinix
Copy link

Konubinix commented Oct 16, 2021 via email

@masak
Copy link
Author

masak commented Oct 18, 2021

Updated; maybe it helps some poor soul discover git replace quicker.

@Konubinix
Copy link

You rock :-)

@milahu
Copy link

milahu commented Nov 27, 2021

in python

also see my verify_github_api.py which is simpler than the dulwich (git.py) code

i wanted to verify a source archive ("git tree") by commit hash
which is surprisingly hard, cos the github commit api is lossy
cos the timezones are missing (author timezone and committer timezone)

@xerZV
Copy link

xerZV commented Dec 4, 2021

Noice

@lemanschik
Copy link

lemanschik commented Apr 14, 2023

i liked this gist i am senior engineer so i want to give you some magic because it was so entertaining for me: Try:

git init
git commit --allow-empty -m "Magic Root Commit"

now the first commit of your repo is empty and you maybe wonder why people could want that?

when you now fork that repo or create repos with the same pattern they are compatible and you get less merge issues also you can always clear historys and start from scratch while all already existing copies stay valid.

i often start from such a repo with additional branches so i can linear switch all branches start from the first empty commit always. Guess what this gives most time zero merge issues ;)

oh and of course you can git clone -s ./magic-repo ./new-repo really fast with zero objects if needed and do sparse checkouts via git worktree add which allows you to combine multiple branches into a single worktree checkout dir holding refs to many branches parallel checked out ;)

@masak
Copy link
Author

masak commented Apr 14, 2023

@lemanschik Nice. I like it.

Although I should add, I have successfully merged together unrelated repositories where a magic root commit did not sit at the top. The thing that happened to me was that Git gave a weak warning like "warning: these histories seem unrelated", and then went on and merged the repositories anyway, into disjoint history graphs. YMMV.

@lemanschik
Copy link

@masak i went with my own meta versioning system i simple store the additional git compatible information as additional meta i go for content unification and then SHA-512 also doing the same for Large assets as i can version blocks. I do not store initial files i store content Blobs of a fixed size 20MB Blocks as this looks like a magic number at present for performance. And as this does not depend on one block per file it reduces space needs a lot.

I do it a bit like dockers overlay file implementations which do also content hashing on block level but less deterministic and predictable. But when you use Docker + BTRFS you come near to my feature set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment