Skip to content

Instantly share code, notes, and snippets.

@masak
Last active October 2, 2024 09:32
Show Gist options
  • Save masak/2415865 to your computer and use it in GitHub Desktop.
Save masak/2415865 to your computer and use it in GitHub Desktop.
How is git commit sha1 formed

Ok, I geeked out, and this is probably more information than you need. But it completely answers the question. Sorry. ☺

Locally, I'm at this commit:

$ git show
commit d6cd1e2bd19e03a81132a23b2025920577f84e37
Author: jnthn <jnthn@jnthn.net>
Date:   Sun Apr 15 16:35:03 2012 +0200

    When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

So that's the sha1 I want to reproduce. d6cd1e2bd19e03a81132a23b2025920577f84e37

When I started my investigations, I thought it was something like these things that went into a commit:

$ git --no-replace-objects cat-file commit HEAD
tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

That is

  • The source tree of the commit (which unravels to all the subtrees and blobs)
  • The parent commit sha1
  • The author info
  • The committer info (right, those are different!)
  • The commit message

But it turns out there is also a NUL-terminated header that gets appended to this, containing the word "commit", and the length in bytes of all of the above information:

$ printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c)
commit 327

(No, you can't see the NUL byte.)

Put this header and the rest of the information together:

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD)
commit 327tree 9bedf67800b2923982bdf60c89c57ce6fd2d9a1c
parent de1eaf515ebea46dedea7b3ae0e5ebe3e1818971
author jnthn <jnthn@jnthn.net> 1334500503 +0200
committer jnthn <jnthn@jnthn.net> 1334500545 +0200

When I added FIRST/NEXT/LAST, it was idiomatic but not quite so fast. This makes it faster. Another little bit of masak++'s program.

...and what you get hashes to the right sha1!

$ (printf "commit %s\0" $(git --no-replace-objects cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
d6cd1e2bd19e03a81132a23b2025920577f84e37  -
@Konubinix
Copy link

You rock :-)

@milahu
Copy link

milahu commented Nov 27, 2021

in python

also see my verify_github_api.py which is simpler than the dulwich (git.py) code

i wanted to verify a source archive ("git tree") by commit hash
which is surprisingly hard, cos the github commit api is lossy
cos the timezones are missing (author timezone and committer timezone)

@xerZV
Copy link

xerZV commented Dec 4, 2021

Noice

@lemanschik
Copy link

lemanschik commented Apr 14, 2023

i liked this gist i am senior engineer so i want to give you some magic because it was so entertaining for me: Try:

git init
git commit --allow-empty -m "Magic Root Commit"

now the first commit of your repo is empty and you maybe wonder why people could want that?

when you now fork that repo or create repos with the same pattern they are compatible and you get less merge issues also you can always clear historys and start from scratch while all already existing copies stay valid.

i often start from such a repo with additional branches so i can linear switch all branches start from the first empty commit always. Guess what this gives most time zero merge issues ;)

oh and of course you can git clone -s ./magic-repo ./new-repo really fast with zero objects if needed and do sparse checkouts via git worktree add which allows you to combine multiple branches into a single worktree checkout dir holding refs to many branches parallel checked out ;)

@masak
Copy link
Author

masak commented Apr 14, 2023

@lemanschik Nice. I like it.

Although I should add, I have successfully merged together unrelated repositories where a magic root commit did not sit at the top. The thing that happened to me was that Git gave a weak warning like "warning: these histories seem unrelated", and then went on and merged the repositories anyway, into disjoint history graphs. YMMV.

@lemanschik
Copy link

@masak i went with my own meta versioning system i simple store the additional git compatible information as additional meta i go for content unification and then SHA-512 also doing the same for Large assets as i can version blocks. I do not store initial files i store content Blobs of a fixed size 20MB Blocks as this looks like a magic number at present for performance. And as this does not depend on one block per file it reduces space needs a lot.

I do it a bit like dockers overlay file implementations which do also content hashing on block level but less deterministic and predictable. But when you use Docker + BTRFS you come near to my feature set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment