Skip to content

Instantly share code, notes, and snippets.

@natanlao
Last active December 27, 2023 07:04
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save natanlao/afb676b17aa724754ee77099e4291f3f to your computer and use it in GitHub Desktop.
Save natanlao/afb676b17aa724754ee77099e4291f3f to your computer and use it in GitHub Desktop.
Translating GitHub resource IDs to global node IDs

GitHub associates a unique resource ID (or "database ID" or just "ID") with each API-accessible resource. For example, each issue, repository, and user has a global ID. In my limited experience with it, GitHub's REST API generally does not expose endpoints by which resources can be queried by ID (though it does have some undocumented endpoints). These resource IDs have been superseded by distinct global node IDs (node_id). GitHub's GraphQL API allows retrieval of a node by its ID, called a "direct node lookup".

As you can tell, you likely don't have much reason to interact with the older identifiers directly. I encountered this case when using the ZenHub API, which interfaces with repositories only by resource ID. In my case, I wanted to retrieve a list of recently-closed issues from a set of repositories identified only by their resource ID (and without their owner or name).

This would be possible using the undocumented repository information REST endpoint I mentioned above, but I wanted to use the GraphQL API due to the amount of repositories I was interested in. GitHub's GraphQL API does not expose a means of querying repositories (or any object, to my understanding) by these resource IDs, so we need to manually convert them.

It turns out that the global node IDs are base64 encodings of a human-readable format. Take, for example, DataBiosphere/azul:

query {
  azul: repository(owner:"DataBiosphere", name:"azul") {
    id
    databaseId
  }
}
{
  "data": {
    "azul": {
      "id": "MDEwOlJlcG9zaXRvcnkxMzkwOTU1Mzc=",
      "databaseId": 139095537
    }
  }
}

If we decode the global node ID, we can see the pattern:

$ echo "MDEwOlJlcG9zaXRvcnkxMzkwOTU1Mzc=" | base64 --decode
010:Repository139095537

which we can infer is roughly equivalent to {type_id}:{type_name}{resource_id}. I'm pretty new to GraphQL, so I may have butchered some of the terminology.

In any case, we can reverse this operation to calculate a global node ID from some resource identified only by resource ID:

$ echo -n "010:Repository139095537" | base64
MDEwOlJlcG9zaXRvcnkxMzkwOTU1Mzc=
query {
  node(id:"MDEwOlJlcG9zaXRvcnkxMzkwOTU1MzcK") {
    ... on Repository {
      nameWithOwner
    }
  }
}
{
  "data": {
    "node": {
      "nameWithOwner": "DataBiosphere/azul"
    }
  }
}

Interestingly, this will work even with the trailing newline (i.e., calling echo in the above example without the -n flag).

@Davetbutler
Copy link

This trick is cool. Any idea how the new global node ids will work? You can see some community questions about this here.

An example (for a random new repo is)

"id": "R_kgDOHL0RdA"
"databaseId": 482152820

The issue I am currently facing is that there used to be this mapping (the one you point out above) from node_id --> database_id (and back again), but with the new node_ids this does not seem to be possible...

Any thoughts?

@kastner
Copy link

kastner commented Apr 29, 2022

Hey @Davetbutler

The issue I am currently facing is that there used to be this mapping (the one you point out above) from node_id --> database_id (and back again), but with the new node_ids this does not seem to be possible...

I poked around a bit and it seems like it is still possible most of the time. As far as I can tell, they moved from Base64 encoding a string to a custom (?) variable length integer with custom-ish encoding. Here's some messy Ruby code showing how to convert between the two:

digitmap = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_".chars

## From new ID to database ID (chop off the first few chars that describe what type this is - e.g. `U_kg` for user)

"DM_w".chars.map {|c| digitmap.index(c)}.map{|n| "%06b" % n}.join[12,8].to_i(2)
# => 255

"DNCAA".chars.map {|c| digitmap.index(c)}.map{|n| "%06b" % n}.join[12,16].to_i(2)
# => 2048

"DOAAjHMA".chars.map {|c| digitmap.index(c)}.map{|n| "%06b" % n}.join[12,32].to_i(2)
# => 575280


## From database ID (or any positive integer above 127) to new ID format (you have to figure out the first few chars differently, I'll explain after):
("%064b" % 255).chars[-8..-1].join.scan(/.{1,6}/).map {|s| s.ljust(6, '0')}.map {|n| digitmap[n.to_i(2)]}.join
# => "_w"

("%064b" % 2048).chars[-16..-1].join.scan(/.{1,6}/).map {|s| s.ljust(6, '0')}.map {|n| digitmap[n.to_i(2)]}.join
# => "CAA"

("%064b" % 575280).chars[-32..-1].join.scan(/.{1,6}/).map {|s| s.ljust(6, '0')}.map {|n| digitmap[n.to_i(2)]}.join
# => "AAjHMA"

You'll notice some special values like -8 or [12,32]. This is related to the first part of the string - e.g. DN. That prefix is an indicator of how many chars/bytes/bits are coming up.

This web page was super helpful in figuring all this out: https://carlmastrangelo.com/blog/lets-make-a-varint

However, I wouldn't rely on this always working, and I found some values that this didn't totally work for. And I'm all out of energy to hack more on this :)

Hope this helps!

@Davetbutler
Copy link

Hey @kastner.

Super helpful and gives me a massive head start. Thanks. I wish they gave a bit more documentation on their process, GH suggest they will provide tools to help convert from new to old etc, but I am sceptical to rely on their word here!

Thanks again. I wish there was more of a reputation system on GH, so I could kudos you for your help here! Hope I can return the favour with some help sometime in the future as it seems we have somewhat mutual interests.

@kastner
Copy link

kastner commented May 13, 2022

:D Happy to help, and it was a very fun reverse engineering challenge!

@twavv
Copy link

twavv commented Aug 12, 2022

FYI I think the IDs are messagepack with prefixes for type.
Repo R_kgDOHXTDzg becomes [0, 494191566].
Commit C_kwDOHXTDztoAKDg5Y2Q2YzA1YTE2YmVmNWUwM2M0OWU5ODVlY2Y0ZWEyZjI1NDQ4MGQ becomes [0, 494191566, '89cd6c05a16bef5e03c49e985ecf4ea2f254480d'] (the string is the SHA of the commit).

I think that's why most ids start with a k: the array indicator in messagepack is a 144+(len) byte (144 is 0x90). 144 is 0b10010000 . Each base64 digit represents six bytes, so we're encoding 0b100100 = 36:

>>> encoding[0b100100]
'k'

Anything with a size <= 3 will start with a k, which seems to be what happens. Most (all?) actually start with kw which is a function of the array byte follewed by a literal zero (not sure why, but the first element is always a zero).

@kastner
Copy link

kastner commented Aug 12, 2022

@travigd I think you nailed it! awesome work :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment