rvagg/multicodec_pull_203_discuss.md

## multicodec_pull_203_discuss.md

      
    Raw
  

              multicodec_pull_203_discuss.md
            
          
    This post builds on the discussion multiformats/multicodec#203 and is posted here because it might be a bit too long and in-the-weeds and I doubt many will actually read this all anyway! But I'd like this as a record of an ongiong broader disucssion about these topics.

Working through the process of mapping blockchain block formats to IPLD has made me see this question slightly differently. First it was getting the full bitcoin chain format working, including the awkward segwit hacks, and now it's working with @i-norden through getting the full ethereum format mapped to IPLD (e.g.).
The primary goal we're trying to achieve with IPLD codec codes here (via CIDs usually) is describing what glasses we put on to see the data. We want to get something into the data model in memory from the raw binary we've been handed.
An obvious example of having different glasses would be these CIDs:

bafyreiblwimnjbqcdoeafiobk6q27jcw64ew7n2fmmhdpldd63edmjecde
bafkreiblwimnjbqcdoeafiobk6q27jcw64ew7n2fmmhdpldd63edmjecde

They're both valid but one says it's a raw and one says it's a dag-cbor. IPLD just switches out its glasses when looking at the bytes. There's even some hacks going on in Filecoin-world to (ab)use raw in this way to get around some DAG-completeness problems.
Another example might be the codecs bitcoin-tx and bitcoin-witness-commitment. They both deal with 64-byte blocks and decode them as tuples of 32-bytes. However, when we put on the bitcoin-tx glasses for a 64-byte block we see [CID, CID], but when we put on the bitcoin-witness-commitment glasses we see [CID, Bytes]. The 32-bytes being used to see the CID isn't even a proper CID, we make a CID emerge from those bytes by bringing knowledge of what the codec should be and what the hash function was.
To further push this "glasses" analogy: in the latest version of the Bitcoin codec I wrote, when putting on the bitcoin-block to look at the header bytes, I made it see two CIDs that certainly aren't in the raw bytes in the way that you're decode them outside of IPLD. The schema for the header has this:
type BitcoinHeader struct {
  previousblockhash optional Bytes
  merkleroot Bytes
  parent optional &BitcoinHeader
  tx &BitcoinTransaction
   # ... other stuff here
}

Those two links emerge out of the two previous Bytes fields, which are left intact (mainly because they're useful as bytes for various reasons, and because they're byte-reversed from what we normally need from a hash digest!). The glasses I made just happen to see those things in the raw bytes even though a bitcoin purist would argue that they're not there.
To take another angle, back to our dag-cbor and raw CIDs above, there's another form that's just as valid:

bafireiblwimnjbqcdoeafiobk6q27jcw64ew7n2fmmhdpldd63edmjecde

This time plain cbor. It's not properly defined what a decoder should to do get this into the data model in some of the edges, such as what to do with tags. ipld-prime will bork at them, but another decoder could just skip over tags entirely or come up with another novel way of presenting them in the data model (an older JS one would instantiate its own custom objects in this case!).
So what we're doing in the case of dag-cbor is insisting that you see the bytes through those glasses and make the CIDs emerge out of the section of bytes that are preceded by the appropriate tag.
Back to the SoftWare Heritage identifiers question - could it be the case that our glasses when viewing this data insist on seeing things in the data that the plain git* codec(s) won't? Maybe it's as simple as seeing string field with the SWH URI for the resource or a CID that points to a unique SWH object that you wouldn't see if you thought it was just plain git*.
I generally still think we should try and avoid using CIDs to signal where to get data from, that's not part of the CID spec in its current form. But I think that we have seem a class of problems around CIDs that suggest that they're not able to do all that people need them to do and without additional extensions to the spec we either have to reject certain use-cases entirely (which really isn't great for IPLD) or overload other pieces of functionality to make it work.
But in this case, perhaps it's as simple as SWH data necessarily looking different to Git data even though the byte format may be the same?