ShreyanJain9/datarepos.md

## datarepos.md

      
    Raw
  

              datarepos.md
            
          
    ATProto Data Repositories, Demystified

If you’re an ATProto developer (feel free to read this if you’re a normie too - I hope I’d be able to impart at least some of the magic to you too, though there’s a high chance most of this ends up flying over your head) you’ve probably heard the term “repo” come up at least once in discussions of the ATProto architecture. While most of the time you can ignore the details of how these data repositories work, since they are the beating heart of ATProto, you may want to learn more about them, and if you know with some detail their inner workings you may also find yourself able to use the protocol more effectively.
If you’ve touched even a bit of ATProto, you probably know that data is organized into typed ‘collections’ of ‘records’. These are actually conceptually not that different from what in an SQL or similar database you might refer to as ‘columns’ of ‘rows’ or ‘records’ - and that’s intentional. ATProto repos are, essentially, a database containing your personal data, in a public, cryptographically verifiable way.
%% I can try to explain how this leads to the particular structure Bluesky chose from the top down, or work from the bottom up in the actual implemenataiton details.
Let’s start by talking a little bit about content-addressing. Content-addressing is a principle used by networks like IPFS and Nostr - where the address used to identify content is not based on its location, but rather the content itself. That sounds a little abstract, but here’s what that actually means:
In a protocol like the web, the address of a webpage looks something like https://shreyanjain.net/some_document.html/. It’s based on specifying the server the document is hosted at (in this case, shreyanjain.net) and then specifying the exact path of the document on that server. The contents of that document can change over time and the same url can point to it.
In a content-addressing scheme like the one used by IPFS, the address of a piece of data is determined by that data’s hash. The hash of some sequence of bytes is a (for all intents and purposes) unique 256-bit integer corresponding to that sequence of bytes. Here, rather than requesting a piece of content by its location, we request it by its hash, and the nodes that have it respond to us with it. And because the hash function should return the same thing every time for the same data, we also get to verify, when we recieve some content, that it should produce the same address that we used to request it - thereby confirming its authenticity.
%% Talk a bit specifically about the IPLD data model, DAG-CBOR, and what a CID is or is that too in depth?
The blockstore is the core of the atproto repo structure - it stores all the pieces of data that make up the repository. It’s a key-value store, where the keys are CIDs and the values are the dag-cbor data corresponding to the CID that is their key. There are three main types of blocks that go in the blockstore:
- Records; these are the bulk of the content in a repository and the only part that most applications will touch. This is the actual user-created data.
- MST Nodes; when we discuss the merkle search tree this will make more sense, but essentially these contain a mapping of at:// uris to CIDs, which is both speedy to search and creates a unique root hash for unique repo states.
- Commits; these are the only part of the repo structure where any cryptography is actually involved! When people are introduced to ATProto, they often assume each individual record is signed by the signing key, or else that the whole repository is signed as a blob or something. The truth is much more elegant: commits are basically the merkle tree’s root hash signed with your signing key.
If all of that still sounds like some crazy magic at this point, well, it should, because it kind of is.