rebolyte/yjs.md

## yjs.md

      
    Raw
  

              yjs.md
            
          
    https://github.com/yjs/yjs/blob/main/INTERNALS.md
https://www.youtube.com/watch?v=0l5XgnQ6rB4 - internals

5:20 - defining a type gives you a view into the Items in the CRDT structure. The type is responsible for taking the Items (a linked list) and representing that to the user
7:15 - all types extend AbstractType
7:50 - everything is a linked list. He implemented a sequence CRDT, and then added maps and XML elements on top of that.
9:00 - codebase is small because basically everything is an Item.
10:40 - AbstractType always has a list CRDT, and the _start property tells you the first item. Also always has a _map property which maps previously-defined arbitrary string to CRDT, which is the last item in the list CRDT. (This is Replace Operation Manager in the whitepaper.)

14:06 - _map is a utility so if the AbstractType is actually storing a map you can look up the last value from the key


16:44 - to store tree of directories and their files, you could have a Map that is the file name, and then a list of Items. Each of those has content {AbstractContent}, which can be an AbstractType (recursive)
17:45 - an Item is just an item in the linked list

AbstractType#_item is null by default - if it’s top-level type, item will be null because it hasn’t been inserted anywhere. _item is the item that contains this type object (points back, links to its parent)
“The item and type object pair have a 1-1 mapping. The item's content field references the AbstractType object and the AbstractType object's _item field references the item.”


21:11 - AbstractContent - one item points to a bunch of content. If you insert into Item(id, ‘abc’), you’ll get Item(id, ‘a’), Item(id, ‘*’), Item(id, ‘bc’)

ID of an Item is the pair of the client ID and the sequence number (clock, lamport timestamp). This refers to content, not to an Item.

User types “a” = Item(1234, 0)
User then types “b” = Item(1234, 1)


27:30 - when instantiating Item, id refers to the first item in content. If it’s a bunch of characters, it points to the first. To reference other characters, I can refer to anything like ID(id.client, id.clock + 1)

This is same as Automerge, but it calls client ID agentId and clock number sequenceNumber`


29:30 - if I type “abc”, that would be clock 0, 1, 2. Then I delete one character, that does not generate a new clock. To mark a deletion you split the item so it’s only the deleted character, and set the deleted flag to true.

Then when I sync with other clients, you are going to send me all the Items I don’t know about, and it’s going to call Item#integrate


32:45 - how do I know what deletes you need to know about? At the beginning of a session you’re going to send all the delete operations to me. Yjs is a combo of all “delta CRDT” (sequential operation-based CRDT) with the exception that deletes are a “state CRDT.” This can be stored super efficiently because you don’t store deletes as operations on your doc (RLE encoding and variable encoding).

Say you deleted 50% of a document’s content, I could write (ID(123, 0), 123, 85732) - this says for client 123 at clock 0 to client 123 at operation 85732 will be deleted.
34:25 - why is the representation so small? Internally it’s encoded as (ID(123, 0), 1900), which says “starting at ID 123,0 delete 1900 items.” Then, if there is something still remaining, I could delete another 300 items - (ID(123, 0), 1900), (1901, 300). Then iterate through all the items in the internal list and mark them as deleted based on what I receive here.
This is just a bunch of numbers, so you can see diff between 1900 and 1901 is only 1, so I encode the diff betwene the two.
(ID(123, 0), 1900), (1901, 300), (2203, 56789) - I encode the ID, the length of the deletion length, and the the difference to the next deletion item (just 1 in this example), then the length of the deletion, then the difference to the next (just 4 in this ex).


38:40 - insert a character in YText - find the position in the linked list, and then patch in this item inside .integrate() - const right = this.left.right.
40:30 - metadata needed for conflict resolution. Properties origin and rightOrigin define where you originally inserted the item. This defines the unique order of the list.

If I insert X in “abXc”, the origin of X would be “b”. Same as predecessor in Automerge This is how it’s done in RGA.


53:42 - having rightOrigin adds a bit of extra data over the wire, but it means that the priority system gets much simpler and we get better performance when merging operations with a lot of prepending going on.
57:00 - all the characters in a string will have the same this.parent
57:41 - in .info, mark is fast-search-marker (skip list implementation with a height of 1). Cache the most recent 10 positions used.
1:01:20 - Integration: when instantiating Item, origin is going to be equal to left and rightOrigin is going to equal right

parentSub is used if this is a child of a map, and will refer to the key of the map (a string).
Conflict happens when users insert into same position at the same time. Takes O(C) where C is amount of concurrent operations at the position.
I have 3 characters that get inserted that all have the same origin. I see the first, insert it after its origin. Then I find out about the next character, I find the origin, I find the conflicting character, then I decide to put it to the left or to the right. Then I get the third character, and I have to scan through to find the right location (what this while loop does).
If second item comes in and inserts 1000 items it will still be fast, because those will almost definitely end up all in one Item
He applies updates from users with high client IDs first, then small client IDs. Because smaller will be inserted to the left, so you get to break condition pretty soon.


1:31:43 - done with history
1:33:19 - once conflict resolution complete and everything in the linked list, we call addStruct to store in our database. transaction.doc.store is as index of all items that are ever created.
1:33:50 - Every change happens inside a transaction. It starts with beforeState of what we began with, and an afterState of the result. We’ll need this to compute, eg, which items got merged. In Transaction#cleanupTransactions is where we do doc.emit(‘update’, …). If you want to revert a transaction you can undo everything it its DeleteSet.
1:41:00 - if you enable GC, deleted items get cleaned up - you can delete its content but you do need to have the length of what was there. This will still be a new GC(id, length) - an AbstractStruct. You can say things like clock 100 to 200 was all included in this deleted struct.
1:46:49 - a state vector is a binary encoding of all the client IDs and their lamport timestamps.
Snapshot is a state vector with a delete set. With that, you can revert to any point in time. You could take a snapshot before syncing with a client, then another after syncing, and you can compare the differences by populating the differences between the delete sets and the state vectors.
1:51:37 - on yjs.dev site - these versions only exist for me. If you visit this site you’ll see the stored state vectors on your side, stored in IndexDB.

https://github.com/yjs/website/blob/master/src/components/demos/demo-prosemirror.js


1:53:43 - struct store. If a user inserts a character, then tells you about it, you don’t scan through the whole linked list. He does a binary search on this stored struct DB, finds the item, and then can jump straight to it in the linked list.

One way to find an item is to go to the abstract type and then iterate through its content, usually you’d do this if you have an index position you want to populate. Can also use skip list here. Or 2) you want to insert between some items, and you’ll use the lamport timestamp to find it in the store.
Say I connect to you over the network, and I find that there’s 5 operations that you don’t know about from me, so I would go to this struct store, slice that part of the array in the store, and encode and send it to you.


2:16:10 - when initially syncing you send the slice of missing items and all the deleted items
2:18:20 - writeUpdateMessageFromTransaction assembles the update messages. First adds all the added structs and then all the deleted.

Case 1: I connect and have no data, so you send me either the whole linked list in order or you send me the arrays of structs. Then if I went offline and later reconnected, you’d need to resend all the deletes because now I’m ahead of you and you don’t know what I know about in terms of deletes.


2:20:01 - syncStep1, syncStep2 - in step 1 I’m going to send you my state vector (all items [clocks] that I know about) and the delete set. When you receive, you respond with step 2, which is the delete set - all the missing items that I don’t know about. Once I know about all the deletes that ever happened and all what I’m missing, I’m synced with you (you’re not synced with me at this point).

State vector is a reference to all the clocks of the clients, snapshot is SV + delete set at that point in time.


2:43:43 - review.
2:51:52 - map. In this._map the last item in the linked list is the current value of the map. If 2 users concurrently insert an item, you ignore all of them except the last, and mark all but the last as deleted. There are more efficient implementations but advantage here is that you can navigate back through the history of the type.
2:56:30 - review.
2:57:20 - y-protocols. Awareness protocols: you moved cursors around so much he doesn’t want to store in Yjs model. Stores references to content (unique IDs in the model). Uses a difference-based CRDT to propagate when there is a change.