evinism/uuid protocol basic

## uuid protocol basic
=== Background ===

For one issue at work, we tossed around the idea of generating UUIDs
clientside in order to ensure that we had a unique immutable id without
having to go to the server to get an ID back, primarily for p2p applications.
Obviously, if somebody's maliciously creating ids, they can choose arbitrary
values and create collisions. Additionally, one client could tell if another
client had already chosen a certain value for an id by trying to commit a
certain UUID to the db.

For example:
> Client 1 creates a "todo" a certain UUID.
> Client 2 tries to create a "todo" with the same UUID.
> Client 2 now has some information about client 1's actions.

I posed the purely academic question of how we might be able to send arbitrary
messages over that channel, and the below is a possible incomplete spec that
attempts to propose a basic solution. To be perfectly clear: This is an
interesting solution to a toy problem.

=== Model ===

Each client can create a record (like a todo) with a certain UUID in this space:
  ----
  0: 00000000-0000-0000-0000-0000000000
  1: 00000000-0000-0000-0000-0000000001
     ...
  n: ffffffff-ffff-ffff-ffff-ffffffffff
  ----

Assume that if an ID hasn't been used already, you get a 201 CREATED back
from the server. If an ID has been used, you get a 409 CONFLICT back. All
GETs will 404 NOT FOUND, so in order to tell whether an ID has been used,
you have to try to create a "todo" with that ID. The interesting bit is
that by reading whether or not another client has created a record with
a UUID, you inevitably end up creating one if it doesn't already exist.

We can imagine a table of sequential UUIDs and whether
they're taken.
  ----
  uuid:   | ...0-0000001 | ...0-0000002 | ...0-0000003 | ...
  taken?: |      no      |      yes     |     yes      |
  ----

We can model the space of taken vs. untaken UUIDs as
shared memory, where the address is the sequential index,
of the UUID, and data is whether the record exists or not.
  ----
  addr: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ...
  data: | 0 | 1 | 1 | 0 | 1 | 1 | 0 | ...
  ----

The "create a record" operation can be viewed in this model as a joint
"read and write true" operation. The interesting thing about this operation
is that any time you read a byte, that byte gets overwritten with all 1s.

=== First Pass: Shared Message Store ===

The first attempt I'm going for is the use of a sled of 1s at the bottom of
the memory space, a data structure I'm calling a "shared message store" that's
just beyond the sled of 1s, and a data heap that starts at the top of the address
space and grows downwards in address space.

Rough overview is that when a client polls for messages, it "slides down" the sled
of 1s to try to find the shared message store, reads the whole shared message store,
performs manipulations to the shared message store, and rewrites the shared message
store just beyond where the previous one ended (which is now filled with 1s due to
the read.) More precise specification follows:

To make it easier on ourselves to start, we're going to start with the
assumption that each client acts are atomically. This is of course a
horrible assumption, but I'll try to address it later.

  ( Message Store Memory Layout )
  [   0  | 00000001 | 01000100 | 00101110 | 01001001 | 01101010 | ... ]
    start, msg_count, head_ptr,  dest_addr,  msg_ptr,  msg_size,
  | -------- header ---------- | ------ message record 1 ------ | -- message record 2 -..

  ( Main memory layout )
  [ ones | shared memory store | zeros | message_data ]
                                       ^
                           head_ptr from message store

  Glossary of fields:
    start:     The start bit, indicating the beginning of the message store
    msg_count: The number of messages in the message store
    head_ptr:  The pointer for the next. Gets updated when a new message
               is added to the message store.
    dest_addr: The destination address of the message
    msg_ptr:   The pointer to the to the message data for that message record
    msg_size:  The length of the message data for that message record

-- Reading a message --

To read from the message, read down the field of 1s until you find a zero,
then read the subsequent 2 bytes to get the message count. Since messages are
of a fixed size (2 bytes), we can now read the whole set of message records.
Once we've read the message queue. Since after reading, the field of 1s now
extends through where the shared message store was written, we rewrite the
message store at the end of the new field (removing any messages addressed
to that client.)

After reading, re-copy the message store just past the end of the newly
invalidated block.

-- Inserting a message --

Follow the steps for reading a message
Additionally, before rewriting the message, perform these steps:
 1: Write the message data to the head_ptr address.
 2: Update head_ptr to the end of the new data blob.
 3: Add a new message record to the message store
 4: Update msg_count

=== Second Pass: Concurrency issues + locks ===

I'm going to attempt to tackle this one whittling down from strong guarantees
to weak ones. The concurrency models from strongest to weakest are:

 1: Clients can execute an entire routine atomically
 2: Clients can execute a readwrite of k addresses atomically (with some limit).
 3: Clients can execute a readwrite of 1 address atomically.

The above protocol is robust to (1) but to no others, and thus will have to be amended.

-- Tackling k < O(log(n)) --
First, we split up the shared message store into a static part and a dynamic part:

 (head_ptr_record)
 |  01  | 00100101 |
  start   head_ptr

We move the residual part of the message store to just beyond the head ptr:
 (shared_message_store)
 | message 1 | ... | message m-1 | message m | msg_count |

where each message is still of the form:
 | dest_addr | msg_ptr | msg_size |

Our whole memory space now looks like:
 [ ones | head_ptr_rec | zeros | message_store | message_data ]

We modify the start sequence, such that a 00 means someone else has a lock on the
datastore a 01 means that you've successfully gained a lock on the datastore. If
a client tries to read the head_ptr_record and gets a 00 in the start field, that
implies that someone else has a lock and so nothing should be done. The convenient
part of this operation is that the lock is maintained atomically-- reading a lock
maintains the lock, and reading a head_ptr_record atomically commits a lock.

This requires atomic manipulation of 2 + log2(n) bits where n is size of the
space of available uuids.

-- Setting up the memory space --
This protocol does not handle the initial coordination of the clients. If the
memory space were blank at the onset of clients joining, then all clients will
loop consistenty, each thinking that some other

To solve this, we'll add in a quick asymmetry to the specification: a client that
writes to the first word automatically gains the read/write lock, and assumes the
default head pointer state of the form [ 01 | top_of_address_space ]. Since a field
of zeros is a valid shared message store, the client performs no additional setup.

An extremely rough implementation of this version of the protocol can be found in
my scratchpad repo evinism/graveyard:
https://github.com/evinism/graveyard/blob/master/uuid_protocol.js

-- Tackling k=2 --
Really, the essence of the locking mechanism we need is the start sequence. We can
replace the whole lock operation with just the sled of 1s, followed by a two-bit
sequence. From a practical perspective, this makes it far less likely that two clients
will conflict horrifically if atomicity can only be guaranteed to k=1, but that's
little consolation when the jitter between requests is not completely neglegable.

If we wanted to go down this path, we could section off the memory into 3 sections
and treat each one of them as a contiguous shared memory space.
 - M1 for i % 3 == 0
 - M2 for i % 3 == 1
 - M3 for i % 3 == 2

We could arbitrarily assign
 M1: Locking mechanism (just the 01 sequence from the head_ptr data structure)
 M2: Shared Message Store (identical the non-concurrent construction)
 M3: Message Data

Since the locking mechanism can no longer store a pointer, we must still
put in a sled of 1s on both M1 and M2. This is the primary modification to
get us to k=2. Super simple.

=== Third Pass: Error correction ===

[ TODO ]
IDK yet. Obviously, there will be existing entries in the UUID space. We'll
have to work around them and mitigate the rare errors incurred. A simple hamming
code should do the trick for now.

An optimal choice of ECC would probably take into account that errors are over the
Z channel, that is the errors can only occur in one direction. I'm not well-versed
enough to give a response in that section, but one probably exists.

Definitely still a work in progress
	=== Background ===

	For one issue at work, we tossed around the idea of generating UUIDs
	clientside in order to ensure that we had a unique immutable id without
	having to go to the server to get an ID back, primarily for p2p applications.
	Obviously, if somebody's maliciously creating ids, they can choose arbitrary
	values and create collisions. Additionally, one client could tell if another
	client had already chosen a certain value for an id by trying to commit a
	certain UUID to the db.

	For example:
	> Client 1 creates a "todo" a certain UUID.
	> Client 2 tries to create a "todo" with the same UUID.
	> Client 2 now has some information about client 1's actions.

	I posed the purely academic question of how we might be able to send arbitrary
	messages over that channel, and the below is a possible incomplete spec that
	attempts to propose a basic solution. To be perfectly clear: This is an
	interesting solution to a toy problem.

	=== Model ===

	Each client can create a record (like a todo) with a certain UUID in this space:
	----
	0: 00000000-0000-0000-0000-0000000000
	1: 00000000-0000-0000-0000-0000000001
	...
	n: ffffffff-ffff-ffff-ffff-ffffffffff
	----

	Assume that if an ID hasn't been used already, you get a 201 CREATED back
	from the server. If an ID has been used, you get a 409 CONFLICT back. All
	GETs will 404 NOT FOUND, so in order to tell whether an ID has been used,
	you have to try to create a "todo" with that ID. The interesting bit is
	that by reading whether or not another client has created a record with
	a UUID, you inevitably end up creating one if it doesn't already exist.

	We can imagine a table of sequential UUIDs and whether
	they're taken.
	----
	uuid: \| ...0-0000001 \| ...0-0000002 \| ...0-0000003 \| ...
	taken?: \| no \| yes \| yes \|
	----

	We can model the space of taken vs. untaken UUIDs as
	shared memory, where the address is the sequential index,
	of the UUID, and data is whether the record exists or not.
	----
	addr: \| 0 \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \| ...
	data: \| 0 \| 1 \| 1 \| 0 \| 1 \| 1 \| 0 \| ...
	----

	The "create a record" operation can be viewed in this model as a joint
	"read and write true" operation. The interesting thing about this operation
	is that any time you read a byte, that byte gets overwritten with all 1s.

	=== First Pass: Shared Message Store ===

	The first attempt I'm going for is the use of a sled of 1s at the bottom of
	the memory space, a data structure I'm calling a "shared message store" that's
	just beyond the sled of 1s, and a data heap that starts at the top of the address
	space and grows downwards in address space.

	Rough overview is that when a client polls for messages, it "slides down" the sled
	of 1s to try to find the shared message store, reads the whole shared message store,
	performs manipulations to the shared message store, and rewrites the shared message
	store just beyond where the previous one ended (which is now filled with 1s due to
	the read.) More precise specification follows:

	To make it easier on ourselves to start, we're going to start with the
	assumption that each client acts are atomically. This is of course a
	horrible assumption, but I'll try to address it later.

	( Message Store Memory Layout )
	[ 0 \| 00000001 \| 01000100 \| 00101110 \| 01001001 \| 01101010 \| ... ]
	start, msg_count, head_ptr, dest_addr, msg_ptr, msg_size,
	\| -------- header ---------- \| ------ message record 1 ------ \| -- message record 2 -..

	( Main memory layout )
	[ ones \| shared memory store \| zeros \| message_data ]
	^
	head_ptr from message store

	Glossary of fields:
	start: The start bit, indicating the beginning of the message store
	msg_count: The number of messages in the message store
	head_ptr: The pointer for the next. Gets updated when a new message
	is added to the message store.
	dest_addr: The destination address of the message
	msg_ptr: The pointer to the to the message data for that message record
	msg_size: The length of the message data for that message record

	-- Reading a message --

	To read from the message, read down the field of 1s until you find a zero,
	then read the subsequent 2 bytes to get the message count. Since messages are
	of a fixed size (2 bytes), we can now read the whole set of message records.
	Once we've read the message queue. Since after reading, the field of 1s now
	extends through where the shared message store was written, we rewrite the
	message store at the end of the new field (removing any messages addressed
	to that client.)

	After reading, re-copy the message store just past the end of the newly
	invalidated block.

	-- Inserting a message --

	Follow the steps for reading a message
	Additionally, before rewriting the message, perform these steps:
	1: Write the message data to the head_ptr address.
	2: Update head_ptr to the end of the new data blob.
	3: Add a new message record to the message store
	4: Update msg_count

	=== Second Pass: Concurrency issues + locks ===

	I'm going to attempt to tackle this one whittling down from strong guarantees
	to weak ones. The concurrency models from strongest to weakest are:

	1: Clients can execute an entire routine atomically
	2: Clients can execute a readwrite of k addresses atomically (with some limit).
	3: Clients can execute a readwrite of 1 address atomically.

	The above protocol is robust to (1) but to no others, and thus will have to be amended.

	-- Tackling k < O(log(n)) --
	First, we split up the shared message store into a static part and a dynamic part:

	(head_ptr_record)
	\| 01 \| 00100101 \|
	start head_ptr

	We move the residual part of the message store to just beyond the head ptr:
	(shared_message_store)
	\| message 1 \| ... \| message m-1 \| message m \| msg_count \|

	where each message is still of the form:
	\| dest_addr \| msg_ptr \| msg_size \|

	Our whole memory space now looks like:
	[ ones \| head_ptr_rec \| zeros \| message_store \| message_data ]

	We modify the start sequence, such that a 00 means someone else has a lock on the
	datastore a 01 means that you've successfully gained a lock on the datastore. If
	a client tries to read the head_ptr_record and gets a 00 in the start field, that
	implies that someone else has a lock and so nothing should be done. The convenient
	part of this operation is that the lock is maintained atomically-- reading a lock
	maintains the lock, and reading a head_ptr_record atomically commits a lock.

	This requires atomic manipulation of 2 + log2(n) bits where n is size of the
	space of available uuids.

	-- Setting up the memory space --
	This protocol does not handle the initial coordination of the clients. If the
	memory space were blank at the onset of clients joining, then all clients will
	loop consistenty, each thinking that some other

	To solve this, we'll add in a quick asymmetry to the specification: a client that
	writes to the first word automatically gains the read/write lock, and assumes the
	default head pointer state of the form [ 01 \| top_of_address_space ]. Since a field
	of zeros is a valid shared message store, the client performs no additional setup.

	An extremely rough implementation of this version of the protocol can be found in
	my scratchpad repo evinism/graveyard:
	https://github.com/evinism/graveyard/blob/master/uuid_protocol.js

	-- Tackling k=2 --
	Really, the essence of the locking mechanism we need is the start sequence. We can
	replace the whole lock operation with just the sled of 1s, followed by a two-bit
	sequence. From a practical perspective, this makes it far less likely that two clients
	will conflict horrifically if atomicity can only be guaranteed to k=1, but that's
	little consolation when the jitter between requests is not completely neglegable.

	If we wanted to go down this path, we could section off the memory into 3 sections
	and treat each one of them as a contiguous shared memory space.
	- M1 for i % 3 == 0
	- M2 for i % 3 == 1
	- M3 for i % 3 == 2

	We could arbitrarily assign
	M1: Locking mechanism (just the 01 sequence from the head_ptr data structure)
	M2: Shared Message Store (identical the non-concurrent construction)
	M3: Message Data

	Since the locking mechanism can no longer store a pointer, we must still
	put in a sled of 1s on both M1 and M2. This is the primary modification to
	get us to k=2. Super simple.

	=== Third Pass: Error correction ===

	[ TODO ]
	IDK yet. Obviously, there will be existing entries in the UUID space. We'll
	have to work around them and mitigate the rare errors incurred. A simple hamming
	code should do the trick for now.

	An optimal choice of ECC would probably take into account that errors are over the
	Z channel, that is the errors can only occur in one direction. I'm not well-versed
	enough to give a response in that section, but one probably exists.

	Definitely still a work in progress