tbg/gist:57bcccc379b151456044

## gistfile1.txt
The handling of the time signal in Cockroach is motivated by the fact that in
Spanner, the target consistency is linearizability, so unrelated transactions must
commit with (database) timestamps whose ordering reflects the order of their
commit timestamps in absolute (i.e. your wrist watch) time. For that to happen,
Spanner basically just "waits" out the clock skew when committing a transaction
(prior to returning to the client). Since they do that, they go to great
lengths to synchronize their clocks and have a good grip on the actual maximal
possible offset.

Cockroach only shoots for serializability (though we do offer linearizable, if
you're prepared to wait and get the offset down), so while transactions run
with high isolation, you might be running T1 and then T2 hitting different
parts of the cluster, and you might get back timestamps that suggest that T2
committed before T1 when really it was the other way around. I have yet to see
a use case where this matters (if you're running both transactions in a
causally related way, there are simple ways to prevent this "anomaly"), and it
means that the time signal is vastly less important than it is in Spanner.

Now, to address your concerns:

I don't see the time signal as having a large influence on network partitions. If
you partition, only the majority remains functional anyway. If you have a
specific concern, I'm happy to discuss it.

The time signal matters mostly if you're trying to access a key at a certain
database time, and there's data with a timestamp in the near future of where
you want to read. Then you can't be sure whether that write happened in your
absolute past or not, but you need to know if you want to be serializable.
In Cockroach, your read will retry (increasing its timestamp so that the write
is in its past), with some optimizations in place that are explained in our
design doc (to keep you from restarting over and over on busy keys).
That period of uncertainty is supplied by configuration (the epsilon, MaxOffset).

This is going to be a little in-depth, but the basic message is: Cockroach
trusts the MaxOffset, and if your clocks don't live up to the promise, you
might get some stale reads. By the way, Spanner breaks in the same way if
their clock offset (via their TrueTime API) failed them. But Spanner has to
wait out the MaxOffset on every commit, we don't - so we get away with having
it high enough for off-the-shelf clock synchronization and save you the atomic
clocks, at similar guarantees. That's a very good deal.

MaxOffset is high by default (200ms) or you can set it manually.
Either way, Cockroach takes that value as authoritative - meaning at the end of
the day the user needs to make sure this holds (with 200ms, that's hopefully
not a real burden). The cluster offset is continuously measured by each node,
and if they find themselves outside of the safe interval, they will stop
participating in the action. That leaves a short interval of time in which a
node may be exceeding MaxOffset, but not aware.
So what if it is? Well, then there's a chance you might be serving some stale
reads from that replica on busy keys (if you're behind and the future writes
seem at "safe distance" from the local point of view), though that presupposes
that you're leading the replica group, in which case you are proposing the
writes (and you won't let them through if the timestamps within seem to
disprove what you assume about the MaxOffset), so likely you'd only be able to
see this during a leadership change. Now assume you don't really trust your
clocks, what can you do? Well, you can bump up the MaxOffset as far as you
like. What you pay for are more restarts on busy keys when unsynchronized nodes
work on them concurrently, but that may be perfectly acceptable for your
workload (if you have really bad clocks, then chances are you're not running in
production).
	The handling of the time signal in Cockroach is motivated by the fact that in
	Spanner, the target consistency is linearizability, so unrelated transactions must
	commit with (database) timestamps whose ordering reflects the order of their
	commit timestamps in absolute (i.e. your wrist watch) time. For that to happen,
	Spanner basically just "waits" out the clock skew when committing a transaction
	(prior to returning to the client). Since they do that, they go to great
	lengths to synchronize their clocks and have a good grip on the actual maximal
	possible offset.

	Cockroach only shoots for serializability (though we do offer linearizable, if
	you're prepared to wait and get the offset down), so while transactions run
	with high isolation, you might be running T1 and then T2 hitting different
	parts of the cluster, and you might get back timestamps that suggest that T2
	committed before T1 when really it was the other way around. I have yet to see
	a use case where this matters (if you're running both transactions in a
	causally related way, there are simple ways to prevent this "anomaly"), and it
	means that the time signal is vastly less important than it is in Spanner.

	Now, to address your concerns:

	I don't see the time signal as having a large influence on network partitions. If
	you partition, only the majority remains functional anyway. If you have a
	specific concern, I'm happy to discuss it.

	The time signal matters mostly if you're trying to access a key at a certain
	database time, and there's data with a timestamp in the near future of where
	you want to read. Then you can't be sure whether that write happened in your
	absolute past or not, but you need to know if you want to be serializable.
	In Cockroach, your read will retry (increasing its timestamp so that the write
	is in its past), with some optimizations in place that are explained in our
	design doc (to keep you from restarting over and over on busy keys).
	That period of uncertainty is supplied by configuration (the epsilon, MaxOffset).

	This is going to be a little in-depth, but the basic message is: Cockroach
	trusts the MaxOffset, and if your clocks don't live up to the promise, you
	might get some stale reads. By the way, Spanner breaks in the same way if
	their clock offset (via their TrueTime API) failed them. But Spanner has to
	wait out the MaxOffset on every commit, we don't - so we get away with having
	it high enough for off-the-shelf clock synchronization and save you the atomic
	clocks, at similar guarantees. That's a very good deal.

	MaxOffset is high by default (200ms) or you can set it manually.
	Either way, Cockroach takes that value as authoritative - meaning at the end of
	the day the user needs to make sure this holds (with 200ms, that's hopefully
	not a real burden). The cluster offset is continuously measured by each node,
	and if they find themselves outside of the safe interval, they will stop
	participating in the action. That leaves a short interval of time in which a
	node may be exceeding MaxOffset, but not aware.
	So what if it is? Well, then there's a chance you might be serving some stale
	reads from that replica on busy keys (if you're behind and the future writes
	seem at "safe distance" from the local point of view), though that presupposes
	that you're leading the replica group, in which case you are proposing the
	writes (and you won't let them through if the timestamps within seem to
	disprove what you assume about the MaxOffset), so likely you'd only be able to
	see this during a leadership change. Now assume you don't really trust your
	clocks, what can you do? Well, you can bump up the MaxOffset as far as you
	like. What you pay for are more restarts on busy keys when unsynchronized nodes
	work on them concurrently, but that may be perfectly acceptable for your
	workload (if you have really bad clocks, then chances are you're not running in
	production).