dfoxfranke/ntpv5-sketch.md

## ntpv5-sketch.md

      
    Raw
  

              ntpv5-sketch.md
            
          
    Overview

ΤΗ ΚΑΛΛΙΣΤΗΙ!
This is a sketch of a proposed NTPv5 design (by no means a complete
spec, but hopefully good enough to make the concepts clear). It's a
fairly ambitious step forward from previous versions, almost but not
quite a green field design.  I say "almost" because it retains a
couple limited backward-compatibility constraints:


It must be possible to cleanly multiplex NTPv5 with NTPv4 (and
earlier) on the same port. Basically the just means keeping the
version field in the same place.


It must be possible to discipline your clock from a mix of NTPv5
and NTPv4 sources.


That said, the reader should still find much that is familiar. We've
learned a lot in the nearly 35 years that NTP has been in operation,
but plenty of those lessons have been about what works and not just
about what doesn't!
This design includes all the goodies — just about everything from
the trac wishlist
is addressed here. Some highlights follow:


The packet structure has been completely reworked, aside from
retaining the version field as previously mentioned.


Network Time Security is a mandatory and integral part of the
protocol. NTS-related fields are a part of the base packet rather
than extension. All information that can be encrypted, is, and
clients do not send any information servers beyond what is strictly
necessary for the server to construct a response.


All communication in NTPv5 is unicast between a stateful client and
a stateless server. Behavior analagous to symmetric mode of NTPv4 is
implemented in NTPv5 by two endpoints each functioning as a client
of the other. The broadcast mode has been eliminated.


NTPv4's interleaved timestamp functionality is replaced by a
distinct "follow-up" message, similar to the eponymous functionality
in PTP.


NTPv4's "refid" mechanism, intended to detect one-degree timing
loops, is replaced by a new mechanism based on Bloom filters that
works out to arbitrary degree.


Timestamps are now based on the TAI timescale, and time packets
carry a UTC-TAI offset and more detailed information about recent or
upcoming leap seconds.


I adopt PTP's 80-bit timestamp format.


Time packets carry both an absolute clock (subject to step
adjustments) and a (stable) difference clock.


Time packets carry an unencrypted and unauthenticated correction
field intended for manipulation by middleboxes. The function of this
field is analogous to PTP's concept of transparent clocks.  We
define a value for the Router Alert IP option to signal to
middleboxes that this behavior is desired.


Methods for selecting among time sources, filtering noise, nd
disclipining the local clock are not discussed in this sketch, and I
propose that they no longer be a normative part of the standard. We
specify only as much as is necessary to define protocol semantics
and to ensure global stability among heterogenous implementations.


Additions to NTS-KE

NTS-KE works the same way for NTPv5 as it does for NTPv4. We allocate
a new NTS Next Protocol ID to represent NTPv5 and define new NTS-KE
record types "New Cookie For NTPv5", "NTPv5 Server Negotiation", and
"NTPv5 Port Negotiation" whose structure and semantics are identical
to their NTPv4 counterparts.
We add one additional NTS-KE record, "Request Address-Bound NTPv5
Cookies", whose body is empty and whose critical bit is unset. When
the client includes this record in an NTS-KE request, it is asking
that the cookies that the server returns be cryptographically bound to
the client's network address, such that the NTP server will reject
them as invalid if the client sends them from any other
address. Normally such a restriction is undesirable, because it
interacts poorly with mobile clients and certain NATs and therefore
would lead to a less reliable protocol.  However, it makes possible
certain features which would otherwise be unsafe.
Most of NTPv5 is designed to prevent DDoS amplification: response are
never larger than the request they are responding to. However, the
"follow-up" feature is an exception to this, since the server sends
back two packets in response to a single packet from the
client. Address-bound cookies prevent this from being abused due to
the difficulty of obtaining a cookie bound to a spoofed
address. Servers will be mandated to send follow-up messages only to
clients which present address-bound cookies, and clients will be
advised to request address-bound cookies if and only if they plan to
utilize the follow-up message feature.
Preliminaries on packet structure

Throughout this document I'll be using the TLS presentation language
to describe packet structure; see RFC 8446, section 3. Of the two
array styles specified there (triangular brackets for arrays preceded
by their length, square brackets for ones that are not), I use use
only the [] style, and the corresponding length field is always
explicit elsewhere in the message.
To be friendly to hardware implementations, all variable-length fields
are padded out to a multiple of 4 octets; the length field gives the
unpadded length which is then rounded up to the next multiple of
four to obtain length inclusive of padding. We'll use the following
type to conveniently represent an opaque field that has padding at the
end of it:
struct {
  opaque data[4];
} padded;

Furthermore, all 1, 2, and 4-octet fields are generally arranged so as
to be self-aligned, and longer fields are arranged to have 4-octet
alignment. Timestamps, which follow the format of PTPv2, are slight
exception to this:
struct {
  int48 seconds; // SI seconds, has a range of about ±4.46 million years
  uint32 nanoseconds; // SI nanoseconds, ranging [0..999999999]
  uint16 fracs; // 1/65536ths of a nanosecond
} Timestamp;

Although they partly follow the rule by being 12 octets long and
aligned to 4 octets everywhere they occur, the uint32 nanoseconds
has odd alignment.
In any case, the padding rules are exactly what is specified in this
grammar; there is no implicit additional padding.
Top-level packet structure

Every NTPv5 packet has the same top-level format: two fixed octet giving
the version number, a two octet packet type, and then a type-dependent
body.
enum {
  request(0),
  response(1),
  request_with_correction(2),
  response_with_correction(3),
  nts_nak(4),
  (65535)
} PacketType;

struct {
  /* Keep this octet fixed like this for all future versions. NTPv6
     and beyond should continue with VN=5 here, and give their actual
     version in the next octet. */
  uint8 legacy_li_vn_mode = 0xe8; // LI = 3 (unsync), VN = 5, Mode = 0

  /* Version field gets a whole octet from now on rather than just 3 bits */
  uint8 version = 5;

  PacketType packet_type;
  
  select(Packet.packet_type) {
    case request:                  NtsData;
    case response:                 NtsData;
    case request_with_correction:  NtsDataWithCorrection;
    case response_with_correction: NtsDataWithCorrection;
    case crypto_nak:               NtsNak;
  } body;
} Packet;

struct {
  uint8 cookie_length; //Always zero in response packets
  uint8 nonce_length;
  uint16 ciphertext_length; 
  opaque unique_identifier[32];
  padded cookie[pad(cookie_length)]; //Always zero-length in response packets
  padded nonce[min(16,pad(cookie_length))]; //Padded up to a minimum 16 octets
  padded ciphertext[pad(ciphertext_length)];
} NtsData;

struct {
  Correction correction; 
  NtsData nts_data;
} NtsDataWithCorrection;

struct {
  opaque unique_identifier[32];
} NtsNak;


The Packet structure is a cryptographic envelope: except for the
Correction structure (fully defined later) which is intentionally
manipulable by middleboxes, it exposes only as much information as is
needed for the receiver to locate, authenticate, and decrypt the
ciphertext.
If you are already familiar with NTS for NTPv4, then the meaning of
the fields of NtsData should be self-explanatory. While they are now
syntactically part of the base packet rather than being scattered
across NTS extension fields, they have basically the same semantics as
those corresponding extensions. There is only a slight difference in
how the associated data for RFC 5116 is constructed. In NTPv4, the
Associated Data is whatever appears on the wire from the start of the
packet to the start of the NTS Authenticator and Encrypted Extensions
Fields extension field. In NTPv5, the associated data is exactly the
following structure:
struct {
  uint8 legacy_li_vn_mode = 0xe8;
  uint8 version = 5;
  PacketType packet_type;
  uint8 cookie_length;
  uint8 nonce_length;
  uint16 ciphertext_length;
  opaque unique_identifier[32];
  padded cookie[pad(cookie_length)];
} NtsAd;

The NtsAd structure is a pseudo-type which never crosses the wire as
such, but is formed by cpying the corresponding fields out of Packet
and NtsData. Authenticating some of these fields is redundant but
helps guard against certain implementation mistakes.
Decrypting the ciphertext gives a structure of type NtsPlaintext:
enum {
  time_request(0),
  time_response(1),
  followup_response(2),
  source_request(3),
  source_response(4),
  (65535)
} MessageType;

struct {
  MessageType msg_type;
  uint8 cookie_len;
  uint8 num_cookies;
  /* num_cookies cookies, each with an unpadded length of cookie_len, and
     padding added to the end of each individual cookie. */
  padded cookies[num_cookies * pad(cookie_len)];
  select (NtsPlaintext.msg_type) {
    case time_request:          TimeRequest;
    case time_response:         TimeResponse;
    case source_request:        SourceRequest;
    case source_response:       SourceResponse;
    case followup_response:     FollowupResponse;
  } body;
} NtsPlaintext;

Here we have a bit more machinery for NTS cookies, and a message type
and corresponding body where all the information actually related to
time synchroniziation lives. In request packets, the cookies field
contains placeholders, and in response packets it contains fresh
cookies being provided to the client. Again, while I have made the
syntactic change of moving these from extension fields to the base
packet, the semantics are the same as they are in NTPv4.
Time messages

Unless extensions are present, the body of a time request has only
one bit of any interest; the rest is just anti-amplification padding.
When bit 2 of the flags octet is set, the client is requesting that
the server send a follow-up packet to provide a drivestamp.
struct {
  uint8 flags; /* bit 2: set = follow-up requested
                  bit 0-1, 3-7: unused, MUST be clear */
  opaque anti_amplification_padding[103];
  Extension extensions[]; // Each extension is self-delimiting, and we're at
                          // the end of the extension list when we're at the
                          // end of the structure
} TimeRequest;

Now finally we reach the meat: the TimeResponse structure where all
the core time synchronization information lives. Inline comments
briefly describe each field and then I'll circle back to highlight
some details.
struct {
  uint8 flags; /* bit 0: set = synchronized, clear = unsynchronized
                  bit 1: set = leap insertion, clear = leap deletion
                  bit 2: set = expect follow-up
                  bit 3-7: unused, MUST be clear */

  /* Same meaning as in NTPv4 */
  int8 precision;

  /* Please don't send me an average of more than one packet per
     2**throttle seconds once you've stabilized */
  int8 throttle;
  
  /* Please don't send me more than one packet over any
     2**burst_throttle second interval */
  int8 burst_throttle;

  /* Randomly-generated value that identifies this server */
  opaque source_id[16];

  /* Incremented whenever we change upstream sources or whenever one
    of our upstream sources changes its own source_seqno. */
  uint32 source_seqno;

  /* These have no specified epoch; they get set arbitrarily on
     startup and never step. They may receive frequency discipline but
     never offset discipline (c.f. RAD clocks). */
  Timestamp recv;
  Timestamp xmit;

  /* Add this amount to convert the recv and xmit timestamps to a TAI
     timestamp (relative to midnight 1970-01-01 TAI). This estimate
     gets an immediate step adjustment any time we receive a new data
     point. */
  Timestamp tai_loc_offset;

  /* Error estimates for tai_loc_offset (replaces root
     delay/dispersion) */
  Timestamp max_error;
  Timestamp rms_error;
  
  /* The TAI time of the last-announced leap event, which might be
     past or future. Flag bit 1 says whether it's an insertion or
     deletion */
  Timestamp leap_event;
  
  /* UTC-TAI offset after the completion of the above event */
  int32 utc_tai_offset;

  /* Counts the number of leap events that have ever historically
     occurred */
  uint32 leap_seqno;
  
  Extension extensions[]; // Each extension is self-delimiting, and we're at
                          // the end of the extension list when we're at the
                          // end of the structure
} TimeResponse;

The recv and xmit timestamps are captured at the same point as
they are in NTPv4 (the recv timestamp the moment the request is
received, the xmit timestamp as nearly as possible to the moment the
response is sent). However, these timestamps don't have a defined
epoch so they don't actually tell you the time as such: they're just
timers that advance by one second per second of real time — difference
clocks as opposed to absolute clocks. To get an absolute timestamp, you
have to add tai_loc_offset to them.
tai_loc_offset represents the server's best available estimate of
the offset between its local monotonic clock (what it samples when it
captures recv and xmit timestamps) and TAI. It's allowed to jump
around every time the server gets a new data point that improves its
estimate. So you can't rely on xmit + tai_loc_offset to be at all
stable; it can even move backward. In any computation that assumes
stable monotonic progression, just use recv and xmit by themselves.
While recv and xmit are guaranteed to never receive step adjustments,
there is no guarantee about any higher derivatives. For example if a server
thinks its clock is 5ppm slow, it can immediately speed it up by 5ppm; it
does not have to gradually accelerate.
If the source_id field has changed from one time response to
another, then recv and xmit timestamps from before the change are
no longer comparable to the one after. The server may have rebooted
and lost its clock.
Since we've adopted PTPv2's timestamp format, we also adopt their
epoch of January 1, 1970 TAI rather than the NTPv4 epoch of 1900.
max_error gives a maximum error bound on tai_loc_offset, and
rms_error gives the square root of the expected squared error.
We don't specify what kind of estimator tai_loc_offset is, e.g. a
minimum-mean-squared-error estimate versus a maximum likelihood
estimate, and we don't require that the estimator be unbiased.
However, since rms_error is based on mean-squared error, using any
estimator other than a MMSE estimator requires reporting a larger
rms_error than one otherwise might be able to.
The leap_event and utc_tai_offset fields along with the leap insertion/deletion flag bit are used in converting from TAI to UTC
and notifying of upcoming or recent leap events. leap_event gives
the TAI timestamp as of the conclusion of the most recently-announced
leap second event, which might be in the future or might be in the
past. utc_tai_offset gives the UTC-TAI offset as of the conclusion
of that event, and the flag bit tells whether the leap event is/was an
insertion or a deletion.
The throttle and burst_throttle fields replace KoD RATE messages.
Rather than the server sending KoDs to clients that query it too
frequently (which burdens the server with maintaining a table to keep
track of this), it simply communicates its policy on what query rate
is acceptable and clients are expected to follow it.
The source_seqno field is part of NTPv5's loop-detection scheme. Its
meaning and purpose will be explained in a later section.
Extensions have the same type-length-value format that they do in NTPv4:
struct {
  uint16 type;
  uint16 len;
  padded body[pad(len)];
} Extension;

But, as throughout NTPv5, the length field gives the unpadded length
rather than the padded one. Unlike in NTPv4, NTPv5 extension bodies
have no minimum length since we're free of the syntactic ambiguities
that force RFC 7822 to require one.
This design defines no extension fields. Extension fields are only one
of two ways that NTPv5 can be extended; new message types can be
defined as well. Generally, extension fields should be preferred only
when the information they carry is somehow coupled to the time data in
the same packet (such as, to pick a silly example, adding additional
bits of precision to a timestamp). Otherwise it is probably better to
define a new message type instead.
Follow-up messages

When the client sets flag bit 2 in a TimeRequest and the server sets
bit 2 in its TimeResponse, it should be immediately followed by a
FollowupResponse which provides a corrected xmit stamp. The corrected
stamp can be based on a drivestamp from the emitting NIC and therefore
more accurate than the original.
struct {
  Timestamp corrected_xmit;
} FollowupResponse;

Source messages and loop detection

NTPv5 uses a loop-detection mechanism based on Bloom filters that
works for a large number of sources out to arbitrary degree. At
startup, each server randomly assigns itself a 128-bit source_id.
It then sends each of its upstream sources a SourceRequest message —
just a bunch of padding, no meaningful fields.
struct {
  opaque anti_amplification_padding[532];
} SourceRequest;

It gets back a SourceResponse from each one.
struct {
  opaque source_id[16];
  uint32 source_seqno;
  opaque bloom_filter[512];
} SourceResponse;

Each source has populated a Bloom filter based on its own sources
further upstream. Our newly-online server now builds a Bloom filter of
its own. First, it takes the union (bitwise OR) of all the filters
from upstream. Then it inserts the source_ids of its immediate
upstream stources into the filter. Since source IDs are already
uniformly random, the hash functions we employ can be very simple.
Split the 128-bit source_id into ten 12-bit values h_0, h_1, …,
h_9, discarding the last 8 bits. Then for each i, set the h_i'th
bit in the filter.
Once the new server has computed its filter and finished
synchronizing, it comes online with a source_seqno of 0. It then
watches the source_seqnos in the time responses from its
sources. Whenever one of them changes, it sends a new SourceRequest
to that server, recomputes its filter, and increments its own
source_seqno. However, if the new filter that comes back includes
our own source_id, this indicates a probable timing loop and we
should drop that source.
The use of Bloom filters means that timing loops will always be
detected but there will be occasional false positives. With the filter
size chosen, we can scale to a few hundred transitive upstream sources
before the probability of a false positive becomes non-negligible,
which should be sufficient.
Endpoints that act only as clients and not as servers don't have to
bother doing any of this since they can't possibly be involved in
timing loops.
Correction fields

The Correction structure is deliberately not encrypted or
authenticated, and is intended for maniplation by middleboxes. It is
similar in function to PTP's transparent clocks.
Packets which contain correction fields must also include a Router
Alert IP option (RFC 2113 & 2711) in order to signal to middleboxes
that they want them modified. Middleboxes must rely strictly on this
option (and not merely on something looking like a syntactically-valid
NTPv5 packet going over the NTP port) to detect that they are handling
NTPv5 traffic and that such modifications are desired.
struct Correction {
  Timestamp correction;
  uint32 path_crc;
};

correction is initially 0 when time requests initiate from the
client. Compliant middleboxes increment it by the amount of time that
the request spends queued. The receiving NTP server copies the
correction field it receives into the response. Middleboxes on the
return path decrement the response's correction field by the amount of
time that the response spends queued.
path_crc is initially 0 when time requests initiate from the
client. Compliant middleboxes self-assign an identifier at random and
add their identifier to the CRC (where "add" here means the CRC group
operation). The receiving NTP server copies the path_crc field it
receives into the response. Middleboxes on the return path subtract
their self-assigned identifier from the path_crc. The client receiving
the response verifies that the path_crc is 0. If not, there is a
routing asymmetry and the correction should be discarded.
Correction fields can, obviously, be maliciously altered by a MitM in
order to cause the client to obtain a bogus time estimate. Clients
which pay attention to correction fields must set a limit on the
largest correction they will accept. A correction whose absolute value
is in excess of one half the measured RTT is definitely bogus and
should always be rejected, but clients may choose to set bounds
tighter than this.