ΤΗ ΚΑΛΛΙΣΤΗΙ!
This is a sketch of a proposed NTPv5 design (by no means a complete spec, but hopefully good enough to make the concepts clear). It's a fairly ambitious step forward from previous versions, almost but not quite a green field design. I say "almost" because it retains a couple limited backward-compatibility constraints:
-
It must be possible to cleanly multiplex NTPv5 with NTPv4 (and earlier) on the same port. Basically the just means keeping the version field in the same place.
-
It must be possible to discipline your clock from a mix of NTPv5 and NTPv4 sources.
That said, the reader should still find much that is familiar. We've learned a lot in the nearly 35 years that NTP has been in operation, but plenty of those lessons have been about what works and not just about what doesn't!
This design includes all the goodies — just about everything from the trac wishlist is addressed here. Some highlights follow:
-
The packet structure has been completely reworked, aside from retaining the version field as previously mentioned.
-
Network Time Security is a mandatory and integral part of the protocol. NTS-related fields are a part of the base packet rather than extension. All information that can be encrypted, is, and clients do not send any information servers beyond what is strictly necessary for the server to construct a response.
-
All communication in NTPv5 is unicast between a stateful client and a stateless server. Behavior analagous to symmetric mode of NTPv4 is implemented in NTPv5 by two endpoints each functioning as a client of the other. The broadcast mode has been eliminated.
-
NTPv4's interleaved timestamp functionality is replaced by a distinct "follow-up" message, similar to the eponymous functionality in PTP.
-
NTPv4's "refid" mechanism, intended to detect one-degree timing loops, is replaced by a new mechanism based on Bloom filters that works out to arbitrary degree.
-
Timestamps are now based on the TAI timescale, and time packets carry a UTC-TAI offset and more detailed information about recent or upcoming leap seconds.
-
I adopt PTP's 80-bit timestamp format.
-
Time packets carry both an absolute clock (subject to step adjustments) and a (stable) difference clock.
-
Time packets carry an unencrypted and unauthenticated correction field intended for manipulation by middleboxes. The function of this field is analogous to PTP's concept of transparent clocks. We define a value for the Router Alert IP option to signal to middleboxes that this behavior is desired.
-
Methods for selecting among time sources, filtering noise, nd disclipining the local clock are not discussed in this sketch, and I propose that they no longer be a normative part of the standard. We specify only as much as is necessary to define protocol semantics and to ensure global stability among heterogenous implementations.
NTS-KE works the same way for NTPv5 as it does for NTPv4. We allocate a new NTS Next Protocol ID to represent NTPv5 and define new NTS-KE record types "New Cookie For NTPv5", "NTPv5 Server Negotiation", and "NTPv5 Port Negotiation" whose structure and semantics are identical to their NTPv4 counterparts.
We add one additional NTS-KE record, "Request Address-Bound NTPv5 Cookies", whose body is empty and whose critical bit is unset. When the client includes this record in an NTS-KE request, it is asking that the cookies that the server returns be cryptographically bound to the client's network address, such that the NTP server will reject them as invalid if the client sends them from any other address. Normally such a restriction is undesirable, because it interacts poorly with mobile clients and certain NATs and therefore would lead to a less reliable protocol. However, it makes possible certain features which would otherwise be unsafe.
Most of NTPv5 is designed to prevent DDoS amplification: response are never larger than the request they are responding to. However, the "follow-up" feature is an exception to this, since the server sends back two packets in response to a single packet from the client. Address-bound cookies prevent this from being abused due to the difficulty of obtaining a cookie bound to a spoofed address. Servers will be mandated to send follow-up messages only to clients which present address-bound cookies, and clients will be advised to request address-bound cookies if and only if they plan to utilize the follow-up message feature.
Throughout this document I'll be using the TLS presentation language to describe packet structure; see RFC 8446, section 3. Of the two array styles specified there (triangular brackets for arrays preceded by their length, square brackets for ones that are not), I use use only the [] style, and the corresponding length field is always explicit elsewhere in the message.
To be friendly to hardware implementations, all variable-length fields are padded out to a multiple of 4 octets; the length field gives the unpadded length which is then rounded up to the next multiple of four to obtain length inclusive of padding. We'll use the following type to conveniently represent an opaque field that has padding at the end of it:
struct {
opaque data[4];
} padded;
Furthermore, all 1, 2, and 4-octet fields are generally arranged so as to be self-aligned, and longer fields are arranged to have 4-octet alignment. Timestamps, which follow the format of PTPv2, are slight exception to this:
struct {
int48 seconds; // SI seconds, has a range of about ±4.46 million years
uint32 nanoseconds; // SI nanoseconds, ranging [0..999999999]
uint16 fracs; // 1/65536ths of a nanosecond
} Timestamp;
Although they partly follow the rule by being 12 octets long and
aligned to 4 octets everywhere they occur, the uint32 nanoseconds
has odd alignment.
In any case, the padding rules are exactly what is specified in this grammar; there is no implicit additional padding.
Every NTPv5 packet has the same top-level format: two fixed octet giving the version number, a two octet packet type, and then a type-dependent body.
enum {
request(0),
response(1),
request_with_correction(2),
response_with_correction(3),
nts_nak(4),
(65535)
} PacketType;
struct {
/* Keep this octet fixed like this for all future versions. NTPv6
and beyond should continue with VN=5 here, and give their actual
version in the next octet. */
uint8 legacy_li_vn_mode = 0xe8; // LI = 3 (unsync), VN = 5, Mode = 0
/* Version field gets a whole octet from now on rather than just 3 bits */
uint8 version = 5;
PacketType packet_type;
select(Packet.packet_type) {
case request: NtsData;
case response: NtsData;
case request_with_correction: NtsDataWithCorrection;
case response_with_correction: NtsDataWithCorrection;
case crypto_nak: NtsNak;
} body;
} Packet;
struct {
uint8 cookie_length; //Always zero in response packets
uint8 nonce_length;
uint16 ciphertext_length;
opaque unique_identifier[32];
padded cookie[pad(cookie_length)]; //Always zero-length in response packets
padded nonce[min(16,pad(cookie_length))]; //Padded up to a minimum 16 octets
padded ciphertext[pad(ciphertext_length)];
} NtsData;
struct {
Correction correction;
NtsData nts_data;
} NtsDataWithCorrection;
struct {
opaque unique_identifier[32];
} NtsNak;
The Packet
structure is a cryptographic envelope: except for the
Correction
structure (fully defined later) which is intentionally
manipulable by middleboxes, it exposes only as much information as is
needed for the receiver to locate, authenticate, and decrypt the
ciphertext.
If you are already familiar with NTS for NTPv4, then the meaning of
the fields of NtsData
should be self-explanatory. While they are now
syntactically part of the base packet rather than being scattered
across NTS extension fields, they have basically the same semantics as
those corresponding extensions. There is only a slight difference in
how the associated data for RFC 5116 is constructed. In NTPv4, the
Associated Data is whatever appears on the wire from the start of the
packet to the start of the NTS Authenticator and Encrypted Extensions
Fields extension field. In NTPv5, the associated data is exactly the
following structure:
struct {
uint8 legacy_li_vn_mode = 0xe8;
uint8 version = 5;
PacketType packet_type;
uint8 cookie_length;
uint8 nonce_length;
uint16 ciphertext_length;
opaque unique_identifier[32];
padded cookie[pad(cookie_length)];
} NtsAd;
The NtsAd
structure is a pseudo-type which never crosses the wire as
such, but is formed by cpying the corresponding fields out of Packet
and NtsData
. Authenticating some of these fields is redundant but
helps guard against certain implementation mistakes.
Decrypting the ciphertext gives a structure of type NtsPlaintext
:
enum {
time_request(0),
time_response(1),
followup_response(2),
source_request(3),
source_response(4),
(65535)
} MessageType;
struct {
MessageType msg_type;
uint8 cookie_len;
uint8 num_cookies;
/* num_cookies cookies, each with an unpadded length of cookie_len, and
padding added to the end of each individual cookie. */
padded cookies[num_cookies * pad(cookie_len)];
select (NtsPlaintext.msg_type) {
case time_request: TimeRequest;
case time_response: TimeResponse;
case source_request: SourceRequest;
case source_response: SourceResponse;
case followup_response: FollowupResponse;
} body;
} NtsPlaintext;
Here we have a bit more machinery for NTS cookies, and a message type
and corresponding body where all the information actually related to
time synchroniziation lives. In request packets, the cookies
field
contains placeholders, and in response packets it contains fresh
cookies being provided to the client. Again, while I have made the
syntactic change of moving these from extension fields to the base
packet, the semantics are the same as they are in NTPv4.
Unless extensions are present, the body of a time request has only
one bit of any interest; the rest is just anti-amplification padding.
When bit 2 of the flags
octet is set, the client is requesting that
the server send a follow-up packet to provide a drivestamp.
struct {
uint8 flags; /* bit 2: set = follow-up requested
bit 0-1, 3-7: unused, MUST be clear */
opaque anti_amplification_padding[103];
Extension extensions[]; // Each extension is self-delimiting, and we're at
// the end of the extension list when we're at the
// end of the structure
} TimeRequest;
Now finally we reach the meat: the TimeResponse
structure where all
the core time synchronization information lives. Inline comments
briefly describe each field and then I'll circle back to highlight
some details.
struct {
uint8 flags; /* bit 0: set = synchronized, clear = unsynchronized
bit 1: set = leap insertion, clear = leap deletion
bit 2: set = expect follow-up
bit 3-7: unused, MUST be clear */
/* Same meaning as in NTPv4 */
int8 precision;
/* Please don't send me an average of more than one packet per
2**throttle seconds once you've stabilized */
int8 throttle;
/* Please don't send me more than one packet over any
2**burst_throttle second interval */
int8 burst_throttle;
/* Randomly-generated value that identifies this server */
opaque source_id[16];
/* Incremented whenever we change upstream sources or whenever one
of our upstream sources changes its own source_seqno. */
uint32 source_seqno;
/* These have no specified epoch; they get set arbitrarily on
startup and never step. They may receive frequency discipline but
never offset discipline (c.f. RAD clocks). */
Timestamp recv;
Timestamp xmit;
/* Add this amount to convert the recv and xmit timestamps to a TAI
timestamp (relative to midnight 1970-01-01 TAI). This estimate
gets an immediate step adjustment any time we receive a new data
point. */
Timestamp tai_loc_offset;
/* Error estimates for tai_loc_offset (replaces root
delay/dispersion) */
Timestamp max_error;
Timestamp rms_error;
/* The TAI time of the last-announced leap event, which might be
past or future. Flag bit 1 says whether it's an insertion or
deletion */
Timestamp leap_event;
/* UTC-TAI offset after the completion of the above event */
int32 utc_tai_offset;
/* Counts the number of leap events that have ever historically
occurred */
uint32 leap_seqno;
Extension extensions[]; // Each extension is self-delimiting, and we're at
// the end of the extension list when we're at the
// end of the structure
} TimeResponse;
The recv
and xmit
timestamps are captured at the same point as
they are in NTPv4 (the recv timestamp the moment the request is
received, the xmit timestamp as nearly as possible to the moment the
response is sent). However, these timestamps don't have a defined
epoch so they don't actually tell you the time as such: they're just
timers that advance by one second per second of real time — difference
clocks as opposed to absolute clocks. To get an absolute timestamp, you
have to add tai_loc_offset
to them.
tai_loc_offset
represents the server's best available estimate of
the offset between its local monotonic clock (what it samples when it
captures recv
and xmit
timestamps) and TAI. It's allowed to jump
around every time the server gets a new data point that improves its
estimate. So you can't rely on xmit + tai_loc_offset
to be at all
stable; it can even move backward. In any computation that assumes
stable monotonic progression, just use recv
and xmit
by themselves.
While recv
and xmit
are guaranteed to never receive step adjustments,
there is no guarantee about any higher derivatives. For example if a server
thinks its clock is 5ppm slow, it can immediately speed it up by 5ppm; it
does not have to gradually accelerate.
If the source_id
field has changed from one time response to
another, then recv
and xmit
timestamps from before the change are
no longer comparable to the one after. The server may have rebooted
and lost its clock.
Since we've adopted PTPv2's timestamp format, we also adopt their epoch of January 1, 1970 TAI rather than the NTPv4 epoch of 1900.
max_error
gives a maximum error bound on tai_loc_offset
, and
rms_error
gives the square root of the expected squared error.
We don't specify what kind of estimator tai_loc_offset
is, e.g. a
minimum-mean-squared-error estimate versus a maximum likelihood
estimate, and we don't require that the estimator be unbiased.
However, since rms_error
is based on mean-squared error, using any
estimator other than a MMSE estimator requires reporting a larger
rms_error
than one otherwise might be able to.
The leap_event
and utc_tai_offset
fields along with the leap insertion/deletion
flag bit are used in converting from TAI to UTC
and notifying of upcoming or recent leap events. leap_event
gives
the TAI timestamp as of the conclusion of the most recently-announced
leap second event, which might be in the future or might be in the
past. utc_tai_offset
gives the UTC-TAI offset as of the conclusion
of that event, and the flag bit tells whether the leap event is/was an
insertion or a deletion.
The throttle
and burst_throttle
fields replace KoD RATE messages.
Rather than the server sending KoDs to clients that query it too
frequently (which burdens the server with maintaining a table to keep
track of this), it simply communicates its policy on what query rate
is acceptable and clients are expected to follow it.
The source_seqno
field is part of NTPv5's loop-detection scheme. Its
meaning and purpose will be explained in a later section.
Extensions have the same type-length-value format that they do in NTPv4:
struct {
uint16 type;
uint16 len;
padded body[pad(len)];
} Extension;
But, as throughout NTPv5, the length field gives the unpadded length rather than the padded one. Unlike in NTPv4, NTPv5 extension bodies have no minimum length since we're free of the syntactic ambiguities that force RFC 7822 to require one.
This design defines no extension fields. Extension fields are only one of two ways that NTPv5 can be extended; new message types can be defined as well. Generally, extension fields should be preferred only when the information they carry is somehow coupled to the time data in the same packet (such as, to pick a silly example, adding additional bits of precision to a timestamp). Otherwise it is probably better to define a new message type instead.
When the client sets flag bit 2 in a TimeRequest
and the server sets
bit 2 in its TimeResponse
, it should be immediately followed by a
FollowupResponse
which provides a corrected xmit
stamp. The corrected
stamp can be based on a drivestamp from the emitting NIC and therefore
more accurate than the original.
struct {
Timestamp corrected_xmit;
} FollowupResponse;
NTPv5 uses a loop-detection mechanism based on Bloom filters that
works for a large number of sources out to arbitrary degree. At
startup, each server randomly assigns itself a 128-bit source_id
.
It then sends each of its upstream sources a SourceRequest
message —
just a bunch of padding, no meaningful fields.
struct {
opaque anti_amplification_padding[532];
} SourceRequest;
It gets back a SourceResponse
from each one.
struct {
opaque source_id[16];
uint32 source_seqno;
opaque bloom_filter[512];
} SourceResponse;
Each source has populated a Bloom filter based on its own sources
further upstream. Our newly-online server now builds a Bloom filter of
its own. First, it takes the union (bitwise OR) of all the filters
from upstream. Then it inserts the source_id
s of its immediate
upstream stources into the filter. Since source IDs are already
uniformly random, the hash functions we employ can be very simple.
Split the 128-bit source_id
into ten 12-bit values h_0
, h_1
, …,
h_9
, discarding the last 8 bits. Then for each i
, set the h_i
'th
bit in the filter.
Once the new server has computed its filter and finished
synchronizing, it comes online with a source_seqno
of 0. It then
watches the source_seqno
s in the time responses from its
sources. Whenever one of them changes, it sends a new SourceRequest
to that server, recomputes its filter, and increments its own
source_seqno
. However, if the new filter that comes back includes
our own source_id
, this indicates a probable timing loop and we
should drop that source.
The use of Bloom filters means that timing loops will always be detected but there will be occasional false positives. With the filter size chosen, we can scale to a few hundred transitive upstream sources before the probability of a false positive becomes non-negligible, which should be sufficient.
Endpoints that act only as clients and not as servers don't have to bother doing any of this since they can't possibly be involved in timing loops.
The Correction
structure is deliberately not encrypted or
authenticated, and is intended for maniplation by middleboxes. It is
similar in function to PTP's transparent clocks.
Packets which contain correction fields must also include a Router Alert IP option (RFC 2113 & 2711) in order to signal to middleboxes that they want them modified. Middleboxes must rely strictly on this option (and not merely on something looking like a syntactically-valid NTPv5 packet going over the NTP port) to detect that they are handling NTPv5 traffic and that such modifications are desired.
struct Correction {
Timestamp correction;
uint32 path_crc;
};
correction
is initially 0 when time requests initiate from the
client. Compliant middleboxes increment it by the amount of time that
the request spends queued. The receiving NTP server copies the
correction field it receives into the response. Middleboxes on the
return path decrement the response's correction field by the amount of
time that the response spends queued.
path_crc
is initially 0 when time requests initiate from the
client. Compliant middleboxes self-assign an identifier at random and
add their identifier to the CRC (where "add" here means the CRC group
operation). The receiving NTP server copies the path_crc
field it
receives into the response. Middleboxes on the return path subtract
their self-assigned identifier from the path_crc
. The client receiving
the response verifies that the path_crc
is 0. If not, there is a
routing asymmetry and the correction should be discarded.
Correction fields can, obviously, be maliciously altered by a MitM in order to cause the client to obtain a bogus time estimate. Clients which pay attention to correction fields must set a limit on the largest correction they will accept. A correction whose absolute value is in excess of one half the measured RTT is definitely bogus and should always be rejected, but clients may choose to set bounds tighter than this.