Skip to content

Instantly share code, notes, and snippets.

@dfoxfranke
Last active October 21, 2022 16:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dfoxfranke/4ca0443b6ff6e0473578cf3827b9ea7f to your computer and use it in GitHub Desktop.
Save dfoxfranke/4ca0443b6ff6e0473578cf3827b9ea7f to your computer and use it in GitHub Desktop.
NTPv5 Design Sketch

Overview

ΤΗ ΚΑΛΛΙΣΤΗΙ!

This is a sketch of a proposed NTPv5 design (by no means a complete spec, but hopefully good enough to make the concepts clear). It's a fairly ambitious step forward from previous versions, almost but not quite a green field design. I say "almost" because it retains a couple limited backward-compatibility constraints:

  1. It must be possible to cleanly multiplex NTPv5 with NTPv4 (and earlier) on the same port. Basically the just means keeping the version field in the same place.

  2. It must be possible to discipline your clock from a mix of NTPv5 and NTPv4 sources.

That said, the reader should still find much that is familiar. We've learned a lot in the nearly 35 years that NTP has been in operation, but plenty of those lessons have been about what works and not just about what doesn't!

This design includes all the goodies — just about everything from the trac wishlist is addressed here. Some highlights follow:

  • The packet structure has been completely reworked, aside from retaining the version field as previously mentioned.

  • Network Time Security is a mandatory and integral part of the protocol. NTS-related fields are a part of the base packet rather than extension. All information that can be encrypted, is, and clients do not send any information servers beyond what is strictly necessary for the server to construct a response.

  • All communication in NTPv5 is unicast between a stateful client and a stateless server. Behavior analagous to symmetric mode of NTPv4 is implemented in NTPv5 by two endpoints each functioning as a client of the other. The broadcast mode has been eliminated.

  • NTPv4's interleaved timestamp functionality is replaced by a distinct "follow-up" message, similar to the eponymous functionality in PTP.

  • NTPv4's "refid" mechanism, intended to detect one-degree timing loops, is replaced by a new mechanism based on Bloom filters that works out to arbitrary degree.

  • Timestamps are now based on the TAI timescale, and time packets carry a UTC-TAI offset and more detailed information about recent or upcoming leap seconds.

  • I adopt PTP's 80-bit timestamp format.

  • Time packets carry both an absolute clock (subject to step adjustments) and a (stable) difference clock.

  • Time packets carry an unencrypted and unauthenticated correction field intended for manipulation by middleboxes. The function of this field is analogous to PTP's concept of transparent clocks. We define a value for the Router Alert IP option to signal to middleboxes that this behavior is desired.

  • Methods for selecting among time sources, filtering noise, nd disclipining the local clock are not discussed in this sketch, and I propose that they no longer be a normative part of the standard. We specify only as much as is necessary to define protocol semantics and to ensure global stability among heterogenous implementations.

Additions to NTS-KE

NTS-KE works the same way for NTPv5 as it does for NTPv4. We allocate a new NTS Next Protocol ID to represent NTPv5 and define new NTS-KE record types "New Cookie For NTPv5", "NTPv5 Server Negotiation", and "NTPv5 Port Negotiation" whose structure and semantics are identical to their NTPv4 counterparts.

We add one additional NTS-KE record, "Request Address-Bound NTPv5 Cookies", whose body is empty and whose critical bit is unset. When the client includes this record in an NTS-KE request, it is asking that the cookies that the server returns be cryptographically bound to the client's network address, such that the NTP server will reject them as invalid if the client sends them from any other address. Normally such a restriction is undesirable, because it interacts poorly with mobile clients and certain NATs and therefore would lead to a less reliable protocol. However, it makes possible certain features which would otherwise be unsafe.

Most of NTPv5 is designed to prevent DDoS amplification: response are never larger than the request they are responding to. However, the "follow-up" feature is an exception to this, since the server sends back two packets in response to a single packet from the client. Address-bound cookies prevent this from being abused due to the difficulty of obtaining a cookie bound to a spoofed address. Servers will be mandated to send follow-up messages only to clients which present address-bound cookies, and clients will be advised to request address-bound cookies if and only if they plan to utilize the follow-up message feature.

Preliminaries on packet structure

Throughout this document I'll be using the TLS presentation language to describe packet structure; see RFC 8446, section 3. Of the two array styles specified there (triangular brackets for arrays preceded by their length, square brackets for ones that are not), I use use only the [] style, and the corresponding length field is always explicit elsewhere in the message.

To be friendly to hardware implementations, all variable-length fields are padded out to a multiple of 4 octets; the length field gives the unpadded length which is then rounded up to the next multiple of four to obtain length inclusive of padding. We'll use the following type to conveniently represent an opaque field that has padding at the end of it:

struct {
  opaque data[4];
} padded;

Furthermore, all 1, 2, and 4-octet fields are generally arranged so as to be self-aligned, and longer fields are arranged to have 4-octet alignment. Timestamps, which follow the format of PTPv2, are slight exception to this:

struct {
  int48 seconds; // SI seconds, has a range of about ±4.46 million years
  uint32 nanoseconds; // SI nanoseconds, ranging [0..999999999]
  uint16 fracs; // 1/65536ths of a nanosecond
} Timestamp;

Although they partly follow the rule by being 12 octets long and aligned to 4 octets everywhere they occur, the uint32 nanoseconds has odd alignment.

In any case, the padding rules are exactly what is specified in this grammar; there is no implicit additional padding.

Top-level packet structure

Every NTPv5 packet has the same top-level format: two fixed octet giving the version number, a two octet packet type, and then a type-dependent body.

enum {
  request(0),
  response(1),
  request_with_correction(2),
  response_with_correction(3),
  nts_nak(4),
  (65535)
} PacketType;

struct {
  /* Keep this octet fixed like this for all future versions. NTPv6
     and beyond should continue with VN=5 here, and give their actual
     version in the next octet. */
  uint8 legacy_li_vn_mode = 0xe8; // LI = 3 (unsync), VN = 5, Mode = 0

  /* Version field gets a whole octet from now on rather than just 3 bits */
  uint8 version = 5;

  PacketType packet_type;
  
  select(Packet.packet_type) {
    case request:                  NtsData;
    case response:                 NtsData;
    case request_with_correction:  NtsDataWithCorrection;
    case response_with_correction: NtsDataWithCorrection;
    case crypto_nak:               NtsNak;
  } body;
} Packet;

struct {
  uint8 cookie_length; //Always zero in response packets
  uint8 nonce_length;
  uint16 ciphertext_length; 
  opaque unique_identifier[32];
  padded cookie[pad(cookie_length)]; //Always zero-length in response packets
  padded nonce[min(16,pad(cookie_length))]; //Padded up to a minimum 16 octets
  padded ciphertext[pad(ciphertext_length)];
} NtsData;

struct {
  Correction correction; 
  NtsData nts_data;
} NtsDataWithCorrection;

struct {
  opaque unique_identifier[32];
} NtsNak;

The Packet structure is a cryptographic envelope: except for the Correction structure (fully defined later) which is intentionally manipulable by middleboxes, it exposes only as much information as is needed for the receiver to locate, authenticate, and decrypt the ciphertext.

If you are already familiar with NTS for NTPv4, then the meaning of the fields of NtsData should be self-explanatory. While they are now syntactically part of the base packet rather than being scattered across NTS extension fields, they have basically the same semantics as those corresponding extensions. There is only a slight difference in how the associated data for RFC 5116 is constructed. In NTPv4, the Associated Data is whatever appears on the wire from the start of the packet to the start of the NTS Authenticator and Encrypted Extensions Fields extension field. In NTPv5, the associated data is exactly the following structure:

struct {
  uint8 legacy_li_vn_mode = 0xe8;
  uint8 version = 5;
  PacketType packet_type;
  uint8 cookie_length;
  uint8 nonce_length;
  uint16 ciphertext_length;
  opaque unique_identifier[32];
  padded cookie[pad(cookie_length)];
} NtsAd;

The NtsAd structure is a pseudo-type which never crosses the wire as such, but is formed by cpying the corresponding fields out of Packet and NtsData. Authenticating some of these fields is redundant but helps guard against certain implementation mistakes.

Decrypting the ciphertext gives a structure of type NtsPlaintext:

enum {
  time_request(0),
  time_response(1),
  followup_response(2),
  source_request(3),
  source_response(4),
  (65535)
} MessageType;

struct {
  MessageType msg_type;
  uint8 cookie_len;
  uint8 num_cookies;
  /* num_cookies cookies, each with an unpadded length of cookie_len, and
     padding added to the end of each individual cookie. */
  padded cookies[num_cookies * pad(cookie_len)];
  select (NtsPlaintext.msg_type) {
    case time_request:          TimeRequest;
    case time_response:         TimeResponse;
    case source_request:        SourceRequest;
    case source_response:       SourceResponse;
    case followup_response:     FollowupResponse;
  } body;
} NtsPlaintext;

Here we have a bit more machinery for NTS cookies, and a message type and corresponding body where all the information actually related to time synchroniziation lives. In request packets, the cookies field contains placeholders, and in response packets it contains fresh cookies being provided to the client. Again, while I have made the syntactic change of moving these from extension fields to the base packet, the semantics are the same as they are in NTPv4.

Time messages

Unless extensions are present, the body of a time request has only one bit of any interest; the rest is just anti-amplification padding. When bit 2 of the flags octet is set, the client is requesting that the server send a follow-up packet to provide a drivestamp.

struct {
  uint8 flags; /* bit 2: set = follow-up requested
                  bit 0-1, 3-7: unused, MUST be clear */
  opaque anti_amplification_padding[103];
  Extension extensions[]; // Each extension is self-delimiting, and we're at
                          // the end of the extension list when we're at the
                          // end of the structure
} TimeRequest;

Now finally we reach the meat: the TimeResponse structure where all the core time synchronization information lives. Inline comments briefly describe each field and then I'll circle back to highlight some details.

struct {
  uint8 flags; /* bit 0: set = synchronized, clear = unsynchronized
                  bit 1: set = leap insertion, clear = leap deletion
                  bit 2: set = expect follow-up
                  bit 3-7: unused, MUST be clear */

  /* Same meaning as in NTPv4 */
  int8 precision;

  /* Please don't send me an average of more than one packet per
     2**throttle seconds once you've stabilized */
  int8 throttle;
  
  /* Please don't send me more than one packet over any
     2**burst_throttle second interval */
  int8 burst_throttle;

  /* Randomly-generated value that identifies this server */
  opaque source_id[16];

  /* Incremented whenever we change upstream sources or whenever one
    of our upstream sources changes its own source_seqno. */
  uint32 source_seqno;

  /* These have no specified epoch; they get set arbitrarily on
     startup and never step. They may receive frequency discipline but
     never offset discipline (c.f. RAD clocks). */
  Timestamp recv;
  Timestamp xmit;

  /* Add this amount to convert the recv and xmit timestamps to a TAI
     timestamp (relative to midnight 1970-01-01 TAI). This estimate
     gets an immediate step adjustment any time we receive a new data
     point. */
  Timestamp tai_loc_offset;

  /* Error estimates for tai_loc_offset (replaces root
     delay/dispersion) */
  Timestamp max_error;
  Timestamp rms_error;
  
  /* The TAI time of the last-announced leap event, which might be
     past or future. Flag bit 1 says whether it's an insertion or
     deletion */
  Timestamp leap_event;
  
  /* UTC-TAI offset after the completion of the above event */
  int32 utc_tai_offset;

  /* Counts the number of leap events that have ever historically
     occurred */
  uint32 leap_seqno;
  
  Extension extensions[]; // Each extension is self-delimiting, and we're at
                          // the end of the extension list when we're at the
                          // end of the structure
} TimeResponse;

The recv and xmit timestamps are captured at the same point as they are in NTPv4 (the recv timestamp the moment the request is received, the xmit timestamp as nearly as possible to the moment the response is sent). However, these timestamps don't have a defined epoch so they don't actually tell you the time as such: they're just timers that advance by one second per second of real time — difference clocks as opposed to absolute clocks. To get an absolute timestamp, you have to add tai_loc_offset to them.

tai_loc_offset represents the server's best available estimate of the offset between its local monotonic clock (what it samples when it captures recv and xmit timestamps) and TAI. It's allowed to jump around every time the server gets a new data point that improves its estimate. So you can't rely on xmit + tai_loc_offset to be at all stable; it can even move backward. In any computation that assumes stable monotonic progression, just use recv and xmit by themselves.

While recv and xmit are guaranteed to never receive step adjustments, there is no guarantee about any higher derivatives. For example if a server thinks its clock is 5ppm slow, it can immediately speed it up by 5ppm; it does not have to gradually accelerate.

If the source_id field has changed from one time response to another, then recv and xmit timestamps from before the change are no longer comparable to the one after. The server may have rebooted and lost its clock.

Since we've adopted PTPv2's timestamp format, we also adopt their epoch of January 1, 1970 TAI rather than the NTPv4 epoch of 1900.

max_error gives a maximum error bound on tai_loc_offset, and rms_error gives the square root of the expected squared error.

We don't specify what kind of estimator tai_loc_offset is, e.g. a minimum-mean-squared-error estimate versus a maximum likelihood estimate, and we don't require that the estimator be unbiased. However, since rms_error is based on mean-squared error, using any estimator other than a MMSE estimator requires reporting a larger rms_error than one otherwise might be able to.

The leap_event and utc_tai_offset fields along with the leap insertion/deletion flag bit are used in converting from TAI to UTC and notifying of upcoming or recent leap events. leap_event gives the TAI timestamp as of the conclusion of the most recently-announced leap second event, which might be in the future or might be in the past. utc_tai_offset gives the UTC-TAI offset as of the conclusion of that event, and the flag bit tells whether the leap event is/was an insertion or a deletion.

The throttle and burst_throttle fields replace KoD RATE messages. Rather than the server sending KoDs to clients that query it too frequently (which burdens the server with maintaining a table to keep track of this), it simply communicates its policy on what query rate is acceptable and clients are expected to follow it.

The source_seqno field is part of NTPv5's loop-detection scheme. Its meaning and purpose will be explained in a later section.

Extensions have the same type-length-value format that they do in NTPv4:

struct {
  uint16 type;
  uint16 len;
  padded body[pad(len)];
} Extension;

But, as throughout NTPv5, the length field gives the unpadded length rather than the padded one. Unlike in NTPv4, NTPv5 extension bodies have no minimum length since we're free of the syntactic ambiguities that force RFC 7822 to require one.

This design defines no extension fields. Extension fields are only one of two ways that NTPv5 can be extended; new message types can be defined as well. Generally, extension fields should be preferred only when the information they carry is somehow coupled to the time data in the same packet (such as, to pick a silly example, adding additional bits of precision to a timestamp). Otherwise it is probably better to define a new message type instead.

Follow-up messages

When the client sets flag bit 2 in a TimeRequest and the server sets bit 2 in its TimeResponse, it should be immediately followed by a FollowupResponse which provides a corrected xmit stamp. The corrected stamp can be based on a drivestamp from the emitting NIC and therefore more accurate than the original.

struct {
  Timestamp corrected_xmit;
} FollowupResponse;

Source messages and loop detection

NTPv5 uses a loop-detection mechanism based on Bloom filters that works for a large number of sources out to arbitrary degree. At startup, each server randomly assigns itself a 128-bit source_id. It then sends each of its upstream sources a SourceRequest message — just a bunch of padding, no meaningful fields.

struct {
  opaque anti_amplification_padding[532];
} SourceRequest;

It gets back a SourceResponse from each one.

struct {
  opaque source_id[16];
  uint32 source_seqno;
  opaque bloom_filter[512];
} SourceResponse;

Each source has populated a Bloom filter based on its own sources further upstream. Our newly-online server now builds a Bloom filter of its own. First, it takes the union (bitwise OR) of all the filters from upstream. Then it inserts the source_ids of its immediate upstream stources into the filter. Since source IDs are already uniformly random, the hash functions we employ can be very simple. Split the 128-bit source_id into ten 12-bit values h_0, h_1, …, h_9, discarding the last 8 bits. Then for each i, set the h_i'th bit in the filter.

Once the new server has computed its filter and finished synchronizing, it comes online with a source_seqno of 0. It then watches the source_seqnos in the time responses from its sources. Whenever one of them changes, it sends a new SourceRequest to that server, recomputes its filter, and increments its own source_seqno. However, if the new filter that comes back includes our own source_id, this indicates a probable timing loop and we should drop that source.

The use of Bloom filters means that timing loops will always be detected but there will be occasional false positives. With the filter size chosen, we can scale to a few hundred transitive upstream sources before the probability of a false positive becomes non-negligible, which should be sufficient.

Endpoints that act only as clients and not as servers don't have to bother doing any of this since they can't possibly be involved in timing loops.

Correction fields

The Correction structure is deliberately not encrypted or authenticated, and is intended for maniplation by middleboxes. It is similar in function to PTP's transparent clocks.

Packets which contain correction fields must also include a Router Alert IP option (RFC 2113 & 2711) in order to signal to middleboxes that they want them modified. Middleboxes must rely strictly on this option (and not merely on something looking like a syntactically-valid NTPv5 packet going over the NTP port) to detect that they are handling NTPv5 traffic and that such modifications are desired.

struct Correction {
  Timestamp correction;
  uint32 path_crc;
};

correction is initially 0 when time requests initiate from the client. Compliant middleboxes increment it by the amount of time that the request spends queued. The receiving NTP server copies the correction field it receives into the response. Middleboxes on the return path decrement the response's correction field by the amount of time that the response spends queued.

path_crc is initially 0 when time requests initiate from the client. Compliant middleboxes self-assign an identifier at random and add their identifier to the CRC (where "add" here means the CRC group operation). The receiving NTP server copies the path_crc field it receives into the response. Middleboxes on the return path subtract their self-assigned identifier from the path_crc. The client receiving the response verifies that the path_crc is 0. If not, there is a routing asymmetry and the correction should be discarded.

Correction fields can, obviously, be maliciously altered by a MitM in order to cause the client to obtain a bogus time estimate. Clients which pay attention to correction fields must set a limit on the largest correction they will accept. A correction whose absolute value is in excess of one half the measured RTT is definitely bogus and should always be rejected, but clients may choose to set bounds tighter than this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment