Skip to content

Instantly share code, notes, and snippets.

@ericmj
Last active March 14, 2016 14:01
Show Gist options
  • Save ericmj/36bf9e64c0566dfad277 to your computer and use it in GitHub Desktop.
Save ericmj/36bf9e64c0566dfad277 to your computer and use it in GitHub Desktop.
Hex new registry formats

A binary format for Hex's registry file, designed for compactness, easy extension and backwards compatibility.

The file begins with the <<72, 69, 88>> header, following the headers is a number of sections. A section has a defined layout made of fields, a field is a single named value with a type defining the field's size and how to interpret it. All sections beginswith a single INT8 field signifying the type of section followed by an INT16 field specifying the length of the section excluding the first INT8 and length.

New fields may be added to the end of a section in the future so clients should only read the fields of a section it knows about to maintain backwards compatibility. If there is data left in the section after the client has read the known fields that data should be skipped until the end of the section. This way new fields can be added to sections without breaking existing clients.

Additionally, to ensure backwards compatibility, clients should ignore and skip section types that are not known to it.

Immediately following the header is the INIT section, unlike the other sections this section will appear only once. It holds the registry version and the log file rebuild counter and size, see the log file endpoint specification for details. The major version is incremented for breaking changes, the minor for non breaking changes. After the INIT section there is no inherent ordering between sections, except that a DEPENDENCY section belongs to the latest defined RELEASE which belongs to the latest defined PACKAGE. Other section types may be interspersed between these sections.

If an unsupported major version is found the client should stop and inform the user.

Below are the section layouts:

INIT <<0>>
  INT16         (length of section)
  INT8          (version number major)
  INT8          (version number minor)
  INT16         (log file rebuild counter)
  INT32         (log file size in bytes)
  BIN[64]       (log file checksum)

MIX CLIENT RELEASE <<1>> - Used by the Mix client to inform users of new releases
  INT16         (length of section)
  STRING        (version)
  STRING        (minimum elixir version)

PACKAGE <<2>> - Package definition
  INT16         (length of section)
  INT32         (package id) (1.)
  STRING        (package name)

RELEASE <<3>> - A package release
  INT16         (length of section)
  STRING        (version)
  BIN[32]       (checksum)
  ARRAY[STRING] (build tools)

DEPENDENCY <<4>> - A dependency of a package release
  INT16         (length of section)
  INT32         (package id)
  STRING        (requirement)
  STRING        (application name)
  BOOL          (optional)

The types:

INT8:        <<value>>
INT16:       <<value::integer-size(16)>>
INT32:       <<value::integer-size(32)>>
STRING:      <<size, value::binary-size(size)>> (2.)
BIN[size]:   <<value::binary-size(size)>>
BOOL:        <<0>> | <<1>>
ARRAY[type]: <<num_elements, ...>> (3.)

All integers are unsigned.

  1. Package ids are unique for the repository hosting the registry.

  2. Strings are UTF8 encoded and has a maximum size of 255 bytes.

  3. An array starts with a single INT8 specifying the number of elements in the array and has a maximum 255 number of elements. Following the INT8 are the elements in the array. The elements' types are defined by type in ARRAY[type].

The log file format for Hex's registry. Similarly to the registry file format built from sections, read the registry proposal for definition of sections. The log file is append only and should be used by clients to do effecient, incremental updates of the client's local registry.

ADD DEPENDENCY belongs to the latest ADD RELEASE.

Below are the section layouts:

ADD MIX CLIENT RELEASE <<0>> - Used by the Mix client to inform users of new releases
  INT16         (length of section)
  STRING        (version)
  STRING        (minimum elixir version)

ADD PACKAGE <<1>> - Package definition
  INT16         (length of section)
  INT32         (package id)
  STRING        (package name)

ADD RELEASE <<2>> - A package release
  INT16         (length of section)
  INT32         (package id)
  STRING        (version)
  BIN[32]       (checksum)
  ARRAY[STRING] (build tools)

ADD DEPENDENCY <<3>> - A dependency of a package release
  INT16         (length of section)
  INT32         (package id) (1.)
  STRING        (requirement)
  STRING        (application name)
  BOOL          (optional)

REMOVE MIX CLIENT RELEASE <<4>>
  INT16         (length of section)
  STRING        (version)

REMOVE PACKAGE <<5>>
  INT16         (length of section)
  INT32         (package id)

REMOVE RELEASE <<6>>
  INT16        (length of section)
  INT32        (package id)
  STRING       (version)

The types:

INT8:        <<value>>
INT16:       <<value::integer-size(16)>>
INT32:       <<value::integer-size(32)>>
STRING:      <<size, value::binary-size(size)>> (1.)
BIN[size]:   <<value::binary-size(size)>>
BOOL:        <<0>> | <<1>>
ARRAY[type]: <<num_elements, ...>> (2.)

All integers are unsigned.

  1. Package ids are unique for the repository hosting the registry.

  2. Strings are UTF8 encoded and has a maximum size of 255 bytes.

  3. An array starts with a single INT8 specifying the number of elements in the array and has a maximum 255 number of elements. Following the INT8 are the elements in the array. The elements' types are defined by type in ARRAY[type].ray. The elements' types are defined by TYPE in ARRAY[TYPE].

To efficiently serve the log file the HTTP Range should be used to only fetch the tail of the log that a client has not previously fetched.

If a client doesn't have a cached representation of the registry it should initially fetch the full registry file. Subsequent requests should do range requests on the log file. The registry file's INIT section holds the rebuild counter and size of the log file at the time the registry was fetched.

The log file size from the registry's INIT section should be used to fetch only the new parts of the registry. For example if the registry sets the log file size to 500 bytes a request with the header Range: 499- should be sent. If the server responds with a Content-Length: 1 header the registry is up to date, if not the locally cached registry should be update with the new data but the first byte of the response body should be ignored (1.). The byte offsets used in the Range should only be on section boundaries, this should be automatic since the repository will never host a log file with incomplete sections.

The rebuild counter should be compared to the x-rebuild-count header on the log file endpoint. If the values do not match the log file has been rebuilt and a full fetch of the registry should be performed. It is important to verify that the rebuild counters match before interpreting the bytes in the response body or you may read corrupted data because the byte offset in the Range header may not have been on a section boundary.

To be able to verify the signature of the log file without fetching the full file a special checksum scheme is used to generate the checksum that will be signed. For each section the byte representation of a section will be concatenated with the previous value of the iterations and hashed with a sha512 hash function. The result of the hash function is what is passed as the next value in the iteration. In Elixir the algorithm would be defined like this:

def checksum([], hash),
   do: hash
def checksum([section|sections], hash),
   do: checksum(sections, sha512(hash <> section))

The INIT section in the full registry file holds the current checksum of all sections when the registry was built. This checksum should be used to build the new checksum when doing a partial range request of the log file.

Sections unknown to a client should be skipped, just like for the full registry file, with the important disinction that they should still be included when calculating the checksum.

  1. The reason one extra byte is fetched and ignored is because many HTTP servers respond with 200 OK instead of 416 Range Not Satisfiable when the range is out-of-bounds. The 200 OK response will include a full response body that we don't care about.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment