Skip to content

Instantly share code, notes, and snippets.

@hohle
Last active June 21, 2023 22:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hohle/1203846da7579950be4b70b93d7cee19 to your computer and use it in GitHub Desktop.
Save hohle/1203846da7579950be4b70b93d7cee19 to your computer and use it in GitHub Desktop.
Internet Draft J. Hohle
<ion-rfc.txt> RFC Editor
Category: Informational USC ISI
Expires December 2023 June 21, 2023
The Amazon Ion Specification
<ion-rfc.txt>
Status of this Memo
Distribution of this memo is unlimited.
This Internet-Draft is submitted to IETF pursuant to, and in full
conformance with, the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on December 23, 2023.
Copyright Notice
Copyright (c) 2013-2023 Amazon.com, Inc. and the persons identified as
the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document.
Abstract
Ion is a set of datatypes as well as a text and binary, language
independent data interchange format designed for use within
Jonker, Goo, Hohle Expires December 2023 [Page 1]
Internet Draft Ion June 21, 2023
applications as well as both an on-wire and at rest encoding. The
datatypes and text encoding are a superset of the JSON format. Along
with these additions is an isomorphic, space-efficient binary encoding.
This document specifies the Ion datatypes, the grammar of the text
representation, and encoding of the binary representation.
Jonker, Goo, Hohle Expires December 2023 [Page 2]
Internet Draft Ion June 21, 2023
Table of Contents
1. Introduction ................................................... 4
1.1 Conventions Used in This Document ........................ 4
2. The Ion Data Model ............................................. 4
2.1. Primitive Types ........................................... 5
2.1.1. Null Values .......................................... 5
2.1.2. Boolean .............................................. 6
2.1.3. Integers ............................................. 6
2.1.4. Real Numbers ......................................... 6
2.1.4.1. Float ........................................... 6
2.1.4.2. Decimal ......................................... 6
2.1.5. Timestamps ........................................... 6
2.1.6. Strings .............................................. 6
2.1.7. Symbols .............................................. 6
2.1.8. Blobs ................................................ 7
2.1.9. Clobs ................................................ 7
2.2. Container Types ........................................... 7
2.2.1. Structures ........................................... 7
2.2.2. Lists ................................................ 7
2.2.3. S-Expressions ........................................ 8
2.2.4. Type Annotations ..................................... 8
2.3. Value Streams ............................................. 8
3. Text Encoding .................................................. 9
3.1. Nulls .................................................... 9
3.2. Booleans .................................................. 9
3.3. Integers .................................................. 10
3.4. Floats .................................................... 10
3.5. Decimals .................................................. 11
3.6. Timestamps ................................................ 11
3.7. Strings ................................................... 13
3.7.1 Long Strings .......................................... 13
3.7.2. Escape Characters .................................... 14
3.8. Symbols ................................................... 14
3.9. Blobs ..................................................... 15
3.10. Clobs .................................................... 16
3.11. Structs .................................................. 17
3.12. Lists .................................................... 17
3.13. S-Expressions ............................................ 18
3.14. Type Annotations ......................................... 18
4. Binary Encoding ................................................ 19
4.1. Basic Field Formats ....................................... 20
4.1.1. UInt and Int Fields .................................. 20
4.1.2. VarUInt and VarInt Fields ............................ 21
4.2. Typed Value Formats ....................................... 22
4.2.1. 0: Nulls ............................................. 23
4.2.1.1. NOP Padding ..................................... 23
4.2.2. 1: Booleans .......................................... 24
Jonker, Goo, Hohle Expires December 2023 [Page 3]
Internet Draft Ion June 21, 2023
4.2.3. 2 & 3: Integers ...................................... 24
4.2.4. 4: Floats ............................................ 24
4.2.5. 5: Decimals .......................................... 25
4.2.6. 6: Timestamps ........................................ 26
4.2.7. 7: Symbols ........................................... 27
4.2.8. 8: Strings ........................................... 28
4.2.9. 9: Clobs ............................................. 28
4.2.10. 10: Blobs ........................................... 29
4.2.11. 11: Lists ........................................... 29
4.2.12. 12: S-Expressions ................................... 29
4.2.13. 13: Structures ...................................... 30
4.2.13.1. NOP Padding in struct Fields ................... 31
4.2.14. 14: Type Annotations ................................ 32
4.2.15. 15: Reserved ........................................ 33
4.3. Illegal Type Descriptors .................................. 33
5. Ion Symbols & Symbol Tables .................................... 34
5.1. Symbol Tables ............................................. 34
5.1.1. The Catalog .......................................... 35
5.1.2. Top-Level Semantics ................................. 35
5.1.3. System Symbols ....................................... 36
5.2. Ion Version Markers ....................................... 37
5.2.2. Local Symbol Tables .................................. 39
5.2.2.1. Imports ......................................... 40
5.2.2.2. Semantics ....................................... 41
5.2.3. Shared Symbol Tables ................................. 41
5.2.3.1. Semantics ....................................... 43
5.2.3.2. Versioning ...................................... 43
5.3. Symbol Zero ............................................... 43
5.4. Data Model ................................................ 44
5.5. Examples .................................................. 44
6. Ion Strings & Clobs ............................................ 46
6.1 Unicode Primer ................................................ 46
6.1. Ion String ................................................ 47
6.2.1. Text Format ......................................... 47
6.2.2. Binary Format ....................................... 49
6.3 Ion Clob ...................................................... 49
6.3.1. Text Format ......................................... 49
6.3.2. Binary Format ....................................... 50
7. Real Numbers ................................................... 50
7.1. Floats .................................................... 50
7.1.1. Encoding Considerations ............................. 51
7.1.2. Special Values ...................................... 51
7.1.3. Examples ............................................ 52
7.2. Decimals .................................................. 53
7.2.1. Data Model ........................................... 53
7.2.2. Text Format ......................................... 53
7.2.3. Binary Format ....................................... 53
8. Compression .................................................... 58
Jonker, Goo, Hohle Expires December 2023 [Page 4]
Internet Draft Ion June 21, 2023
9. Security Considerations ........................................ 58
10. IANA Considerations ............................................ 58
9. Appendix A: Antlr v4 Grammar for Ion 1.0 Text .................. 59
10. References .................................................... 72
10.1. Normative References .................................... 72
10.2. Informative References ................................. 73
Jonker, Goo, Hohle Expires December 2023 [Page 5]
Internet Draft Ion June 21, 2023
1. Introduction
The Amazon Ion specification has three parts:
o A set of data types
o A textual notation for values of those types
o A binary notation for values of those types
All three views are semantically isomorphic, meaning they can
represent exactly the same data structures, and an Ion processor can
transcode between the formats without loss of data. This allows
applications to optimize different areas for different uses - say,
using text for human readability and binary for compact persistence -
by transcribing between the formats with almost complete fidelity.
("Almost" because converting from text to binary does not preserve
whitespace and comments.)
The Ion text encoding is intended to be easy to read and write. It
may be more suitable for streaming applications since sequences don't
need to be length-prefixed. Whitespace is insignificant and is only
required where necessary to separate tokens. C-style comments (either
block or inline) are treated as whitespace, and are not part of the
binary encoding.
The binary encoding is much more compact and efficient. An important
feature is that parts of the whole can be accessed without
"preparation", meaning you don't have to load it into another form
before accessing the values.
1.1. Conventions Used in This Document
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in [RFC2119].
The grammatical rules in this document are to be interpreted as
described in [RFC5234].
2. The Ion Data Model
The semantic basis of Ion is an abstract data model, composed of a
set of primitive types and a set of recursively-defined container
types. All types support null values and user-defined type
annotations.
It's important to note that the data model is value-based and does
Jonker, Goo, Hohle Expires December 2023 [Page 6]
Internet Draft Ion June 21, 2023
not include references. As a result, the data model can express data
hierarchies (values can be nested to arbitrary depth), but not
general directed graphs.
Here's an overview of the core data types:
o null - A generic null value
o bool - Boolean values
o int - Signed integers of arbitrary size
o float - Binary-encoded floating point numbers (IEEE 64-bit)
o decimal - Decimal-encoded real numbers of arbitrary precision
o timestamp - Date/time/timezone moments of arbitrary precision
o string - Unicode text literals
o symbol - Interned, Unicode symbolic atoms (aka identifiers)
o blob - Binary data of user-defined encoding
o clob - Text data of user-defined encoding
o struct - Unordered collections of tagged values
o list - Ordered collections of values
o sexp - Ordered collections of values with application-defined
semantics
2.1. Primitive Types
The Ion primitive types represent scalar values including nulls,
booleans, numbers, timestamps, character sequences, and symbols.
2.1.1. Null Values
Ion supports distinct null values for every core type, as well as a
separate null type that's distinct from all other types.
Null values exist for all core types, including the null type. The
null type has a single value, null.
As a historical aside, the null type exists primarily for
compatibility with JSON, which has only the untyped null value.
Jonker, Goo, Hohle Expires December 2023 [Page 7]
Internet Draft Ion June 21, 2023
2.1.2. Booleans
The bool type may have a value of true, false, or, null.
2.1.3. Integers
The int type consists of signed integers of arbitrary size.
2.1.4. Real Numbers
Ion supports both binary and lossless decimal encodings of real
numbers as, respectively, types float and decimal.
2.1.4.1. Floats
The float type denotes either 32-bit or 64-bit IEEE-754 floating-
point values; other sizes may be supported in future versions of this
specification.
2.1.4.2. Decimals
Because most decimal values cannot be represented exactly in binary
floating-point, float values may change "appearance" and precision
when being read or written. The decimal type, however, has
significant precision, including trailing zeros and is preserved
through round-trips.
2.1.5. Timestamps
Timestamps represent a specific moment in time, always include a
local offset, and are capable of arbitrary precision.
Values that are precise only to the year, month, or date are assumed
to be UTC values with unknown local offset.
Zero and negative dates are not valid, so the earliest instant in
time that can be represented as a timestamp is Jan 01, 0001. As per
the W3C note, leap seconds cannot be represented.
Two timestamps are only equivalent if they represent the same instant
with the same offset and precision.
2.1.6. Strings
Ion string values are Unicode character sequences of arbitrary
length.
2.1.7. Symbols
Jonker, Goo, Hohle Expires December 2023 [Page 8]
Internet Draft Ion June 21, 2023
Symbols are much like strings, in that they are Unicode character
sequences. The primary difference is the intended semantics: symbols
represent semantic identifiers as opposed to textual literal values.
Symbols are case sensitive.
Symbols may be shared between applications out of band or stored
separately from data at rest using symbol tables. Symbol tables map
Symbol text to a unique integer token.
2.1.8. Blobs
The blob type allows embedding of arbitrary raw binary data. Ion
treats such data as a single (though often very large) value. It does
no processing of such data other than passing it through intact.
2.1.9. Clobs
The clob type is similar to blob in that it holds uninterpreted
binary data, the difference is that the content is expected to be
text. Like blobs, clobs are a sequence of raw octets that are not
given any special interpretation. This guarantees that the value can
be transmitted without modification.
2.2. Container Types
Ion defines three container types: structures, lists, and S-
expressions. These types are defined recursively and may contain
values of any Ion type.
2.2.1. Structures
Structures are unordered collections of name/value pairs. The names
are symbol tokens, and the values are unrestricted. Each name/value
pair is called a field.
When two fields in the same struct have the same name we say there
are "repeated names" or (somewhat misleadingly) "repeated fields".
Implementations must preserve all such fields, i.e., they may not
discard fields that have repeated names. However, implementations may
reorder fields (the binary format identifies structs that are sorted
by symbolID), so certain operations may lead to nondeterministic
behavior.
Note that field names are symbol tokens, not symbol values, and thus
may not be annotated. The value of a field may be annotated like any
other value.
2.2.2. Lists
Jonker, Goo, Hohle Expires December 2023 [Page 9]
Internet Draft Ion June 21, 2023
Lists are ordered collections of values. The contents of the list are
heterogeneous (that is, each element can have a different type).
Homogeneous lists are not supported by the core type system, but may
be imposed by schema validation tools.
2.2.3. S-Expressions
An S-expression (or symbolic expression) is much like a list in that
it's an ordered collection of values. However, the notation aligns
with Lisp syntax to connote use of application semantics like
function calls or programming-language statements. As such, correct
interpretation requires a higher-level context other than the raw Ion
parser and data model.
Ion does not define the interpretation of S-expressions or any
semantics beyond the pure sequence-of-values data model.
2.2.4. Type Annotations
Any Ion value can include one or more annotation symbols denoting the
semantics of the content. This can be used to:
o Annotate individual values with schema types, for validation
purposes.
o Associate a higher-level datatype (e.g. a Java class) during
serialization processes.
o Indicate the notation used within a blob or clob value.
o Apply other application semantics to a single value.
When multiple annotations are present, the Ion processor will
maintain their order. Duplicate annotation symbols are allowed but
discouraged.
Except for a small number of predefined system annotations, Ion
itself neither defines nor validates such annotations; that behavior
is left to applications or tools (such as schema validators).
It's important to understand that annotations are symbol tokens, not
symbol values. That means they do not have annotations themselves.
2.3. Value Streams
A value stream is a (potentially unbounded) sequence of Ion values in
either text or binary.
Jonker, Goo, Hohle Expires December 2023 [Page 10]
Internet Draft Ion June 21, 2023
3. Ion Text Encoding
The ion text encoding is a value stream which must be a valid
sequence of UTF-8 code points. It shares many similarities to JSON,
and in fact, is a proper superset of JSON. Ion extends JSON with
additional datatypes, syntax, containers, and annotations.
3.1.1. Nulls
The null type has a single value, denoted in the text format by the
keyword `null'. Null values for all core types are denoted by
suffixing the keyword with a period and the desired type. Thus we can
enumerate all possible null values as follows:
null
null.null // Identical to unadorned null
null.bool
null.int
null.float
null.decimal
null.timestamp
null.string
null.symbol
null.blob
null.clob
null.struct
null.list
null.sexp
The text format treats all of these as reserved tokens; to use those
same characters as a symbol, they must be enclosed in single-quotes:
null // The type is null
'null' // The type is symbol
null.list // The type is list
'null.int' // The type is symbol
(As a historical aside, the null type exists primarily for compatibility
with JSON, which has only the untyped null value.)
3.2. Booleans
The bool type is self-explanatory, but note that (as with all Ion types)
there's a null value. Thus the set of all Boolean values consists of the
following three reserved tokens:
null.bool
Jonker, Goo, Hohle Expires December 2023 [Page 11]
Internet Draft Ion June 21, 2023
true
false
(As with the null values, one can single-quote those tokens to force
them to be parsed as symbols.)
3.3. Integers
The text format allows hexadecimal and binary (but not octal) notation,
but such notation will not be maintained during binary-to-text
conversions. It also allows for the use of underscores to
separate digits.
null.int // A null int value
0 // Zero. Surprise!
-0 // ...the same value with a minus sign
123 // A normal int
-123 // Another negative int
0xBeef // An int denoted in hexadecimal
0b0101 // An int denoted in binary
1_2_3 // An int with underscores
0xFA_CE // An int denoted in hexadecimal with underscores
0b10_10_10 // An int denoted in binary with underscores
+1 // ERROR: leading plus not allowed
0123 // ERROR: leading zeros not allowed (no support for
// octal notation)
1_ // ERROR: trailing underscore not allowed
1__2 // ERROR: consecutive underscores not allowed
0x_12 // ERROR: underscore can only appear between digits (the
// radix prefix is not a digit)
_1 // A symbol (ints cannot start with underscores)
In the text notation, integer values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f.
3.4. Floats
In the text format, float values are denoted much like the decimal
formats in C or Java. As with JSON, extra leading zeros are not allowed.
Digits may be separated with an underscore.
null.float // A null float value
-0.12e4 // Type is float
Jonker, Goo, Hohle Expires December 2023 [Page 12]
Internet Draft Ion June 21, 2023
0E0 // Zero as float
-0e0 // Negative zero float (distinct from positive zero)
The float type denotes either 32-bit or 64-bit IEEE-754 floating-point
values; other sizes may be supported in future versions of
this specification.
In the text notation, real values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f.
Because most decimal values cannot be represented exactly in binary
floating-point, float values may change "appearance" and precision when
reading or writing Ion text.
3.5. Decimals
In the text format, decimal values use d instead of e to start the
exponent. Reals without an exponent are treated as decimal. As with
JSON, extra leading zeros are not allowed. Digits may be separated with
an underscore.
null.decimal // A null decimal value
0.123 // Type is decimal
-0.12d4 // Type is decimal
0D0 // Zero as decimal
0. // ...the same value with different notation
-0d0 // Negative zero decimal (distinct from positive zero)
-0. // ...the same value with different notation
-0d-1 // Decimal maintains precision: -0. != -0.0
123_456.789_012 // Decimal with underscores
123_._456 // ERROR: underscores may not appear next to the decimal point
12__34.56 // ERROR: consecutive underscores not allowed
123.456_ // ERROR: trailing underscore not allowed
-_123.456 // ERROR: underscore after negative sign not allowed
_123.456 // ERROR: the symbol '_123' followed by an unexpected dot
The precision of decimal values, including trailing zeros, is
significant and is preserved through round-trips.
3.6. Timestamps
In the text format, timestamps follow the W3C note on date and time
formats, but they must end with the literal "T" if not at least
whole-day precision. Fractional seconds are allowed, with at least one
Jonker, Goo, Hohle Expires December 2023 [Page 13]
Internet Draft Ion June 21, 2023
digit of precision and an unlimited maximum. Local-time offsets may be
represented as either hour:minute offsets from UTC, or as the literal
"Z" to denote a local time of UTC. They are required on timestamps with
time and are not allowed on date values.
Ion follows the "Unknown Local Offset Convention" of [RFC3339]:
If the time in UTC is known, but the offset to local time is unknown,
this can be represented with an offset of "-00:00". This differs
semantically from an offset of "Z" or "+00:00", which imply that UTC is
the preferred reference point for the specified time. RFC2822 describes
a similar convention for email.
Values that are precise only to the year, month, or date are assumed to
be UTC values with unknown local offset.
null.timestamp // A null timestamp value
2007-02-23T12:14Z // Seconds are optional, but local
// offset is not
2007-02-23T12:14:33.079-08:00 // A timestamp with millisecond
// precision and PST local time
2007-02-23T20:14:33.079Z // The same instant in UTC ("zero"
// or "Zulu")
2007-02-23T20:14:33.079+00:00 // The same instant, with explicit
// local offset
2007-02-23T20:14:33.079-00:00 // The same instant, with unknown
// local offset
2007-01-01T00:00-00:00 // Happy New Year in UTC, unknown local offset
2007-01-01 // The same instant, with days precision, unknown local offset
2007-01-01T // The same value, different syntax.
2007-01T // The same instant, with months precision, unknown local offset
2007T // The same instant, with years precision, unknown local offset
2007-02-23 // A day, unknown local offset
2007-02-23T00:00Z // The same instant, but more precise and in UTC
2007-02-23T00:00+00:00 // An equivalent format for the same value
2007-02-23T00:00:00-00:00 // The same instant, with seconds precision
2007 // Not a timestamp, but an int
2007-01 // ERROR: Must end with 'T' if not
// whole-day precision, this results
// as an invalid-numeric-stopper error
2007-02-23T20:14:33.Z // ERROR: Must have at least one digit precision after decimal point.
Zero and negative dates are not valid, so the earliest instant in time
Jonker, Goo, Hohle Expires December 2023 [Page 14]
Internet Draft Ion June 21, 2023
that can be represented as a timestamp is Jan 01, 0001. As per the W3C
note, leap seconds cannot be represented.
Two timestamps are only equivalent if they represent the same instant
with the same offset and precision. This means that the following are
not equivalent:
2000T // January 1st 2000, year precision, unknown local offset
2000-01-01T00:00:00Z // January 1st 2000, second precision, UTC
2000-01-01T00:00:00.000Z // January 1st 2000, millisecond precision, UTC
2000-01-01T00:00:00.000-00:00 // January 1st 2000, millisecond precision, negative zero local offset
In the text notation, timestamp values must be followed by one of the
fifteen numeric stop-characters: {}[](),\"\'\ \t\n\r\v\f.
3.7. Strings
In the text format, strings are delimited by double-quotes and follow
C/Java backslash-escape conventions (see below).
null.string // A null string value
"" // An empty string value
" my string " // A normal string
"\"" // Contains one double-quote character
"\uABCD" // Contains one Unicode character
xml::"<e a='v'>c</e>" // String with type annotation 'xml'
3.7.1. Long Strings
The text format supports an alternate syntax for "long strings",
including those that break across lines. Sequences bounded by three
single-quotes (''') can cross multiple lines and still count as a valid,
single string. In addition, any number of adjacent triple-quoted strings
are concatenated into a single value. The concatenation happens within
the Ion text parser and is neither detectable via the data model nor
applicable to the binary format. Note that comments are always treated
as whitespace, so concatenation still occurs when a comment falls
between two long strings.
( '''hello ''' // Sexp with one element
'''world!''' )
("hello world!") // The exact same sexp value
// This Ion value is a string containing three newlines. The
// serialized form's first newline is escaped into nothingness.
'''\
The first line of the string.
Jonker, Goo, Hohle Expires December 2023 [Page 15]
Internet Draft Ion June 21, 2023
This is the second line of the string,
and this is the third line.
'''
3.7.2. Escape Characters
The Ion text format supports escape sequences only within quoted strings
and symbols. Ion supports most of the escape sequences defined by C++,
Java, and JSON.
The following sequences are allowed:
Unicode Code Point Ion Escape Meaning
U+0000 \0 NUL
U+0007 \a alert BEL
U+0008 \b backspace BS
U+0009 \t horizontal tab HT
U+000A \n linefeed LF
U+000C \f form feed FF
U+000D \r carriage return CR
U+000B \v vertical tab VT
U+0022 \" double quote
U+0027 \' single quote
U+003F \? question mark
U+005C \\ backslash
U+002F \/ forward slash
nothing \NL escaped NL expands to nothing
U+00HH \xHH 2-digit hexadecimal Unicode
code point
U+HHHH \uHHHH 4-digit hexadecimal Unicode
code point
U+HHHHHHHH \UHHHHHHHH 8-digit hexadecimal Unicode
code point
Any other sequence following a backslash is an error.
Note that Ion does not support the following escape sequences:
o Java's extended Unicode markers, e.g., "\uuuXXXX"
o General octal escape sequences, \OOO
3.8. Symbols
In the text format, symbols are delimited by single-quotes and use the
same escape characters as [Strings].
Jonker, Goo, Hohle Expires December 2023 [Page 16]
Internet Draft Ion June 21, 2023
A subset of symbols called identifiers can be denoted in text without
single-quotes. An identifier is a sequence of ASCII letters, digits, or
the characters $ (dollar sign) or _ (underscore), not starting with
a digit.
null.symbol // A null symbol value
'myVar2' // A symbol
myVar2 // The same symbol
myvar2 // A different symbol
'hi ho' // Symbol requiring quotes
''ahoy'' // A symbol with embedded quotes
'' // The empty symbol
Within S-expressions, the rules for unquoted symbols include another set
of tokens: operators. An operator is an unquoted sequence of one or more
of the following nineteen ASCII characters: !#%&*+-./;<=>?@^`|~
Operators and identifiers can be juxtaposed without whitespace:
( 'x' '+' 'y' ) // S-expression with three symbols
( x + y ) // The same three symbols
(x+y) // The same three symbols
(a==b&&c==d) // S-expression with seven symbols
Note that the data model does not distinguish between identifiers,
operators, or other symbols, and that - as always - the binary format
does not retain whitespace.
See Ion Symbols for more details about symbol representations and
symbol tables.
3.9. Blobs
In the text format, blob values are denoted as [RFC 4648]-compliant
Base64 text within two pairs of curly braces.
When parsing blob text, an error must be raised if the data:
o Contains characters outside of the Base64 character set.
o Contains a padding character (=) anywhere other than at the end.
o Is terminated by an incorrect number of padding characters.
Within blob values, whitespace is ignored and comments are not allowed.
The / character is always considered part of the Base64 data and the
* character is invalid for Base64 encoding.
Jonker, Goo, Hohle Expires December 2023 [Page 17]
Internet Draft Ion June 21, 2023
// A null blob value
null.blob
// A valid blob value with zero padding characters.
{{
+AB/
}}
// A valid blob value with one required padding character.
{{ VG8gaW5maW5pdHkuLi4gYW5kIGJleW9uZCE= }}
// ERROR: Incorrect number of padding characters.
{{ VG8gaW5maW5pdHkuLi4gYW5kIGJleW9uZCE== }}
// ERROR: Padding character within the data.
{{ VG8gaW5maW5pdHku=Li4gYW5kIGJleW9uZCE= }}
// A valid blob value with two required padding characters.
{{ dHdvIHBhZGRpbmcgY2hhcmFjdGVycw== }}
// ERROR: Invalid character within the data.
{{ dHdvIHBhZGRpbmc_gY2hhcmFjdGVycw= }}
3.10. Clobs
In the text format, clob values use similar syntax to blob, but the data
between braces must be one string. The string may only contain legal
7-bit ASCII characters, using the same escaping syntax as string and
symbol values. This guarantees that the value can be transmitted
unscathed while remaining generally readable (at least for western
language text). Like blobs, clobs disallow comments everywhere within
the value.
[Strings] and [Clobs] gives details on the subtle, but profound,
differences between Ion strings and clobs.
null.clob // A null clob value
{{ "This is a CLOB of text." }}
shift_jis ::
{{
'''Another clob with user-defined encoding, '''
'''this time on multiple lines.'''
}}
{{
Jonker, Goo, Hohle Expires December 2023 [Page 18]
Internet Draft Ion June 21, 2023
// ERROR
"comments not allowed"
}}
Note that the shift_jis type annotation above is, like all type
annotations, application-defined. Ion does not interpret or validate
that symbol; that's left to the application.
3.11. Structures
In the text format, structures are wrapped by curly braces, with a colon
between each name and value, and a comma between the fields. For the
purposes of JSON compatibility, it's also legal to use strings for field
names, but they are converted to symbol tokens by the parser.
null.struct // A null struct value
{ } // An empty struct value
{ first : "Tom" , last: "Riddle" } // Structure with two fields
{"first":"Tom","last":"Riddle"} // The same value with confusing style
{center:{x:1.0, y:12.5}, radius:3} // Nested struct
{ x:1, } // Trailing comma is legal in Ion (unlike JSON)
{ "":42 } // A struct value containing a field with an empty name
{ x:1, x:null.int } // WARNING: repeated name 'x' leads to undefined behavior
{ x:1, , } // ERROR: missing field between commas
Note that field names are symbol tokens, not symbol values, and thus may
not be annotated. The value of a field may be annotated like any other
value. Syntactically the field name comes first, then annotations, then
the content.
{ annotation:: field_name: value } // ERROR
{ field_name: annotation:: value } // Okay
3.12. Lists
In the text format, lists are bounded by square brackets and elements
are separated by commas.
null.list // A null list value
[] // An empty list value
[1, 2, 3] // List of three ints
[ 1 , two ] // List of an int and a symbol
[a , [b]] // Nested list
[ 1.2, ] // Trailing comma is legal in Ion (unlike JSON)
[ 1, , 2 ] // ERROR: missing element between commas
Jonker, Goo, Hohle Expires December 2023 [Page 19]
Internet Draft Ion June 21, 2023
3.13. S-Expressions
In the text format, S-expressions are bounded by parentheses.
S-expressions also allow unquoted operator symbols in addition to the
unquoted identifier symbols allowed everywhere.
null.sexp // A null S-expression value
() // An empty expression value
(cons 1 2) // S-expression of three values
([hello][there]) // S-expression containing two lists
(a+-b) ( 'a' '+-' 'b' ) // Equivalent; three symbols
(a.b;) ( 'a' '.' 'b' ';') // Equivalent; four symbols
Although Ion S-expressions use a syntax similar to Lisp expressions, Ion
does not define their interpretation or any semantics at all, beyond the
pure sequence-of-values data model indicated above.
3.14. Type Annotations
In the text format, type annotations are denoted by a non-null symbol
token and double-colons preceding any value. Multiple annotations on
the same value are separated by double-colons:
int32::12 // Suggests 32 bits as
// end-user type
degrees::'celsius'::100 // you can have multiple
// annotations on a value
'my.custom.type' :: { x : 12 , y : -1 } // Gives a struct a
// user-defined type
{ field: something::'another thing'::value } // Field's name must
// precede annotations
// of its value
jpeg :: {{ ... }} // Indicates the blob
// contains jpeg data
bool :: null.int // A very misleading
// annotation on the
// integer null
'' :: 1 // An empty annotation
null.symbol :: 1 // ERROR: type annotation
// cannot be null
Except for a small number of predefined system annotations, Ion itself
neither defines nor validates such annotations; that behavior is left to
applications or tools (such as schema validators).
Jonker, Goo, Hohle Expires December 2023 [Page 20]
Internet Draft Ion June 21, 2023
It's important to understand that annotations are symbol tokens, not
symbol values. That means they do not have annotations themselves. In
particular, the text `a::c' is a single value consisting of three
textual tokens (a symbol, a double-colon, and another symbol); the first
symbol token is an annotation on the value, and the second is the
content of the value.
4. Ion Binary Encoding
The Ion binary encoding is a compact and efficient value stream. In the
binary format, a value stream always starts with a binary version marker
(BVM) that specifies the precise Ion version used to encode the data
that follows:
7 0 7 0 7 0 7 0
+------+-------+-------+------+
binary version marker | 0xE0 | major | minor | 0xEA |
+------+-------+-------+------+
The four-octet BVM also acts as a "magic cookie" to distinguish Ion
binary data from other formats, including Ion text data. Its first octet
(in sequence from the beginning of the stream) is 0xE0 and its fourth
octet is 0xEA. The second and third octets contain major and minor
version numbers. The only valid BVM, identifying Ion 1.0, is
0xE0 0x01 0x00 0xEA.
An Ion value stream starts with a BVM, followed by zero or more values
which contain the actual data. These values are generally referred to as
"top-level values".
31 0
+-------------------------+
value stream | binary version marker |
+-------------------------+
: value :
+=========================+
^a(R)
+=========================+
: binary version marker :
+=========================+
: value :
+=========================+
^a(R)
A value stream can contain other, perhaps different, BVMs interspersed
with the top-level values. Each BVM resets the decoder to the
appropriate initial state for the given version of Ion. This allows the
stream to be constructed by concatenating data from different sources,
Jonker, Goo, Hohle Expires December 2023 [Page 21]
Internet Draft Ion June 21, 2023
without requiring transcoding to a single version of the format.
Note: The BVM is not a value and should not be visible to or manipulable
by the user; it is internal data managed by and for the
Ion implementation.
4.1. Basic Field Formats
Binary-encoded Ion values are comprised of one or more fields, and the
fields use a small number of basic formats (separate from the Ion types
visible to users).
4.1.1. UInt and Int Fields
UInt and Int fields represent fixed-length unsigned and signed integer
values. These field formats are always used in some context that clearly
indicates the number of octets in the field.
7 0
+-------------------------+
UInt field | bits |
+-------------------------+
: bits :
+=========================+
^a(R)
+=========================+
: bits :
+=========================+
n+7 n
UInts are sequences of octets, interpreted as big-endian.
7 6 0
+---+---------------------+
Int field | | bits |
+---+---------------------+
^
|
+--sign
+=========================+
: bits :
+=========================+
^a(R)
+=========================+
: bits :
+=========================+
n+7 n
Jonker, Goo, Hohle Expires December 2023 [Page 22]
Internet Draft Ion June 21, 2023
Ints are sequences of octets, interpreted as sign-and-magnitude big
endian integers (with the sign on the highest-order bit of the first
octet). This means that the representations of 123456 and -123456 should
only differ in their sign bit. (See
http://en.wikipedia.org/wiki/Signed_number_representation for more
info.)
4.1.2. VarUInt and VarInt Fields
VarUInt and VarInt fields represent self-delimiting, variable-length
unsigned and signed integer values. These field formats are always used
in a context that does not indicate the number of octets in the field;
the last octet (and only the last octet) has its high-order bit set to
terminate the field.
7 6 0
+===+=====================+
VarUInt field : 0 : bits :
+===+=====================+
^a(R)
n+7 n+6 n
+---+---------------------+
| 1 | bits |
+---+---------------------+
VarUInts are a sequence of octets. The high-order bit of the last octet
is one, indicating the end of the sequence. All other high-order bits
must be zero.
7 6 5 0 n+7 n+6 n
+===+ +---+
VarInt field : 0 : payload ... | 1 | payload
+===+ +---+
+---+-----------------+ +=====================+
| | magnitude | ... : magnitude :
+---+-----------------+ +=====================+
^ ^ ^
| | |
| +--sign +--end flag
+--end flag
VarInts are sign-and-magnitude integers, like Ints. Their layout is
complicated, as there is one special leading bit (the sign) and one
special trailing bit (the terminator). In the above diagram, we put the
two concepts on different layers.
The high-order bit in the top layer is an end-of-sequence marker. It
must be set on the last octet in the representation and clear in all
Jonker, Goo, Hohle Expires December 2023 [Page 23]
Internet Draft Ion June 21, 2023
other octets. The second-highest order bit (0x40) is a sign flag in the
first octet of the representation, but part of the extension bits for
all other octets. For single-octet VarInt values, this collapses
down to:
7 6 5 0
+---+---+-------------+
single octet VarInt field | 1 | | magnitude |
+---+---+-------------+
^
|
+--sign
4.2. Typed Value Formats
A value consists of a one-octet type descriptor, possibly followed by a
length in octets, possibly followed by a representation.
7 4 3 0
+---------+---------+
value | T | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: representation :
+==========================+
The type descriptor octet has two subfields: a four-bit type code T, and
a four-bit length L.
Each value of T identifies the format of the representation, and
generally (though not always) identifies an Ion datatype. Each type code
T defines the semantics of its length field L as described below.
The length value - the number of octets in the representation field(s) -
is encoded in L and/or length fields, depending on the magnitude and on
some particulars of the actual type. The length field is empty (taking
up no octets in the message) if we can store the length value inside L
itself. If the length field is not empty, then it is a single VarUInt
field. The representation may also be empty (no octets) in some cases,
as detailed below.
Unless otherwise defined, the length of the representation is encoded
as follows:
o If the value is null (for that type), then L is set to 15.
Jonker, Goo, Hohle Expires December 2023 [Page 24]
Internet Draft Ion June 21, 2023
o If the representation is less than 14 bytes long, then L is set to the
length, and the length field is omitted.
o If the representation is at least 14 bytes long, then L is set to 14,
and the length field is set to the representation length, encoded as a
VarUInt field.
4.2.1. 0: Nulls
7 4 3 0
+---------+---------+
Null value | 0 | 15 |
+---------+---------+
Values of type null always have empty lengths and representations. The
only valid L value is 15, representing the only value of this type,
null.null.
4.2.1.1. NOP Padding
7 4 3 0
+---------+---------+
NOP Pad | 0 | L |
+---------+---------+======+
: length [VarUInt] :
+--------------------------+
| ignored octets |
+--------------------------+
In addition to null.null, the null type code is used to encode padding
that has no operation (NOP padding). This can be used for "binary
whitespace" when alignment of octet boundaries is needed or to support
in-place editing. Such encodings are not considered values and are
ignored by the processor.
In this encoding, L specifies the number of octets that should
be ignored.
The following is a single byte NOP pad. The NOP padding typedesc bytes
are counted as padding:
0x00
The following is a two byte NOP pad:
0x01 0xFE
Note that the single byte of "payload" 0xFE is arbitrary and ignored by
Jonker, Goo, Hohle Expires December 2023 [Page 25]
Internet Draft Ion June 21, 2023
the parser.
The following is a 16 byte NOP pad:
0x0E 0x8E 0x00 ... <12 arbitrary octets> ... 0x00
NOP padding is valid anywhere a value can be encoded, except for within
an annotation wrapper. NOP padding in struct requires additional
encoding considerations.
4.2.2. 1: Booleans
7 4 3 0
+---------+---------+
Bool value | 1 | rep |
+---------+---------+
Values of type bool always have empty lengths, and their representation
is stored in the typedesc itself (rather than after the typedesc). A
representation of 0 means false; a representation of 1 means true; and a
representation of 15 means null.bool.
4.2.3. 2 & 3: Integers
Values of type int are stored using two type codes: 2 for positive
values and 3 for negative values. Both codes use a UInt subfield to
store the magnitude.
7 4 3 0
+---------+---------+
Int value | 2 or 3 | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: magnitude [UInt] :
+==========================+
Zero is always stored as positive; negative zero is illegal.
If L is 0 the value is zero, and there are no length or magnitude
subfields. As a result, when T is 3, both L_ and the magnitude subfield
must be non-zero.
With either type code 2 or 3, if L is 15, then the value is null.int and
the magnitude is empty. Note that this implies there are two equivalent
binary representations of null integer values.
4.2.4. 4: Floats
Jonker, Goo, Hohle Expires December 2023 [Page 26]
Internet Draft Ion June 21, 2023
7 4 3 0
+---------+---------+
Float value | 4 | L |
+---------+---------+-----------+
| representation [IEEE-754] |
+-------------------------------+
Floats are encoded as big endian octets of their IEEE 754 bit patterns.
The L field of floats encodes the size of the IEEE-754 value.
o If L is 4, then the representation is 32 bits (4 octets).
o If L is 8, then the representation is 64 bits (8 octets).
There are two exceptions for the L field:
o If L is 0, then the value is 0e0 and representation is empty.
Note: This is not to be confused with -0e0 which is a distinct value
and in current Ion must be encoded as a normal IEEE float
bit pattern.
o If L is 15, then the value is null.float and the representation is
empty.
Note: Ion 1.0 only supports 32-bit and 64-bit float values (i.e. L
size 4 or 8), but future versions of the standard may support
16-bit and 128-bit float values.
4.2.5. 5: Decimals
7 4 3 0
+---------+---------+
Decimal value | 5 | L |
+---------+---------+======+
: length [VarUInt] :
+--------------------------+
| exponent [VarInt] |
+--------------------------+
| coefficient [Int] |
+--------------------------+
Decimal representations have two components: exponent (a VarInt) and
coefficient (an Int). The decimal's value is coefficient * 10 ^
exponent.
Jonker, Goo, Hohle Expires December 2023 [Page 27]
Internet Draft Ion June 21, 2023
The length of the coefficient subfield is the total length of the
representation minus the length of exponent. The subfield should not be
present (that is, it has zero length) when the coefficient's value is
(positive) zero.
If L is zero the value is `0`, and there are no length, exponent, or
coefficient subfields.
There are two exceptions for the L field:
1. If L is 0, then the value is `0.` (aka `0d0`), and there are no length,
exponent, or coefficient subfields.
2. If L is 15, then the value is `null.decimal` and there are no length,
exponent, or coefficient subfields.
4.2.6. 6: Timestamps
7 4 3 0
+---------+---------+
Timestamp value | 6 | L |
+---------+---------+========+
: length [VarUInt] :
+----------------------------+
| offset [VarInt] |
+----------------------------+
| year [VarUInt] |
+----------------------------+
: month [VarUInt] :
+============================+
: day [VarUInt] :
+============================+
: hour [VarUInt] :
+==== ====+
: minute [VarUInt] :
+============================+
: second [VarUInt] :
+============================+
: fraction_exponent [VarInt] :
+============================+
: fraction_coefficient [Int] :
+============================+
Timestamp representations have 7 components, where 5 of these components
are optional depending on the precision of the timestamp. The 2
non-optional components are offset and year. The 5 optional components
are (from least precise to most precise): month, day, hour and minute,
Jonker, Goo, Hohle Expires December 2023 [Page 28]
Internet Draft Ion June 21, 2023
second, fraction_exponent and fraction_coefficient. All of these 7
components are in Universal Coordinated Time (UTC).
The offset denotes the local-offset portion of the timestamp, in minutes
difference from UTC.
The hour and minute is considered as a single component, that is, it is
illegal to have hour but not minute (and vice versa).
The fraction_exponent and fraction_coefficient denote the fractional
seconds of the timestamp as a decimal value. The fractional seconds'
value is coefficient * 10 ^ exponent. It must be greater than or equal
to zero and less than 1. A missing coefficient defaults to zero.
Fractions whose coefficient is zero and exponent is greater than -1 are
ignored. The following hex encoded timestamps are equivalent:
68 80 0F D0 81 81 80 80 80 // 2000-01-01T00:00:00Z with no fractional seconds
69 80 0F D0 81 81 80 80 80 80 // The same instant with 0d0 fractional seconds and implicit zero coefficient
6A 80 0F D0 81 81 80 80 80 80 00 // The same instant with 0d0 fractional seconds and explicit zero coefficient
69 80 0F D0 81 81 80 80 80 C0 // The same instant with 0d-0 fractional seconds
69 80 0F D0 81 81 80 80 80 81 // The same instant with 0d1 fractional seconds
Conversely, none of the following are equivalent:
68 80 0F D0 81 81 80 80 80 // 2000-01-01T00:00:00Z with no fractional seconds
69 80 0F D0 81 81 80 80 80 C1 // 2000-01-01T00:00:00.0Z
69 80 0F D0 81 81 80 80 80 C2 // 2000-01-01T00:00:00.00Z
If a timestamp representation has a component of a certain precision,
each of the less precise components must also be present or else the
representation is illegal. For example, a timestamp representation that
has a fraction_exponent and fraction_coefficient component but not the
month component, is illegal.
Note: The component values in the binary encoding are always in UTC,
while components in the text encoding are in the local time! This means
that transcoding requires a conversion between UTC and local time.
4.2.7. 7: Symbols
Jonker, Goo, Hohle Expires December 2023 [Page 29]
Internet Draft Ion June 21, 2023
7 4 3 0
+---------+---------+
Symbol value | 7 | L |
+---------+---------+======+
: length [VarUInt] :
+--------------------------+
| symbol ID [UInt] |
+--------------------------+
In the binary encoding, all Ion symbols are stored as integer symbol IDs
whose text values are provided by a symbol table. If L is zero then the
symbol ID is zero and the length and symbol ID fields are omitted.
See Ion Symbols for more details about symbol representations and
symbol tables
4.2.8. 8: Strings
7 4 3 0
+---------+---------+
String value | 8 | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: representation [UTF8] :
+==========================+
These are always sequences of Unicode characters, encoded as a sequence
of UTF-8 octets. If L is zero then the string is the empty string "" and
the length and representation fields are omitted.
4.2.9. 9: Clobs
7 4 3 0
+---------+---------+
Clob value | 9 | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: data [Bytes] :
+==========================+
Values of type clob are encoded as a sequence of octets that should be
interpreted as text with an unknown encoding (and thus opaque to
the application).
Zero-length clobs are legal, so L may be zero.
Jonker, Goo, Hohle Expires December 2023 [Page 30]
Internet Draft Ion June 21, 2023
4.2.10. 10: Blobs
7 4 3 0
+---------+---------+
Blob value | 10 | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: data [Bytes] :
+==========================+
This is a sequence of octets with no interpretation (and thus opaque to
the application).
Zero-length blobs are legal, so L may be zero.
4.2.11. 11: Lists
7 4 3 0
+---------+---------+
List value | 11 | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: value :
+==========================+
:
The representation fields of a list value are simply nested Ion values.
When L is 15, the value is null.list and there's no length or nested
values. When L is 0, the value is an empty list, and there''s no length
or nested values.
Because values indicate their total lengths in octets, it is possible to
locate the beginning of each successive value in constant time.
4.2.12. 12: S-Expressions
Jonker, Goo, Hohle Expires December 2023 [Page 31]
Internet Draft Ion June 21, 2023
7 4 3 0
+---------+---------+
Sexp value | 12 | L |
+---------+---------+======+
: length [VarUInt] :
+==========================+
: value :
+==========================+
:
Values of type sexp are encoded exactly as are list values, except with
a different type code.
4.2.13. 13: Structures
Structs are encoded as sequences of symbol/value pairs. Since all
symbols are encoded as positive integers, we can omit the typedesc on
the field names and just encode the integer value.
7 4 3 0
+---------+---------+
Struct value | 13 | L |
+---------+---------+======+
: length [VarUInt] :
+======================+===+==================+
: field name [VarUInt] : value :
+======================+======================+
^a(R) ^a(R)
Binary-encoded structs support a special case where the fields are known
to be sorted such that the field-name integers are increasing. This
state exists when L is one. Thus:
o When L is 0, the value is an empty struct, and there's no length or
nested fields.
o When L is 1, the struct has at least one symbol/value pair, the length
field exists, and the field name integers are sorted in
increasing order.
o When L is 14, the length field exists, and no assertion is made about
field ordering.
o When L is 15, the value is null.struct, and there's no length or
nested fields.
o When 1 < L < 14 then there is no length field as L is enough to
represent the struct size, and no assertion is made about
Jonker, Goo, Hohle Expires December 2023 [Page 32]
Internet Draft Ion June 21, 2023
field ordering.
o Otherwise, 1 < L < 14 and there is no length field as _L_ is enough to
represent the struct size. No assertion is made about field ordering.
Note: Because VarUInts depend on end tags to indicate their lengths,
finding the succeeding value requires parsing the field name prefix.
However, VarUInts are a more compact representation than Int values.
4.2.13.1. NOP Padding in struct Fields
NOP Padding in struct values requires additional consideration of the
field name element. If the "value" of a struct field is the NOP pad,
then the field name is ignored. This means that it is not possible to
encode padding in a struct value that is less than two bytes.
Implementations should use symbol ID zero as the field name to emphasize
the lack of meaning of the field name. For more general details about
the semantics of symbol ID zero, refer to Ion Symbols.
For example, consider the following empty struct with three bytes
of padding:
0xD3 0x80 0x01 0xAC
In the above example, the struct declares that it is three bytes large,
and the encoding of the pair of symbol ID zero followed by a pad that is
two bytes large (note the last octet 0xAC is completely arbitrary and
never interpreted by an implementation).
The following is an example of struct with a single field with four
total bytes of padding:
0xD7 0x84 0x81 "a" 0x80 0x02 0x01 0x02
The above is equivalent to {name:"a"}.
The following is also a empty struct, with a two byte pad:
0xD2 0x8F 0x00
In the above example, the field name of symbol ID 15 is ignored
(regardless of if it is a valid symbol ID).
The following is malformed because there is an annotation "wrapping" a
NOP pad, which is not allowed generally for annotations:
Jonker, Goo, Hohle Expires December 2023 [Page 33]
Internet Draft Ion June 21, 2023
// {$0:name::<NOP>}
0xD5 0x80 0xE3 0x81 0x84 0x00
4.2.14. 14: Type Annotations
This special type code doesn't map to an Ion value type, but instead is
a wrapper used to associate annotations with content.
Annotations are a special type that wrap content identified by the other
type codes. The annotations themselves are encoded as integer
symbol IDs.
7 4 3 0
+---------+---------+
Annotation wrapper | 14 | L |
+---------+---------+======+
: length [VarUInt] :
+--------------------------+
| annot_length [VarUInt] |
+--------------------------+
| annot [VarUInt] | ...
+--------------------------+
| value |
+--------------------------+
The length field L field indicates the length from the beginning of the
annot_length field to the end of the enclosed value. Because at least
one annotation and exactly one content field must exist, L is at least 3
and is never 15.
The annot_length field contains the length of the (one or more)
annot fields.
It is illegal for an annotation to wrap another annotation atomically,
i.e., annotation(annotation(value)). However, it is legal to have an
annotation on a container that holds annotated values. Note that it is
possible to enforce the illegality of annotation(annotation(value))
directly in a grammar, but we have not chosen to do that in
this document.
Furthermore, it is illegal for an annotation to wrap a NOP Pad since
this encoding is not an Ion value. Thus, the following sequence
is malformed:
0xE3 0x81 0x84 0x00
Note: Because L cannot be zero, the octet 0xE0 is not a valid type
Jonker, Goo, Hohle Expires December 2023 [Page 34]
Internet Draft Ion June 21, 2023
descriptor. Instead, that octet signals the start of a binary
version marker.
4.2.15. 15: Reserved
The remaining type code, 15, is reserved for future use and is not legal
in Ion 1.0 data.
4.3. Illegal Type Descriptors
The preceding sections define valid type descriptor octets, composed of
a type code (T) in the upper four bits and a length field (L) in the
lower four bits. As mentioned, many possible combinations are illegal
and must cause parsing errors.
The following table enumerates the illegal type descriptors in Ion
1.0 data.
T L Reason
1 [2-14] For bool values, L is used to encode the
value, and may be 0 (false), 1 (true), or 15
(null.bool).
3 [0] The int 0 is always stored with type code 2.
Thus, type code 3 with L equal to zero
is illegal.
4 [1-3],[5-7],[9-14] For float values, only 32-bit and 64-bit
IEEE-754 values are supported. Additionally,
0e0 and null.float are represented with L
equal to 0 and 15, respectively.
6 [0-1] For timestamp values, a VarInt offset and
VarUInt year are required. Thus, type code 6
with L equal to zero or one is illegal.
14 [0]*,[1-2],[15] Annotation wrappers must have one
annot_length field, at least one annot
field, and exactly one value field. Null
annotation wrappers are illegal.
Note: Since 0xE0 signals the start of the
BVM, encountering this octet where a type
descriptor is expected should only cause
parsing errors when it is not followed by
the rest of the BVM octet sequence.
Jonker, Goo, Hohle Expires December 2023 [Page 35]
Internet Draft Ion June 21, 2023
15 [0-15] The type code 15 is illegal in Ion 1.0 data.
5. Ion Symbols & Symbol Tables
Ion symbols are critical to the notation's performance and
space-efficiency.
In Ion binary, all symbols are represented as integers. These integers
are symbol IDs whose corresponding text forms are defined by a
symbol table.
In Ion text, symbols are represented in three ways:
o Quoted symbol: a sequence of zero or more characters between
single-quotes, e.g., 'hello', 'a symbol', '123', ''. This
representation can denote any symbol text.
o Identifier: an unquoted sequence of one or more ASCII letters, digits,
or the characters $ (dollar sign) or _ (underscore), not starting with
a digit.
o Operator: an unquoted sequence of one or more of the following
nineteen ASCII characters: !#%&*+-./;<=>?@^`|~ Operators can only be
used as (direct) elements of an S-expression. In any other context
those characters require single-quotes.
A subset of identifiers have special meaning:
o Symbol Identifier: an identifier that starts with $ (dollar sign)
followed by one or more digits. These identifiers directly represent
the symbol's integer symbol ID, not the symbol's text. This form is
not typically visible to users, but they should be aware of the
reserved notation so they don't attempt to use it for other purposes.
By convention, symbols starting with $ should be reserved for system
tools, processing frameworks, and the like, and should be avoided by
applications and end users. In particular, the symbol $ion and all
symbols starting with $ion_ are reserved for use by the Ion notation and
by related standards.
5.1. Symbol Tables
There are two kinds of symbol tables: shared and local.
A shared symbol table is intended for use from multiple data sources.
Each shared table is uniquely identified by a (string) name and (int)
version number. Shared symbol tables are key to the compactness of Ion
binary data, extracting the text of frequently used symbols (field
Jonker, Goo, Hohle Expires December 2023 [Page 36]
Internet Draft Ion June 21, 2023
names, enumerations, keywords, etc.) out of individual documents and
into a common data structure.
A system symbol table is a shared table with the name "$ion". Every
version of the Ion notation maps to a specific version of the system
symbol table.
A local symbol table is for the sole use of a well-defined scope of Ion
data. Since it does not need to be referenced from other contexts, it
has no name or version number. Local tables may import symbols defined
in one or more shared tables or import the symbols in the previously
defined local symbol table (but not both). Local tables also accumulate
any other symbols encountered within their scoped data, such as when
encoding Ion text into binary.
At any point during processing, there is a current symbol table, which
is either a local symbol table or a system symbol table. At the start of
input, the current symbol table is initialized to be the system symbol
table for Ion 1.0. The current symbol table is only changed in two
circumstances: (1) encountering a system identifier at the top-level,
or (2) encountering a local symbol table at the top-level.
5.1.1. The Catalog
This specification refers to a "catalog". That's simply an abstraction
for the set of available Ion shared symbol tables. It's not necessarily
a static set: one could implement a catalog that pulls symbol tables
from a network repository, or one that has application symbol tables
"compiled in", or (very likely) some composition of these techniques.
The mechanism by which shared symbol tables are acquired is irrelevant
to this specification.
5.1.2. Top-Level Semantics
Symbol tables only (meaningfully) occur at the top level of a data
stream or datagram. An Ion data stream is structured as follows:
An initial Ion Version Marker is required in binary data, and optional
(but highly recommended) in text. All Ion text implicitly starts with
$ion_1_0 when not explicitly provided. Every (top-level) IVM switches
the parser to the indicated Ion version and sets the current symbol
table to the indicated Ion system symbol table.
o Every top-level value (and all the hierarchical data within it) is
interpreted with respect to the current symbol table at the point
where the value starts.
o Every top-level local symbol table becomes the current symbol table
Jonker, Goo, Hohle Expires December 2023 [Page 37]
Internet Draft Ion June 21, 2023
for the value(s) following it. A local table may be injected or
extended by the implementation during processing of the rest of
the stream.
o Certain top-level values such as IVMs and local symbol tables are
referred to as system values; all other values are referred to as user
values. An Ion implementation may give applications the ability to
"skip over" the system values, since they are generally irrelevant to
the semantics of the user data.
5.1.3. System Symbols
The version included in the system identifier is independent of the
version of the implied system symbol table (named "$ion"). Each version
of the Ion specification defines the corresponding system symbol table
version. Ion 1.0 uses the "$ion" symbol table, version 1, and future
versions of Ion will use larger versions of the "$ion" symbol table.
$ion_1_1 will probably use version 2, while $ion_2_0 might use
version 5.
Applications and users should never have to care about these symbol
table versions, since they are never explicit in user data: this
specification disallows (by ignoring) imports named "$ion").
Here are the system symbols for Ion 1.0.
Symbol ID Symbol Name
1 $ion
2 $ion_1_0
3 $ion_symbol_table
4 name
5 version
6 imports
7 symbols
8 max_id
9 $ion_shared_symbol_table
Equivalently:
$ion_shared_symbol_table::
{
name: "$ion", version: 1,
symbols:
[ "$ion", "$ion_1_0", "$ion_symbol_table", "name", "version",
"imports", "symbols", "max_id", "$ion_shared_symbol_table"
]
Jonker, Goo, Hohle Expires December 2023 [Page 38]
Internet Draft Ion June 21, 2023
}
5.2. Ion Version Markers
In Ion text, the Ion Version Marker (IVM) is represented by the
following symbol.
$ion_1_0
This stand-alone symbol is recommended at the start of Ion text data. It
identifies a specific major/minor version of the Ion notation. It resets
the current symbol table to be the corresponding system symbol table,
and simultaneously switches the parser into the appropriate mode for
parsing the right version of Ion notation.
A version marker can also occur at non-initial positions at the top
level, and it will have the same effect; when encountered below
top-level, it has no processing effect and is treated as an ordinary
user value.
IVMs do not have annotations. The input ann::$ion_1_0 is not a version
marker, it's a symbol with an annotation.
In Ion binary, there is a special sequence of bytes that represent
the IVM.
E0 01 00 EA
This sequence of bytes can only appear at the top-level, much like the
text IVM, and can occur at non-initial positions as well. Note that this
particular form is equivalent to its textual counterpart $ion_1_0 and
has the same processing semantics, but is a special encoding artifact in
the binary format.
At the top-level, any encoding of $ion_1_0 that does not match the forms
specified above are system values that have no processing semantics
(a NOP).
Below are examples of the symbol $ion_1_0 that are not interpreted
as IVMs:
// explicitly quoted
'$ion_1_0'
// explicitly quoted with some newline escapes
'$ion_ 1 _ 0'
Jonker, Goo, Hohle Expires December 2023 [Page 39]
Internet Draft Ion June 21, 2023
// symbol ID mapping $ion_1_0 declared in the system symbol table
$2
$ion_symbol_table::
{
symbols:["$ion_1_0"]
}
// a locally declared symbol ID mapping to $ion_1_0
$10
It is important to round-trip the forms above correctly, here is an
example of IVMs mixed with these NOP encodings:
// IVM
$ion_1_0
$ion_symbol_table::
{
symbols:["a"]
}
// not the IVM
'$ion_1_0'
// also not the IVM
$2
// maps to "a"
$10
The above is equivalent to the following, more concise Ion:
$ion_1_0
a
Here is a bad example of re-encoding the previous example in a
naive way:
// IVM
$ion_1_0
$ion_symbol_table::
{
symbols:["a"]
}
// quoted form improperly got converted to an IVM
$ion_1_0
// ERROR! the following symbol ID is not defined
$10
The problem with the above example is that the conversion of '$ion_1_0'
to $ion_1_0 changed it from being a NOP to an IVM which resets the
current symbol table to the system symbol table.
Jonker, Goo, Hohle Expires December 2023 [Page 40]
Internet Draft Ion June 21, 2023
5.2.2. Local Symbol Tables
A local symbol table defines symbols through two mechanisms, both of
which are optional.
First, it imports the symbols from one or more shared symbol tables,
offsetting symbol IDs appropriately so they do not overlap. Instead of
importing the symbols from shared symbol tables, a local symbol table
may import the current symbol table.
Second, it defines local symbols similarly to shared tables. The latter
aspect is generally not managed by users: the system uses this form in
the binary encoding to record local symbols encountered during parsing.
// a local symbol table that resets the context, imports some shared
// symbol tables and adds three local symbols
$ion_symbol_table::
{
imports:[ { name: "com.amazon.ols.symbols.offer",
version: 1,
max_id: 75 },
// ...
],
symbols:[ "rock", "paper", "scissors" ]
}
// a local symbol table that adds two local symbols to the context
$ion_symbol_table::
{
imports:$ion_symbol_table,
symbols:[ "lizard", "spock" ]
}
When immediately following an explicit system ID, a top-level struct
whose first annotation is $ion_symbol_table is interpreted as a local
symbol table. If the struct is null (null.struct) then it is treated as
if it were an empty struct.
The imports field should be the symbol $ion_symbol_table or a list as
specified in the following section.
The symbols field should be a list of strings. If the field is missing
or has any other type, it is treated as if it were an empty list.
Null elements in the symbols list declare unknown symbol text ("gaps")
for its SID within the sequence. Any element of the list that is not a
string must be interpreted as if it were null. Any SIDs that refer to
null slots in a local symbol table are equivalent to symbol zero.
Jonker, Goo, Hohle Expires December 2023 [Page 41]
Internet Draft Ion June 21, 2023
Any other field (including, for example, name or version) is ignored.
5.2.2.1. Imports
A local symbol table implicitly imports the system symbol table that is
active at the point where the local table is encountered.
If the value of the imports field is the symbol $ion_symbol_table, then
the all of the symbol ID assignments in the current symbol table are
imported into the new local table. Thus, if the current symbol table was
the system symbol table, then processing is identical to having no
imports field value.
If the value of the imports field is a list, each element of the list
must be a struct; each element that is null or is not a struct
is ignored.
Each import (including the implicit system table import) allocates a
contiguous, non-overlapping sequence of symbol IDs. The system symbols
start at 1, each import starts one past the end of the previous import,
and the local symbols start immediately after the last import. The size
of each import's subsequence is defined by the max_id on the import
statement, regardless of the actual size of the referenced table.
Import structs in an import list are processed in order as follows:
o If no name field is defined, or if it is not a non-empty string, the
import clause is ignored.
o If the name field is "$ion", the import clause is ignored.
o If no version field is defined, or if it is null, not an int, or less
than 1, act as if it is 1.
o If a max_id field is defined but is null, not an int, or less than
zero, act as if it is undefined.
o Select a shared symbol table instance as follows:
o Query the catalog to retrieve the specified table by name
and version.
o If an exact match is not found:
o If max_id is undefined, implementations MUST raise an error and
halt processing.
o Otherwise query the catalog to retrieve the table with the given
Jonker, Goo, Hohle Expires December 2023 [Page 42]
Internet Draft Ion June 21, 2023
name and the greatest version available.
o If no table has been selected, substitute a dummy table containing
max_id undefined symbols.
o If max_id is undefined, set it to the largest symbol ID of the
selected table (which will necessarily be an exact match).
o Allocate the next max_id symbol IDs to this imported symbol table.
After processing imports, a number of symbol IDs will have been
allocated, including at least those of a system symbol table. This
number is always well-defined, and any local symbols will be numbered
immediately beyond that point. We refer to the smallest local symbol ID
as the local min_id.
Note: This specification allows a local table to declare multiple
imports with the same name, perhaps even the same version.
Such a situation provides redundant data and allocates unnecessary
symbol IDs but is otherwise harmless.
5.2.2.2. Semantics
When mapping from symbol ID to string, there is no ambiguity. However,
due to unavailable imports, certain IDs may appear to be undefined when
binary data is decoded. Any symbol ID outside of the range of the local
symbol table (or system symbol table if no local symbol table is
defined) for which it is encoded under MUST raise an error.
When mapping from string to symbol ID, there may be multiple assigned
IDs; implementations MUST select the lowest known ID. If an imported
table is unavailable, this may cause selection of a greater ID than
would be the case otherwise. This restriction ensures that symbols
defined by system symbol tables can never be mapped to other IDs.
Put another way, string-to-SID mappings have the following precedence:
o The system table is always consulted first.
o Each imported table is consulted in the order of import.
o Local symbols are last.
5.2.3. Shared Symbol Tables
This section defines the serialized form of shared symbol tables.
Unlike local symbol tables, the Ion parser does not intrinsically
recognize or process this data; it is up to higher-level specifications
Jonker, Goo, Hohle Expires December 2023 [Page 43]
Internet Draft Ion June 21, 2023
or conventions to define how shared symbol tables are communicated.
$ion_shared_symbol_table::
{
name: "com.amazon.ols.symbols.offer",
version: 1,
imports: // For informational purposes only.
[
{ name:"..." , version:1 },
// ...
],
symbols:
[
"fee", "fie", "foe", /* ... */ "hooligan"
]
}
A shared symbol table is serialized as a struct with the annotation
$ion_shared_symbol_table.
The name field should be a string with length at least one. If the
field has any other value, then materialization of this symbol table
must fail.
The version field should be an int and at least 1. If the field is
missing or has any other value, it is treated as if it were 1.
The imports field is for informational purposes only in shared tables.
They assert that this table contains a superset of the strings in each
of these named tables. It makes no assertion about any relationship
between symbol IDs in this table and the imports, only that the symbols'
text occurs here. An implementation MAY issue a warning if these claims
don't match what's in the symbols field.
The symbols field should be a list of strings. If the field is missing
or has any other type, it is treated as if it were an empty list.
Null elements declare undefined symbol IDs ("gaps") within the sequence;
implementations must handle requests for such symbols the same as if the
requested ID beyond the end of the list. Any element of the list that is
not a string must be interpreted as if it were null.
A few things worth noting:
o Shared symbol tables do not make use of a max_id field since the
largest SID is implicit in the symbols list. If a max_id field exists,
it must be ignored.
Jonker, Goo, Hohle Expires December 2023 [Page 44]
Internet Draft Ion June 21, 2023
o A shared table isn't coupled to any particular system table, so it can
be used in any context.
o The algorithm for SID assignment differs between shared and local
tables. SIDs in shared tables always start at one. SIDs in local tables
are always offset by the sum of the sizes of the system symbol table and
all imported tables.
5.2.3.1. Semantics
Symbol IDs are assigned to the symbols strings in order of their
appearance in the list: the first element has symbol ID 1 (aka $1), the
last has the symbol ID equal to the length of the list.
When mapping from symbol ID to string, a simple index into the list is
all that's needed.
When mapping from string to symbol ID, there may be multiple associated
IDs (the same string could appear twice as children of the symbols
field). Implementations MUST select the lowest known ID, and all other
associated IDs MUST be handled as if undefined.
5.2.3.2. Versioning
A shared symbol table with version greater than one should usually be a
strict extension of the immediately preceding version, but Ion does not
(and in reality cannot) enforce this. Symbols may be removed, but they
cannot be renumbered or given different text. This ensures that when
version N is requested, any version larger than N can be used without
changing semantics. However, if symbols become undefined then some
extant data may become unreadable when an exact-match import cannot
be found.
The use of symbol tables that violate these restriction will lead to undefined and potentially incorrect interpretation of Ion data. Therefore implementations should enforce these restrictions at appropriate points.
Version N+1 of a table MAY be the same as version N.
5.3. Symbol Zero
SID zero (i.e. `$0`) is a special symbol that is not assigned text by any
symbol table, even the system symbol table. Symbol zero always has unknown
text, and can be useful in synthesizing symbol identifiers where the text
image of symbol is not known in a particular operating context.
It is important to note that `$0` is only semantically equivalent to itself
and to locally-declared SIDs with unknown text. It is not semantically
equivalent to SIDs with unknown text from shared symbol tables, so
Jonker, Goo, Hohle Expires December 2023 [Page 45]
Internet Draft Ion June 21, 2023
replacing such SIDs with `$0` is a destructive operation to the semantics
of the data.
5.4. Data Model
An important consideration for symbols is what semantics they have in
the Ion data model. Any symbol which has the same text image as another
symbol irrespective of the ID integer or the shared symbol table (if
applicable) used to encode it is considered to be equivalent.
Ion symbols may have text that is unknown. That is, there is no binding
to a (potentially empty) sequence of text. This can happen as a result
of not having access to a shared symbol table being imported, or having
a symbol table (shared or local) that contains a null slot.
When operating on data that contains symbols with unknown text, it is
important to not treat them as equivalent unless any of the
following hold:
o Symbols with unknown text declared in a local symbol table are all
equivalent to one another and to SID 0.
o For symbols defined from shared symbol table imports, symbols are
equivalent only if all of the following hold:
- The name of the table that the symbols were imported from is the
same string.
- The position in the table that the symbols were imported from is the
same spot. Note that this is not the same as the local SID value,
but can be calculated from the SIDs by the allocation
algorithm above.
o SID 0 is only equivalent to itself.
A processor encountering a symbol with unknown text and a valid SID
other than $0 MAY produce an error because this means that the context
of the data is missing, however any implementation that chooses not to
MUST conform to the above semantics with respect to round-tripping data.
5.5. Examples
A typical text document looks like:
$ion_1_0
$ion_symbol_table::
{
imports:[{ name:"com.amazon.ols.symbols.offer", version:1 },
Jonker, Goo, Hohle Expires December 2023 [Page 46]
Internet Draft Ion June 21, 2023
{ name:"com.amazon.ims3.symbols.submission", version:1 }]
}
// Here's the user data, one or more top-level values.
submission::{ /* ... */ local_symbol /* ... */ }
submission::{ /* ... */ 'another one' /* ... */ }
The example above shows a local table with imports but no symbols. This
is a typical scenario for human-authored data. When parsing this text,
the local table will be extended on the fly to contain any new symbols.
Here's the same data printed after parsing, in which the local table has
been extended with symbols encountered in the user data.
$ion_1_0
$ion_symbol_table::
{
imports:[{ name:"com.amazon.ols.symbols.offer", version:1, max_id:75 },
{ name:"com.amazon.ims3.symbols.submission", version:1, max_id:100 }],
symbols:["local_symbol", "another one"]
}
submission::{ /* ... */ local_symbol /* ... */ }
submission::{ /* ... */ 'another one' /* ... */ }
Since the $ion_1_0 defines eight symbols ($1 through $9), the offer
table covers ids $10 through $84, the submission table covers ids $85
through $184, and local symbols start at $185.
Here's the same data as above serialized with a local symbol table being
"flushed" between each top-level value.
$ion_1_0
$ion_symbol_table::
{
imports:[{ name:"com.amazon.ols.symbols.offer", version:1, max_id:75 },
{ name:"com.amazon.ims3.symbols.submission", version:1, max_id:100 }],
symbols:["local_symbol"]
}
submission::{ /* ... */ local_symbol /* ... */ }
$ion_symbol_table::
{
imports:$ion_symbol_table,
symbols:["another one"]
}
submission::{ /* ... */ 'another one' /* ... */ }
In this case, the first local symbol table generated only needs to add
one new local symbol for the top-level value being serialized in its
Jonker, Goo, Hohle Expires December 2023 [Page 47]
Internet Draft Ion June 21, 2023
context. The second symbol table adds a subsequent symbol to the context
for the value immediately following it. This pattern of local symbol
tables allows top-level values to be written to a stream without knowing
all symbols ahead of time.
5.3.x Annotating local symbol tables
Although a local symbol table struct may have multiple annotations, its
first annotation must be $ion_symbol_table in order to be interpreted as
a local symbol table.
The following will be interpreted as a valid local symbol table:
$ion_symbol_table::annotated::
{
symbols:["a", "b"]
}
The example below, however, will be interpreted as a simple struct with
two annotations:
annotated::$ion_symbol_table::
{
symbols:["a", "b"]
}
6. Ion Strings & Clobs
This document clarifies the semantics of the Amazon Ion string and clob
data types with respect to escapes and the [Unicode][2] standard.
As of the date of this writing, the Unicode Standard is on
[version 10.0][3]. This specification is to that standard.
6.1. Unicode Primer
The Unicode standard specifies a large set of code points, the Universal
Character Set (UCS), which is an integer in the range of 0 (0x0) through
1,114,111 (0x10FFFF) inclusive. Throughout this document, the notation
U+HHHH and U+HHHHHHHH refer to the Unicode code point HHHH and HHHHHHHH
respectively as a hexadecimal ordinal. This notation follows the Unicode
standard convention.
Traditionally, from a programmer's perspective, a code point can be thought
of as a character, but there is sometimes a subtle distinction. For
example, in Java, the char type is an unsigned, 16-bit integer, which is
normally used to hold UTF-16 code units (e.g. [java.lang.CharSequence][4]).
Jonker, Goo, Hohle Expires December 2023 [Page 48]
Internet Draft Ion June 21, 2023
For the Unicode code point, Mathematical Bold Capital "A" (code point
U+0001D400), this encoded in a UTF-16 string as two units: 0xD835 followed
by 0xDC00. So in this case, Java's UTF-16 representation actually utilizes
two character (i.e. char) values to represent one Unicode code point.
This document attempts to avoid using the term character when referring to
Unicode code points. The reasoning for this is partly stated above, but
also has to do with the overloaded nature of the term (e.g. a user
character or grapheme). For more details, consult section [3.4 of the
Unicode Standard][5].
Another interesting aspect of the UCS, is a block of code points that is
reserved exclusively for use in the UTF-16 encoding (i.e. surrogate code
points). As such, strictly speaking, no encoding of Unicode are allowed to
represent the code points in the inclusive range U+D800 to U+DFFF. In the
UTF-16 case, these code points are only allowed to be used in the encoding
to specify characters in the U+00010000 to U+0010FFFF range. Refer to
sections [3.8 and 3.9 of the Unicode Standard][5] for details.
6.2. Ion String
The Ion String data type is a sequence of Unicode code points. The Ion
semantics of this are agnostic to any particular Unicode encoding (e.g.
UTF-16, UTF-8), except for the concrete syntax specification of the Ion
binary and text formats.
6.2.1. Text Format
See the [grammar][6] for a formal definition of the Ion Text encoding for
the string type.
Multiple Ion long string literals that are adjacent to each other by zero
or more whitespace are concatenated automatically. For example the
following two blocks of Ion text syntax are semantically equivalent. Note
that short string literals do not exhibit this behavior.
"1234" '''Hello''' '''World'''
"1234" "HelloWorld"
Each individual long string literal must be a valid Unicode character
sequence when unescaped. The following examples are invalid due to
splitting Unicode escapes, an escaped surrogate pair, and a common escape,
respectively.
'''''' '''1234'''
'''U0000''' '''1234'''
Jonker, Goo, Hohle Expires December 2023 [Page 49]
Internet Draft Ion June 21, 2023
'''D800''' '''DC00'''
'''''' '''n'''
Within long string literals unescaped newlines are normalized such that
U+000D U+000A pairs (CARRIAGE RETURN and LINE FEED respectively) and U+000D
are replaced with U+000A. This is to facilitate compatibility across
operating systems.
Normalization can be subverted by using a combination of escapes:
CARRIAGE REtTwUoR'N''only:
'''one
CARRIAGE RETURN and LINE FEED:
'''one
two'''
Escaped newlines are not replaced with any characters (i.e. the newline is
removed). In addition, the following table describes the string escape
sequences that have direct code point replacement for all quoted string and
symbol forms.
Unicode Code Point Ion Escape Semantics
U+0007 \a BEL (alert)
U+0008 \b BS (backspace)
U+0009 \t HT (tab)
U+000A \n LF (linefeed)
U+000C \f FF (form feed)
U+000D \r CR (carriage return)
U+000B \v VT (vertical tab)
U+0022 \" double quote
U+0027 \' single quote
U+003F \? question mark
U+005C \\ backslash
U+002F \/ forward slash
U+0000 \0 NUL (null character)
The for the Unicode ordinal string escapes, U, , and \x, the escape must
be followed by a number of hexadecimal digits as described below.
Unicode Ion
Code Point Sequence Semantics
U+HHHHHHHH UHHHHHHHH 8-digit hexadecimal Unicode code point
U+HHHH HHHH 4-digit hexadecimal Unicode code point;
Jonker, Goo, Hohle Expires December 2023 [Page 50]
Internet Draft Ion June 21, 2023
equivalent to U0000HHHH
U+00HH \xHH 2-digit hexadecimal Unicode code point;
equivalent to 00HH and U000000HH
Ion does not specify the behavior of specifying invalid Unicode code points
or surrogate code points (used only for UTF-16) using the escape sequences.
It is highly recommended that Ion implementations reject such escape
sequences as they are not proper Unicode as specified by the standard. To
this point, consider the Ion string sequence, "D800DC00". A compliant
parser may throw an exception because surrogate characters are specified
outside of the context of UTF-16, accept the string as a technically
invalid sequence of two Unicode code points (i.e. U+D800 and U+DC00), or
interpret it as the single Unicode code point U+00010000. In this regard,
the Ion string data type does not conform to the Unicode specification.
A strict Unicode implementation of the Ion text should not accept such
sequences.
6.2.2. Binary Format
The Ion binary format encodes the string data type directly as a sequence
of UTF-8 octets. A strict, Unicode compliant implementation of Ion should
not allow invalid UTF-8 sequences (e.g. surrogate code points, overlong
values, and values outside of the inclusive range, U+0000 to U+0010FFFF).
6.3. Ion Clob
An Ion clob type is similar to the blob type except that the denotation in
the Ion text format uses an ASCII-based string notation rather than a
base64 encoding to denote its binary value. It is important to make the
distinction that clob is a sequence of raw octets and string is a sequence
of Unicode code points.
6.3.1. Text Format
See the [grammar][6] for a formal definition of the Ion Text encoding for
the clob type.
Similar to string, adjoining long string literals within an Ion clob are
concatenated automatically. Within a clob, only one short string literal or
multiple long string literals are allowed. For example, the following two
blocks of Ion text syntax are semantically equivalent.
{{ '''Hello''' '''World''' }}
{{ "HelloWorld" }}
The rules for the quoted strings within a clob follow the similarly to the
string type, with the following exceptions. Unicode newline characters in
Jonker, Goo, Hohle Expires December 2023 [Page 51]
Internet Draft Ion June 21, 2023
long strings and all verbatim ASCII characters are interpreted as their
ASCII octet values. Non-printable ASCII and non-ASCII Unicode code points
are not allowed unescaped in the string bodies. Furthermore, the following
table describes the clob string escape sequences that have direct octet
replacement for both all strings.
Octet Ion Escape Semantics
0x07 \a ASCII BEL (alert)
0x08 \b ASCII BS (backspace)
0x09 \t ASCII HT (tab)
0x0A \n ASCII LF (line feed)
0x0C \f ASCII FF (form feed)
0x0D \r ASCII CR (carriage return)
0x0B \v ASCII VT (vertical tab)
0x22 \" ASCII double quote
0x27 \' ASCII single quote
0x3F \? ASCII question mark
0x5C \\ ASCII backslash
0x2F \/ ASCII forward slash
0x00 \0 ASCII NUL (null character)
The clob escape \x must be followed by two hexadecimal digits. Note that
clob does not support the and U escapes since it represents an octet
sequence and not a Unicode encoding.
Octet Ion Escape Semantics
0xHH \xHH 2-digit hexadecimal octet
It is important to note that clob is a binary type that is designed for
binary values that are either text encoded in a code page that is ASCII
compatible or should be octet editable by a human (escaped string syntax
vs. base64 encoded data). Clearly non-ASCII based encodings will not be
very readable (e.g. the clob for the EBCDIC encoded string representing
"hello" could be denoted as {{ "\xc7\xc1%%?" }}).
6.3.2. Binary Format
This is represented directly as the octet values in the clob value.
7. Real Numbers
Ion supports two types of real numbers: floats and decimals.
Both the text and binary representations of an Ion value stream may be
compressed in one or more GZIP [RFC 1952] members.
7.1. Floats
Ion supports IEEE-754 binary floating point values using the IEEE-754
Jonker, Goo, Hohle Expires December 2023 [Page 52]
Internet Draft Ion June 21, 2023
32-bit (binary32) and 64-bit (binary64) encodings. In the data model,
all floating point values are treated as though they are binary64 (all
binary32 encoded values can be represented exactly in binary64).
7.1.1. Encoding Considerations
In text, binary float is represented using familiar base-10 digits.
While this is convenient for human representation, there is no explicit
notation for expressing a particular floating point value as binary32 or
binary64. Furthermore, many base-10 real numbers are irrational with
respect to base-2 and cannot be expressed exactly in either binary
floating point encoding (e.g. 1.1e0).
Because of this asymmetry, the rules for Ion text float notation when
round-tripping to Ion binary MUST be observed:
o Any text notation that can be exactly represented as binary32 MAY be
encoded as either binary32 or binary64 in Ion binary.
o Any text notation that can only be exactly represented as binary64
MUST be encoded as binary64 in Ion binary.
o Any text notation that has no exact representation (i.e. irrational
in base-2 or more precision than the binary64 mantissa), MUST be
encoded as binary64. This is to ensure that irrational numbers or
truncated values are represented in the highest fidelity of the
float data type.
When encoding a decimal real number that is irrational in base-2 or has
more precision than can be stored in binary64, the exact binary64 value
is determined by using the IEEE-754 round-to-nearest mode with a
round-half-to-even as the tie-break. This mode/tie-break is the common
default used in most programming environments and is discussed in detail in
"Correctly Rounded Binary-Decimal and Decimal-Binary Conversions" (see
http://ampl.com/REFS/rounding.pdf). This conversion algorithm is illustrated
in a straightforward way in Clinger's Algorithm (see
http://www.cesura17.net/~will/professional/research/papers/howtoread.pdf).
When encoding a binary32 or binary64 value in text notation, an
implementation MAY want to consider the approach described in "Printing
Floating-Point Numbers Quickly and Accurately" (see
http://www.cs.indiana.edu/~dyb/pubs/FP-Printing-PLDI96.pdf).
7.1.2. Special Values
The IEEE-754 binary floating point encoding supports special non-number
values. These are represented in the binary format as per the encoding
Jonker, Goo, Hohle Expires December 2023 [Page 53]
Internet Draft Ion June 21, 2023
rules of the IEEE-754 specification, and are represented in text by the
following keywords:
o nan - denotes the not a number (NaN) value.
o +inf - denotes positive infinity.
o -inf - denotes negative infinity.
The Ion data model considers all encodings of positive infinity to be
equivalent to one another and all encodings of negative infinity to be
equivalent to one another. Thus, an implementation encoding +inf or -inf
in Ion binary MAY choose to encode it using the binary32 or
binary64 form.
The IEEE-754 specification has many encodings of NaN, but the Ion data
model considers all encodings of NaN (i.e. all forms of signaling or
quiet NaN) to be equivalent. Note that the text keyword nan does not map
to any particular encoding, the only requirement is that an
implementation emit a bit-pattern that represents an IEEE-754 NaN value
when converting to binary (e.g. the binary64 bit pattern of
0x7FF8000000000000).
An important consideration is that NaN is not treated in a consistent
manner between programming environments. For example, Java defines that
there is only one canonical NaN value and it happens to be signaling.
On C/C++, on the other hand, NaN is mostly platform defined, but on
platforms that support it, the NAN macro is a quiet NaN. In general,
common programming environments give testing routines for NaN, but no
consistent way to represent it.
7.1.2. Examples
To illustrate the text/binary round-tripping rules above, consider the
following examples.
The Ion text literal 2.147483647e9 overflows the 23-bits of significand
in binary32 and MUST be encoded in Ion binary as a binary64 value. The
Ion binary encoding for this text literal is as follows:
0x48 0x41 0xDF 0xFF 0xFF 0xFF 0xC0 0x00 0x00
The base-2 irrational literal 1.2e0 following the rounding and encoding
rules MUST be encoded in Ion binary as:
0x48 0x3F 0xF3 0x33 0x33 0x33 0x33 0x33 0x33
Although the textual representative of 1.2e0 itself is irrational, its
Jonker, Goo, Hohle Expires December 2023 [Page 54]
Internet Draft Ion June 21, 2023
canonical form in the data model is not (based on the rounding rules),
thus the following text forms all map to the same binary64 value:
Jonker, Goo, Hohle Expires December 2023 [Page 55]
Internet Draft Ion June 21, 2023
// the most human-friendly representation
1.2e0
// the exact textual representation in base-10 for the binary64 value
// 1.2e0 represents
1.1999999999999999555910790149937383830547332763671875e0
// a shortened, irrational version, but still the same value
1.1999999999999999e0
// a lengthened, irrational version that is still the same value
1.19999999999999999999999999999999999999999999999999999999e0
7.2. Decimals
Ion supports a decimal numeric type to allow accurate representation of
base-10 floating point values such as currency amounts. An Ion Decimal
has arbitrary precision and scale. This representation preserves
significant trailing zeros when converting between text and binary forms.
Decimals are supported in addition to the traditional base-2 floating
point type. This avoids the loss of exactness often incurred when storing
a decimal fraction as a binary fraction. Many common decimal numbers with
relatively few digits cannot be represented as a terminating
binary fraction.
7.2.1. Data Model
Ion decimals follow the IBM Hursley Lab General Decimal Arithmetic
Specification (see: http://speleotrove.com/decimal/decarith.html), which
defines an abstract decimal data model (see:
http://speleotrove.com/decimal/damodel.html) represented by the following
3-tuple:
(<sign 0|1>, <coefficient: unsigned integer>, <exponent: integer>)
Decimals should be considered equivalent if and only if their data model
tuples are equivalent, where exponents of +0 and -0 are considered
equivalent. All forms of positive zero are distinguished only by the
exponent. All forms of negative zero, which are distinct from all forms
of positive zero, also are distinguished only by the exponent.
7.2.2. Text Format
The Hursley rules for describing a finite value converting from textual
notation must be followed. The Hursley rules for describing a special
value are not followed--the rules for
Jonker, Goo, Hohle Expires December 2023 [Page 56]
Internet Draft Ion June 21, 2023
o infinity - rule is not applicable for Ion Decimals
o nan - rule is not applicable for Ion Decimals
Specifically, the rules for getting the integer coefficient from the
decimal-part (digits preceding the exponent) of the textual
representation are specified as follows.
If the decimal-part included a decimal point the exponent is then
reduced by the count of digits following the decimal point (which may
be zero) and the decimal point is removed. The remaining string of
digits has any leading zeros removed (except for the rightmost digit)
and is then converted to form the coefficient which will be zero or
positive.
Where X is any unsigned integer, all of the following formulae can be
demonstrated to be equivalent using the text conversion rules and the
data model.
// Exponent implicitly zero
X.
// Exponent explicitly zero
Xd0
// Exponent explicitly negative zero (equivalent to zero).
Xd-0
Other equivalent representations include the following, where Y is the
number of digits in X.
// There are Y digits past the decimal point in the
// decimal-part, making the exponent zero. One leading zero
// is removed.
0.XdY
For example, all of the following text Ion decimal representations are
equivalent to each other.
0.
0d0
0d-0
0.0d1
Additionally, all of the following are equivalent to each other (but not
to any forms of positive zero).
Jonker, Goo, Hohle Expires December 2023 [Page 57]
Internet Draft Ion June 21, 2023
.nf
-0.
-0d0
-0d-0
-0.0d1
Because all forms of zero are distinctly identified by the exponent, the
following are not equivalent to each other.
// Exponent implicitly zero.
0.
// Exponent explicitly 5.
0d5
All of the following are equivalent to each other.
42.
42d0
42d-0
4.2d1
0.42d2
However, the following are not equivalent to each other.
.nf
// Text converted to 42.
0.42d2
// Text converted to 42.0
0.420d2
7.2.3. Binary Format
The encoding of Ion decimals, which follows the decimal data model
described above, is specified in [Ion Binary Encoding].
The following binary encodings of decimal values are all equivalent
to 0d0.
KS
+-----------------+------------+-------------+
| type descriptor | exponent | coefficient |
| | (VarInt) | (Int) |
+-----------------+------------+-------------+
Most compact encoding of 0d0
+-----------------+
: 0x50 :
Jonker, Goo, Hohle Expires December 2023 [Page 58]
Internet Draft Ion June 21, 2023
+-----------------+
Explicit encoding of 0d0
+-----------------+------------+-------------+
: 0x52 : 0x80 : 0x00 |
+-----------------+------------+-------------+
Explicit encoding of 0d(negative)0
+-----------------+------------+-------------+
: 0x52 : 0xC0 : 0x00 |
+-----------------+------------+-------------+
0d0 with overpadded coefficient
+-----------------+------------+-------------+
: 0x53 : 0x80 : 0x00 0x00 |
+-----------------+------------+-------------+
0d0 with overpadded exponent and coefficient
+-----------------+------------+-------------+
: 0x54 : 0x00 0x80 : 0x00 0x00 |
+-----------------+------------+-------------+
Note: The latter two examples demonstrate overpadded encodings of the
exponent and coefficient subfields. Overpadded encodings such as these
are possible for any decimal and are always equivalent to the
unpadded encoding.
The following binary encodings of decimal values are equivalent to -0d0
(but not to 0d0).
+-----------------+------------+-------------+
| type descriptor | exponent | coefficient |
| | (VarInt) | (Int) |
+-----------------+------------+-------------+
Explicit encoding of (negative)0d0
+-----------------+------------+-------------+
: 0x52 : 0x80 : 0x80 |
+-----------------+------------+-------------+
Explicit encoding of (negative)0d(negative)0
+-----------------+------------+-------------+
: 0x52 : 0xC0 : 0x80 |
+-----------------+------------+-------------+
Finally, the following binary encodings of decimal values are equivalent
to 42d0.
Jonker, Goo, Hohle Expires December 2023 [Page 59]
Internet Draft Ion June 21, 2023
+-----------------+------------+-------------+
| type descriptor | exponent | coefficient |
| | (VarInt) | (Int) |
+-----------------+------------+-------------+
Explicit encoding of 42d0
+-----------------+------------+-------------+
: 0x52 : 0x80 : 0x2A |
+-----------------+------------+-------------+
Explicit encoding of 42d(negative)0
+-----------------+------------+-------------+
: 0x52 : 0xC0 : 0x2A |
+-----------------+------------+-------------+
8. Compression
Both the text and binary representations of an Ion value stream may be
compressed in one or more GZIP [RFC 1952] members.
9. Security Considerations
Unlike JSON, Ion is not a subset of an existing programming logic and
cannot be "eval()"-ed. However, Ion data can represent an application
specific programming language implemented by a consumer. When viewed as
a data stream this poses no security risk.
When used to implement a programming language, Ion neither provides nor
prevents safeguards found in any other programming language.
10. IANA Considerations
No actions are required from IANA as result of the publication of
this document.
The MIME media type for Ion is application/ion.
Type name: application
Subtype name: ion
Required parameters: n/a
Optional parameters: charset
Encoding considerations: binary
Security considerations: See [RFC Ion], Section 7.
Jonker, Goo, Hohle Expires December 2023 [Page 60]
Internet Draft Ion June 21, 2023
Interoperability considerations: Described in [RFC Ion].
Published specification: [RFC Ion]
Applications that use this media type:
Ion has been used to exchange data between applications written in all
of these programming languages: C, C++, Java, Perl, Python, Ruby, and
JavaScript.
Additional information:
Magic number(s): 0xE0 0x01 0x00 0xEA
File extension(s): .ion, .10n, .ion.gz, .10n.gz
Person & email address to contact for further information:
ion@amazon.com
Intended usage: COMMON
Restrictions on usage: none
Author:
Originally written by Todd Jonker, Almann Goo, and Jonathan Hohle
Translated to RFC by Jonathan Hohle
Change controller: Amazon.com, Inc.
9. Appendix A: Antlr v4 Grammar for Ion 1.0 Text
// Ion Text 1.0 ANTLR v4 Grammar
//
// The following grammar does not encode all of the Ion semantics, in particular:
//
// * Timestamps are syntactically defined but the rules of ISO 8601 need to be
// applied (especially regarding day rules with months and leap years).
// * Non $ion_1_0 version markers are not trapped (e.g. $ion_1_1, $ion_2_0)
// * Edge cases around Unicode semantics:
// - ANTLR specifies only four hex digit Unicode escapes and on Java operates
// on UTF-16 code units (this is a flaw in ANTLR).
// - The grammar doesn't validate unpaired surrogate escapes in symbols or strings
// (e.g. "dc00")
grammar IonText;
Jonker, Goo, Hohle Expires December 2023 [Page 61]
Internet Draft Ion June 21, 2023
// note that EOF is a concept for the grammar, technically Ion streams
// are infinite
top_level
: (ws* top_level_value)* ws* value? EOF
;
top_level_value
: annotation+ top_level_value
| delimiting_entity
// numeric literals (if followed by something), need to be followed by
// whitespace or a token that is either quoted (e.g. string) or
// starts with punctuation (e.g. clob, struct, list)
| numeric_entity ws
| numeric_entity quoted_annotation value
| numeric_entity delimiting_entity
// literals that are unquoted symbols or keywords have a similar requirement
// as the numerics above, they have different productions because the
// rules for numerics are the same in s-expressions, but keywords
// have different rules between top-level and s-expressions.
| keyword_entity ws
| keyword_entity quoted_annotation value
| keyword_entity keyword_delimiting_entity
;
value
: annotation* entity
;
entity
: numeric_entity
| delimiting_entity
| keyword_entity
;
delimiting_entity
: quoted_text
| SHORT_QUOTED_CLOB
| LONG_QUOTED_CLOB
| BLOB
| list
| sexp
| struct
;
keyword_delimiting_entity
: delimiting_entity
| numeric_entity
;
Jonker, Goo, Hohle Expires December 2023 [Page 62]
Internet Draft Ion June 21, 2023
keyword_entity
: any_null
| BOOL
| SPECIAL_FLOAT
| IDENTIFIER_SYMBOL
// note that this is because we recognize the type names for null
// they are ordinary symbols on their own
| TYPE
;
numeric_entity
: BIN_INTEGER
| DEC_INTEGER
| HEX_INTEGER
| TIMESTAMP
| FLOAT
| DECIMAL
;
annotation
: symbol ws* COLON COLON ws*
;
quoted_annotation
: QUOTED_SYMBOL ws* COLON COLON ws*
;
list
: L_BRACKET ws* value ws* (COMMA ws* value)* ws* (COMMA ws*)? R_BRACKET
| L_BRACKET ws* R_BRACKET
;
sexp
: L_PAREN (ws* sexp_value)* ws* value? R_PAREN
;
sexp_value
: annotation+ sexp_value
| sexp_delimiting_entity
| operator
// much like at the top level, numeric/identifiers/keywords
// have similar delimiting rules
| numeric_entity ws
| numeric_entity quoted_annotation value
| numeric_entity sexp_delimiting_entity
| sexp_keyword_entity ws
| sexp_keyword_entity quoted_annotation value
| sexp_keyword_entity sexp_keyword_delimiting_entity
Jonker, Goo, Hohle Expires December 2023 [Page 63]
Internet Draft Ion June 21, 2023
| NULL ws
| NULL quoted_annotation value
| NULL sexp_null_delimiting_entity
;
sexp_delimiting_entity
: delimiting_entity
;
sexp_keyword_delimiting_entity
: sexp_delimiting_entity
| numeric_entity
| operator
;
sexp_null_delimiting_entity
: delimiting_entity
| NON_DOT_OPERATOR+
;
sexp_keyword_entity
: typed_null
| BOOL
| SPECIAL_FLOAT
| IDENTIFIER_SYMBOL
// note that this is because we recognize the type names for null
// they are ordinary symbols on their own
| TYPE
;
operator
: (DOT | NON_DOT_OPERATOR)+
;
struct
: L_CURLY ws* field (ws* COMMA ws* field)* ws* (COMMA ws*)? R_CURLY
| L_CURLY ws* R_CURLY
;
field
: field_name ws* COLON ws* annotation* entity
;
any_null
: NULL
| typed_null
;
Jonker, Goo, Hohle Expires December 2023 [Page 64]
Internet Draft Ion June 21, 2023
typed_null
: NULL DOT NULL
| NULL DOT TYPE
;
field_name
: symbol
| SHORT_QUOTED_STRING
| (ws* LONG_QUOTED_STRING)+
;
quoted_text
: QUOTED_SYMBOL
| SHORT_QUOTED_STRING
| (ws* LONG_QUOTED_STRING)+
;
symbol
: IDENTIFIER_SYMBOL
// note that this is because we recognize the type names for null
// they are ordinary symbols on their own
| TYPE
| QUOTED_SYMBOL
;
ws
: WHITESPACE
| INLINE_COMMENT
| BLOCK_COMMENT
;
////////////////////////////////////////////////////////////////////////
// Ion Punctuation
////////////////////////////////////////////////////////////////////////
L_BRACKET : '[';
R_BRACKET : ']';
L_PAREN : '(';
R_PAREN : ')';
L_CURLY : '{';
R_CURLY : '}';
COMMA : ',';
COLON : ':';
DOT : '.';
NON_DOT_OPERATOR
: [!#%&*+-/;<=>?@^`|~]
;
Jonker, Goo, Hohle Expires December 2023 [Page 65]
Internet Draft Ion June 21, 2023
////////////////////////////////////////////////////////////////////////
// Ion Whitespace / Comments
////////////////////////////////////////////////////////////////////////
WHITESPACE
: WS+
;
INLINE_COMMENT
: '//' .*? (NL | EOF)
;
BLOCK_COMMENT
: '/*' .*? '*/'
;
////////////////////////////////////////////////////////////////////////
// Ion Null
////////////////////////////////////////////////////////////////////////
NULL
: 'null'
;
TYPE
: 'bool'
| 'int'
| 'float'
| 'decimal'
| 'timestamp'
| 'symbol'
| 'string'
| 'clob'
| 'blob'
| 'list'
| 'sexp'
| 'struct'
;
////////////////////////////////////////////////////////////////////////
// Ion Bool
////////////////////////////////////////////////////////////////////////
BOOL
: 'true'
| 'false'
;
Jonker, Goo, Hohle Expires December 2023 [Page 66]
Internet Draft Ion June 21, 2023
////////////////////////////////////////////////////////////////////////
// Ion Timestamp
////////////////////////////////////////////////////////////////////////
TIMESTAMP
: DATE ('T' TIME?)?
| YEAR '-' MONTH 'T'
| YEAR 'T'
;
fragment
DATE
: YEAR '-' MONTH '-' DAY
;
fragment
YEAR
: '000' [1-9]
| '00' [1-9] DEC_DIGIT
| '0' [1-9] DEC_DIGIT DEC_DIGIT
| [1-9] DEC_DIGIT DEC_DIGIT DEC_DIGIT
;
fragment
MONTH
: '0' [1-9]
| '1' [0-2]
;
fragment
DAY
: '0' [1-9]
| [1-2] DEC_DIGIT
| '3' [0-1]
;
fragment
TIME
: HOUR ':' MINUTE (':' SECOND)? OFFSET
;
fragment
OFFSET
: 'Z'
| PLUS_OR_MINUS HOUR ':' MINUTE
;
fragment
Jonker, Goo, Hohle Expires December 2023 [Page 67]
Internet Draft Ion June 21, 2023
HOUR
: [01] DEC_DIGIT
| '2' [0-3]
;
fragment
MINUTE
: [0-5] DEC_DIGIT
;
// note that W3C spec requires a digit after the '.'
fragment
SECOND
: [0-5] DEC_DIGIT ('.' DEC_DIGIT+)?
;
////////////////////////////////////////////////////////////////////////
// Ion Int
////////////////////////////////////////////////////////////////////////
BIN_INTEGER
: '-'? '0' [bB] BINARY_DIGIT (UNDERSCORE? BINARY_DIGIT)*
;
DEC_INTEGER
: '-'? DEC_UNSIGNED_INTEGER
;
HEX_INTEGER
: '-'? '0' [xX] HEX_DIGIT (UNDERSCORE? HEX_DIGIT)*
;
////////////////////////////////////////////////////////////////////////
// Ion Float
////////////////////////////////////////////////////////////////////////
SPECIAL_FLOAT
: PLUS_OR_MINUS 'inf'
| 'nan'
;
FLOAT
: DEC_INTEGER DEC_FRAC? FLOAT_EXP
;
fragment
FLOAT_EXP
: [Ee] PLUS_OR_MINUS? DEC_DIGIT+
Jonker, Goo, Hohle Expires December 2023 [Page 68]
Internet Draft Ion June 21, 2023
;
////////////////////////////////////////////////////////////////////////
// Ion Decimal
////////////////////////////////////////////////////////////////////////
DECIMAL
: DEC_INTEGER DEC_FRAC? DECIMAL_EXP?
;
fragment
DECIMAL_EXP
: [Dd] PLUS_OR_MINUS? DEC_DIGIT+
;
////////////////////////////////////////////////////////////////////////
// Ion Symbol
////////////////////////////////////////////////////////////////////////
QUOTED_SYMBOL
: SYMBOL_QUOTE SYMBOL_TEXT SYMBOL_QUOTE
;
fragment
SYMBOL_TEXT
: (TEXT_ESCAPE | SYMBOL_TEXT_ALLOWED)*
;
// non-control Unicode and not single quote or backslash
fragment
SYMBOL_TEXT_ALLOWED
: '0020'..'0026' // no C1 control characters and no U+0027 single quote
| '0028'..'005B' // no U+005C backslash
| '005D'..'FFFF' // should be up to U+10FFFF
| WS_NOT_NL
;
IDENTIFIER_SYMBOL
: [$_a-zA-Z] ([$_a-zA-Z] | DEC_DIGIT)*
;
////////////////////////////////////////////////////////////////////////
// Ion String
////////////////////////////////////////////////////////////////////////
SHORT_QUOTED_STRING
: SHORT_QUOTE STRING_SHORT_TEXT SHORT_QUOTE
;
Jonker, Goo, Hohle Expires December 2023 [Page 69]
Internet Draft Ion June 21, 2023
LONG_QUOTED_STRING
: LONG_QUOTE STRING_LONG_TEXT LONG_QUOTE
;
fragment
STRING_SHORT_TEXT
: (TEXT_ESCAPE | STRING_SHORT_TEXT_ALLOWED)*
;
fragment
STRING_LONG_TEXT
: (TEXT_ESCAPE | STRING_LONG_TEXT_ALLOWED)*?
;
// non-control Unicode and not double quote or backslash
fragment
STRING_SHORT_TEXT_ALLOWED
: '0020'..'0021' // no C1 control characters and no U+0022 double quote
| '0023'..'005B' // no U+005C backslash
| '005D'..'{10FFFF}'
| WS_NOT_NL
;
// non-control Unicode (newlines are OK)
fragment
STRING_LONG_TEXT_ALLOWED
: '0020'..'005B' // no C1 control characters and no U+005C backslash
| '005D'..'{10FFFF}'
| WS
;
fragment
TEXT_ESCAPE
: COMMON_ESCAPE | HEX_ESCAPE | UNICODE_ESCAPE
;
////////////////////////////////////////////////////////////////////////
// Ion CLOB
////////////////////////////////////////////////////////////////////////
SHORT_QUOTED_CLOB
: LOB_START WS* SHORT_QUOTE CLOB_SHORT_TEXT SHORT_QUOTE WS* LOB_END
;
LONG_QUOTED_CLOB
: LOB_START (WS* LONG_QUOTE CLOB_LONG_TEXT*? LONG_QUOTE)+ WS* LOB_END
;
Jonker, Goo, Hohle Expires December 2023 [Page 70]
Internet Draft Ion June 21, 2023
fragment
CLOB_SHORT_TEXT
: (CLOB_ESCAPE | CLOB_SHORT_TEXT_ALLOWED)*
;
fragment
CLOB_LONG_TEXT
: CLOB_LONG_TEXT_NO_QUOTE
| ''' CLOB_LONG_TEXT_NO_QUOTE
| '''' CLOB_LONG_TEXT_NO_QUOTE
;
fragment
CLOB_LONG_TEXT_NO_QUOTE
: (CLOB_ESCAPE | CLOB_LONG_TEXT_ALLOWED)
;
// non-control ASCII and not double quote or backslash
fragment
CLOB_SHORT_TEXT_ALLOWED
: '0020'..'0021' // no U+0022 double quote
| '0023'..'005B' // no U+005C backslash
| '005D'..'007F'
| WS_NOT_NL
;
// non-control ASCII (newlines are OK)
fragment
CLOB_LONG_TEXT_ALLOWED
: '0020'..'0026' // no U+0027 single quote
| '0028'..'005B' // no U+005C backslash
| '005D'..'007F'
| WS
;
fragment
CLOB_ESCAPE
: COMMON_ESCAPE | HEX_ESCAPE
;
////////////////////////////////////////////////////////////////////////
// Ion BLOB
////////////////////////////////////////////////////////////////////////
BLOB
: LOB_START (BASE_64_QUARTET | WS)* BASE_64_PAD? WS* LOB_END
;
Jonker, Goo, Hohle Expires December 2023 [Page 71]
Internet Draft Ion June 21, 2023
fragment
BASE_64_PAD
: BASE_64_PAD1
| BASE_64_PAD2
;
fragment
BASE_64_QUARTET
: BASE_64_CHAR WS* BASE_64_CHAR WS* BASE_64_CHAR WS* BASE_64_CHAR
;
fragment
BASE_64_PAD1
: BASE_64_CHAR WS* BASE_64_CHAR WS* BASE_64_CHAR WS* '='
;
fragment
BASE_64_PAD2
: BASE_64_CHAR WS* BASE_64_CHAR WS* '=' WS* '='
;
fragment
BASE_64_CHAR
: [0-9a-zA-Z+/]
;
////////////////////////////////////////////////////////////////////////
// Common Lexer Primitives
////////////////////////////////////////////////////////////////////////
fragment LOB_START : '{{';
fragment LOB_END : '}}';
fragment SYMBOL_QUOTE : ''';
fragment SHORT_QUOTE : '"';
fragment LONG_QUOTE : ''''';
// Ion does not allow leading zeros for base-10 numbers
fragment
DEC_UNSIGNED_INTEGER
: '0'
| [1-9] (UNDERSCORE? DEC_DIGIT)*
;
fragment
DEC_FRAC
: '.'
| '.' DEC_DIGIT (UNDERSCORE? DEC_DIGIT)*
;
Jonker, Goo, Hohle Expires December 2023 [Page 72]
Internet Draft Ion June 21, 2023
fragment
DEC_DIGIT
: [0-9]
;
fragment
HEX_DIGIT
: [0-9a-fA-F]
;
fragment
BINARY_DIGIT
: [01]
;
fragment
PLUS_OR_MINUS
: [+-]
;
fragment
COMMON_ESCAPE
: '\' COMMON_ESCAPE_CODE
;
fragment
COMMON_ESCAPE_CODE
: 'a'
| 'b'
| 't'
| 'n'
| 'f'
| 'r'
| 'v'
| '?'
| '0'
| '''
| '"'
| '/'
| '\'
| NL
;
fragment
HEX_ESCAPE
: '\x' HEX_DIGIT HEX_DIGIT
;
Jonker, Goo, Hohle Expires December 2023 [Page 73]
Internet Draft Ion June 21, 2023
fragment
UNICODE_ESCAPE
: '\u' HEX_DIGIT_QUARTET
| '\U000' HEX_DIGIT_QUARTET HEX_DIGIT
| '\U0010' HEX_DIGIT_QUARTET
;
fragment
HEX_DIGIT_QUARTET
: HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
fragment
WS
: WS_NOT_NL
| '000A' // line feed
| '000D' // carriage return
;
fragment
NL
: '000D000A' // carriage return + line feed
| '000D' // carriage return
| '000A' // line feed
;
fragment
WS_NOT_NL
: '0009' // tab
| '000B' // vertical tab
| '000C' // form feed
| '0020' // space
;
fragment
UNDERSCORE
: '_'
;
10. References
10.1. Normative References
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[RFC4234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Jonker, Goo, Hohle Expires December 2023 [Page 74]
Internet Draft Ion June 21, 2023
Specifications: ABNF", RFC 4234, October 2005.
10.2. Informative References
[RFC1952] Deutsch, L., "GZIP file format specification version 4.3",
RFC 1952, May 1996.
[RFC2822] Resnick, P., "Internet Message Format", RFC 2822, April 2001.
[RFC3339] Klyne, G., "Date and Time on the Internet: Timestamps",
RFC 3339, July 2002.
[RFC4648] Josefsson, S., "The Base16, Base342, and Base64 Encodings",
RFC 3548, October 2006.
Authors' Addresses
Jonathan Hohle
15987 N 114th Way
Scottsdale, AZ 85255
Tel: 480 323 5799 (Jonathan Hohle)
EMail: jonhohle@gmail.com
Questions about the technical content of this specification can be
sent by email to:
The Ion Team <ion@amazon.com>
Jonker, Goo, Hohle Expires December 2023 [Page 75]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment