Skip to content

Instantly share code, notes, and snippets.

@tayloraswift
Last active October 31, 2021 19:38
Show Gist options
  • Save tayloraswift/98cf9cf7e119a4684a245ecf1c982257 to your computer and use it in GitHub Desktop.
Save tayloraswift/98cf9cf7e119a4684a245ecf1c982257 to your computer and use it in GitHub Desktop.
Integer-convertible character literals

Integer-convertible character literals

Introduction

Swift’s String type is designed for Unicode correctness and abstracts away the underlying binary representation of the string to model it as a Collection of grapheme clusters. This is an appropriate string model for human-readable text, as to a human reader, the atomic unit of a string is (usually) the extended grapheme cluster. When treated this way, many logical string operations “just work” the way users expect.

However, it is also common in programming to need to express values which are intrinsically numeric, but have textual meaning, when taken as an ASCII value. We propose adding a new literal syntax takes single-quotes ('), and is transparently convertible to Swift’s integer types. This syntax, but not the behavior, will extend to all “scalar” text literals, up to and including Character, and will become the preferred literal syntax these types.

Motivation

For both correctness and efficiency, [UInt8] (or another integer array type) is usually the most appropriate representation for an ASCII string. (See Stop converting Data to String for a discussion on why String is an inappropriate representation.)

A major pain point of integer arrays is that they lack a clear and readable literal type. In C, 'a' is a uint8_t literal, equivalent to 97. Swift has no such equivalent, requiring awkward spellings like UInt8(ascii: "a"). Alternatives, like spelling out the values in hex or decimal directly, are even worse. This harms readability of code, and is one of the sore points of bytestring processing in Swift.

static char const hexcodes[16] = {
    '0', '1', '2', '3', '4' ,'5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
};
let hexcodes = [
    UInt8(ascii: "0"), UInt8(ascii: "1"), UInt8(ascii: "2"), UInt8(ascii: "3"),
    UInt8(ascii: "4"), UInt8(ascii: "5"), UInt8(ascii: "6"), UInt8(ascii: "7"),
    UInt8(ascii: "8"), UInt8(ascii: "9"), UInt8(ascii: "a"), UInt8(ascii: "b"),
    UInt8(ascii: "c"), UInt8(ascii: "d"), UInt8(ascii: "e"), UInt8(ascii: "f")
]    

Sheer verbosity can be reduced by applying “clever” higher-level constructs such as

let hexcodes = [
    "0", "1", "2", "3",
    "4", "5", "6", "7",
    "8", "9", "a", "b",
    "c", "d", "e", "f"
].map{ UInt8(ascii: $0) }

or even

let hexcodes = Array(UInt8(ascii: "0") ... UInt8(ascii: "9")) + 
               Array(UInt8(ascii: "a") ... UInt8(ascii: "f"))

though this comes at the expense of an even higher noise-to-signal ratio, as we are forced to reference concepts such as function mapping, or concatenation, range construction, Array materialization, and run-time type conversion, when all we wanted to express was a fixed set of hardcoded values.

In addition, the init(ascii:) initializer only exists on UInt8. If you're working with other types like Int8 (common when dealing with C APIs that take char), it is much more awkward. Consider scanning through a char* buffer as an UnsafeBufferPointer<Int8>:

for scalar in int8buffer {
    switch scalar {
    case Int8(UInt8(ascii: "a")) ... Int8(UInt8(ascii: "f")):
        // lowercase hex letter
    case Int8(UInt8(ascii: "A")) ... Int8(UInt8(ascii: "F")):
        // uppercase hex letter
    case Int8(UInt8(ascii: "0")) ... Int8(UInt8(ascii: "9")):
        // hex digit
    default:
        // something else
    }
}

Aside from being ugly and verbose, transforming Unicode.Scalar literals also sacrifices compile-time guarantees. The statement let char: UInt8 = 1989 is a compile time error, whereas let char: UInt8 = .init(ascii: "߅") is a run time error.

ASCII scalars are inherently textual, so it should be possible to express them with a textual literal without requiring layers upon layers of transformations. Just as applying the String APIs runs counter to Swift’s stated design goals of safety and efficiency, forcing users to express basic data values in such a convoluted and unreadable way runs counter to our design goal of expressiveness.

Integer character literals would provide benefits to String users. One of the future directions for String is to provide performance-sensitive or low-level users with direct access to code units. Having numeric character literals for use with this API is hugely motivating. Furthermore, improving Swift’s bytestring ergonomics is an important part of our long term goal of expanding into embedded platforms.

Proposed solution

Let's do the obvious thing here, and conform Swift’s integer literal types to ExpressibleByUnicodeScalarLiteral. These conversions will only be valid for the ASCII range U+0 ..< U+128; unicode scalar literals outside of that range will be invalid and treated similar to the way we currently diagnose overflowing integer literals. This is a conservative limitation which we believe is warranted, as allowing transparent unicode conversion to integer types carries major encoding pitfalls we want to protect users from.

ExpressibleBy UnicodeScalarLiteral ExtendedGraphemeClusterLiteral StringLiteral
UInt8:, … , Int: yes* no no
Unicode.Scalar: yes no no
Character: yes (inherited) yes no
String: no* no* yes
StaticString: no* no* yes

Cells marked with an asterisk * indicate behavior that is different from the current language behavior.

As we are introducing a separate literal syntax 'a' for “scalar” text objects, and making it the preferred syntax for Unicode.Scalar and Character, it will no longer be possible to initialize Strings or StaticStrings from unicode scalar literals or character literals. To users, this will have no discernable impact, as double quoted literals will simply be inferred as string literals.

This proposal will have no impact on custom ExpressibleBy conformances, however, integer types UInt8 through Int will now be available as source types provided by the ExpressibleByUnicodeScalarLiteral.init(unicodeScalarLiteral:) initializer. For these specializations, the initializer will be responsible for enforcing the compile-time ASCII range check on the unicode scalar literal.

init() unicodeScalarLiteral extendedGraphemeClusterLiteral stringLiteral
:UInt8, … , :Int yes* no no
:Unicode.Scalar yes no no
:Character yes (upcast) yes no
:String yes (upcast) yes (upcast) yes (upcast)
:StaticString yes (upcast) yes (upcast) yes

The ASCII range restriction will only apply to single-quote literals coerced to an integer type. Any valid Unicode.Scalar can be written as a single-quoted unicode scalar literal, and any valid Character can be written as a single-quoted character literal.

'a' 'é' 'β' '𓀎' '👩‍✈️' "ab"
:String "ab"
:Character 'a' 'é' 'β' '𓀎' '👩‍✈️'
:Unicode.Scalar U+0061 U+00E9 U+03B2 U+1300E
:UInt32 97
:UInt16 97
:UInt8 97
:Int8 97

With these changes, the hex code example can be written much more naturally:

let hexcodes: [UInt8] = [
    '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 
    'a', 'b', 'c', 'd', 'e', 'f'
]

for scalar in int8buffer {
    switch scalar {
    case 'a' ... 'f':
        // lowercase hex letter
    case 'A' ... 'F':
        // uppercase hex letter
    case '0' ... '9':
        // hex digit
    default:
        // something else
    }
}

Choice of single quotes

We propose to adopt the 'x' syntax for all textual literal types up to and including ExtendedGraphemeClusterLiteral, but not including StringLiteral. These literals will be used to express integer types, Character, Unicode.Scalar, and types like UTF16.CodeUnit in the standard library.

The default inferred literal type for let x = 'a' will be Character, following the principle of least surprise. This also allows for a natural user-side syntax for differentiating methods overloaded on both Character and String.

Single-quoted literals will be inferred to be integer types in cases where a Character or Unicode.Scalar overload does not exist, but an integer overload does. This can lead to strange spellings such as '1' + 1' == 98. However, we forsee problems arising from this to be quite rare, as the type system will almost always catch such mistakes, and very few users are likely to express a String with two literals instead of the much more obvious "11".

Use of single quotes for character/scalar literals is heavily precedented in other languages, including C, Objective-C, C++, Java, and Rust, although different languages have slightly differing ideas about what a “character” is. We choose to use the single quote syntax specifically because it reinforces the notion that strings and character values are different: the former is a sequence, the later is a scalar (and "integer-like"). Character types also don't support string literal interpolation, which is another reason to move away from double quotes.

Single quotes in Swift, a historical perspective

In Swift 1.0, we wanted to reserve single quotes for some yet-to-be determined syntactical purpose. However, today, pretty much all of the things that we once thought we might want to use single quotes for have already found homes in other parts of the Swift syntactical space. For example, syntax for multi-line string literals uses triple quotes ("""), and string interpolation syntax uses standard double quote syntax. With the passage of SE-0200, raw-mode string literals settled into the #""# syntax. In current discussions around regex literals, most people seem to prefer slashes (/).

At this point, it is clear that the early syntactic conservatism was unwarranted. We do not forsee another use for this syntax, and given the strong precedent in other languages for characters, it is natural to use it.

Existing double quote initializers for characters

We propose deprecating the double quote literal form for Character and Unicode.Scalar types and slowly migrating them out of Swift.

let c2 = 'f'               // preferred
let c1: Character = "f"   // deprecated

Detailed Design

The only standard library change will be to add {UInt8, Int8, ..., Int} to the list of allowed Self.UnicodeScalarLiteralType types. (This entails conforming the integer types to _ExpressibleByBuiltinUnicodeScalarLiteral.) The ASCII range checking will be performed at compile-time in the typechecker, in essentially the same way that overflow checking for ExpressibleByIntegerLiteral.IntegerLiteralType types works today.

protocol ExpressibleByUnicodeScalarLiteral {
    associatedtype UnicodeScalarLiteralType: 
        {StaticString, ..., Unicode.Scalar} + {UInt8, Int8, ..., Int}
    
    init(unicodeScalarLiteral: UnicodeScalarLiteralType)
}

The default inferred type for all single-quoted literals will be Character, addressing a longstanding pain point in Swift, where Characters had no dedicated literal syntax.

typealias UnicodeScalarLiteralType           = Character
typealias ExtendedGraphemeClusterLiteralType = Character 

This will have no source-level impact, as all double-quoted literals get their default inferred type from the StringLiteralType typealias, which currently overshadows ExtendedGraphemeClusterLiteralType and UnicodeScalarLiteralType. The UnicodeScalarLiteralType typealias will remain meaningless, but ExtendedGraphemeClusterLiteralType typealias will now be used to infer a default type for single-quoted literals.

Source compatibility

This proposal could be done in a way that is strictly additive, but we feel it is best to deprecate the existing double quote initializers for characters, and the UInt8.init(ascii:) initializer.

Here is a specific sketch of a deprecation policy:

  • Continue accepting these in Swift 5 mode with no change.

  • Introduce the new syntax support into Swift 5.1.

  • Swift 5.1 mode would start producing deprecation warnings (with a fixit to change double quotes to single quotes.)

  • The Swift 5 to 5.1 migrator would change the syntax (by virtue of applying the deprecation fixits.)

  • Swift 6 would not accept the old syntax.

During the transition period, "a" will remain a valid unicode scalar literal, so it will be possible to initialize integer types with double-quoted ASCII literals.

let ascii:Int8 = "a" // produces a deprecation warning 

However, as this will only be possible in new code, and will produce a deprecation warning from the outset, this should not be a problem.

Effect on ABI stability

All changes except deprecating the UInt8.init(ascii:) initializer are either additive, or limited to the type checker, parser, or lexer. Removing String and StaticString’s ExpressibleByUnicodeScalarLiteral and ExpressibleByExtendedGraphemeClusterLiteral conformances would otherwise be ABI-breaking, but this can be implemented entirely in the type checker, since source literals are a compile-time construct.

Removing UInt8.init(ascii:) would break ABI, but this is not necessary to implement the proposal, it’s merely housekeeping.

Effect on API resilience

None.

Alternatives considered

Integer initializers

Some have proposed extending the UInt8(ascii:) initializer to other integer types (Int8, UInt16, … , Int). However, this forgoes compile-time validity checking, and entails a substantial increase in API surface area for questionable gain.

Lifting the ASCII range restriction

Some have proposed allowing any unicode scalar literal whose codepoint index does not overflow the target integer type to be convertible to that integer type. Consensus was that this is an easy source of unicode encoding bugs, and provides little utility to the user. If people change their minds in the future, this restriction can always be lifted in a source and ABI compatible way.

Single-quoted ASCII strings

Some have proposed allowing integer array types to be expressible by multi-character ASCII strings such as 'abcd'. We consider this to be out of scope of this proposal, as well as unsupported by precedent in C and related languages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment