Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Pitch: Unicode Named Character Escape Sequence

Introduction

This proposal adds a new \N{name} escape sequence to Swift string literals, where name is the name of a Unicode character.

Discussion

The Unicode named character escape sequence was previously discussed here:

Background

Each Unicode character is assigned a unique code point, a number between U+0000 — U+10FFFF, and a name, consisting of uppercase letters (A–Z), digits (0–9), hyphens, and spaces. For example, the Unicode character for the letter “A” used in English has the code point U+0041 and the name LATIN CAPITAL LETTER A. he term scalar value defines the subset of Unicode code points that aren’t surrogate pairs.

In Swift, a string literal may include a character directly (“A”) or using the \u{n} escape sequence, where n is a 1–8 digit hexadecimal number corresponding to the scalar value (”\u{0041}”). A string literal may also include character by interpolation (let letterA = ”\u{0041}”; “\(letterA)”).

Motivation

In Swift, it can be cumbersome to work with Unicode characters that are non-printing, confusable, or have difficulty rendering in the editor. This difficulty can inhibit developer productivity and cause programming errors.

Non-Printing Characters

Non-printing characters can’t be seen or directly interacted with in most editors (including Xcode), which makes them difficult to work with in code.

For example, the 👩‍👧 emoji is a sequence comprising 👩 (WOMAN U+1F469) + (ZERO WIDTH JOINER U+200D) + 👧 (GIRL U+1F467). The middle character, zero-width joiner (ZWJ), is a non-printing control character that changes the way glyphs are shaped for adjacent characters rather than having a distinct rendering itself.

There are currently a few different strategies for working with non-printing characters:

One approach is to use the \u{n} escape sequence in a string literal, passing scalar value for each character.

// Unicode Scalar Value Escape
"\u{1F469}\u{200D}\u{1F467}" // "👩‍👧"

This achieves the desired results, but the use of opaque numerical constants makes the code difficult to understand.

Another approach is to assign character values to constants, using a combination of variable names and comments to clarify intent, and interpolate those values:

// Commented Declaration + Interpolation
let woman: Character = "\u{1F469}" // WOMAN
let zwj: Character = "\u{200D}" // ZERO WIDTH JOINER
let girl: Character = "\u{1F469}" // GIRL
"\(woman)\(zwj)\(girl)" // "👩‍👧"

This approach is more understandable than the previous one but requires more work on the part of the developer. More concerning, however, is that this approach can lead to difficult-to-track-down bugs if the comments and behavior conflict, whether due to a mistake initially or an erroneous change later on.

An ideal solution would combine the compiler checking of the former approach with the semantic clarity of the latter approach. This could be achieved by adding support for a new escape sequence, \N{name}, which allows Unicode characters to be included into string literals by name:

// Proposed \N Escape Sequence
"\N{WOMAN}\N{ZERO WIDTH JOINER}\N{GIRL}" // "👩‍👧"

Unicode 11.0 specifies 777 emoji ZWJ sequences, including 👩‍👧, that platforms are encouraged to support. Vendors may also choose to support other emoji ZWJ sequences, such as Microsoft’s “Hipster Cat”, which comprises 🐱 (CAT FACE U+1F431) + ZWJ + 👓 (EYEGLASSES U+1F453), and is only currently supported on Windows platforms.

In addition to Emoji, ZWJ is used in Arabic script and Indic scripts, including Devanagari and Kannada. Incidentally, Arabic script provides another example of the difficulty a developer faces when working with text in code: handling directionality.

Directional Formatting Characters

Arabic script is written right-to-left (RTL) whereas Latin script is written left-to-right (LTR). When working with text containing, for example, both Arabic and Latin script, the use of non-printing, directional formatting characters like RIGHT-TO-LEFT MARK U+200F (RLM) may be necessary to achieve the desired results.

As with the previous ZWJ example, the proposed “\N{name}” escape sequence offers a solution that’s both understandable to the developer and checked by the compiler:

// Unicode Scalar Value Escape
"The phrase is مرحبا بالعالم!\u{200F} in Arabic."

// Commented Declaration + Interpolation
let rlm: Character = "\u{200F}" // RIGHT-TO-LEFT MARK
"The phrase is مرحبا بالعالم!\(rlm) in Arabic."

// Proposed \N Escape Sequence
"The phrase is مرحبا بالعالم!\N{RIGHT-TO-LEFT MARK} in Arabic."

Confusable Characters

Even if a character is printing, their glyph may be ambiguous.

Unicode Technical Report #36 describes how characters in single-, mixed-, and whole-script contexts may be confused for another character. This phenomenon is demonstrated well by the Confusables Unicode Utility.

Correct handling of confusables is most important in security applications, such as for preventing hostname spoofing in URLs. However, confusable characters can be problematic in code as well. For example, consider the following selection from the 24 characters comprising Unicode’s Punctuation, Dash [Pd] category:

*   U+002D  HYPHEN-MINUS           -
*   U+2010  HYPHEN                 ‐
*   U+2011  NON-BREAKING HYPHEN    ‑
*   U+2012  FIGURE DASH            ‒
*   U+2013  EN DASH                –
*   U+2014  EM DASH                —
*   U+2015  HORIZONTAL BAR         ―
*   U+2E3A  TWO-EM DASH            ⸺
*   U+2E3B  THREE-EM DASH          ⸻

Most programming fonts are unable to distinguish these characters. If a developer decides to include a character directly into a string literal, the original meaning may be lost in subsequent changes. A developer may not recognize a code convention for using en dash (–) to delimit range bounds, and instead type a hyphen-minus (-) somewhere else in the project.

Consider the following four options for including an en dash in code, including the proposed \N{name} escape sequence:

// Direct
""

// Unicode Scalar Value Escape
"\u{2013}"

// Commented Declaration + Interpolation
let enDash: Character = "\u{2013}" // EN DASH
"\(enDash)"

// Proposed \N Escape Sequence
“\N{EN DASH}"

Another example of confusable characters are cross-script homographs. Unicode defines separate code points for LATIN CAPITAL LETTER A (U+0041), CYRILLIC CAPITAL LETTER A (U+0410), and GREEK CAPITAL LETTER ALPHA (U+0391). However, these characters are indiscernible in most fonts.

The proposed \N{name} escape sequence can be helpful for distinguishing between homographs like these:

"A == \u{0041} == \N{LATIN CAPITAL LETTER A}"
"А == \u{0410} == \N{CYRILLIC CAPITAL LETTER A}"
"Α == \u{0391} == \N{GREEK CAPITAL LETTER ALPHA}"

Design

The \N{} escape sequence is supported in a few programming languages. We propose to model the design according to these existing implementations.

Python

Python defines a \N{name} escape sequence. Support for name aliases was added in Python 3.

Perl

Perl defines a \N{} escape sequence) that accepts code points with a U+, such as \N{U+0041}. Including the statement use charnames qw( :full ); allows Perl code to pass name arguments to \N{} as well.

Objective-C / Swift / Foundation

The \N syntax can be found when calling the (NS)String method applyingTransform(_:reverse:) with the .toUnicodeName transform:

import Foundation

"🍩".applyingTransform(.toUnicodeName, reverse: false) // \N{DOUGHNUT}
"\\N{DOUGHNUT}".applyingTransform(.toUnicodeName, reverse: true) // 🍩

Implementation

The data required to implement this feature is provided by the ICU library. However, the Swift compiler doesn't currently link libICU, and doing that may not be straightforward.

An alternative approach would be to embed this data from the Unicode Character Database (UCD) directly, using the XML representation described in Unicode Standard Annex #42. As part of the build process, this XML file could be downloaded, parsed, and used to generate a static array declaration in code that's used by the compiler to do character name lookups.

To get a sense of what this entails, here's a link to the directory of UCD XML files for Unicode 11.0: https://www.unicode.org/Public/11.0.0/ucdxml/.

In terms of code impact, Unicode 11.0 has 137,374 characters. Estimating that each character name is between 16 and 32 bytes, we could expect this to require around 2 – 4 megabytes.

It should also be possible to reference Unicode characters by normative formal name aliases, of which there are currently a few hundred.

Source Compatibility

This is a purely additive change. The syntax proposed is not currently valid Swift.

Effect on ABI Stability

None

Effect on API Resilience

None

Documentation Impact

The Swift Programming Language would need to update the “Special Characters in String Literals” section in its Strings and Characters chapter to document the new escape sequence.

In addition, documentation for the applyingTransform(_:reverse:) method would need to be updated to note support for the \N escape sequence in Swift.

Alternatives to Consider

As part of the pitch process, we are especially interested in soliciting feedback and suggestions for the following:

Using \U{name} instead of \N{name}

In terms of spelling, the strength of \N{name} comes entirely from the precedent set by the aforementioned languages that currently implement this functionality. The letter "N" is a weak mnemonic for "Unicode character name". It's also similar --- but unrelated to --- the more common \n escape sequence, which might create confusion for developers.

An alternative spelling to consider for this proposal is \U{name}. The letter "U" reinforces its relation to "Unicode" and provides case symmetry with the existing, related \u{n} escape sequence.

Unfortunately, searching for \N usage in code is difficult (GitHub, for example, strips the backslash character in code search). Therefore, we don't have any data about the prevalence of \N in the wild to help make a determination of how strong the existing convention is.

Supporting Named Sequences

Unicode also provides a database of named sequences, as described by Unicode Standard Annex #34. Essentially, these are common extended grapheme clusters that are treated like characters (unique name, part of the Unicode namespace), but comprise multiple code points instead of just one. For example, the named sequence LATIN SMALL LETTER I WITH MACRON AND GRAVE (ī̀) is defined by U+012B followed by U+0300. A list of named sequences in Unicode 11.0 is provided by the data file NamedSequences.txt.

Adding support for named sequences would add complexity, as the compiler would have to treat these differently than normal characters (named sequences can't be used for Unicode scalars literals). It's unclear whether named sequences should be supported in the initial implementation, deferred until later, or implemented separately with a new escape sequence.

Supporting Emoji Sequences

Related to the previous point, there are also several hundred named Emoji sequences; see emoji-sequences.txt and emoji-zwj-sequences.txt.

We're less inclined to include these in an initial implementation, but are interested in gauging demand for this functionality later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.