Skip to content

Instantly share code, notes, and snippets.

@xwu
Last active July 16, 2018 06:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save xwu/e687ed737d0e22817bc944b7099f8cfb to your computer and use it in GitHub Desktop.
Save xwu/e687ed737d0e22817bc944b7099f8cfb to your computer and use it in GitHub Desktop.
String case folding and normalization APIs

String case folding and normalization APIs

During the review process, add the following fields as needed:

Introduction

We propose to add APIs for case folding and normalization of Unicode strings to the standard library.

Swift-evolution thread: Discussion thread topic for that proposal

Motivation

The standard library offers Unicode-aware processing of strings by default. It relies on system ICU libraries to implement that support, but for reasons outlined in SE-0211, these APIs cannot be used in a straightforward way by end users.

The same Unicode extended grapheme cluster (known in Swift as a "character") can have more than one underlying representation. For example, Ä (00C4 LATIN CAPITAL LETTER A WITH DIAERESIS) and Ä (0041 LATIN CAPITAL LETTER A + 0308 + COMBINING DIAERESIS) are equivalent. Unicode-aware, locale-independent string comparisons must rely on normalization to ignore any "irrelevant" differences and on case folding to ignore case. (Case folding is an operation defined by the Unicode standard that's related to case conversion but stable across Unicode revisions, language-neutral, and context-insensitive.)

In Swift, the standard library offers Unicode-aware, locale-independent string comparisons, while Foundation augments these facilities with locale-specific comparisons. Since Swift 4.2, the standard library uses "Fast C Contiguous" (FCC) normalization when it performs (case-sensitive) string comparisons and sorting, producing results that are generally consistent with user expectations:

let x = "\u{00C4}"         // Ä
let y = "\u{0041}\u{0308}" // Ä
x == y                     // true

(Case-insensitive string comparisons require Foundation.)

For many users, these facilities may be sufficient. However, there are clear use cases that motivate direct access to normalization and case folding functionality. Consider the following scenario:

A recent pull request improves the swift-corelibs-foundation implementation of HTTPCookie to align with IETF standards. One improvement was the use of lowercased() prior to domain matching; this is sufficient for ASCII domain names.

However, suppose we wished (as might well be advisable for a language such as Swift that provides Unicode-aware string handling by default) to support internationalized domain names (IDNs). Conversion of IDNs to their ASCII form requires use of the Nameprep and Punycode algorithms; it's possible to implement Punycode in Swift-native code, but Nameprep requires case folding and NFKC normalization and can't feasibly be implemented in Swift!

This example, we believe, is not one-of-its-kind. In general, code that will need to interoperate with other systems by handling non-ASCII text can be expected to transform input or produce output in ways that ignore "unwanted" differences in representation, where what's "unwanted" is specified by the relevant standard and may differ from Swift's adopted FCC normalization. This is precisely why Unicode defines four normalization forms that decompose code points either canonically or compatibly (while ICU also provides access to two other "fast" algorithms specified in Unicode Technical Note #5, FCC being one of them):

  • Normalization Form D (NFD), canonical composition
  • Normalization Form C (NFC), canonical decomposition followed by canonical decomposition
  • Normalization Form KD (NFKD), compatibility decomposition
  • Normalization Form KC (NFKC), compatibility decomposition followed by canonical composition
  • "Fast C or D" (FCD)
  • "Fase C Contiguous" (FCC)

(See the proposed implementation for an introductory description of canonical and compatibility decomposition and these normalization forms.)

Other standards reference one or more of these normalization forms that conforming implementations should then use when processing text. We propose to expose these normalization forms and associated facilities in the standard library to enable such code to be written in native Swift.

Proposed solution

In SE-0211, a significant number of Unicode properties were exposed to users; they adhere closely to the Unicode standard, and (as a result) not all of them correspond exactly with likely expectations of users not familiar with the standard. The property isEmoji, for example, does not align perfectly with what the average smartphone user might consider to be emoji for various reasons too complex to cover here. Since these properties were intended for advanced use only, and since there were so many of them, they added to a separate Unicode.Scalar.Properties structure rather than Unicode.Scalar itself.

An important question here is where case folding and normalization are appropriate to be exposed to users on String itself. Several reasons suggest that this is the case:

  • Unlike properties such as isEmoji, users unfamiliar with the Unicode standard are unlikely to have subtly erroneous preconceptions as to what case folding and normalization are.
  • There are not so many new APIs, as there were in SE-0211, such that their addition would unduly bloat the existing String API.
  • The case conversion operations uppercased() and lowercased() certainly have their place, but where users arbitrarily choose one or the other for language-independent caseless comparison (such as in the case of domain matching discussed above), the more appropriate facility is case folding, which would ideally be presented to users as an option exactly where case conversions are available.
  • In Foundation, NSString provides locale-sensitive case and diacritic folding in the method folding(options:locale:), but the standard library does not provide the Unicode-defined locale-independent counterpart as it does for case conversion.
  • Many (if not most) common languages provide Unicode normalization facilities, and often as methods available on their counterpart to the String type. These languages are not exclusively "low level" and many do not also provide the full complement of ICU APIs, suggesting that normalization in particular has more use cases than perhaps other Unicode facilities:
Language Normalization method
C# str.Normalize(form)
Java Normalizer.normalize(str, form)
JavaScript str.normalize(form)
Python unicodedata.normalize(form, str)
Rust str.nfc(), str.nfd(), etc.

Therefore, we propose adding two methods to StringProtocol (and, therefore, to the only two types that are permitted to conform to that protocol, String and Substring):

// Existing methods:
func lowercased() -> String
func uppercased() -> String

// Proposed methods
func caseFolded() -> String
func normalized(_ form: Unicode.NormalizationForm) -> String

The available Unicode normalization forms will be defined as follows:

/* non-frozen */ public enum NormalizationForm {
  case nfd
  case nfc
  case nfkd
  case nfkc
  case fcd
  case fcc
}

Detailed design

The detailed design, including proposed documentation, is available in the implementation linked above. We describe some detailed design considerations below.

Case folding

The proposed case folding API is modeled after the existing methods uppercased() and lowercased(). In line with Unicode properties recently added in SE-0211 (changesWhenUppercased, changesWhenLowercased, changesWhenCaseFolded), it is spelled caseFolded() (note the use of camel case).

The implementation calls an ICU API to perform the Unicode-defined, language-independent, context-insensitive default case folding. Documentation will emphasize what sort of usage the case folding method is meant to support (caseless string matching) and what it's not recommended for (natural language text for human consumption).

No option is provided to use Turkic mappings (for dotless I's). Although linguistically important and supported by the underlying ICU API, it is only one of many possible language-specific case mappings. Providing this one mapping option dilutes the message that the function is (as the Unicode standard says) intended to be a language-neutral facility. Python, which also exposes the case folding API (str.casefold()) similarly does not provide a specific Turkic option.

Normalization

The normalization API is modeled after rounded(_:), taking one unlabeled option. Some may wonder whether a preposition such as "using" might be useful as an argument label. However, in our view, it does not read any better or worse, so we settled on the more concise option. Texts that discuss Unicode normalization often refer to "NFC strings" and "NFKC strings," and string.normalized(.nfc) can be read "string, normalized NFC" just as value.rounded(.down) is intended to be read "value, rounded down." (Another possible argument label, form, was also considered; however, it is strictly redundant with the parameter type, Unicode.NormalizationForm.)

We also considered also whether normalization form names ought to be written out in some way rather than using the term-of-art abbreviation. Such a design was disfavored because it did not improve clarity but did increase verbosity. Specifically:

  • .nfd is more readable at a glance than .canonicalDecomposition.
  • .nfd and .nfkd are more visually distinct from each other than are .canonicalDecomposition and .compatibilityDecomposition.
  • A reader who is unfamiliar with Unicode normalization is no more likely to understand the term "canonical decomposition" than they are to understand "NFD": both are terms of art, and the latter is more easily googled.
  • A reader who has some familiarity with Unicode normalization is more likely to associate "NFD" with the relevant algorithm than "canonical decomposition" (or even "form D").
  • Corresponding properties added to Unicode.Scalar.Properties refer to these algorithms by their abbreviations: isNFDInert, isNFKDInert, etc.
  • The written-out name that would be used instead of .nfkc is .compatibilityDecompositionFollowedByCanonicalComposition, which is absurd.
  • The written-out name that would be used instead of .fcd would be .fastCOrD, which is not more readable and itself still uses the abbreviations "C" and "D."

We also considered whether there should be a default option for normalization form. In other languages that provide such a default option, it is NFC, and users familiar with Unicode normalization may expect the same default in Swift if one exists. In Swift, however, strings use FCC normalization before comparison, and Swift users unfamiliar with Unicode normalization may expect that the default option for explicit normalization to be the same as that for comparison. It is not particularly cumbersome to choose one or another normalization form explicitly; moreover, it adds clarity for the reader and avoids this issue regarding which normalization form ought to be the default.

The implementation calls an ICU API to perform the indicated normalization. Prior to doing so, it will call another ICU API to check quickly if the given input is definitely already normalized; if so, it will return self. In practice, most strings will already be normalized, for most normalization forms.

No separate API is currently proposed here (although, if a clear use case arises, it can be later added) for separately checking if a string is already normalized. Thus far, the envisioned use cases for the standard library APIs proposed here would use such a check, then return normalize if required or use the string as-is otherwise. However, Swift.String is a copy-on-write (CoW) type, and the proposed implementation of normalize(_:) already performs such a check.

Source compatibility

The proposed changes are strictly additive; there is no effect on source compatibility.

Effect on ABI stability

The proposed changes are strictly additive; there is no effect on the ABI of existing facilities.

Effect on API resilience

The Unicode.NormalizationForm enumeration is defined as non-@frozen so that future normalization forms can be added to Swift when they come into being.

Alternatives considered

An alternative to adding case folding to String (which may still have independent value) is to add caseFoldMapping to Unicode.Scalar.Properties to complement lowercaseMapping and others. Its absence appears to be an oversight. However, as discussed above, the conceptually related case conversion operations are already used by some users where case folding would be the preferred solution, and those are available as String methods. Moreover, Foundation offers a locale-aware counterpart as an NSString method. Therefore, we consider it appropriate to expose caseFolded() where lowercased() and uppercased() are available, and to allow users to case-fold entire strings rather than to proceed through each code point to concatenate mapped strings.

Normalization can be exposed in a Normalizer type and not as a method on String. However, for the reasons outlined above, we feel that such a method is appropriate on String and does not cause undue API bloat.

Finally, we also considered including a titlecased() method. It would largely be similar in implementation to the case folding API (with the exception that a UBreakIterator is required to identify the beginning of words). However, the result of titlecasing by locale-independent Unicode rules is unsatisfactory for human consumption in obvious ways. Consider, for example:

# Written in (actual, real) Python, which *does* expose this API:
"we're titlecasing!".title()
# "We'Re Titlecasing!"

Discussions in the Python community reveal pervasive dissatisfaction and misunderstanding. However, as Guido van Rossum has pointed out, such discrepancies between the Unicode standard and non-expert user expectations cannot actually be reconciled even by means of alternative rules: for example, no rule could correctly titlecase both "we're" and "O'Brien." Although such an API would expose some useful functionality (around digraphs, for example), a reasonable conclusion would be that something so prone to mismatched user expectations is not an appropriate general-use API for the standard library. Moreover, for digraphs, the titlecase mapping can be obtained through the Unicode.Scalar.Properties property titlecaseMapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment