xwu/nnnn-string-case-folding-normalization.md Secret

## nnnn-string-case-folding-normalization.md

      
    Raw
  

              nnnn-string-case-folding-normalization.md
            
          
    String case folding and normalization APIs


Proposal: SE-NNNN
Authors: Xiaodi Wu, Author 2
Review Manager: TBD
Status: TBD

During the review process, add the following fields as needed:

Implementation: apple/swift#17933
Decision Notes: Rationale, Additional Commentary
Bugs: SR-NNNN, SR-MMMM

Introduction

We propose to add APIs for case folding and normalization of Unicode strings to
the standard library.
Swift-evolution thread: Discussion thread topic for that proposal
Motivation

The standard library offers Unicode-aware processing of strings by default. It
relies on system ICU libraries to implement that support, but for reasons
outlined in SE-0211,
these APIs cannot be used in a straightforward way by end users.
The same Unicode extended grapheme cluster (known in Swift as a "character")
can have more than one underlying representation. For example, Ä (00C4 LATIN
CAPITAL LETTER A WITH DIAERESIS) and Ä (0041 LATIN CAPITAL LETTER A + 0308 +
COMBINING DIAERESIS) are equivalent. Unicode-aware, locale-independent string
comparisons must rely on normalization to ignore any "irrelevant"
differences and on case folding to ignore case. (Case folding is an
operation defined by the Unicode standard that's related to case conversion
but stable across Unicode revisions, language-neutral, and context-insensitive.)
In Swift, the standard library offers Unicode-aware, locale-independent string
comparisons, while Foundation augments these facilities with locale-specific
comparisons. Since Swift 4.2, the
standard library uses "Fast C Contiguous" (FCC) normalization
when it performs (case-sensitive) string comparisons and sorting, producing
results that are generally consistent with user expectations:
let x = "\u{00C4}"         // Ä
let y = "\u{0041}\u{0308}" // Ä
x == y                     // true
(Case-insensitive string comparisons require Foundation.)
For many users, these facilities may be sufficient. However, there are clear
use cases that motivate direct access to normalization and case folding
functionality. Consider the following scenario:
A recent pull request
improves the swift-corelibs-foundation implementation of HTTPCookie to align
with IETF standards. One improvement was the use of lowercased() prior to
domain matching; this is sufficient for ASCII domain names.
However, suppose we wished (as might well be advisable for a language such as
Swift that provides Unicode-aware string handling by default) to support
internationalized domain names (IDNs). Conversion of IDNs to their ASCII form
requires use of the Nameprep and Punycode algorithms; it's possible to
implement Punycode in Swift-native code, but Nameprep requires case folding and
NFKC normalization and can't feasibly be implemented in Swift!
This example, we believe, is not one-of-its-kind. In general, code that will
need to interoperate with other systems by handling non-ASCII text can be
expected to transform input or produce output in ways that ignore "unwanted"
differences in representation, where what's "unwanted" is specified by the
relevant standard and may differ from Swift's adopted FCC normalization. This is
precisely why Unicode defines four normalization forms that decompose code
points either canonically or compatibly (while ICU also provides access
to two other "fast" algorithms specified in Unicode Technical Note #5, FCC being
one of them):

Normalization Form D (NFD), canonical composition
Normalization Form C (NFC), canonical decomposition followed by canonical
decomposition
Normalization Form KD (NFKD), compatibility decomposition
Normalization Form KC (NFKC), compatibility decomposition followed by
canonical composition
"Fast C or D" (FCD)
"Fase C Contiguous" (FCC)

(See the proposed implementation
for an introductory description of canonical and compatibility decomposition and
these normalization forms.)
Other standards reference one or more of these normalization forms that
conforming implementations should then use when processing text. We propose to
expose these normalization forms and associated facilities in the standard
library to enable such code to be written in native Swift.
Proposed solution

In SE-0211,
a significant number of Unicode properties were exposed to users; they adhere
closely to the Unicode standard, and (as a result) not all of them correspond
exactly with likely expectations of users not familiar with the standard. The
property isEmoji, for example, does not align perfectly with what the average
smartphone user might consider to be emoji for various reasons too complex to
cover here. Since these properties were intended for advanced use only, and
since there were so many of them, they added to a separate
Unicode.Scalar.Properties structure rather than Unicode.Scalar itself.
An important question here is where case folding and normalization are
appropriate to be exposed to users on String itself. Several reasons suggest
that this is the case:

Unlike properties such as isEmoji, users unfamiliar with the Unicode
standard are unlikely to have subtly erroneous preconceptions as to what case
folding and normalization are.
There are not so many new APIs, as there were in SE-0211, such that their
addition would unduly bloat the existing String API.
The case conversion operations uppercased() and lowercased() certainly
have their place, but where users arbitrarily choose one or the other for
language-independent caseless comparison (such as in the case of domain
matching discussed above), the more appropriate facility is case folding,
which would ideally be presented to users as an option exactly where case
conversions are available.
In Foundation, NSString provides locale-sensitive case and diacritic folding
in the method folding(options:locale:), but the standard library does not
provide the Unicode-defined locale-independent counterpart as it does for
case conversion.
Many (if not most) common languages provide Unicode normalization facilities,
and often as methods available on their counterpart to the String type.
These languages are not exclusively "low level" and many do not also provide
the full complement of ICU APIs, suggesting that normalization in particular
has more use cases than perhaps other Unicode facilities:


Language
Normalization method


C#
str.Normalize(form)


Java
Normalizer.normalize(str, form)


JavaScript
str.normalize(form)


Python
unicodedata.normalize(form, str)


Rust
str.nfc(), str.nfd(), etc.


Therefore, we propose adding two methods to StringProtocol (and, therefore, to
the only two types that are permitted to conform to that protocol, String and
Substring):
// Existing methods:
func lowercased() -> String
func uppercased() -> String

// Proposed methods
func caseFolded() -> String
func normalized(_ form: Unicode.NormalizationForm) -> String
The available Unicode normalization forms will be defined as follows:
/* non-frozen */ public enum NormalizationForm {
  case nfd
  case nfc
  case nfkd
  case nfkc
  case fcd
  case fcc
}
Detailed design

The detailed design, including proposed documentation, is available in the
implementation linked above. We describe some detailed design considerations
below.
Case folding

The proposed case folding API is modeled after the existing methods
uppercased() and lowercased(). In line with Unicode properties recently
added in SE-0211 (changesWhenUppercased, changesWhenLowercased,
changesWhenCaseFolded), it is spelled caseFolded() (note the use of camel
case).
The implementation calls an ICU API to perform the Unicode-defined,
language-independent, context-insensitive default case folding. Documentation
will emphasize what sort of usage the case folding method is meant to support
(caseless string matching) and what it's not recommended for (natural language
text for human consumption).
No option is provided to use Turkic mappings (for dotless I's). Although
linguistically important and supported by the underlying ICU API, it is only one
of many possible language-specific case mappings. Providing this one mapping
option dilutes the message that the function is (as the Unicode standard says)
intended to be a language-neutral facility. Python, which also exposes the case
folding API (str.casefold()) similarly does not provide a specific Turkic
option.
Normalization

The normalization API is modeled after rounded(_:), taking one unlabeled
option. Some may wonder whether a preposition such as "using" might be useful as
an argument label. However, in our view, it does not read any better or worse,
so we settled on the more concise option. Texts that discuss Unicode
normalization often refer to "NFC strings" and "NFKC strings," and
string.normalized(.nfc) can be read "string, normalized NFC" just as
value.rounded(.down) is intended to be read "value, rounded down." (Another
possible argument label, form, was also considered; however, it is strictly
redundant with the parameter type, Unicode.NormalizationForm.)
We also considered also whether normalization form names ought to be written out
in some way rather than using the term-of-art abbreviation. Such a design was
disfavored because it did not improve clarity but did increase verbosity.
Specifically:

.nfd is more readable at a glance than .canonicalDecomposition.
.nfd and .nfkd are more visually distinct from each other than are
.canonicalDecomposition and .compatibilityDecomposition.
A reader who is unfamiliar with Unicode normalization is no more likely to
understand the term "canonical decomposition" than they are to understand
"NFD": both are terms of art, and the latter is more easily googled.
A reader who has some familiarity with Unicode normalization is more likely to
associate "NFD" with the relevant algorithm than "canonical decomposition" (or
even "form D").
Corresponding properties added to Unicode.Scalar.Properties refer to these
algorithms by their abbreviations: isNFDInert, isNFKDInert, etc.
The written-out name that would be used instead of .nfkc is
.compatibilityDecompositionFollowedByCanonicalComposition, which is absurd.
The written-out name that would be used instead of .fcd would be
.fastCOrD, which is not more readable and itself still uses the
abbreviations "C" and "D."

We also considered whether there should be a default option for normalization
form. In other languages that provide such a default option, it is NFC, and
users familiar with Unicode normalization may expect the same default in Swift
if one exists. In Swift, however, strings use FCC normalization before
comparison, and Swift users unfamiliar with Unicode normalization may expect
that the default option for explicit normalization to be the same as that for
comparison. It is not particularly cumbersome to choose one or another
normalization form explicitly; moreover, it adds clarity for the reader and
avoids this issue regarding which normalization form ought to be the default.
The implementation calls an ICU API to perform the indicated normalization.
Prior to doing so, it will call another ICU API to check quickly if the given
input is definitely already normalized; if so, it will return self. In
practice, most strings will already be normalized, for most normalization
forms.
No separate API is currently proposed here (although, if a clear use case
arises, it can be later added) for separately checking if a string is already
normalized. Thus far, the envisioned use cases for the standard library APIs
proposed here would use such a check, then return normalize if required or use
the string as-is otherwise. However, Swift.String is a copy-on-write (CoW)
type, and the proposed implementation of normalize(_:) already performs such a
check.
Source compatibility

The proposed changes are strictly additive; there is no effect on source
compatibility.
Effect on ABI stability

The proposed changes are strictly additive; there is no effect on the ABI of
existing facilities.
Effect on API resilience

The Unicode.NormalizationForm enumeration is defined as non-@frozen so that
future normalization forms can be added to Swift when they come into being.
Alternatives considered

An alternative to adding case folding to String (which may still have
independent value) is to add caseFoldMapping to Unicode.Scalar.Properties to
complement lowercaseMapping and others. Its absence appears to be an
oversight. However, as discussed above, the conceptually related case conversion
operations are already used by some users where case folding would be the
preferred solution, and those are available as String methods. Moreover,
Foundation offers a locale-aware counterpart as an NSString method. Therefore,
we consider it appropriate to expose caseFolded() where lowercased() and
uppercased() are available, and to allow users to case-fold entire strings
rather than to proceed through each code point to concatenate mapped strings.
Normalization can be exposed in a Normalizer type and not as a method on
String. However, for the reasons outlined above, we feel that such a method
is appropriate on String and does not cause undue API bloat.
Finally, we also considered including a titlecased() method. It would largely
be similar in implementation to the case folding API (with the exception that a
UBreakIterator is required to identify the beginning of words). However, the
result of titlecasing by locale-independent Unicode rules is unsatisfactory for
human consumption in obvious ways. Consider, for example:
# Written in (actual, real) Python, which *does* expose this API:
"we're titlecasing!".title()
# "We'Re Titlecasing!"
Discussions in the Python community reveal pervasive dissatisfaction and
misunderstanding. However, as Guido van Rossum has pointed out,
such discrepancies between the Unicode standard and non-expert user expectations
cannot actually be reconciled even by means of alternative rules: for example,
no rule could correctly titlecase both "we're" and "O'Brien." Although such an
API would expose some useful functionality (around digraphs, for example), a
reasonable conclusion would be that something so prone to mismatched user
expectations is not an appropriate general-use API for the standard library.
Moreover, for digraphs, the titlecase mapping can be obtained through the
Unicode.Scalar.Properties property titlecaseMapping.
Language	Normalization method
C#	`str.Normalize(form)`
Java	`Normalizer.normalize(str, form)`
JavaScript	`str.normalize(form)`
Python	`unicodedata.normalize(form, str)`
Rust	`str.nfc()`, `str.nfd()`, etc.