Skip to content

Instantly share code, notes, and snippets.

@jtbandes jtbandes/unicode-id-op.md Secret
Created Sep 18, 2016

Embed
What would you like to do?
Unicode pre-proposal discussion

Background and motivation

To ease lexing/parsing and avoid user confusion, the names of custom identifiers (type names, variable names, etc.) and operators in Swift can be composed of (mostly) separate sets of characters.

Using terminology from TSPL:

identifier-head/operator-head are characters which can begin an identifier or operator.

identifier-character/operator-character are characters which can appear anywhere in an identifier or operator (these are supersets of the -head sets).

https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/LexicalStructure.html

(Note also that some particular arrangements of characters are reserved; for instance, $ followed by digits for an implicit closure parameter, and "If an operator doesn’t begin with a dot, it can’t contain a dot elsewhere." There are also special characters in the language which are neither identifiers nor operators, such as: ()[]{},:@#)

Prior discussion on swift-evolution

"Request to add middle dot (U+00B7) as operator character?"
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151214/003176.html

"Free the '$' Symbol!"
https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20151228/005133.html

"Proposal: Allow Single Dollar Sign as Valid Identifier"
https://github.com/apple/swift-evolution/pull/354

Chris Lattner has said:

"...our current operator space (particularly the unicode segments covered) is not super well considered. It would be great for someone to take a more systematic pass over them to rationalize things."

"We need a token to be unambiguously an operator or identifier - we can have different rules for the leading and subsequent characters though."

Current state of affairs

Swift's identifier-head and identifier-character mostly conform to the recommendations in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2010/n3146.html
https://github.com/apple/swift/blob/08e7963/lib/Parse/Lexer.cpp#L421-L489

The allowed operator characters include "Unicode math, symbol, arrow, dingbat, and line/box drawing chars", however I don't believe this aligns with any particular spec: https://github.com/apple/swift/blob/08e7963/include/swift/AST/Identifier.h#L87-L121
https://github.com/apple/swift/commit/a2341a4

Identifiers/operators elsewhere

There is an Unicode Standard Annex "identifier and pattern syntax" http://unicode.org/reports/tr31/ which defines the categories ID_Start/ID_Continue.

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%3AID_Continue%3A%5D

ECMAScript 2015 "ES6"

Uses ID_Start and ID_Continue, as well as Other_ID_Start / Other_ID_Continue. http://www.ecma-international.org/ecma-262/6.0/#sec-names-and-keywords

Haskell

Distinguishes identifiers/operators by their general category (such as "any Unicode lowercase letter", "any Unicode symbol or punctuation", etc.).
http://www.fileformat.info/info/unicode/category/index.htm

In particular, identifiers can start with any lowercase letter or _, and may contain any letter/digit/'/_. This would seem to include letters like δ and Я, and digits like ٢.

https://www.haskell.org/onlinereport/syntax-iso.html
https://github.com/ghc/ghc/blob/714bebff44076061d0a719c4eda2cfd213b7ac3d/compiler/parser/Lexer.x#L1949-L1973

Current problems

Weird identifier code points

The current identifier-character set contains many characters which wouldn't make good identifiers:

  • 11 entire planes of characters (U+20000–U+2FFFD, etc.) which are currently unassigned.
  • The middle dot · which looks like an operator.
  • Many non-combining "modifiers" and accent marks, such as ´ and ¨ and ꓻ which don't really make sense on their own.
  • "Tone marks" from various languages, including ˫ (similar to a box-drawing character ├ which is an operator).
  • The "Greek question mark" ; (see below)
  • Symbols which are simply not linguistic, such as ۞ and ༒.

https://goo.gl/tyn0Cz
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Ba-zA-Z%0D%0A_%0D%0A%5Cu00A8%0D%0A%5Cu00AA%0D%0A%5Cu00AD%0D%0A%5Cu00AF%0D%0A%5Cu00B2-%5Cu00B5%0D%0A%5Cu00B7-%5Cu00BA%0D%0A%5Cu00BC-%5Cu00BE%0D%0A%5Cu00C0-%5Cu00D6%0D%0A%5Cu00D8-%5Cu00F6%0D%0A%5Cu00F8-%5Cu00FF%0D%0A%5Cu0100-%5Cu02FF%0D%0A%5Cu0370-%5Cu167F%0D%0A%5Cu1681-%5Cu180D%0D%0A%5Cu180F-%5Cu1DBF%0D%0A%5Cu1E00-%5Cu1FFF%0D%0A%5Cu200B-%5Cu200D%0D%0A%5Cu202A-%5Cu202E%0D%0A%5Cu203F-%5Cu2040%0D%0A%5Cu2054%0D%0A%5Cu2060-%5Cu206F%0D%0A%5Cu2070-%5Cu20CF%0D%0A%5Cu2100-%5Cu218F%0D%0A%5Cu2460-%5Cu24FF%0D%0A%5Cu2776-%5Cu2793%0D%0A%5Cu2C00-%5Cu2DFF%0D%0A%5Cu2E80-%5Cu2FFF%0D%0A%5Cu3004-%5Cu3007%0D%0A%5Cu3021-%5Cu302F%0D%0A%5Cu3031-%5Cu303F%0D%0A%5Cu3040-%5CuD7FF%0D%0A%5CuF900-%5CuFD3D%0D%0A%5CuFD40-%5CuFDCF%0D%0A%5CuFDF0-%5CuFE1F%0D%0A%5CuFE30-%5CuFE44%0D%0A%5CuFE47-%5CuFFFD%0D%0A%5CU00010000-%5CU0001FFFD%0D%0A%5CU00020000-%5CU0002FFFD%0D%0A%5CU00030000-%5CU0003FFFD%0D%0A%5CU00040000-%5CU0004FFFD%0D%0A%5CU00050000-%5CU0005FFFD%0D%0A%5CU00060000-%5CU0006FFFD%0D%0A%5CU00070000-%5CU0007FFFD%0D%0A%5CU00080000-%5CU0008FFFD%0D%0A%5CU00090000-%5CU0009FFFD%0D%0A%5CU000A0000-%5CU000AFFFD%0D%0A%5CU000B0000-%5CU000BFFFD%0D%0A%5CU000C0000-%5CU000CFFFD%0D%0A%5CU000D0000-%5CU000DFFFD%0D%0A%5CU000E0000-%5CU000EFFFD%5D%0D%0A%5B0-9%0D%0A%5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE20-%5CuFE2F%5D

Weird operator code points

The current operator-character set has a lot of characters that are clearly operator-esque (≈ ∈ ⊕ ⊅), but some things are not so obviously desirable:

  • Box-drawing characters
  • Combining accents and other characters
  • Various symbols, e.g. ⚄ and ♄ (this category also overlaps with emoji)
  • Braille patterns such as ⠟ — should they not be treated as letter-like (thus identifiers)?
  • A plethora of arrows

https://goo.gl/s136Nh
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%2F%3D%5C-%2B%21*%25%3C%3E%5C%26%7C%5C%5E~%3F%0D%0A%5Cu00A1-%5Cu00A7%0D%0A%5Cu00A9%5Cu00AB%0D%0A%5Cu00AC%0D%0A%5Cu00AE%0D%0A%5Cu00B0-%5Cu00B1%0D%0A%5Cu00B6%0D%0A%5Cu00BB%0D%0A%5Cu00BF%0D%0A%5Cu00D7%0D%0A%5Cu00F7%0D%0A%5Cu2016-%5Cu2017%0D%0A%5Cu2020-%5Cu2027%0D%0A%5Cu2030-%5Cu203E%0D%0A%5Cu2041-%5Cu2053%0D%0A%5Cu2055-%5Cu205E%0D%0A%5Cu2190-%5Cu23FF%0D%0A%5Cu2500-%5Cu2775%0D%0A%5Cu2794-%5Cu2BFF%0D%0A%5Cu2E00-%5Cu2E7F%0D%0A%5Cu3001-%5Cu3003%0D%0A%5Cu3008-%5Cu3030%5D%0D%0A%5B%5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE00-%5CuFE0F%0D%0A%5CuFE20-%5CuFE2F%0D%0A%5CU000E0100-%5CU000E01EF%5D

Code points which are both

A handful of characters are accepted both as identifier-head and operator-head (which seems pointless and might have been unintentional):

U+3021–U+3029, Suzhou numerals 〡〢〣〤〥〦〧〨〩 https://en.wikipedia.org/wiki/Suzhou_numerals
U+302A–U+302F, ideographic & hangul tone marks 〪 〫 〬 〭 〮 〯

let 〨 = 2
infix operator <〨>

(Note that infix operator 〨 doesn't work because the lexer greedily treats this as an identifier. Also, interestingly, the corresponding ideographic zero 〇 is only an identifier char.)

https://goo.gl/lZcMqO
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%5ba-zA-Z%0d%0a_%0d%0a%5cu00A8%0d%0a%5cu00AA%0d%0a%5cu00AD%0d%0a%5cu00AF%0d%0a%5cu00B2-%5cu00B5%0d%0a%5cu00B7-%5cu00BA%0d%0a%5cu00BC-%5cu00BE%0d%0a%5cu00C0-%5cu00D6%0d%0a%5cu00D8-%5cu00F6%0d%0a%5cu00F8-%5cu00FF%0d%0a%5cu0100-%5cu02FF%0d%0a%5cu0370-%5cu167F%0d%0a%5cu1681-%5cu180D%0d%0a%5cu180F-%5cu1DBF%0d%0a%5cu1E00-%5cu1FFF%0d%0a%5cu200B-%5cu200D%0d%0a%5cu202A-%5cu202E%0d%0a%5cu203F-%5cu2040%0d%0a%5cu2054%0d%0a%5cu2060-%5cu206F%0d%0a%5cu2070-%5cu20CF%0d%0a%5cu2100-%5cu218F%0d%0a%5cu2460-%5cu24FF%0d%0a%5cu2776-%5cu2793%0d%0a%5cu2C00-%5cu2DFF%0d%0a%5cu2E80-%5cu2FFF%0d%0a%5cu3004-%5cu3007%0d%0a%5cu3021-%5cu302F%0d%0a%5cu3031-%5cu303F%0d%0a%5cu3040-%5cuD7FF%0d%0a%5cuF900-%5cuFD3D%0d%0a%5cuFD40-%5cuFDCF%0d%0a%5cuFDF0-%5cuFE1F%0d%0a%5cuFE30-%5cuFE44%0d%0a%5cuFE47-%5cuFFFD%0d%0a%5cU00010000-%5cU0001FFFD%0d%0a%5cU00020000-%5cU0002FFFD%0d%0a%5cU00030000-%5cU0003FFFD%0d%0a%5cU00040000-%5cU0004FFFD%0d%0a%5cU00050000-%5cU0005FFFD%0d%0a%5cU00060000-%5cU0006FFFD%0d%0a%5cU00070000-%5cU0007FFFD%0d%0a%5cU00080000-%5cU0008FFFD%0d%0a%5cU00090000-%5cU0009FFFD%0d%0a%5cU000A0000-%5cU000AFFFD%0d%0a%5cU000B0000-%5cU000BFFFD%0d%0a%5cU000C0000-%5cU000CFFFD%0d%0a%5cU000D0000-%5cU000DFFFD%0d%0a%5cU000E0000-%5cU000EFFFD%5d%26%5b%2f%3d%5c-%2b%21%2a%25%3C%3E%5c%26%7c%5c%5e~%3f%0d%0a%5cu00A1-%5cu00A7%0d%0a%5cu00A9%5cu00AB%0d%0a%5cu00AC%0d%0a%5cu00AE%0d%0a%5cu00B0-%5cu00B1%0d%0a%5cu00B6%0d%0a%5cu00BB%0d%0a%5cu00BF%0d%0a%5cu00D7%0d%0a%5cu00F7%0d%0a%5cu2016-%5cu2017%0d%0a%5cu2020-%5cu2027%0d%0a%5cu2030-%5cu203E%0d%0a%5cu2041-%5cu2053%0d%0a%5cu2055-%5cu205E%0d%0a%5cu2190-%5cu23FF%0d%0a%5cu2500-%5cu2775%0d%0a%5cu2794-%5cu2BFF%0d%0a%5cu2E00-%5cu2E7F%0d%0a%5cu3001-%5cu3003%0d%0a%5cu3008-%5cu3030%5d]

In addition to the numerals and tone marks above, many (all?) combining marks are accepted as identifier-character and operator-character. These may be necessary for natural-looking words in some languages, but they don't seem necessary for operators.

Also present in both sets are the variation selectors 1 through 256 (U+FE00–U+FE0F, U+E0100–U+E01EF). It seems they are of limited use for the operator characters, unless you count the emoji: http://www.unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt

https://goo.gl/VKrisf
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[%5ba-zA-Z%0d%0a_%0d%0a%5cu00A8%0d%0a%5cu00AA%0d%0a%5cu00AD%0d%0a%5cu00AF%0d%0a%5cu00B2-%5cu00B5%0d%0a%5cu00B7-%5cu00BA%0d%0a%5cu00BC-%5cu00BE%0d%0a%5cu00C0-%5cu00D6%0d%0a%5cu00D8-%5cu00F6%0d%0a%5cu00F8-%5cu00FF%0d%0a%5cu0100-%5cu02FF%0d%0a%5cu0370-%5cu167F%0d%0a%5cu1681-%5cu180D%0d%0a%5cu180F-%5cu1DBF%0d%0a%5cu1E00-%5cu1FFF%0d%0a%5cu200B-%5cu200D%0d%0a%5cu202A-%5cu202E%0d%0a%5cu203F-%5cu2040%0d%0a%5cu2054%0d%0a%5cu2060-%5cu206F%0d%0a%5cu2070-%5cu20CF%0d%0a%5cu2100-%5cu218F%0d%0a%5cu2460-%5cu24FF%0d%0a%5cu2776-%5cu2793%0d%0a%5cu2C00-%5cu2DFF%0d%0a%5cu2E80-%5cu2FFF%0d%0a%5cu3004-%5cu3007%0d%0a%5cu3021-%5cu302F%0d%0a%5cu3031-%5cu303F%0d%0a%5cu3040-%5cuD7FF%0d%0a%5cuF900-%5cuFD3D%0d%0a%5cuFD40-%5cuFDCF%0d%0a%5cuFDF0-%5cuFE1F%0d%0a%5cuFE30-%5cuFE44%0d%0a%5cuFE47-%5cuFFFD%0d%0a%5cU00010000-%5cU0001FFFD%0d%0a%5cU00020000-%5cU0002FFFD%0d%0a%5cU00030000-%5cU0003FFFD%0d%0a%5cU00040000-%5cU0004FFFD%0d%0a%5cU00050000-%5cU0005FFFD%0d%0a%5cU00060000-%5cU0006FFFD%0d%0a%5cU00070000-%5cU0007FFFD%0d%0a%5cU00080000-%5cU0008FFFD%0d%0a%5cU00090000-%5cU0009FFFD%0d%0a%5cU000A0000-%5cU000AFFFD%0d%0a%5cU000B0000-%5cU000BFFFD%0d%0a%5cU000C0000-%5cU000CFFFD%0d%0a%5cU000D0000-%5cU000DFFFD%0d%0a%5cU000E0000-%5cU000EFFFD%5d%0d%0a%5b0-9%0d%0a%5cu0300-%5cu036F%0d%0a%5cu1DC0-%5cu1DFF%0d%0a%5cu20D0-%5cu20FF%0d%0a%5cuFE20-%5cuFE2F%5d%26%5b%2f%3d%5c-%2b%21%2a%25%3C%3E%5c%26%7c%5c%5e~%3f%0d%0a%5cu00A1-%5cu00A7%0d%0a%5cu00A9%5cu00AB%0d%0a%5cu00AC%0d%0a%5cu00AE%0d%0a%5cu00B0-%5cu00B1%0d%0a%5cu00B6%0d%0a%5cu00BB%0d%0a%5cu00BF%0d%0a%5cu00D7%0d%0a%5cu00F7%0d%0a%5cu2016-%5cu2017%0d%0a%5cu2020-%5cu2027%0d%0a%5cu2030-%5cu203E%0d%0a%5cu2041-%5cu2053%0d%0a%5cu2055-%5cu205E%0d%0a%5cu2190-%5cu23FF%0d%0a%5cu2500-%5cu2775%0d%0a%5cu2794-%5cu2BFF%0d%0a%5cu2E00-%5cu2E7F%0d%0a%5cu3001-%5cu3003%0d%0a%5cu3008-%5cu3030%5d%0d%0a%5b%5cu0300-%5cu036F%0d%0a%5cu1DC0-%5cu1DFF%0d%0a%5cu20D0-%5cu20FF%0d%0a%5cuFE00-%5cuFE0F%0d%0a%5cuFE20-%5cuFE2F%0d%0a%5cU000E0100-%5cU000E01EF%5d]

Code points which should be illegal

There are several surprising non-printing characters, including:

  • U+2064 INVISIBLE PLUS is currently an identifier
  • U+200B ZERO WIDTH SPACE is currently an identifier

No good will come of these.

Categories which are split between identifiers and operators

  • Emoji and symbols: most of the newer emoji are identifiers, but many emoji/pictographs are operators, especially those from "Miscellaneous Symbols". The results are hilariously illogical:

    • ☹️ is an operator, but 🙂 is an identifier.
    • ✌️ is an operator, but 🤘 is an identifier.
    • 🔼 is an operator, but ▶️ is an identifier.
    • ✳️ is an operator, but 🔯 is an identifier.
    • ✈️ is an operator, but 🛩 is an identifier.
    • ♠️ is an operator, but 🂡 is an identifier. (Presumably, 🂡 = A ♠️ 🂠!)

    (But the counterintuitive examples extend outside the emoji too: + is an operator, while ₊ and ⁺ are identifiers.)

  • Currency symbols: ¢ £ ¤ ¥ are operators, but ₪ € ₱ ₹ ฿ and many others are identifiers, and $ is allowed in an identifier.

Missing characters

A handful of characters are neither operators nor identifiers. This list mostly makes sense (reserved characters and whitespace), but I wonder about a few which seem like they could easily be operators: ⑊ ⑀ ﹅ etc.

https://goo.gl/U0GVNn
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%5Cu0001-%5CU0010FFFF%5D-%5B%5B%2F%3D%5C-%2B!*%25%3C%3E%5C%26%7C%5C%5E~%3F%0D%0A%5Cu00A1-%5Cu00A7%0D%0A%5Cu00A9%5Cu00AB%0D%0A%5Cu00AC%0D%0A%5Cu00AE%0D%0A%5Cu00B0-%5Cu00B1%0D%0A%5Cu00B6%0D%0A%5Cu00BB%0D%0A%5Cu00BF%0D%0A%5Cu00D7%0D%0A%5Cu00F7%0D%0A%5Cu2016-%5Cu2017%0D%0A%5Cu2020-%5Cu2027%0D%0A%5Cu2030-%5Cu203E%0D%0A%5Cu2041-%5Cu2053%0D%0A%5Cu2055-%5Cu205E%0D%0A%5Cu2190-%5Cu23FF%0D%0A%5Cu2500-%5Cu2775%0D%0A%5Cu2794-%5Cu2BFF%0D%0A%5Cu2E00-%5Cu2E7F%0D%0A%5Cu3001-%5Cu3003%0D%0A%5Cu3008-%5Cu3030%5D%0D%0A%5B%5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE00-%5CuFE0F%0D%0A%5CuFE20-%5CuFE2F%0D%0A%5CU000E0100-%5CU000E01EF%5D%5Ba-zA-Z%0D%0A_%0D%0A%5Cu00A8%0D%0A%5Cu00AA%0D%0A%5Cu00AD%0D%0A%5Cu00AF%0D%0A%5Cu00B2-%5Cu00B5%0D%0A%5Cu00B7-%5Cu00BA%0D%0A%5Cu00BC-%5Cu00BE%0D%0A%5Cu00C0-%5Cu00D6%0D%0A%5Cu00D8-%5Cu00F6%0D%0A%5Cu00F8-%5Cu00FF%0D%0A%5Cu0100-%5Cu02FF%0D%0A%5Cu0370-%5Cu167F%0D%0A%5Cu1681-%5Cu180D%0D%0A%5Cu180F-%5Cu1DBF%0D%0A%5Cu1E00-%5Cu1FFF%0D%0A%5Cu200B-%5Cu200D%0D%0A%5Cu202A-%5Cu202E%0D%0A%5Cu203F-%5Cu2040%0D%0A%5Cu2054%0D%0A%5Cu2060-%5Cu206F%0D%0A%5Cu2070-%5Cu20CF%0D%0A%5Cu2100-%5Cu218F%0D%0A%5Cu2460-%5Cu24FF%0D%0A%5Cu2776-%5Cu2793%0D%0A%5Cu2C00-%5Cu2DFF%0D%0A%5Cu2E80-%5Cu2FFF%0D%0A%5Cu3004-%5Cu3007%0D%0A%5Cu3021-%5Cu302F%0D%0A%5Cu3031-%5Cu303F%0D%0A%5Cu3040-%5CuD7FF%0D%0A%5CuF900-%5CuFD3D%0D%0A%5CuFD40-%5CuFDCF%0D%0A%5CuFDF0-%5CuFE1F%0D%0A%5CuFE30-%5CuFE44%0D%0A%5CuFE47-%5CuFFFD%0D%0A%5CU00010000-%5CU0001FFFD%0D%0A%5CU00020000-%5CU0002FFFD%0D%0A%5CU00030000-%5CU0003FFFD%0D%0A%5CU00040000-%5CU0004FFFD%0D%0A%5CU00050000-%5CU0005FFFD%0D%0A%5CU00060000-%5CU0006FFFD%0D%0A%5CU00070000-%5CU0007FFFD%0D%0A%5CU00080000-%5CU0008FFFD%0D%0A%5CU00090000-%5CU0009FFFD%0D%0A%5CU000A0000-%5CU000AFFFD%0D%0A%5CU000B0000-%5CU000BFFFD%0D%0A%5CU000C0000-%5CU000CFFFD%0D%0A%5CU000D0000-%5CU000DFFFD%0D%0A%5CU000E0000-%5CU000EFFFD%5D%0D%0A%5B0-9%0D%0A%5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE20-%5CuFE2F%5D%5D%5D

Solutions

Still up for discussion — please reply to this thread!

Adopting (X)ID_Start/Continue for identifiers, or a simpler solution like Haskell's use of "letter" categories, might work well.

(I've given up hope of finding some kind of "perfect" solution — how can it be possible, when ᛏ is a letter, yet ↑ is not?)

Making the choice of operator characters more logical/standards-based would be nice (not just a set of ranges). However, Haskell's approach of using all punctuation & symbols is probably not right for Swift:

https://goo.gl/Ud4KqY
http://unicode.org/cldr/utility/unicodeset.jsp?a=%5B%5B-%2F%3D%2B!*%25%3C%3E%5C%26%7C%5C%5E~?%5Cu00A1-%5Cu00A7%5Cu00A9%5Cu00AB%5Cu00AC%5Cu00AE%5Cu00B0-%5Cu00B1%5Cu00B6%5Cu00BB%5Cu00BF%5Cu00D7%5Cu00F7%5Cu2016-%5Cu2017%5Cu2020-%5Cu2027%5Cu2030-%5Cu203E%5Cu2041-%5Cu2053%5Cu2055-%5Cu205E%5Cu2190-%5Cu23FF%5Cu2500-%5Cu2775%5Cu2794-%5Cu2BFF%5Cu2E00-%5Cu2E7F%5Cu3001-%5Cu3003%5Cu3008-%5Cu3030%5Cu0300-%5Cu036F%5Cu1DC0-%5Cu1DFF%5Cu20D0-%5Cu20FF%5CuFE00-%5CuFE0F%5CuFE20-%5CuFE2F%5CU000E0100-%5CU000E01EF%5D%5D&b=%5B%5B:Currency_Symbol:%5D%5B:Modifier_Symbol:%5D%5B:Math_Symbol:%5D%5B:Other_Symbol:%5D%5B:Connector_Punctuation:%5D%5B:Dash_Punctuation:%5D%5B:Close_Punctuation:%5D%5B:Final_Punctuation:%5D%5B:Initial_Punctuation:%5D%5B:Other_Punctuation:%5D%5B:Open_Punctuation:%5D%5D

I'm not really sure what to do with emoji — they're a very cute novelty feature, but I don't know what the motivation is for including these as valid operators/identifiers.

At the least, we should try to gather them all into one of the two categories. My inclination would be to keep them as identifiers, which would mean moving the following out of the operator category:

https://goo.gl/CBJEKX
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B%5B%3AEmoji%3A%5D%26%5B%5B%2F%3D%5C-%2B%21*%25%3C%3E%5C%26%7C%5C%5E~%3F%0D%0A%5Cu00A1-%5Cu00A7%0D%0A%5Cu00A9%5Cu00AB%0D%0A%5Cu00AC%0D%0A%5Cu00AE%0D%0A%5Cu00B0-%5Cu00B1%0D%0A%5Cu00B6%0D%0A%5Cu00BB%0D%0A%5Cu00BF%0D%0A%5Cu00D7%0D%0A%5Cu00F7%0D%0A%5Cu2016-%5Cu2017%0D%0A%5Cu2020-%5Cu2027%0D%0A%5Cu2030-%5Cu203E%0D%0A%5Cu2041-%5Cu2053%0D%0A%5Cu2055-%5Cu205E%0D%0A%5Cu2190-%5Cu23FF%0D%0A%5Cu2500-%5Cu2775%0D%0A%5Cu2794-%5Cu2BFF%0D%0A%5Cu2E00-%5Cu2E7F%0D%0A%5Cu3001-%5Cu3003%0D%0A%5Cu3008-%5Cu3030%5D%0D%0A%5B%5Cu0300-%5Cu036F%0D%0A%5Cu1DC0-%5Cu1DFF%0D%0A%5Cu20D0-%5Cu20FF%0D%0A%5CuFE00-%5CuFE0F%0D%0A%5CuFE20-%5CuFE2F%0D%0A%5CU000E0100-%5CU000E01EF%5D%5D%5D

Concurrently-discussable topics

There are a few relevant topics that came to mind, which I think are worth discussing around the same time.

Dollar signs ($)

$ is currently allowed in identifiers, but it can't begin an identifier except for the magic implicit closure params ($0, $1, ...) and LLDB/REPL-related uses.

It's arguable, but I feel that $ would be more effective as an operator character than an identifier character. There's precedent in Haskell for operators like <$> and being able to replicate these in Swift would be nice.

Diagnostics improvements

Regardless of what ends up being the ultimate solution, it would be great to improve diagnostics for cases when the wrong types of characters are used.

infix operator abc produces 'abc' is considered to be an identifier, not an operator. That's not too bad.

let +++ = 3 produces expected pattern.

let $foo = 3 produces expected numeric value following '$'.

Security and сοnfuѕаbIе characters

Confusable characters (e vs. е, o vs. ο, ; vs. ;) are an issue not taken lightly in the world of web security (cf. domain names). I haven't found much information about whether this has been considered a major security issue in programming languages, but I would think so (one can imagine such characters being introduced to a codebase subtly over time, hiding malicious functionality).

It'd be pretty cool if Swift could detect whether two identifiers might be confusable, and produce a warning.

http://www.unicode.org/reports/tr36/#Recommendations_General
http://unicode.org/reports/tr39/#Confusable_Detection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.