Skip to content

Instantly share code, notes, and snippets.

@johnwcowan
Last active July 19, 2021 18:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save johnwcowan/6632019c42160683dd78141c7fa1038d to your computer and use it in GitHub Desktop.
Save johnwcowan/6632019c42160683dd78141c7fa1038d to your computer and use it in GitHub Desktop.
XCCS 3.0 additions

Introduction

This document is a rough specification of proposed additions to XCCS, the internal character set of Interlisp and various other Xerox products and the direct ancestor of Unicode. Since no one else is maintaining it, the Medley Project has taken over the effort. The initial objective here is to be able to make use of more modern fonts, for which we need full XCCS-to-Unicode mappings, and to be able to handle the most important characters required to write modern languages.

At present, we have Unicode mappings for only the characters for which we have Medley bitmap fonts. This is supposedly a subset of XCCS 1.0, though in fact there are some mapped characters that aren't in XCCS 2.0 (we will pretend they are). One job, which is now underway, is to produce mappings for all the XCCS 2.0 (1990) characters.

XCCS is inherently a 16-bit character set, and there is no reasonable way to expand it beyond that point, so it is not possible to map every Unicode character (although matters are not as simple as a 1-1 mapping, either). But there is also no reason why XCCS should remain frozen at 2.0.

Architecturally, since compatibility with ISO 2022 is no longer an issue, we can now make use of character sets 01-20, 7F, and 80-A0, as well as character codes 00-20, 7F, and 80-A0 in already assigned character sets. Character sets 01L and 02 are assigned to ASCII characters with the Meta bit set (128 characters) and to the 112 possible function keys respectively.

New Characters

I want to add 14,365 new CJK characters, namely the 20,721 Unihan Core 2020 characters less the 6356 JIS X 0208 kanji characters, which are already encoded. Unihan Core 2020 is considered to be the minimal character list for Japanese and the national variants of Chinese and Korean when it is not possible to provide all 93,000 Unicode characters (and the amount is growing). These will be allocated to the upper half of charsets 30-7F (unassigned), plus part of the unassigned "Chinese" region, specifically A1-B1. That frees up the 35 "Chinese" charsets B2-D4 (8925 characters) for other uses.

That leaves us 128 character sets (32640 characters) for new and extended scripts. Here is my tentative proposal for what to include.

About 3500 total emoji. Character sets 60-6F (4080 characters) to leave room for expansion.

About 0 European-I characters (Latin, Greek, Cyrillic, Armenian, Georgian - XCCS 2.0)

About 0 European-II characters (Runic, Gothic - XCCS 2.0). These would not be included if they weren't already in XCCS 2.0.

About 256 Middle Eastern-I characters (Arabic - XCCS 2.0) Character set D0.

868 South Asian characters (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam - XCCS 2.0). Character set E5 for Devanagari (already allocated), E6-E9 for the rest.

537 South and Central Asian-II characters (Thaana, Sinhala, Mongolian, Tibetan). Character sets 2D-2F.

About 500 Southeast Asian characters (Thai, Lao, Myanmar, Khmer). Character sets D1-D3.

256 East Asian (non-CJK) characters (Bopomofo, Hiragana, Katakana, Hangul - XCCS 2.0). Character set D4.

172 African characters (Ethiopic, Tifinagh). Note: Unicode Ethiopic characters are divided into onset + vowel pairs for XCCS. Character set EA.

710 American characters (Canadian Syllabics) Character sets EB-ED.

About 0 Plane 0 symbol characters

About 512 rendering/presentation characters (Unicode - XCCS 2.0)

Medley vs XCCS 2.0

Note: nnn/mmm means that nnn of the mmm characters in XCCS 2.0 have mappings.

Character set 0 (Latin: 255/255 characters

Character set 21 (Symbols 1): 138/151 characters

Character set 22 (Symbols 2): 69/180 characters

Character set 23 (Extended Latin): 66/87 characters

Character set 24L (Hiragana): 83/83 characters

Character set 24R (Bopomofo): 45/45 characters

Character set 25L (Katakana): 91/91 characters

Character set 26 (Greek): 109/109 characters

Character set 27 (Cyrillic) 104/180 characters

Character set 28L (Forms): 64/64 characters

Character set 28R (Mosaic): 0/63 characters

Character set 29 (Runic/Gothic): 0/154 characters

Character set 2A (Ext. Cyrillic): 0/126 characters

Character set 2EL (Decorated Rules): 0/24 characters

Character set 2FL (Vertical symbols): 0/104 characters

Character set 74L (Symbols 3): 0/27 characters

Character set 76L (Symbols 4): 0/68 characters

Character set E0 (Arabic): 157/157 characters

Character set E1 (Hebrew): 92/92 characters

Character set E2 (IPA): 21/149 characters

Character set E3 (Hangul): 51/54 characters

Character set E4L (Georgian): 80/80 characters

(more to come)

Alternative idea: modified UCS-2

An alternative to XCCS 3.0 is to adopt UCS-2 (Plane 0 of Unicode) as the internal representation. Plane 0 is basically full, so we can't represent all of the UnihanCore2020 and emoji lists discussed above: it requires 1904 additional UnihanCore2020 and 1542 additional emoji characters. We also need 128+ Meta pseudo-characters and 112 function-key pseudo-characters. Fortunately, we can grab 3686 characters from the Private Zone (6400 characters) for this purpose and convert them to real Unicode when reading and writing files and when font rendering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment