johnwcowan/XCCS-3.0.md

## XCCS-3.0.md

      
    Raw
  

              XCCS-3.0.md
            
          
    Introduction

This document is a rough specification of proposed additions to XCCS,
the internal character set of Interlisp and various other Xerox products
and the direct ancestor of Unicode.  Since no one else is maintaining it,
the Medley Project has taken over the effort.  The initial objective here
is to be able to make use of more modern fonts,
for which we need full XCCS-to-Unicode mappings,
and to be able to handle the most important characters
required to write modern languages.
At present, we have Unicode mappings for only
the characters for which we have Medley bitmap fonts.
This is supposedly a subset of
XCCS 1.0,
though in fact there are some mapped characters that aren't in XCCS 2.0
(we will pretend they are).
One job, which is now underway, is to produce mappings for all the
XCCS 2.0 (1990)
characters.
XCCS is inherently a 16-bit character set,
and there is no reasonable way to expand it
beyond that point, so it is not possible to map
every Unicode character (although matters
are not as simple as a 1-1 mapping, either).
But there is also no reason why XCCS should remain frozen at 2.0.
Architecturally, since compatibility with ISO 2022
is no longer an issue,
we can now make use of character sets 01-20, 7F, and 80-A0,
as well as character codes 00-20, 7F, and 80-A0 in already
assigned character sets.  Character sets 01L and 02 are
assigned to ASCII characters with the Meta bit set (128 characters)
and to the 112 possible function keys respectively.
New Characters

I want to add 14,365 new CJK characters, namely the 20,721
Unihan Core 2020
characters less the 6356 JIS X 0208 kanji characters, which are already encoded.
Unihan Core 2020 is considered to be the minimal character list for
Japanese and the national variants of Chinese and Korean when it is
not possible to provide all 93,000 Unicode characters (and the amount is growing).
These will be allocated to the upper half of charsets 30-7F (unassigned),
plus part of the unassigned "Chinese" region, specifically A1-B1.
That frees up the 35 "Chinese" charsets B2-D4 (8925 characters) for other uses.
That leaves us 128 character sets (32640 characters) for new and extended scripts.
Here is my tentative proposal for what to include.
About 3500 total emoji.
Character sets 60-6F (4080 characters) to leave room
for expansion.
About 0 European-I characters (Latin, Greek, Cyrillic, Armenian, Georgian - XCCS 2.0)
About 0 European-II characters (Runic, Gothic - XCCS 2.0).  These would not be
included if they weren't already in XCCS 2.0.
About 256 Middle Eastern-I characters (Arabic - XCCS 2.0)
Character set D0.
868 South Asian characters (Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
Tamil, Telugu, Kannada, Malayalam - XCCS 2.0).  Character set E5 for Devanagari
(already allocated), E6-E9 for the rest.
537 South and Central Asian-II characters (Thaana, Sinhala, Mongolian, Tibetan).
Character sets 2D-2F.
About 500 Southeast Asian characters (Thai, Lao, Myanmar, Khmer).
Character sets D1-D3.
256 East Asian (non-CJK) characters (Bopomofo, Hiragana, Katakana, Hangul - XCCS 2.0).
Character set D4.
172 African characters (Ethiopic, Tifinagh).
Note: Unicode Ethiopic characters are divided into onset + vowel pairs for XCCS.
Character set EA.
710 American characters (Canadian Syllabics)
Character sets EB-ED.
About 0 Plane 0 symbol characters
About 512 rendering/presentation characters (Unicode - XCCS 2.0)
Medley vs XCCS 2.0

Note: nnn/mmm means that nnn of the mmm characters in XCCS 2.0
have mappings.
Character set 0 (Latin: 255/255 characters
Character set 21 (Symbols 1): 138/151 characters
Character set 22 (Symbols 2): 69/180 characters
Character set 23 (Extended Latin): 66/87 characters
Character set 24L (Hiragana): 83/83 characters
Character set 24R (Bopomofo): 45/45 characters
Character set 25L (Katakana): 91/91 characters
Character set 26 (Greek): 109/109 characters
Character set 27 (Cyrillic) 104/180 characters
Character set 28L (Forms): 64/64 characters
Character set 28R (Mosaic): 0/63 characters
Character set 29 (Runic/Gothic): 0/154 characters
Character set 2A (Ext. Cyrillic): 0/126 characters
Character set 2EL (Decorated Rules): 0/24 characters
Character set 2FL (Vertical symbols): 0/104 characters
Character set 74L (Symbols 3): 0/27 characters
Character set 76L (Symbols 4): 0/68 characters
Character set E0 (Arabic): 157/157 characters
Character set E1 (Hebrew): 92/92 characters
Character set E2 (IPA): 21/149 characters
Character set E3 (Hangul): 51/54 characters
Character set E4L (Georgian): 80/80 characters
(more to come)
Alternative idea: modified UCS-2

An alternative to XCCS 3.0 is to adopt UCS-2 (Plane 0 of Unicode)
as the internal representation.  Plane 0 is basically full,
so we can't represent all of the UnihanCore2020 and emoji lists
discussed above: it requires 1904 additional UnihanCore2020 and
1542 additional emoji characters.  We also need 128+ Meta
pseudo-characters and 112 function-key pseudo-characters.
Fortunately, we can grab 3686 characters from the Private Zone
(6400 characters) for this purpose and convert them to
real Unicode when reading and writing files and when font rendering.