Skip to content

Instantly share code, notes, and snippets.

@jtanx
Created December 17, 2013 06:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jtanx/8001108 to your computer and use it in GitHub Desktop.
Save jtanx/8001108 to your computer and use it in GitHub Desktop.
Looking up non-Unicode format 4 TrueType character maps (cmap; ShiftJIS, PRC, Big5, Wansung, Johab)

Reference material: http://www.microsoft.com/typography/otspec/cmap.htm

Convert Unicode string using the appropriate encoding (e.g ShiftJIS). For each unicode code point, store the corresponding multibyte character sequence in little endian format.

E.g: Suppose you converted the character 'コ' (U+30B3) using ShiftJIS encoding. This would give the bytes [0x83, 0x52]. Store the result in a WORD (or larger sized variable) to give 0x8352 as the value. Use this value to lookup the glyph id from the cmap.

Conversion between charsets may be achieved through a number of ways, such as by using ICU. On Windows, the Unicode and Character set functions, such as WideChartoMultiByte may be used instead. Depending on the language, there may be in-built character set conversion utilities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment