Skip to content

Instantly share code, notes, and snippets.

@arrowtype
Last active October 8, 2021 13:15
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arrowtype/713dad14fe9a574d58d1aab61ba9b2f0 to your computer and use it in GitHub Desktop.
Save arrowtype/713dad14fe9a574d58d1aab61ba9b2f0 to your computer and use it in GitHub Desktop.
The basics of working with unicode values in Python

Unicode values in Python

Unicodes can either be integers (“A” is 65, “B” is 66, etc) or hex (“A” is 0x41, “B” is 0x42, etc).

When scripting with RoboFont or FontTools, a hard thing at first is that different styles come up in different contexts. For example, integers will often be used in scripts, but hex values are shown in UIs and in the TTX output of cmap (the table that maps unicode values to glyphs). So, it's helpful to know how to go between them to do different types of work.

To go from a string to an unicode integer, you can use ord(), like:

>>> ord("A")
65

To go from an integer to a hex, you can use hex(), like:

>>> hex(65)
'0x41'

To go from an integer or hex to a string, you can use chr(), like:

>>> chr(0x41)
'A'

>>> chr(65)
'A'

To go from a hex value to an integer, use int(), like:

>>> int(0x0083)
131

>>> int(0x41)
65
@okay-type
Copy link

okay-type commented Mar 31, 2020

# also useful

z = ord('A')

# lowercase hex value
x = f'{z:0>4x}'
print(x) 
# > 0041

# uppercase hex value
x = f'{z:0>4X}
print(x)
# > 0041

@okay-type
Copy link

# and in robofont

z = ord('!')

# decimal to glyphname
from glyphNameFormatter import GlyphName
glyphName = GlyphName(z).getName()
print(glyphName)
# > exclam

@okay-type
Copy link

# and 

import unicodedata
x = unicodedata.category('A')
print(x)
# Ll -- lowercase
# Lu -- uppercase
# Lt -- titlecase
# Lm -- modifier
# Lo -- other

@nedbat
Copy link

nedbat commented Mar 31, 2020

I'm not sure you'll need it for what you do, but all Unicode code points have names, and Python can tell you what they are:

>>> import unicodedata
>>> unicodedata.name("\U0001EE01")
'ARABIC MATHEMATICAL BEH'
>>> unicodedata.name("\U0001F4A9")
'PILE OF POO'

@lianghai
Copy link

lianghai commented Apr 2, 2020

I recommend unicodedata2 (https://github.com/mikekap/unicodedata2) instead of the standard library module unicodedata, as the latter one is often not the latest.

Also, fontTools.unicodedata (https://github.com/fonttools/fonttools/blob/master/Lib/fontTools/unicodedata/__init__.py) is my favorite kind of wrapped unicodedata. It prefers unicodedata2 underlyingly and provides some useful, additional tools, such as .script(char: str) -> str for the Unicode character property Script (https://www.unicode.org/reports/tr24/), and the conversion between Unicode Script codes and OTL script tags: .ot_tags_from_script(script_code: str) -> List[str].ot_tag_to_script(tag: str) -> str.

@arrowtype
Copy link
Author

Thanks for such helpful additions, everyone! These are great pieces of related advice.

@arrowtype
Copy link
Author

Note to self: If you’re converting a hex string like '0x000D' to an int...

You can use int() on a string with the prefix 0x, but you need to tell it to use 0 as the base:

>>> int('0x51', 0)
81

source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment