Skip to content

Instantly share code, notes, and snippets.

@ramalho
Created May 8, 2020 07:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ramalho/d671496947715fdfe96bf390cb5121cf to your computer and use it in GitHub Desktop.
Save ramalho/d671496947715fdfe96bf390cb5121cf to your computer and use it in GitHub Desktop.
Functions to create an inverted index to find Unicode characters by name
"""
``char_index`` builds an inverted index mapping words to sets of Unicode
characters which contain that word in their names. For example::
>>> index = char_index(32, 65)
>>> sorted(index['SIGN'])
['#', '$', '%', '+', '<', '=', '>']
>>> sorted(index['DIGIT'])
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>> index['DIGIT'] & index['EIGHT']
{'8'}
"""
import sys
import re
import unicodedata
from typing import Dict, Set, Iterator, cast
RE_WORD = re.compile('\w+')
def tokenize(text: str) -> Iterator[str]:
"""return iterable of uppercased words"""
for match in RE_WORD.finditer(text):
yield match.group().upper()
def char_index(start: int = 32, end: int = 0) -> Dict[str, Set[str]]:
if end == 0:
end = sys.maxunicode + 1
index: Dict[str, Set[str]] = {}
for char in (chr(i) for i in range(start, end)):
if name := unicodedata.name(char, ''):
for word in tokenize(name):
index.setdefault(word, cast(Set[str], set())).add(char)
return index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment