Skip to content

Instantly share code, notes, and snippets.

@oconnor663 oconnor663/
Last active Jul 19, 2016

What would you like to do?
the alphabet of all Twitter-safe characters
#! /usr/bin/env python3
import sys
import unicodedata
import urllib.request
# We want to use every possible code point we can. That means starting at 0 and
# going all the way up to 0x10ffff, the largest encodable value. Because
# Twitter does NFC Unicode normalization, we need to omit characters that don't
# have NFC_Quick_Check=Yes property. We also need to omit characters that
# Twitter might strip, as well as the surrogate characters, which aren't legal
# to encode.
# The file DerivedNormalizationProps.txt defines all the code points with
# NFC_Quick_Check values of No or Maybe. Parse these out, so that we can
# exclude them from the Twitter alphabet.
derived_normalization_props_url = \
props = urllib.request.urlopen(derived_normalization_props_url).read().decode()
non_quick_check_code_points = set()
for line in props.splitlines():
# Strip comments.
comment_start = line.find('#')
if comment_start != -1:
line = line[:comment_start]
# Skip unrelated lines.
if 'NFC_QC' not in line:
# Parse out the code point or range of code points.
hex_points = line.split(';')[0].strip().split('..')
points = [int(point, 16) for point in hex_points]
# Add single code points, or every code point in the given range
# (inclusive).
if len(points) == 1:
for i in range(points[0], points[1]+1):
bad_unicode_categories = {
"Cc", # control characters
"Cf", # format characters
"Cs", # surrogate characters
"Zl", # line separators
"Zp", # paragraph separators
"Zs", # space separators
# Iterate over all possible characters, and if they're NFC_Quick_Check=Yes and
# not in one of the bad categories, print them to stdout.
for i in range(0x110000):
if i in non_quick_check_code_points:
c = chr(i)
if unicodedata.category(c) in bad_unicode_categories:
This file has been truncated, but you can view the full file.

This comment has been minimized.

Copy link
Owner Author

commented Jul 19, 2016

I've checked this with Unicode 9.0.0, and I think there are no changes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.