Skip to content

Instantly share code, notes, and snippets.

@bjhomer
Last active August 29, 2015 14:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bjhomer/8af670a31944b54bdfef to your computer and use it in GitHub Desktop.
Save bjhomer/8af670a31944b54bdfef to your computer and use it in GitHub Desktop.
Tokenizer comparison
Input: @"This is a puppy that is happy. 🐶❄️ = 😀🐶. 你好,我喜欢汉字。"
kCFStringTokenizerUnitWordBoundary results:
token:[This]
token:[ ]
token:[is]
token:[ ]
token:[a]
token:[ ]
token:[puppy]
token:[ ]
token:[that]
token:[ ]
token:[is]
token:[ ]
token:[happy]
token:[.]
token:[ ]
token:[🐶]
token:[❄️]
token:[ ]
token:[=]
token:[ ]
token:[😀]
token:[🐶]
token:[.]
token:[ ]
token:[你]
token:[好]
token:[,]
token:[我]
token:[喜欢]
token:[汉字]
token:[。]
kCFStringTokenizerUnitWord results:
token:[This]
token:[is]
token:[a]
token:[puppy]
token:[that]
token:[is]
token:[happy]
token:[🐶]
token:[️] // <-- That's the failed attempt to tokenize <U+2744 SNOWFLAKE><U+FE0F VARIATION SELECTOR-16>
token:[😀🐶]
token:[你]
token:[好]
token:[我]
token:[喜欢]
token:[汉字]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment