Created
December 1, 2022 01:10
-
-
Save karolzlot/83768e1977b7a0fb6cadc1912e272607 to your computer and use it in GitHub Desktop.
Python normalization comparison
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import unicodedata | |
from unidecode import unidecode | |
def normalize(text:str): | |
text = unicodedata.normalize('NFD', text)\ | |
.encode('ascii', 'ignore')\ | |
.decode("utf-8") | |
return text | |
text ='zażółć gęślą jaźń, kožušček 北亰 François aaßaa aßb' | |
print(normalize(text)) | |
# zazoc gesla jazn, kozuscek Francois aaaa ab | |
print(unidecode(text)) | |
# zazolc gesla jazn, kozuscek Bei Jing Francois aassaa assb |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment