Last active
August 7, 2020 06:06
-
-
Save chapmanjacobd/39eb7ae5a20b98cc1c4634c6d516f5a0 to your computer and use it in GitHub Desktop.
iconv doesn't have a way to prevent ? from overwritting illegal characters so I want to keep invalid characters but still translit and remove diacritics
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--most extreme (makes everything lowercase) | |
diff non-ascii (bat non-ascii | sed 's/[[=a=]]/a/g; s/[[=b=]]/b/g; s/[[=c=]]/c/g; s/[[=d=]]/d/g; s/[[=e=]]/e/g; s/[[=f=]]/f/g; s/[[=g=]]/g/g; s/[[=h=]]/h/g; s/[[=i=]]/i/g; s/[[=j=]]/j/g; s/[[=k=]]/k/g; s/[[=l=]]/l/g; s/[[=m=]]/m/g; s/[[=n=]]/n/g; s/[[=o=]]/o/g; s/[[=p=]]/p/g; s/[[=q=]]/q/g; s/[[=r=]]/r/g; s/[[=s=]]/s/g; s/[[=t=]]/t/g; s/[[=u=]]/u/g; s/[[=v=]]/v/g; s/[[=w=]]/w/g; s/[[=x=]]/x/g; s/[[=y=]]/y/g; s/[[=z=]]/z/g' | psub) | wc -l | |
96514 | |
-- removes pretty much everything but has some weird stuff: | |
pip3 install --user unidecode | |
diff non-ascii (bat non-ascii | unidecode | psub) | wc -l | |
10924 | |
--only removes diacriticals (this one is the most technically correct) | |
perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g' | |
diff non-ascii (bat non-ascii | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g' | psub) | wc -l | |
9198 | |
diff non-ascii <(bat non-ascii | recode -f UTF-8..ASCII) | wc -l | |
4930 | |
diff non-ascii <(bat non-ascii | sed 'y/āáǎàçēéěèīíǐìōóǒòūúǔùǖǘǚǜüĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛÜ/aaaaceeeeiiiioooouuuuuuuuuAAAAEEEEIIIIOOOOUUUUUUUUU/') | wc -l | |
2638 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment