Skip to content

Instantly share code, notes, and snippets.

@chapmanjacobd
Last active August 7, 2020 06:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chapmanjacobd/39eb7ae5a20b98cc1c4634c6d516f5a0 to your computer and use it in GitHub Desktop.
Save chapmanjacobd/39eb7ae5a20b98cc1c4634c6d516f5a0 to your computer and use it in GitHub Desktop.
iconv doesn't have a way to prevent ? from overwritting illegal characters so I want to keep invalid characters but still translit and remove diacritics
--most extreme (makes everything lowercase)
diff non-ascii (bat non-ascii | sed 's/[[=a=]]/a/g; s/[[=b=]]/b/g; s/[[=c=]]/c/g; s/[[=d=]]/d/g; s/[[=e=]]/e/g; s/[[=f=]]/f/g; s/[[=g=]]/g/g; s/[[=h=]]/h/g; s/[[=i=]]/i/g; s/[[=j=]]/j/g; s/[[=k=]]/k/g; s/[[=l=]]/l/g; s/[[=m=]]/m/g; s/[[=n=]]/n/g; s/[[=o=]]/o/g; s/[[=p=]]/p/g; s/[[=q=]]/q/g; s/[[=r=]]/r/g; s/[[=s=]]/s/g; s/[[=t=]]/t/g; s/[[=u=]]/u/g; s/[[=v=]]/v/g; s/[[=w=]]/w/g; s/[[=x=]]/x/g; s/[[=y=]]/y/g; s/[[=z=]]/z/g' | psub) | wc -l
96514
-- removes pretty much everything but has some weird stuff:
pip3 install --user unidecode
diff non-ascii (bat non-ascii | unidecode | psub) | wc -l
10924
--only removes diacriticals (this one is the most technically correct)
perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g'
diff non-ascii (bat non-ascii | perl -CS -MUnicode::Normalize -pne '$_=NFKD($_);s/\p{InDiacriticals}//g' | psub) | wc -l
9198
diff non-ascii <(bat non-ascii | recode -f UTF-8..ASCII) | wc -l
4930
diff non-ascii <(bat non-ascii | sed 'y/āáǎàçēéěèīíǐìōóǒòūúǔùǖǘǚǜüĀÁǍÀĒÉĚÈĪÍǏÌŌÓǑÒŪÚǓÙǕǗǙǛÜ/aaaaceeeeiiiioooouuuuuuuuuAAAAEEEEIIIIOOOOUUUUUUUUU/') | wc -l
2638
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment