Skip to content

Instantly share code, notes, and snippets.

@kylebgorman
Created March 19, 2020 21:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kylebgorman/4d8c6ed7375bddc10534f20cc3364a29 to your computer and use it in GitHub Desktop.
Save kylebgorman/4d8c6ed7375bddc10534f20cc3364a29 to your computer and use it in GitHub Desktop.
Converts from an English-like UTF-8 to ASCII
#!/usr/bin/perl
use strict;
use warnings;
use Unicode::Normalize;
use open ":encoding(utf8)";
binmode STDIN, ":encoding(utf8)";
binmode STDOUT, ":encoding(ascii)";
binmode STDERR, ":encoding(utf8)";
while (<>) {
# Smart quotes etc.
s/\x{2013}/--/g;
s/\x{2014}/--/g;
s/\x{2018}/`/g;
s/\x{2019}/'/g;
s/\x{201c}/``/g;
s/\x{201d}/''/g;
# Unicode normalization.
$_ = NFKD($_);
# Removes diacritics so decomposed.
s/\p{NonspacingMark}//g;
# Ignores normalization problems.
next if (/[^[:ascii:]]/);
# Penn Treebank-ify the quotation characters.
s/^\"/``/;
s/([ (\[{<])"/$1``/g;
s/"/''/g;
print;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment