Skip to content

Instantly share code, notes, and snippets.

@takdavid
Created December 14, 2016 15:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save takdavid/3fa2cc3ae21aa96da24b8bd90b8c63b0 to your computer and use it in GitHub Desktop.
Save takdavid/3fa2cc3ae21aa96da24b8bd90b8c63b0 to your computer and use it in GitHub Desktop.
#!/usr/bin/perl
use Unicode::UCD 'charinfo';
use Unicode::Normalize 'isNonStDecomp', 'NFD', 'NFC', 'isComp2nd', 'isNonStDecomp';
use charnames ':full';
use Encode;
use HTML::Entities;
#use Getopt::Std;
#our ($opt_f, $opt_r);
#getopts('fr');
binmode(STDOUT, ':utf8');
sub usage
{
print <<eof
Usage:
./unicount.pl <infile >outfile
Ordered by counts:
./unicount.pl <infile | sort -r -n >outfile
eof
}
my %count;
my $composition;
my $line;
my @charcodes;
while ($line = <>)
{
$line = Encode::decode("UTF-8", $line);
$line = decode_entities($line);
$line = NFD($line);
@charcodes = unpack("U0U*", $line);
for ($i=0; $i<=$#charcodes; $i++)
{
$charcode = $charcodes[$i];
$chr = substr $line, $i, 1;
# combining characters together with the starter
if (isComp2nd($charcode) || isNonStDecomp($charcode))
{
$composition .= $chr;
}
else
{
$count{$composition}++;
$composition = $chr;
}
}
$count{$composition}++;
}
delete $count{''};
while (($composition, $cnt) = each(%count))
{
@charcodes = ();
@charnames = ();
foreach $chr (split(//, $composition))
{
$charcode = unpack("U0U*", $chr);
push @charcodes, sprintf('%04X', $charcode);
push @charnames, charnames::viacode($charcode);
}
$composition = '' if ($composition =~ /[\r\t\n]/);
printf "%d\t%s\t%s\t%s\n", $cnt, $composition, join(' + ', @charcodes), join(' + ', @charnames);
}
@takdavid
Copy link
Author

takdavid commented Dec 14, 2016

This script counts letters in its starndard input, where letter means a unicode code point base character together with its combining characters (diacritics etc.). Characters are first decomposed by canonical equivalence (NFD), so that combining characters are arranged in a specific order, so similarly looking letters that are differently encoded will be taken as equals.

The input is expected to be in utf-8, the output is also utf-8, unsorted, and tab separated; the columns are: counter; normalized combined character; hex code of the character(s) joined by +; standard name of the character(s) joined by +.

For example, the command:

echo 'árvítűrő tükörfúrógép' | perl unicount.pl | sort -r -n

would output:

4	r	0072	LATIN SMALL LETTER R
2	t	0074	LATIN SMALL LETTER T
1	v	0076	LATIN SMALL LETTER V
1	ű	0075 + 030B	LATIN SMALL LETTER U + COMBINING DOUBLE ACUTE ACCENT
1	ü	0075 + 0308	LATIN SMALL LETTER U + COMBINING DIAERESIS
1	ú	0075 + 0301	LATIN SMALL LETTER U + COMBINING ACUTE ACCENT
1	p	0070	LATIN SMALL LETTER P
1	ő	006F + 030B	LATIN SMALL LETTER O + COMBINING DOUBLE ACUTE ACCENT
1	ö	006F + 0308	LATIN SMALL LETTER O + COMBINING DIAERESIS
1	ó	006F + 0301	LATIN SMALL LETTER O + COMBINING ACUTE ACCENT
1	k	006B	LATIN SMALL LETTER K
1	í	0069 + 0301	LATIN SMALL LETTER I + COMBINING ACUTE ACCENT
1	g	0067	LATIN SMALL LETTER G
1	f	0066	LATIN SMALL LETTER F
1	é	0065 + 0301	LATIN SMALL LETTER E + COMBINING ACUTE ACCENT
1	á	0061 + 0301	LATIN SMALL LETTER A + COMBINING ACUTE ACCENT
1	 	0020	SPACE
1		000A	LINE FEED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment