Skip to content

Instantly share code, notes, and snippets.

@jimregan
Created February 25, 2012 11:39
Show Gist options
  • Save jimregan/1908022 to your computer and use it in GitHub Desktop.
Save jimregan/1908022 to your computer and use it in GitHub Desktop.
Generates XML rules for LanguageTool for words that can be written separately, but ought to be written together. LGPL. (From this thread: https://sourceforge.net/mailarchive/forum.php?thread_name=4F2E7385.5070404%40wp.pl&forum_name=languagetool-devel)
<rule id="NA_WZAJEM" name="„na wzajem” (nawzajem)">
<pattern>
<token>na</token>
<token>wzajem</token>
</pattern>
<message>Ten wyraz zwykle pisze się łącznie: <suggestion>\1\2</suggestion>.</message>
<short>Prawdopodobna literówka</short>
<example correction="nawzajem" type="incorrect">Oni kochają się <marker>na wzajem</marker>.</example>
<example type="correct">Oni kochają się nawzajem.</example>
</rule>
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
use Getopt::Long;
#uncomment to use unidecode
#use Text::Unidecode;
my $lang = 'pl';
my $input = '';
my $output = '';
my $encoding = 'UTF-8';
my $fhin = *STDIN;
my $fhout = *STDOUT;
my $res = GetOptions(
'lang=s' => \$lang,
'in=s' => \$input,
'out=s' => \$output,
'enc=s' => \$encoding
);
if ($input ne '') {
open ($fhin, "<$input");
}
if ($output ne '') {
open ($fhout, ">$output");
}
binmode($fhin, ":encoding($encoding)");
binmode($fhout, ":encoding(UTF-8)");
my %message = (
'pl' => 'Ten wyraz zwykle pisze się łącznie:',
'en' => 'Did you mean',
);
my %short = (
'pl' => 'Prawdopodobna literówka',
'en' => 'Possible typo',
);
while (<$fhin>) {
chomp;
my ($incorrect, $example) = split/\t/, $_;
# We could probably check the example for the incorrect form, but it
# seems better (or, at least, easier) to just crap out on the line.
if ($incorrect !~ / /) {
print "Error: no spaces in incorrect form: $_\n";
next;
}
# The correct form is the incorrect minus spaces
my $correct = $incorrect;
$correct =~ s/ //g;
# Uncoment for unidecode
# my $id = uc(unidecode($incorrect));
# without unidecode, this should be ok for Polish
# comment these two lines to use unidecode
my $id = uc($incorrect);
$id =~ tr/ĄĆĘŁŃÓŚŻŹ/ACELNOSZZ/;
# Do this anyway
$id =~ s/ /_/g;
print $fhout " <rule id=\"$id\" name=\"„$incorrect” ($correct)\">\n";
my @parts = split/ /, $incorrect;
print $fhout " <pattern>\n";
for my $part (@parts) {
print " <token>$part</token>\n";
}
print $fhout " </pattern>\n";
print $fhout " <message>$message{$lang} <suggestion>";
for (my $i=1; $i<=($#parts+1); $i++) {
print $fhout "\\$i";
}
print $fhout "</suggestion>.</message>\n";
print $fhout " <short>$short{$lang}</short>\n";
my $outincor = $example;
my $outcor = $example;
if ($example =~ /($incorrect)/i) {
my $m = $1;
$outincor =~ s#$m#<marker>$m</marker>#;
$outcor =~ s#$m#$correct#;
} elsif ($example =~ /($correct)/i) {
my $m = $1;
$outincor =~ s#$m#<marker>$incorrect</marker>#;
} else {
print "Error: example contains neither correct nor incorrect phrase: $example\n";
next;
}
print " <example correction=\"$correct\" type=\"incorrect\">${outincor}.</example>\n";
print " <example type=\"correct\">${outcor}.</example>\n";
print " </rule>\n";
}
na wzajem Oni kochają się na wzajem
@jimregan
Copy link
Author

I should note that LanguageTool already has a rule for this example, it just happens to stick out in my memory (I spent a long time trying to find it, only to eventually find it should have been one word, and that was in my dictionary.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment