Skip to content

Instantly share code, notes, and snippets.

@hoehrmann
Created April 21, 2012 20:59
Show Gist options
  • Save hoehrmann/2439564 to your computer and use it in GitHub Desktop.
Save hoehrmann/2439564 to your computer and use it in GitHub Desktop.
Extract german noun inflections from Wiktionary (quick and dirty)
#!perl -w
use strict;
use warnings;
use encoding 'utf-8';
use MediaWiki::DumpFile::Pages;
use YAML::XS;
my $pages = MediaWiki::DumpFile::Pages
->new('dewiktionary-20120416-pages-meta-current.xml');
while(defined(my $page = $pages->next)) {
my $rev = $page->revision;
my $re = qr/
\{\{Deutsch\s+Substantiv\s+Übersicht\s+
(\|.*?\s+)*
\|Nominativ\s+Singular\s*=\s*(?<NS>.*?)\s+
\|Nominativ\s+Plural\s*=\s*(?<NP>.*?)\s+
\|Genitiv\s+Singular\s*=\s*(?<GS>.*?)\s+
\|Genitiv\s+Plural\s*=\s*(?<GP>.*?)\s+
\|Dativ\s+Singular\s*=\s*(?<DS>.*?)\s+
\|Dativ\s+Plural\s*=\s*(?<DP>.*?)\s+
\|Akkusativ\s+Singular\s*=\s*(?<AS>.*?)\s+
\|Akkusativ\s+Plural\s*=\s*(?<AP>.*?)\s+
\}\}/x;
if ($rev->text =~ $re) {
print Dump \%+;
} elsif (index ($rev->text, 'Deutsch Substantiv Übersicht') >= 0) {
# die $rev->text;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment