Skip to content

Instantly share code, notes, and snippets.

@amir
Last active July 26, 2022 07:15
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save amir/9171330 to your computer and use it in GitHub Desktop.
Save amir/9171330 to your computer and use it in GitHub Desktop.
#!/usr/bin/perl
use open qw(:std :utf8);
# Program to filter Wikipedia XML dumps to "clean" text consisting only of lowercase
# letters (a-z, converted from A-Z), and spaces (never consecutive).
# All other characters are converted to spaces. Only text which normally appears
1 #!/usr/bin/perl
# in the web browser is displayed. Tables are removed. Image captions are
# preserved. Links are converted to normal text. Digits are spelled out.
# Written by Matt Mahoney, June 10, 2006. This program is released to the public domain.
# Modified by Amir Mohammad Saied, Feb, 23, 2013 for utf-8, and "fa" wiki support.
binmode(STDOUT, ":utf8");
$/=">"; # input record separator
while (<>) {
if (/<text /) {$text=1;} # remove all but between <text> ... </text>
if (/#redirect/i) {$text=0;} # remove #REDIRECT
if ($text) {
# Remove any text not normally visible
if (/<\/text>/) {$text=0;}
s/<.*>//; # remove xml tags
s/&amp;/&/g; # decode URL encoded chars
s/&lt;/</g;
s/&gt;/>/g;
s/<ref[^<]*<\/ref>//g; # remove references <ref...> ... </ref>
s/<[^>]*>//g; # remove xhtml tags
s/\[http:[^] ]*/[/g; # remove normal url, preserve visible text
s/\|thumb//ig; # remove images links, preserve caption
s/\|left//ig;
s/\|right//ig;
s/\|\d+px//ig;
s/\[\[image:[^\[\]]*\|//ig;
s/\[\[category:([^|\]]*)[^]]*\]\]/[[$1]]/ig; # show categories without markup
s/\[\[[a-z\-]*:[^\]]*\]\]//g; # remove links to other languages
s/\[\[[^\|\]]*\|/[[/g; # remove wiki url, preserve visible text
s/{{[^}]*}}//g; # remove {{icons}} and {tables}
s/{[^}]*}//g;
s/\[//g; # remove [ and ]
s/\]//g;
s/&[^;]*;/ /g; # remove URL encoded chars
# convert to lowercase letters and spaces, spell digits
$_=" $_ ";
tr/A-Z/a-z/;
s/\x{6F0}/ \x{635}\x{641}\x{631} /g;
s/\x{6F1}/ \x{6CC}\x{6A9} /g;
s/\x{6F2}/ \x{62F}\x{648} /g;
s/\x{6F3}/ \x{633}\x{647} /g;
s/\x{6F4}/ \x{686}\x{647}\x{627}\x{631} /g;
s/\x{6F5}/ \x{67E}\x{646}\x{62C} /g;
s/\x{6F6}/ \x{634}\x{634} /g;
s/\x{6F7}/ \x{647}\x{641}\x{62A} /g;
s/\x{6F8}/ \x{647}\x{634}\x{62A} /g;
s/\x{6F9}/ \x{646}\x{647} /g;
tr/a-z*-[]%()#'_}{»«$\\^|%&!\x{61F}\x{60C}\x{66C}\x{61B}\x{2014}\x{2013}\x{66A}\x{66B}/ /;
tr/\n+/ /;
tr/\t+/ /;
tr/ +/ /;
chop;
print $_;
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment