Skip to content

Instantly share code, notes, and snippets.

@jimregan
jimregan / eg.xml
Created August 17, 2014 21:07
Example
<mwdictionary>
<mwpardefs>
<mwpardef n="adj+adj=adj">
<e>
<p>
<l>
<w><lemma n="1"/><s n="adj"/><s n="f"/><s n="sg"/></w>
<b/>
<w><lemma n="2"/><s n="adj"/><s n="f"/><s n="sg"/></w>
</l>
@jimregan
jimregan / rozdzial1.xml
Last active August 29, 2015 14:07
Pan Tadeusz, tagged with WCRF
This file has been truncated, but you can view the full file.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
<chunk id="l1" type="p">
<sentence id="s1">
<tok>
<orth>Litwo</orth>
<lex disamb="1"><base>Litwa</base><ctag>subst:sg:voc:f</ctag></lex>
</tok>
<ns/>
@jimregan
jimregan / bigram-counts.pl
Last active August 29, 2015 14:15
Tesseract gle_uncial bits
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDIN, ":utf8";
my $last = '';
@jimregan
jimregan / expected-output.xml
Created February 25, 2012 11:39
Generates XML rules for LanguageTool for words that can be written separately, but ought to be written together. LGPL. (From this thread: https://sourceforge.net/mailarchive/forum.php?thread_name=4F2E7385.5070404%40wp.pl&forum_name=languagetool-devel)
<rule id="NA_WZAJEM" name="„na wzajem” (nawzajem)">
<pattern>
<token>na</token>
<token>wzajem</token>
</pattern>
<message>Ten wyraz zwykle pisze się łącznie: <suggestion>\1\2</suggestion>.</message>
<short>Prawdopodobna literówka</short>
<example correction="nawzajem" type="incorrect">Oni kochają się <marker>na wzajem</marker>.</example>
<example type="correct">Oni kochają się nawzajem.</example>
</rule>
@jimregan
jimregan / apertium_aprilfirst.cc
Created February 25, 2012 11:20
April 1st filter for Apertium
/*
* Copyright (C) 2005 Universitat d'Alacant / Universidad de Alicante
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License as
* published by the Free Software Foundation; either version 2 of the
* License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
@jimregan
jimregan / rect64tomagick.pl
Created February 25, 2012 12:21
Perl subs for getting an ImageMagick region from a Picasa rect64
sub padrect {
my $rin = shift;
my $rout;
if (length($rin) == 16) {
return $rin;
} elsif (length($rin) > 16) {
return $rout; # can't process, return undef
} else {
my $diff = (16 - length($rin));
for (my $i = 0; $i < $diff; $i++) {
@jimregan
jimregan / spotlight.diff
Created February 26, 2012 01:50
diff of dbpedia spotlight for Polish/to get it to run.
Index: demo/pom.xml
===================================================================
--- demo/pom.xml (revision 367)
+++ demo/pom.xml (working copy)
@@ -24,7 +24,8 @@
<parent>
<artifactId>spotlight</artifactId>
<groupId>org.dbpedia.spotlight</groupId>
- <version>${dbpedia.spotlight.version}</version>
+ <version>0.6</version>
@jimregan
jimregan / blacklistedURIPatterns.pl.txt
Created February 26, 2012 15:01
Blacklisted URI patterns for pl.wikipedia
^Lista_.+
^Wikiprojekt:.+
^Portal:.+
.+(ujednoznacznienie)$
@jimregan
jimregan / hallucination.ttl
Created March 1, 2012 15:42
RDFa data mirage
@prefix og: <http://ogp.me/ns#> .
@prefix fb: <http://ogp.me/ns/fb#> .
@prefix zimbiofb: <http://ogp.me/ns/fb/zimbiofb#> .
<http://www.zimbio.com/photos/Aiste+Paskeviciute/Luck+Attitude+Launch+Party+3/XFhu3CPOWRW>
fb:app_id "137068566357971" ;
og:site_name "Zimbio";
og:type zimbiofb:photostream ;
og:url <http://www.zimbio.com/photos/Aiste+Paskeviciute/Luck+Attitude+Launch+Party+3/XFhu3CPOWRW> ;
og:title "Aiste Paskeviciute Photostream" ;
#!/usr/bin/perl
use warnings;
use strict;
use Encode::Escape;
use utf8;
binmode STDIN, ":utf8";
binmode STDERR, ":utf8";