Skip to content

Instantly share code, notes, and snippets.

View nevenjovanovic's full-sized avatar

Neven Jovanović nevenjovanovic

View GitHub Profile
@nevenjovanovic
nevenjovanovic / greek-not-recognized-morpheus.md
Last active December 16, 2018 19:46
A list of Greek words not recognized by the Morpheus parser (online and in CLTK)

21 Greek words not recognized by the Morpheus parser in a 3,863 word forms prose set

Data: 60 brief passages of Ancient Greek prose, from Herodotus to Plotinus. 9,278 words total and 3,863 different word forms.

The texts (in plain text format) are published here: [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/pos_txt/], directories p1, p2, p3.

The tokenized and cleaned-up XML version (words in w, punctuation in pc, names of source files as @id; combined diacritics and letters replaced with precomposed characters where necessary) is in the same repository: [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/pos_txt/tokenizedp/grctxt.xml].

The words were sent to the online Morpheus parser at [http://morph.perseids.org/analysis/word?lang=grc&engine=morpheusgrc&word=], using the XQuery script [https://bitbucket.org/nevenjovanovic/hellenismos-hypostates/src/master/scripts/ParsePerseusGetHeadwordFromDB.xq].

@nevenjovanovic
nevenjovanovic / pannonius.md
Last active November 13, 2018 18:16
Izbor bibliografije o Janu Panoniju, s naglaskom na mađarskim djelima

Izabrana bibliografija o Janu Panoniju

@nevenjovanovic
nevenjovanovic / 16croala-testforunicodeblocks.xsl
Last active May 1, 2016 18:11
An XSL stylesheet testing for presence of characters from a certain Unicode block (in this case, Cyrillic) and reporting a message with filename of file containing such characters. Useful for cleaning up OCR, correcting homographs.
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:tei="http://www.tei-c.org/ns/1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
exclude-result-prefixes="tei">
<xsl:output method = "xml" indent="yes" omit-xml-declaration="no" />
<!-- 16croala-testforunicodeblocks: test text//text() nodes for characters from certain Unicode blocks -->
<xsl:template match="//*:text//text()">
<xsl:if test="matches(., '[\p{IsCyrillic}\p{IsCyrillicSupplement}\p{IsCyrillicExtended-A}\p{IsCyrillicExtended-B}]')">
<xsl:message>Characters from Cyrillic Unicode blocks in <xsl:value-of select="base-uri(.)"/></xsl:message>
@nevenjovanovic
nevenjovanovic / getcamena2.py
Last active April 30, 2016 16:44
A Python script to scrape links to XML documents from the CAMENA Thesaurus project pages, adding links to a text file
"""getcamena2.py: Parse a list of CAMENA Thesaurus htmls, write to file only the links ending in .xml."""
__author__ = 'Neven Jovanovic'
__copyright__ = "Neven Jovanovic, Zagreb, Hrvatska"
__credits__ = ["Neven Jovanovic"]
__license__ = "CC-BY"
__version__ = "0.0.1"
__maintainer__ = "Neven Jovanovic"
@nevenjovanovic
nevenjovanovic / getcamena.py
Created April 30, 2016 16:32
A Python script to download pages from the CAMENA project, parse them and then follow only the links to XML documents
"""getcamena.py: Parse a list of CAMENA htmls, download links ending with .xml."""
__author__ = 'Neven Jovanovic'
__copyright__ = "Neven Jovanovic, Zagreb, Hrvatska"
__credits__ = ["Neven Jovanovic"]
__license__ = "CC-BY"
__version__ = "0.0.2"
@nevenjovanovic
nevenjovanovic / gkhcr-uvod.txt
Created March 2, 2016 07:04
GKHCR o helenizmu
1 Grčka književnost helenističkoga i carskoga razdoblja
Salopek 158-187
Salopek – Sironić – 5–157 : 158–187 (30 str.)
Đurić, Istorija helenske književnosti – 665–764 (100 str.)
Tronski 227–322 (100 str.)
1.1 Periodizacija
Sironić: 1300 godina, od Homera do 529. (Justinijan zatvara filozofsku školu u Ateni)