Skip to content

Instantly share code, notes, and snippets.

@kylepjohnson
kylepjohnson / lat-livy.txt
Last active December 28, 2020 07:06
Latin text for CLTK demonstrations
Iam primum omnium satis constat Troia capta in ceteros saevitum esse Troianos, duobus, Aeneae Antenorique, et vetusti iure hospitii et quia pacis reddendaeque Helenae semper auctores fuerant, omne ius belli Achiuos abstinuisse; casibus deinde variis Antenorem cum multitudine Enetum, qui seditione ex Paphlagonia pulsi et sedes et ducem rege Pylaemene ad Troiam amisso quaerebant, venisse in intimum maris Hadriatici sinum, Euganeisque qui inter mare Alpesque incolebant pulsis Enetos Troianosque eas tenuisse terras. Et in quem primo egressi sunt locum Troia vocatur pagoque inde Troiano nomen est: gens universa Veneti appellati. Aeneam ab simili clade domo profugum sed ad maiora rerum initia ducentibus fatis, primo in Macedoniam venisse, inde in Siciliam quaerentem sedes delatum, ab Sicilia classe ad Laurentem agrum tenuisse. Troia et huic loco nomen est. Ibi egressi Troiani, ut quibus ab immenso prope errore nihil praeter arma et naues superesset, cum praedam ex agris agerent, Latinus rex Aboriginesque qui tum ea
@kylepjohnson
kylepjohnson / grc-thucydides.txt
Last active December 28, 2020 07:06
Ancient Greek text for CLTK demonstrations
Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων, ὡς ἐπολέμησαν πρὸς ἀλλήλους, ἀρξάμενος εὐθὺς καθισταμένου καὶ ἐλπίσας μέγαν τε ἔσεσθαι καὶ ἀξιολογώτατον τῶν προγεγενημένων, τεκμαιρόμενος ὅτι ἀκμάζοντές τε ᾖσαν ἐς αὐτὸν ἀμφότεροι παρασκευῇ τῇ πάσῃ καὶ τὸ ἄλλο Ἑλληνικὸν ὁρῶν ξυνιστάμενον πρὸς ἑκατέρους, τὸ μὲν εὐθύς, τὸ δὲ καὶ διανοούμενον. κίνησις γὰρ αὕτη μεγίστη δὴ τοῖς Ἕλλησιν ἐγένετο καὶ μέρει τινὶ τῶν βαρβάρων, ὡς δὲ εἰπεῖν καὶ ἐπὶ πλεῖστον ἀνθρώπων. τὰ γὰρ πρὸ αὐτῶν καὶ τὰ ἔτι παλαίτερα σαφῶς μὲν εὑρεῖν διὰ χρόνου πλῆθος ἀδύνατα ἦν, ἐκ δὲ τεκμηρίων ὧν ἐπὶ μακρότατον σκοποῦντί μοι πιστεῦσαι ξυμβαίνει οὐ μεγάλα νομίζω γενέσθαι οὔτε κατὰ τοὺς πολέμους οὔτε ἐς τὰ ἄλλα. φαίνεται γὰρ ἡ νῦν Ἑλλὰς καλουμένη οὐ πάλαι βεβαίως οἰκουμένη, ἀλλὰ μεταναστάσεις τε οὖσαι τὰ πρότερα καὶ ῥᾳδίως ἕκαστοι τὴν ἑαυτῶν ἀπολείποντες βιαζόμενοι ὑπό τινων αἰεὶ πλειόνων. τῆς γὰρ ἐμπορίας οὐκ οὔσης, οὐδ᾽ ἐπιμειγνύντες ἀδεῶς ἀλλήλοις οὔτε κατὰ γῆν οὔτε διὰ θαλάσσης, νεμόμενοί τε τὰ αὑτῶν ἕκαστοι ὅσον ἀποζ
(cltk) AMAC02Z92FELVCG:cltkv1 kyle.p.johnson$ mv ~/cltk_data ~/cltk_data_bak
(cltk) AMAC02Z92FELVCG:cltkv1 kyle.p.johnson$ poetry run python src/cltkv1/nlp.py
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.ar.vec' to '/Users/kyle.p.johnson/cltk_data/arb/embeddings/fasttext/wiki.ar.vec'? [y/n] y
100%|█████████████████████████████████████| 1.61G/1.61G [02:39<00:00, 10.1MiB/s]
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.arc.vec' to '/Users/kyle.p.johnson/cltk_data/arc/embeddings/fasttext/wiki.arc.vec'? [y/n] y
100%|█████████████████████████████████████| 8.66M/8.66M [00:00<00:00, 10.9MiB/s]
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/wiki.got.vec' to '/Users/kyle.p.johnson/cltk_data/got/embeddings/fasttext/wiki.got.vec'? [y/n] y
100%|█████████████████████████████████████| 6.94M/6.94M [00:00<00:00, 10.3MiB/s]
Do you want to download file 'https://dl.fbaipublicfiles.com/fasttext/vectors-wiki/
@kylepjohnson
kylepjohnson / gist:d40215b380be4b050b5cc1ceac09e369
Last active November 15, 2019 06:39
stanfordnlp, Old French FileNotFoundError
(cltk) AMAC02Z92FELVCG:cltkv1 kyle.p.johnson$ poetry run ipython
Python 3.7.5 (default, Oct 31 2019, 20:57:45)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.9.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import stanfordnlp
In [2]: stanfordnlp.__version__
Out[2]: '0.2.0'
$ ipython
Python 3.7.4 (v3.7.4:e09359112e, Jul 8 2019, 14:54:52)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.8.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from cltk.corpus.latin.wordnet import WordNetCorpusReader
In [2]: LWN = WordNetCorpusReader()
In [3]: uirtus = LWN.lemma('uirtus', 'n', 'n-s---fn3-')
@kylepjohnson
kylepjohnson / pipeline_example.py
Last active August 27, 2019 03:21
MWV to illustrate proposed new CLTK data types and use in an "NLP object"
"""An example of a proposed NLP pipeline system. Goals are to allow for:
1. default NLP pipeline for any given language
2. users to override default pipeline
3. users to choose alternative code (classes/methods/functions) w/in the CLTK
4. users to use their own custom code (inheriting or replacing those w/in CLTK)
#!/bin/bash
set -e
apt-get update -q
apt-get upgrade -q -y
apt-get install -y software-properties-common
add-apt-repository ppa:webupd8team/java < /dev/null
apt-get update -q
echo debconf shared/accepted-oracle-license-v1-1 select true | sudo debconf-set-selections
echo debconf shared/accepted-oracle-license-v1-1 seen true | sudo debconf-set-selections
Running python-crfsuite-0.9.5/setup.py -q bdist_egg --dist-dir /tmp/easy_install-lbeiimu8/python-crfsuite-0.9.5/egg-dist-tmp-3yelhlin
cc1plus: warning: command line option ‘-std=c99’ is valid for C/ObjC but not for C++
pycrfsuite/_pycrfsuite.cpp: In function ‘void __Pyx__ExceptionSave(PyThreadState*, PyObject**, PyObject**, PyObject**)’:
pycrfsuite/_pycrfsuite.cpp:14140:21: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_type’; did you mean ‘curexc_type’?
*type = tstate->exc_type;
^~~~~~~~
curexc_type
pycrfsuite/_pycrfsuite.cpp:14141:22: error: ‘PyThreadState {aka struct _ts}’ has no member named ‘exc_value’; did you mean ‘curexc_value’?
*value = tstate->exc_value;
^~~~~~~~~
@kylepjohnson
kylepjohnson / subscript-non-pairs
Created June 5, 2017 01:27
Pairs of subscript chars w/o
'\u1fbc', # ᾼ Greek Capital Letter Alpha with Prosgegrammeni
'\u0391',# Α Greek Capital Letter Alpha
'\u1fcc', # ῌ Greek Capital Letter Eta with Prosgegrammeni
'\u0397', # Η Greek Capital Letter Eta
'\u1ffc' # ῼ Greek Capital Letter Omega with Prosgegrammeni
'\u03a9', # Ω Greek Capital Letter Omega
'\u1f88', # ᾈ Greek Capital Letter Alpha with Psili and Prosgegrammeni
import os
with open(os.path.expanduser('~/Downloads/subscript-non-pairs')) as fo:
text = fo.read()
pairs = text.split('\n\n')
map_sub_nosub = {}
for pair in pairs:
key, val = pair.split('\n')