Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Compute complexity metrics from Universal Dependencies. | |
Usage: python3 udstyle.py [OPTIONS] FILE... | |
--parse=LANG parse texts with Stanza; provide 2 letter language code | |
--output=FILENAME write result to a tab-separated file. | |
--persentence report per sentence results, not mean per document | |
Reported metrics: | |
- LEN: mean sentence length in words (excluding punctuation). | |
- MDD: mean dependency distance (Gibson, 1998). | |
- NDD: normalized dependency distance (Lei & Jockers, 2018). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: UTF-8 -*- | |
"""Preprocessing of text files. | |
Writes one paragraph per line, and normalizes punctuation & whitespace. | |
No sentence or word tokenization. | |
Usage: preprocess.py [FILE] | |
or: preprocess.py --batch FILES... | |
By default, produce cleaned version given a single filename to standard output. | |
Diagnostic information is written to standard error. |
We can make this file beautiful and searchable if this error is corrected: It looks like row 4 should actually have 27 columns, instead of 25. in line 3.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
DBNLti_id DBNLpers_id YearFirstPublished YearEditionPublished Edition Woman Born Died AuthorOrigin DBNLgeb_land_code DBNLgenre DBNLsubgenre Author Title Filename ti_id_set WPAuthor AuthorInCanon2002 TitleInCanon2002 InBasisbibliotheek2008 AuthorDBRDMatches AuthorNLWikipedia2019Matches DBNLSecRefsAuthor DBNLSecRefsTitle holding lending GNTpages | |
kist001leve01 kist001 1800 1800 1ste druk 0 1758 1841 Woerden proza roman Willem Kist Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen) kist001leve01_01.xml kist001leve01 0 0 0 0 1 19 1 0 0 4 | |
wolf016gesc01 deke001 1802 1802 1ste druk 1 1741 1804 Amstelveen proza roman Aagje Deken Geschrift eener bejaarde vrouw wolf016gesc01_01.xml wolf016gesc01 Aagje Deken 1 0 0 1 21 131 6 0 0 0 | |
stre001char01 stre001 1804 1804 1ste druk 1 1760 1828 Amsterdam proza briefroman Naatje van Streek-Brinkman Charakters en lotgevallen van Adelson, Héloïse en Elius stre001char01_01.xml stre001char01 0 0 0 0 0 13 0 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Apply PCA to a CSV file and plot its datapoints (one per line). | |
The first column should be a category (determines the color of each datapoint), | |
the second a label (shown alongside each datapoint).""" | |
import sys | |
import pandas | |
import pylab as pl | |
from sklearn import preprocessing | |
from sklearn.decomposition import PCA |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This is a comment | |
FROM ubuntu:20.04 | |
MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl> | |
RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime | |
ENV DEBIAN_FRONTEND=noninteractive | |
RUN apt-get update && apt-get install -y \ | |
build-essential \ | |
curl \ | |
git \ | |
python3 \ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""A baseline Bag-of-Words text classification. | |
Usage: python3 classify.py <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams] | |
train.txt and test.txt should contain one "document" per line, | |
first token should be the label. | |
The default is to use regularized Logistic Regression and relative frequencies. | |
Pass --svm to use Linear SVM instead. | |
Pass --tfidf to use tf-idf instead of relative frequencies. | |
Pass --bigrams to use bigrams instead of unigrams. | |
""" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Extract metadata from Project Gutenberg RDF catalog into a Python dict. | |
Based on https://bitbucket.org/c-w/gutenberg/ | |
>>> md = readmetadata() | |
>>> md[123] | |
{'LCC': {'PS'}, | |
'author': u'Burroughs, Edgar Rice', | |
'authoryearofbirth': 1875, | |
'authoryearofdeath': 1950, |
NewerOlder