Skip to content

Instantly share code, notes, and snippets.

@andreasvc
andreasvc / assignment2.ipynb
Created March 8, 2024 08:08
Assignment 2 of Distant Reading course: Topic Modeling
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@andreasvc
andreasvc / 1027.txt.mrg.gz
Last active December 30, 2022 00:35
A tutorial on using tree fragments for text classification. http://nbviewer.ipython.org/gist/andreasvc/9467e27680d8950045b2
"""Compute complexity metrics from Universal Dependencies.
Usage: python3 udstyle.py [OPTIONS] FILE...
--parse=LANG parse texts with Stanza; provide 2 letter language code
--output=FILENAME write result to a tab-separated file.
--persentence report per sentence results, not mean per document
Reported metrics:
- LEN: mean sentence length in words (excluding punctuation).
- MDD: mean dependency distance (Gibson, 1998).
- NDD: normalized dependency distance (Lei & Jockers, 2018).
# -*- coding: UTF-8 -*-
"""Preprocessing of text files.
Writes one paragraph per line, and normalizes punctuation & whitespace.
No sentence or word tokenization.
Usage: preprocess.py [FILE]
or: preprocess.py --batch FILES...
By default, produce cleaned version given a single filename to standard output.
Diagnostic information is written to standard error.
We can make this file beautiful and searchable if this error is corrected: It looks like row 4 should actually have 27 columns, instead of 25. in line 3.
DBNLti_id DBNLpers_id YearFirstPublished YearEditionPublished Edition Woman Born Died AuthorOrigin DBNLgeb_land_code DBNLgenre DBNLsubgenre Author Title Filename ti_id_set WPAuthor AuthorInCanon2002 TitleInCanon2002 InBasisbibliotheek2008 AuthorDBRDMatches AuthorNLWikipedia2019Matches DBNLSecRefsAuthor DBNLSecRefsTitle holding lending GNTpages
kist001leve01 kist001 1800 1800 1ste druk 0 1758 1841 Woerden proza roman Willem Kist Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen) kist001leve01_01.xml kist001leve01 0 0 0 0 1 19 1 0 0 4
wolf016gesc01 deke001 1802 1802 1ste druk 1 1741 1804 Amstelveen proza roman Aagje Deken Geschrift eener bejaarde vrouw wolf016gesc01_01.xml wolf016gesc01 Aagje Deken 1 0 0 1 21 131 6 0 0 0
stre001char01 stre001 1804 1804 1ste druk 1 1760 1828 Amsterdam proza briefroman Naatje van Streek-Brinkman Charakters en lotgevallen van Adelson, Héloïse en Elius stre001char01_01.xml stre001char01 0 0 0 0 0 13 0 0
@andreasvc
andreasvc / pca.py
Last active October 20, 2021 17:57
Apply PCA to a CSV file and plot its datapoints (one per line).Usage: pca.py <csv_file>
"""Apply PCA to a CSV file and plot its datapoints (one per line).
The first column should be a category (determines the color of each datapoint),
the second a label (shown alongside each datapoint)."""
import sys
import pandas
import pylab as pl
from sklearn import preprocessing
from sklearn.decomposition import PCA
@andreasvc
andreasvc / longest non-taboo sequence.ipynb
Last active February 18, 2021 10:41
Find the longest sequence of tokens in a text without any taboo n-grams
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@andreasvc
andreasvc / Dockerfile
Created September 8, 2020 12:29
docker-compose example
# This is a comment
FROM ubuntu:20.04
MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl>
RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
python3 \
@andreasvc
andreasvc / bowclassify.py
Last active July 8, 2020 15:07
A baseline Bag-of-Words text classification
"""A baseline Bag-of-Words text classification.
Usage: python3 classify.py <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams]
train.txt and test.txt should contain one "document" per line,
first token should be the label.
The default is to use regularized Logistic Regression and relative frequencies.
Pass --svm to use Linear SVM instead.
Pass --tfidf to use tf-idf instead of relative frequencies.
Pass --bigrams to use bigrams instead of unigrams.
"""
@andreasvc
andreasvc / metainfo.py
Last active May 23, 2020 16:39
Extract metadata from Project Gutenberg RDF catalog into a Python dict.
"""Extract metadata from Project Gutenberg RDF catalog into a Python dict.
Based on https://bitbucket.org/c-w/gutenberg/
>>> md = readmetadata()
>>> md[123]
{'LCC': {'PS'},
'author': u'Burroughs, Edgar Rice',
'authoryearofbirth': 1875,
'authoryearofdeath': 1950,