Skip to content

Instantly share code, notes, and snippets.

View metadata.tsv
We can make this file beautiful and searchable if this error is corrected: It looks like row 4 should actually have 27 columns, instead of 25. in line 3.
DBNLti_id DBNLpers_id YearFirstPublished YearEditionPublished Edition Woman Born Died AuthorOrigin DBNLgeb_land_code DBNLgenre DBNLsubgenre Author Title Filename ti_id_set WPAuthor AuthorInCanon2002 TitleInCanon2002 InBasisbibliotheek2008 AuthorDBRDMatches AuthorNLWikipedia2019Matches DBNLSecRefsAuthor DBNLSecRefsTitle holding lending GNTpages
kist001leve01 kist001 1800 1800 1ste druk 0 1758 1841 Woerden proza roman Willem Kist Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen) kist001leve01_01.xml kist001leve01 0 0 0 0 1 19 1 0 0 4
wolf016gesc01 deke001 1802 1802 1ste druk 1 1741 1804 Amstelveen proza roman Aagje Deken Geschrift eener bejaarde vrouw wolf016gesc01_01.xml wolf016gesc01 Aagje Deken 1 0 0 1 21 131 6 0 0 0
stre001char01 stre001 1804 1804 1ste druk 1 1760 1828 Amsterdam proza briefroman Naatje van Streek-Brinkman Charakters en lotgevallen van Adelson, Héloïse en Elius stre001char01_01.xml stre001char01 0 0 0 0 0 13 0 0
andreasvc / longest non-taboo sequence.ipynb
Last active February 18, 2021 10:41
Find the longest sequence of tokens in a text without any taboo n-grams
View longest non-taboo sequence.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
andreasvc / Dockerfile
Created September 8, 2020 12:29
docker-compose example
View Dockerfile
# This is a comment
FROM ubuntu:20.04
MAINTAINER Andreas van Cranenburgh <>
RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
python3 \
import random
from timeit import timeit
import re
import re2
re_ip = re.compile(br'\d+\.\d+\.\d+\.\d+')
re2_ip = re2.compile(br'\d+\.\d+\.\d+\.\d+')
lines = ['.'.join(str(random.randint(1, 255)) for _ in range(4)).encode('utf8')
for _ in range(16000)]
import datetime
def addseconds(timestamp, seconds):
"""Take timestamp as string and add seconds to it.
>>> addseconds('00:01:45,667', 1)
>>> addseconds('00:01:45,667', 0.5)
View detectedlangs_not_nl.tsv
filename lang confidence read_bytes
train/neg/3706_2.txt en 81.0 1268
train/neg/9466_1.txt en 99.0 1066
train/neg/6464_2.txt en 99.0 1248
train/neg/14850_2.txt en 99.0 1128
train/neg/4674_2.txt en 99.0 1306
train/neg/7036_1.txt fy 68.0 997
train/neg/7454_2.txt en 63.0 688
train/neg/4856_2.txt en 99.0 1363
train/neg/12096_2.txt en 99.0 1339
andreasvc /
Last active October 3, 2019 17:12
Apply polyglot language detection recursively
"""Apply polyglot language detection to all .txt files under current directory
(searched recursively), write report in tab-separated file detectedlangs.tsv.
import os
from glob import glob
from polyglot.detect import Detector
from polyglot.detect.base import UnknownLanguage
def main():
andreasvc /
Created September 17, 2019 09:11
More Python exercises
  1. Write a function char_freq() that takes a string and builds a frequency listing of the characters contained in it. Represent the frequency listing as a Python dictionary. Try it with something like char_freq("abbabcbdbabdbdbabababcbcbab").

  2. Write a function char_freq_table() that take a file name as argument, builds a frequency listing of the characters contained in the file, and prints a sorted and nicely formatted character frequency table to the screen.

  3. The third person singular verb form in English is distinguished by the suffix -s, which is added to the stem of the infinitive form: run -> runs. A simple set of rules can be given as follows:

    a. If the verb ends in y, remove it and add ies b. If the verb ends in o, ch, s, sh, x or z, add es c. By default just add s


Python exercises

  1. Define a function max() that takes two numbers as arguments and returns the largest of them. Use the if-then-else construct available in Python. (It is true that Python has the max() function built in, but writing it yourself is nevertheless a good exercise).

  2. Define a function max_of_three() that takes three numbers as arguments and returns the largest of them.

  3. Define a function that computes the length of a given list or string. (It is true that Python has the len() function built in, but writing it yourself is nevertheless a good exercise).

  4. Write a function that takes a character (i.e. a string of length 1) and returns True if it is a vowel, False otherwise.

andreasvc /
Last active July 8, 2020 15:07
A baseline Bag-of-Words text classification
"""A baseline Bag-of-Words text classification.
Usage: python3 <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams]
train.txt and test.txt should contain one "document" per line,
first token should be the label.
The default is to use regularized Logistic Regression and relative frequencies.
Pass --svm to use Linear SVM instead.
Pass --tfidf to use tf-idf instead of relative frequencies.
Pass --bigrams to use bigrams instead of unigrams.