Skip to content

Instantly share code, notes, and snippets.

@andreasvc
andreasvc / assignment2.ipynb
Created March 8, 2024 08:08
Assignment 2 of Distant Reading course: Topic Modeling
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
We can make this file beautiful and searchable if this error is corrected: It looks like row 4 should actually have 27 columns, instead of 25. in line 3.
DBNLti_id DBNLpers_id YearFirstPublished YearEditionPublished Edition Woman Born Died AuthorOrigin DBNLgeb_land_code DBNLgenre DBNLsubgenre Author Title Filename ti_id_set WPAuthor AuthorInCanon2002 TitleInCanon2002 InBasisbibliotheek2008 AuthorDBRDMatches AuthorNLWikipedia2019Matches DBNLSecRefsAuthor DBNLSecRefsTitle holding lending GNTpages
kist001leve01 kist001 1800 1800 1ste druk 0 1758 1841 Woerden proza roman Willem Kist Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen) kist001leve01_01.xml kist001leve01 0 0 0 0 1 19 1 0 0 4
wolf016gesc01 deke001 1802 1802 1ste druk 1 1741 1804 Amstelveen proza roman Aagje Deken Geschrift eener bejaarde vrouw wolf016gesc01_01.xml wolf016gesc01 Aagje Deken 1 0 0 1 21 131 6 0 0 0
stre001char01 stre001 1804 1804 1ste druk 1 1760 1828 Amsterdam proza briefroman Naatje van Streek-Brinkman Charakters en lotgevallen van Adelson, Héloïse en Elius stre001char01_01.xml stre001char01 0 0 0 0 0 13 0 0
@andreasvc
andreasvc / longest non-taboo sequence.ipynb
Last active February 18, 2021 10:41
Find the longest sequence of tokens in a text without any taboo n-grams
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@andreasvc
andreasvc / Dockerfile
Created September 8, 2020 12:29
docker-compose example
# This is a comment
FROM ubuntu:20.04
MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl>
RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
build-essential \
curl \
git \
python3 \
import random
from timeit import timeit
import re
import re2
re_ip = re.compile(br'\d+\.\d+\.\d+\.\d+')
re2_ip = re2.compile(br'\d+\.\d+\.\d+\.\d+')
lines = ['.'.join(str(random.randint(1, 255)) for _ in range(4)).encode('utf8')
for _ in range(16000)]
import datetime
def addseconds(timestamp, seconds):
"""Take timestamp as string and add seconds to it.
>>> addseconds('00:01:45,667', 1)
'00:01:46,667'
>>> addseconds('00:01:45,667', 0.5)
'00:01:46,167'
filename lang confidence read_bytes
train/neg/3706_2.txt en 81.0 1268
train/neg/9466_1.txt en 99.0 1066
train/neg/6464_2.txt en 99.0 1248
train/neg/14850_2.txt en 99.0 1128
train/neg/4674_2.txt en 99.0 1306
train/neg/7036_1.txt fy 68.0 997
train/neg/7454_2.txt en 63.0 688
train/neg/4856_2.txt en 99.0 1363
train/neg/12096_2.txt en 99.0 1339
@andreasvc
andreasvc / detectlang.py
Last active October 3, 2019 17:12
Apply polyglot language detection recursively
"""Apply polyglot language detection to all .txt files under current directory
(searched recursively), write report in tab-separated file detectedlangs.tsv.
"""
import os
from glob import glob
from polyglot.detect import Detector
from polyglot.detect.base import UnknownLanguage
def main():
@andreasvc
andreasvc / exercises.md
Created September 17, 2019 09:11
More Python exercises
  1. Write a function char_freq() that takes a string and builds a frequency listing of the characters contained in it. Represent the frequency listing as a Python dictionary. Try it with something like char_freq("abbabcbdbabdbdbabababcbcbab").

  2. Write a function char_freq_table() that take a file name as argument, builds a frequency listing of the characters contained in the file, and prints a sorted and nicely formatted character frequency table to the screen.

  3. The third person singular verb form in English is distinguished by the suffix -s, which is added to the stem of the infinitive form: run -> runs. A simple set of rules can be given as follows:

    a. If the verb ends in y, remove it and add ies b. If the verb ends in o, ch, s, sh, x or z, add es c. By default just add s

Python exercises

  1. Define a function max() that takes two numbers as arguments and returns the largest of them. Use the if-then-else construct available in Python. (It is true that Python has the max() function built in, but writing it yourself is nevertheless a good exercise).

  2. Define a function max_of_three() that takes three numbers as arguments and returns the largest of them.

  3. Define a function that computes the length of a given list or string. (It is true that Python has the len() function built in, but writing it yourself is nevertheless a good exercise).

  4. Write a function that takes a character (i.e. a string of length 1) and returns True if it is a vowel, False otherwise.