Andreas van Cranenburgh andreasvc

## assignment2.ipynb

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andreasvc
                / assignment2.ipynb
            
            
              Created
              March 8, 2024 08:08
            
              
                Assignment 2 of Distant Reading course: Topic Modeling
              
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## metadata.tsv
DBNLti_id	DBNLpers_id	YearFirstPublished	YearEditionPublished	Edition	Woman	Born	Died	AuthorOrigin	DBNLgeb_land_code	DBNLgenre	DBNLsubgenre	Author	Title	Filename	ti_id_set	WPAuthor	AuthorInCanon2002	TitleInCanon2002	InBasisbibliotheek2008	AuthorDBRDMatches	AuthorNLWikipedia2019Matches	DBNLSecRefsAuthor	DBNLSecRefsTitle	holding	lending	GNTpages
kist001leve01	kist001	1800	1800	1ste druk	0	1758	1841	Woerden		proza	roman	Willem Kist	Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen)	kist001leve01_01.xml	kist001leve01		0	0	0	0	1	19	1	0	0	4
wolf016gesc01	deke001	1802	1802	1ste druk	1	1741	1804	Amstelveen		proza	roman	Aagje Deken	Geschrift eener bejaarde vrouw	wolf016gesc01_01.xml	wolf016gesc01	Aagje Deken	1	0	0	1	21	131	6	0	0	0
stre001char01	stre001	1804	1804	1ste druk	1	1760	1828	Amsterdam		proza	briefroman	Naatje van Streek-Brinkman	Charakters en lotgevallen van Adelson, Héloïse en Elius	stre001char01_01.xml	stre001char01		0	0	0	0	0	13	0	0

## longest non-taboo sequence.ipynb

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andreasvc
                / longest non-taboo sequence.ipynb
            
            
              Last active
              February 18, 2021 10:41
            
              
                Find the longest sequence of tokens in a text without any taboo n-grams
              
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Dockerfile
# This is a comment
FROM ubuntu:20.04
MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl>
RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
        build-essential \
        curl \
        git \
        python3 \

## ip_re_bench.py
import random
from timeit import timeit
import re
import re2

re_ip = re.compile(br'\d+\.\d+\.\d+\.\d+')
re2_ip = re2.compile(br'\d+\.\d+\.\d+\.\d+')

lines = ['.'.join(str(random.randint(1, 255)) for _ in range(4)).encode('utf8')
                for _ in range(16000)]

## datetime_example.py
import datetime


def addseconds(timestamp, seconds):
    """Take timestamp as string and add seconds to it.

    >>> addseconds('00:01:45,667', 1)
    '00:01:46,667'
    >>> addseconds('00:01:45,667', 0.5)
    '00:01:46,167'

## detectedlangs_not_nl.tsv

          
            filename
            lang
            confidence
            read_bytes

            
              train/neg/3706_2.txt
              en
              81.0
              1268

            
              train/neg/9466_1.txt
              en
              99.0
              1066

            
              train/neg/6464_2.txt
              en
              99.0
              1248

            
              train/neg/14850_2.txt
              en
              99.0
              1128

            
              train/neg/4674_2.txt
              en
              99.0
              1306

            
              train/neg/7036_1.txt
              fy
              68.0
              997

            
              train/neg/7454_2.txt
              en
              63.0
              688

            
              train/neg/4856_2.txt
              en
              99.0
              1363

            
              train/neg/12096_2.txt
              en
              99.0
              1339

## detectlang.py
"""Apply polyglot language detection to all .txt files under current directory
(searched recursively), write report in tab-separated file detectedlangs.tsv.
"""
import os
from glob import glob
from polyglot.detect import Detector
from polyglot.detect.base import UnknownLanguage


def main():

## exercises.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andreasvc
                / exercises.md
            
            
              Created
              September 17, 2019 09:11
            
              
                More Python exercises
              
          
Write a function char_freq() that takes a string and builds a frequency listing
of the characters contained in it. Represent the frequency listing as a Python dictionary.
Try it with something like char_freq("abbabcbdbabdbdbabababcbcbab").


Write a function char_freq_table() that take a file name as argument, builds a frequency listing of the characters contained in the file, and prints a sorted and nicely formatted character frequency table to the screen.


The third person singular verb form in English is distinguished by the suffix -s, which is added to the stem of the infinitive form: run -> runs. A simple set of rules can be given as follows:
a. If the verb ends in y, remove it and add ies
b. If the verb ends in o, ch, s, sh, x or z, add es
c. By default just add s


## exercises.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andreasvc
                / exercises.md
            
            
              Last active
              September 18, 2019 13:39
            
          
    Python exercises


Define a function max() that takes two numbers as arguments and returns the largest of them. Use the if-then-else construct available in Python. (It is true that Python has the max() function built in, but writing it yourself is nevertheless a good exercise).


Define a function max_of_three() that takes three numbers as arguments and returns the largest of them.


Define a function that computes the length of a given list or string. (It is true that Python has the len() function built in, but writing it yourself is nevertheless a good exercise).


Write a function that takes a character (i.e. a string of length 1) and returns True if it is a vowel, False otherwise.
	DBNLti_id DBNLpers_id YearFirstPublished YearEditionPublished Edition Woman Born Died AuthorOrigin DBNLgeb_land_code DBNLgenre DBNLsubgenre Author Title Filename ti_id_set WPAuthor AuthorInCanon2002 TitleInCanon2002 InBasisbibliotheek2008 AuthorDBRDMatches AuthorNLWikipedia2019Matches DBNLSecRefsAuthor DBNLSecRefsTitle holding lending GNTpages
	kist001leve01 kist001 1800 1800 1ste druk 0 1758 1841 Woerden proza roman Willem Kist Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen) kist001leve01_01.xml kist001leve01 0 0 0 0 1 19 1 0 0 4
	wolf016gesc01 deke001 1802 1802 1ste druk 1 1741 1804 Amstelveen proza roman Aagje Deken Geschrift eener bejaarde vrouw wolf016gesc01_01.xml wolf016gesc01 Aagje Deken 1 0 0 1 21 131 6 0 0 0
	stre001char01 stre001 1804 1804 1ste druk 1 1760 1828 Amsterdam proza briefroman Naatje van Streek-Brinkman Charakters en lotgevallen van Adelson, Héloïse en Elius stre001char01_01.xml stre001char01 0 0 0 0 0 13 0 0
	# This is a comment
	FROM ubuntu:20.04
	MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl>
	RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
	ENV DEBIAN_FRONTEND=noninteractive
	RUN apt-get update && apt-get install -y \
	build-essential \
	curl \
	git \
	python3 \
	import random
	from timeit import timeit
	import re
	import re2

	re_ip = re.compile(br'\d+\.\d+\.\d+\.\d+')
	re2_ip = re2.compile(br'\d+\.\d+\.\d+\.\d+')

	lines = ['.'.join(str(random.randint(1, 255)) for _ in range(4)).encode('utf8')
	for _ in range(16000)]
	import datetime


	def addseconds(timestamp, seconds):
	"""Take timestamp as string and add seconds to it.

	>>> addseconds('00:01:45,667', 1)
	'00:01:46,667'
	>>> addseconds('00:01:45,667', 0.5)
	'00:01:46,167'
filename	lang	confidence	read_bytes
train/neg/3706_2.txt	en	81.0	1268
train/neg/9466_1.txt	en	99.0	1066
train/neg/6464_2.txt	en	99.0	1248
train/neg/14850_2.txt	en	99.0	1128
train/neg/4674_2.txt	en	99.0	1306
train/neg/7036_1.txt	fy	68.0	997
train/neg/7454_2.txt	en	63.0	688
train/neg/4856_2.txt	en	99.0	1363
train/neg/12096_2.txt	en	99.0	1339
	"""Apply polyglot language detection to all .txt files under current directory
	(searched recursively), write report in tab-separated file detectedlangs.tsv.
	"""
	import os
	from glob import glob
	from polyglot.detect import Detector
	from polyglot.detect.base import UnknownLanguage


	def main():