Andreas van Cranenburgh andreasvc

## assignment2.ipynb

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andreasvc
                / assignment2.ipynb
            
            
              Created
              March 8, 2024 08:08
            
              
                Assignment 2 of Distant Reading course: Topic Modeling
              
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## 1027.txt.mrg.gz

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                andreasvc
                / 1027.txt.mrg.gz
            
            
              Last active
              December 30, 2022 00:35
            
              
                A tutorial on using tree fragments for text classification. http://nbviewer.ipython.org/gist/andreasvc/9467e27680d8950045b2
              
          
            View raw
        
    
## udstyle.py
"""Compute complexity metrics from Universal Dependencies.

Usage: python3 udstyle.py [OPTIONS] FILE...
  --parse=LANG          parse texts with Stanza; provide 2 letter language code
  --output=FILENAME     write result to a tab-separated file.
  --persentence         report per sentence results, not mean per document
Reported metrics:
  - LEN:  mean sentence length in words (excluding punctuation).
  - MDD:  mean dependency distance (Gibson, 1998).
  - NDD:  normalized dependency distance (Lei & Jockers, 2018).

## preprocess.py
# -*- coding: UTF-8 -*-
"""Preprocessing of text files.
Writes one paragraph per line, and normalizes punctuation & whitespace.
No sentence or word tokenization.

Usage: preprocess.py [FILE]
or: preprocess.py --batch FILES...

By default, produce cleaned version given a single filename to standard output.
Diagnostic information is written to standard error.

## metadata.tsv
DBNLti_id	DBNLpers_id	YearFirstPublished	YearEditionPublished	Edition	Woman	Born	Died	AuthorOrigin	DBNLgeb_land_code	DBNLgenre	DBNLsubgenre	Author	Title	Filename	ti_id_set	WPAuthor	AuthorInCanon2002	TitleInCanon2002	InBasisbibliotheek2008	AuthorDBRDMatches	AuthorNLWikipedia2019Matches	DBNLSecRefsAuthor	DBNLSecRefsTitle	holding	lending	GNTpages
kist001leve01	kist001	1800	1800	1ste druk	0	1758	1841	Woerden		proza	roman	Willem Kist	Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen)	kist001leve01_01.xml	kist001leve01		0	0	0	0	1	19	1	0	0	4
wolf016gesc01	deke001	1802	1802	1ste druk	1	1741	1804	Amstelveen		proza	roman	Aagje Deken	Geschrift eener bejaarde vrouw	wolf016gesc01_01.xml	wolf016gesc01	Aagje Deken	1	0	0	1	21	131	6	0	0	0
stre001char01	stre001	1804	1804	1ste druk	1	1760	1828	Amsterdam		proza	briefroman	Naatje van Streek-Brinkman	Charakters en lotgevallen van Adelson, Héloïse en Elius	stre001char01_01.xml	stre001char01		0	0	0	0	0	13	0	0

## pca.py
"""Apply PCA to a CSV file and plot its datapoints (one per line).

The first column should be a category (determines the color of each datapoint),
the second a label (shown alongside each datapoint)."""
import sys
import pandas
import pylab as pl
from sklearn import preprocessing
from sklearn.decomposition import PCA

## longest non-taboo sequence.ipynb

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andreasvc
                / longest non-taboo sequence.ipynb
            
            
              Last active
              February 18, 2021 10:41
            
              
                Find the longest sequence of tokens in a text without any taboo n-grams
              
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Dockerfile
# This is a comment
FROM ubuntu:20.04
MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl>
RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get install -y \
        build-essential \
        curl \
        git \
        python3 \

## bowclassify.py
"""A baseline Bag-of-Words text classification.

Usage: python3 classify.py <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams]
train.txt and test.txt should contain one "document" per line,
first token should be the label.
The default is to use regularized Logistic Regression and relative frequencies.
Pass --svm to use Linear SVM instead.
Pass --tfidf to use tf-idf instead of relative frequencies.
Pass --bigrams to use bigrams instead of unigrams.
"""

## metainfo.py
"""Extract metadata from Project Gutenberg RDF catalog into a Python dict.

Based on https://bitbucket.org/c-w/gutenberg/

>>> md = readmetadata()
>>> md[123]
{'LCC': {'PS'},
 'author': u'Burroughs, Edgar Rice',
 'authoryearofbirth': 1875,
 'authoryearofdeath': 1950,
	"""Compute complexity metrics from Universal Dependencies.

	Usage: python3 udstyle.py [OPTIONS] FILE...
	--parse=LANG parse texts with Stanza; provide 2 letter language code
	--output=FILENAME write result to a tab-separated file.
	--persentence report per sentence results, not mean per document
	Reported metrics:
	- LEN: mean sentence length in words (excluding punctuation).
	- MDD: mean dependency distance (Gibson, 1998).
	- NDD: normalized dependency distance (Lei & Jockers, 2018).
	# -- coding: UTF-8 --
	"""Preprocessing of text files.
	Writes one paragraph per line, and normalizes punctuation & whitespace.
	No sentence or word tokenization.

	Usage: preprocess.py [FILE]
	or: preprocess.py --batch FILES...

	By default, produce cleaned version given a single filename to standard output.
	Diagnostic information is written to standard error.
	DBNLti_id DBNLpers_id YearFirstPublished YearEditionPublished Edition Woman Born Died AuthorOrigin DBNLgeb_land_code DBNLgenre DBNLsubgenre Author Title Filename ti_id_set WPAuthor AuthorInCanon2002 TitleInCanon2002 InBasisbibliotheek2008 AuthorDBRDMatches AuthorNLWikipedia2019Matches DBNLSecRefsAuthor DBNLSecRefsTitle holding lending GNTpages
	kist001leve01 kist001 1800 1800 1ste druk 0 1758 1841 Woerden proza roman Willem Kist Het leven, gevoelens en zonderlinge reize van den landjonker Govert Hendrik Godefroi van Blankenheim tot den Stronk (2 delen) kist001leve01_01.xml kist001leve01 0 0 0 0 1 19 1 0 0 4
	wolf016gesc01 deke001 1802 1802 1ste druk 1 1741 1804 Amstelveen proza roman Aagje Deken Geschrift eener bejaarde vrouw wolf016gesc01_01.xml wolf016gesc01 Aagje Deken 1 0 0 1 21 131 6 0 0 0
	stre001char01 stre001 1804 1804 1ste druk 1 1760 1828 Amsterdam proza briefroman Naatje van Streek-Brinkman Charakters en lotgevallen van Adelson, Héloïse en Elius stre001char01_01.xml stre001char01 0 0 0 0 0 13 0 0
	"""Apply PCA to a CSV file and plot its datapoints (one per line).

	The first column should be a category (determines the color of each datapoint),
	the second a label (shown alongside each datapoint)."""
	import sys
	import pandas
	import pylab as pl
	from sklearn import preprocessing
	from sklearn.decomposition import PCA
	# This is a comment
	FROM ubuntu:20.04
	MAINTAINER Andreas van Cranenburgh <a.w.vancranenburgh@uva.nl>
	RUN ln -fs /usr/share/zoneinfo/Europe/Amsterdam /etc/localtime
	ENV DEBIAN_FRONTEND=noninteractive
	RUN apt-get update && apt-get install -y \
	build-essential \
	curl \
	git \
	python3 \
	"""A baseline Bag-of-Words text classification.

	Usage: python3 classify.py <train.txt> <test.txt> [--svm] [--tfidf] [--bigrams]
	train.txt and test.txt should contain one "document" per line,
	first token should be the label.
	The default is to use regularized Logistic Regression and relative frequencies.
	Pass --svm to use Linear SVM instead.
	Pass --tfidf to use tf-idf instead of relative frequencies.
	Pass --bigrams to use bigrams instead of unigrams.
	"""
	"""Extract metadata from Project Gutenberg RDF catalog into a Python dict.

	Based on https://bitbucket.org/c-w/gutenberg/

	>>> md = readmetadata()
	>>> md[123]
	{'LCC': {'PS'},
	'author': u'Burroughs, Edgar Rice',
	'authoryearofbirth': 1875,
	'authoryearofdeath': 1950,