This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pdfquery | |
import re | |
from lxml import etree as ET | |
import urllib.request | |
import urllib.error | |
# https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt | |
with open('words_alpha.txt', 'r') as f: | |
words = set([x.strip() for x in f]) | |
words.add('embeddings') |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pdfquery | |
import re | |
from lxml import etree as ET | |
import urllib.request | |
import urllib.error | |
from collections import Counter | |
import random | |
import numpy as np | |
import fileinput | |
import multiprocessing |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import re | |
from typing import List | |
re_author_split = re.compile(' and |, ') | |
re_curly_brace = re.compile('{([A-Za-z0-9 ]+)}') | |
acceptable_chars = '[\'`\/:\-()?\w\s\d.,]+' | |
re_newline = re.compile('[ ]*\n[ ]*') | |
re_inline_italics = re.compile(r'{\\(?:em|it) (' + acceptable_chars + ')}') |
We can't make this file beautiful and searchable because it's too large.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
UID,title,authors,abstract,keywords,session | |
1,Learning to Understand Child-directed and Adult-directed Speech,Lieke Gelderloos|Grzegorz Chrupała|Afra Alishahi,"Speech directed to children differs from adult-directed speech in linguistic aspects such as repetition, word choice, and sentence length, as well as in aspects of the speech signal itself, such as prosodic and phonemic variation. Human language acquisition research indicates that child-directed speech helps language learners. This study explores the effect of child-directed speech when learning to extract semantic information from speech directly. We compare the task performance of models trained on adult-directed speech (ADS) and child-directed speech (CDS). We find indications that CDS helps in the initial stages of learning, but eventually, models trained on ADS reach comparable task performance, and generalize better. The results suggest that this is at least partially due to linguistic rather than acoustic properties of the two registers, as we s |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import pandas as pd | |
import re | |
from datetime import datetime | |
from collections import defaultdict, OrderedDict | |
import yaml | |
re_session_extract = re.compile(r'\w+ (\w+) (\d+), (\d+) (\d+\w) ([\w\d\s:\-.,()]+-\d+) (\d+):(\d\d) UTC(.*)') | |
def extract_date(x): |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import glob | |
import itertools | |
from lxml import etree | |
from networkx.utils import UnionFind | |
from tqdm import tqdm | |
def elems_same(elem1, elem2): | |
return elem1.attrib == elem2.attrib |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import json | |
import sqlite3 | |
from time import sleep | |
SCHEMA = """CREATE TABLE IF NOT EXISTS planes ( | |
hex TEXT NOT NULL, | |
unix_time INT NOT NULL, | |
flight TEXT, | |
category TEXT); |