Skip to content

Instantly share code, notes, and snippets.

@nsaphra
nsaphra / txt2giza.py
Last active August 29, 2015 14:00
Generate GIZA++ input files from segmented parallel text files, with option to add onto previous input files.
#!/usr/bin/python
import argparse
from collections import defaultdict
parser = argparse.ArgumentParser(description='Generate GIZA++ input files from '
'segmented parallel text files.')
parser.add_argument('-s', '--src_in', help='Source input file')
parser.add_argument('-t', '--tgt_in', help='Target input file')
parser.add_argument('-p', '--prev_out', default=None, help='Previous output files prefix')
parser.add_argument('-o', '--out', help='Prefix for output files')
@nsaphra
nsaphra / Find.jl
Created October 14, 2014 21:37
Filesystem find one-liner
find(path::String, exec, filterfcn) = [name => exec(name) for name in filter(filterfcn, readdir(path))]
@nsaphra
nsaphra / concatenate_corpus.py
Created February 17, 2015 17:29
Concatenate all the files in a directory, recursively, and print their contents.
#!/usr/bin/python
from collections import defaultdict
import json
import os
import argparse
import gzip
import sys
import codecs
from time import asctime
@nsaphra
nsaphra / LispParser.jl
Last active March 2, 2016 14:49
Simple lisp parser for RC pair programming interview.
type SyntaxNode
label::AbstractString
parent::SyntaxNode
children::Array{SyntaxNode}
# TODO No error handling when going up a level with undefined parent.
SyntaxNode() = (
x = new();
x.label = "";
x.children = [];
@nsaphra
nsaphra / zipf.py
Created April 19, 2017 15:50
discrete log uniform power distribution
def zipf(size, exponent):
x = np.arange(size, dtype='float')
pmf = (x ** exponent).reciprocal()
pmf /= pmf.sum()
return stats.rv_discrete(values=range(size), pmf)
@nsaphra
nsaphra / naughtandcrosses.py
Created March 6, 2017 19:14
recurse center interview code
class NoughtsAndCrosses:
NOUGHT = "O"
CROSS = "X"
EMPTY = " "
STALEMATE = "Nobody"
def __init__(self):
self.board = [[self.EMPTY] * 3, [self.EMPTY] * 3, [self.EMPTY] * 3]

Keybase proof

I hereby claim:

  • I am nsaphra on github.
  • I am nsaphra (https://keybase.io/nsaphra) on keybase.
  • I have a public key ASCpyzsqtJYqR6IjSCnoPwSjrInpOg35MPypGR9l_pvTcQo

To claim this, I am signing this object:

@nsaphra
nsaphra / tf.sh
Last active November 24, 2017 16:24
Activate a conda jupyter notebook in tmux, for use on a server with timeouts after each notebook start.
#!/bin/bash
if [ "$TERM" != "screen" ]
then
if type tmux >/dev/null 2>&1
then
tmux att || tmux \
new -s tensorflow -n shell \; \
neww -n notebook "source activate tensorflow; cd Documents/dynamic_curriculum; jupyter notebook" \; \
neww -n dir "cd Documents/dynamic_curriculum"
@nsaphra
nsaphra / shuffle_corpus.py
Created July 9, 2018 14:59
If you have a corpus in a format where 1 file contains tokens and a different file has the corresponding POS tags, take the 2 files and shuffle them simultaneously so the tokens are still aligned with the correct tags.
# -*- coding: utf-8 -*-
import os
from random import shuffle
import argparse
parser = argparse.ArgumentParser(description='shuffle a corpus such that the tags and the original tokenized text still align')
parser.add_argument('--unshuffled_dir', type=str)
parser.add_argument('--shuffled_dir', type=str)
parser.add_argument('--tag_suffix', type=str, default='.tag')
args = parser.parse_args()
@nsaphra
nsaphra / token_type_counter.py
Created September 20, 2018 15:21
count the type and tokens in a file
import sys
types = set()
token_count = 0
for i, line in enumerate(sys.stdin):
if i % 1000 == 0:
print('.')
line = line.strip().split()
types.update(line)