Skip to content

Instantly share code, notes, and snippets.

View fginter's full-sized avatar

Filip Ginter fginter

View GitHub Profile
ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC=range(10)
def read_conll(inp,max_sent=0,drop_tokens=True,drop_nulls=True):
comments=[]
sent=[]
yielded=0
for line in inp:
line=line.strip()
if line.startswith("#"):
comments.append(line)
@fginter
fginter / gist:2d4662faeef79acdb772
Last active August 31, 2020 06:55
Super-fast sort - uniq for ngram counting

The problem:

  • 1.3TB data with 5B lines in a 72GB .gz file
  • Need to sort the lines and get a count for each unique line, basically a sort | uniq -c
  • Have a machine with 24 cores, 128GB of memory, but not 1.3TB of free disk space
  • Solution: sort | uniq -c with lots of non-standard options and pigz to take care of compression

Here's the sort part, uniq I used as usual.

INPUT=$1

OUTPUT=${INPUT%.gz}.sorted.gz

@fginter
fginter / gist:7cb32b22aabdc499428d
Created June 26, 2014 07:40
make remote tracking branch
git checkout -b branchname
git push -u origin branchname
import argparse
if __name__=="__main__":
parser = argparse.ArgumentParser(description='Train')
parser.add_argument('-p', '--processes', type=int, default=4, help='How many processes to run?')
parser.add_argument('-o', '--output', required=True, help='Name of the output model.')
parser.add_argument('input', nargs='?', help='Training file name, or nothing for training on stdin')
args = parser.parse_args()
# Get some code that does something in a class. Note that the code is a string.
code="""
def to_string(self):
print "x=", self.x
"""
# Print that code into a temporary Python module
with open("zzz.py","wt") as py:
@fginter
fginter / test_sm.py
Created May 19, 2014 14:05
test of shared memory array / numpy integration
import multiprocessing
import numpy
#Demonstrates shared memory numpy arrays with no synchronization between processes
def increment(s_arr):
#A function for a single process
#increment every element in s_arr by 1.0
#s_arr is a shared array from multiprocessing