Skip to content

Instantly share code, notes, and snippets.

View emjotde's full-sized avatar

Marcin Junczys-Dowmunt emjotde

View GitHub Profile
auto objFromIds = [](similarity::IdType id, const std::vector<similarity::IdType>& input) {
return new similarity::Object(id, -1, input.size() * sizeof(similarity::IdType), &input[0]);
};
similarity::initLibrary(0, LogChoice::LIB_LOGSTDERR, 0);
std::unique_ptr<similarity::Space<float>> space(new similarity::SpaceSparseJaccard<float>());
similarity::ObjectVector data; // doesn't free anything, just a vector of Object*
data.push_back(objFromIds(0, {0, 1, 2, 3, 4, 5, 6, 7, 8, 9})); // delete me
data.push_back(objFromIds(1, {2, 3, 4, 5, 6, 7})); // delete me
@emjotde
emjotde / hugging2marian.py
Last active April 7, 2020 04:09
Convert Huggingface BERT-style model to Marian Transformer Encoder
import numpy as np
import sys
import yaml
import argparse
from transformers import BertModel
parser = argparse.ArgumentParser(description='Convert Huggingface Bert model to Marian weight file.')
parser.add_argument('--bert', help='Path to Huggingface Bert PyTorch model', required=True)
parser.add_argument('--marian', help='Output path for Marian weight file', required=True)
@emjotde
emjotde / nele.doc-level.txt
Created November 7, 2019 16:02
Totally cherry-picked doc-level MT output vs. sent-level MT output translated from https://www.zeit.de/gesellschaft/zeitgeschehen/2018-11/chronische-schmerzen-borreliose-diagnose-ungeklaert with a long-sequence translation system vs. normal sentence-level translation system.
Chronic Pain: Nele Has Pain
Ever since Nele was seven, her legs and joints have been aching.
Every single day.
To this day, doctors don't quite understand why.
Some believe it's simulated.
Someone hits Nele's knee with a hammer, with full force, over and over again.
"It's a dull pain," she says, not knowing if "dull" is a good word.
She's known this feeling for 26 years.
Every hour, every minute, it's there.
On bad days, it feels as if a backhoe is driving over Nele's legs.
@emjotde
emjotde / fixRapid.pl
Last active March 3, 2019 19:42
Perl script for removing document with missing German umlauts from WMT19 Rapid corpus, expects tsv on stdin, produces tsv on stdout.
use strict;
use utf8;
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
sub checkAndPrint {
my @doc = @_;
my @temp = @doc;
shift(@temp); shift(@temp); shift(@temp); shift(@temp); shift(@temp); # remove first 5 lines, seems to be boilerplate and title sections with umlauts
@emjotde
emjotde / transformer.big.sh
Last active June 8, 2019 12:30
WMT 2018 hyperparameters
#!/bin/bash -v
WORKSPACE=19000
SEED=0
HDFS=/hdfs/$PHILLY_VC/marcinjd
MARIAN=$HDFS/bins/marian
DATA_DIR=$HDFS/WMT.paracrawl
LOG_DIR=$PHILLY_LOG_DIR
MODEL_DIR=$PHILLY_MODEL_DIR
@emjotde
emjotde / keywords.cpp
Last active September 26, 2016 15:25
Function keywords a la Python for C++
#include <typeinfo>
#include <typeindex>
#include <type_traits>
#include <cstdlib>
#include <cstdint>
#include <string>
#include <tuple>
#include <iostream>
static constexpr uint32_t crc_table[256] = {
@emjotde
emjotde / gist:9cf3a14e164e86274340
Created December 3, 2015 20:08 — forked from karpathy/gist:587454dc0146a6ae21fc
An efficient, batched LSTM.
"""
This is a batched LSTM forward and backward pass
"""
import numpy as np
import code
class LSTM:
@staticmethod
def init(input_size, hidden_size, fancy_forget_bias_init = 3):