alvations

## make-15AUG19.log
$ make -j $(nproc)
Scanning dependencies of target nccl_install
Scanning dependencies of target marian_version
Scanning dependencies of target pathie-cpp
Scanning dependencies of target SQLiteCpp
Scanning dependencies of target libyaml-cpp
Scanning dependencies of target zlib
[  0%] Running cpp protocol buffer compiler on sentencepiece_model.proto
[  1%] Running cpp protocol buffer compiler on sentencepiece.proto
[  2%] Running cpp protocol buffer compiler on sentencepiece_model.proto

## train-15AUG19.log
[2019-08-15 08:31:02] [marian] Marian v1.7.8 c65c26d6 2019-08-11 18:27:00 +0100
[2019-08-15 08:31:02] [marian] Running on walle3 as process 24138 with command line:
[2019-08-15 08:31:02] [marian] /home/xyz/marian-dev/build/marian --model /disk2/models/xx-yy-r0/model.npz --type transformer --train-sets /disk2/data/xx-yy/train.sk /disk2/data/xx-yy/train.en --vocabs /disk2/models/xx-yy-r0/vocab.src.spm /disk2/models/xx-yy-r0/vocab.trg.spm --dim-vocabs 32000 32000 --mini-batch-fit --mini-batch 1000 --maxi-batch 1000 --valid-freq 10000 --save-freq 10000 --disp-freq 500 --valid-metrics ce-mean-words perplexity bleu-detok --valid-sets /disk2/data/xx-yy/valid.sk /disk2/data/xx-yy/valid.en --quiet-translation --beam-size 6 --normalize=0.6 --valid-mini-batch 16 --early-stopping 5 --cost-type=ce-mean-words --log /disk2/models/xx-yy-r0/train.log --valid-log /disk2/models/xx-yy-r0/valid.log --enc-depth 6 --dec-depth 6 --transformer-preprocess n --transformer-postprocess da --tied-embeddings-all --dim-emb 1024 --transforme

## big.txt
The Project Gutenberg EBook of The Adventures of Sherlock Holmes
by Sir Arthur Conan Doyle
(#15 in our series by Sir Arthur Conan Doyle)

Copyright laws are changing all over the world. Be sure to check the
copyright laws for your country before downloading or redistributing
this or any other Project Gutenberg eBook.

This header should be the first thing seen when viewing this Project
Gutenberg file.  Please do not remove it.  Do not change or edit the

## x.py
from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential([
    Dense(32, input_shape=(784,)),
    Activation('relu'),
    Dense(10),
    Activation('softmax'),
])

## x.lua
$ th

  ______             __   |  Torch7
 /_  __/__  ________/ /   |  Scientific computing for Lua.
  / / / _ \/ __/ __/ _ \  |
 /_/  \___/_/  \__/_//_/  |  https://github.com/torch
                          |  http://torch.ch

th> torch.Tensor{1,2,3}
 1

## toxicdataset.py
class ToxicDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.vocab = Dictionary(texts)
        special_tokens = {'<pad>': 0, '<unk>':1}
        self.vocab = Dictionary(texts)
        self.vocab.patch_with_special_tokens(special_tokens)

        # Vectorize labels
        self.labels = torch.tensor(labels)

## tsundoku_vocab.py
import os
from argparse import Namespace
from collections import Counter
import json
import re
import string

import numpy as np
import pandas as pd
import torch

## surnames_with_splits.csv

          
            nationality
            nationality_index
            split
            surname

            
              Arabic
              15
              train
              Totah

            
              Arabic
              15
              train
              Abboud

            
              Arabic
              15
              train
              Fakhoury

            
              Arabic
              15
              train
              Srour

            
              Arabic
              15
              train
              Sayegh

            
              Arabic
              15
              train
              Cham

            
              Arabic
              15
              train
              Haik

            
              Arabic
              15
              train
              Kattan

            
              Arabic
              15
              train
              Khouri

## language-never-random.txt
                       Language is never, ever, ever, random

                                                               ADAM KILGARRIFF


Abstract
Language users never choose words randomly, and language is essentially
non-random. Statistical hypothesis testing uses a null hypothesis, which

## time_ngrams.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                alvations
                / time_ngrams.md
            
            
              Last active
              October 18, 2018 04:02
            
              
                Zipping might not be as fast as we thought...
              
          
    How fast can we make the ngrams() function in NLTK?

From https://stackoverflow.com/q/21883108/610569, it suggested:
def zipgrams(sequence, n):
    """ From https://stackoverflow.com/q/21883108/610569"""
    return zip(*[sequence[i:] for i in range(n)])
	$ make -j $(nproc)
	Scanning dependencies of target nccl_install
	Scanning dependencies of target marian_version
	Scanning dependencies of target pathie-cpp
	Scanning dependencies of target SQLiteCpp
	Scanning dependencies of target libyaml-cpp
	Scanning dependencies of target zlib
	[ 0%] Running cpp protocol buffer compiler on sentencepiece_model.proto
	[ 1%] Running cpp protocol buffer compiler on sentencepiece.proto
	[ 2%] Running cpp protocol buffer compiler on sentencepiece_model.proto
	[2019-08-15 08:31:02] [marian] Marian v1.7.8 c65c26d6 2019-08-11 18:27:00 +0100
	[2019-08-15 08:31:02] [marian] Running on walle3 as process 24138 with command line:
	[2019-08-15 08:31:02] [marian] /home/xyz/marian-dev/build/marian --model /disk2/models/xx-yy-r0/model.npz --type transformer --train-sets /disk2/data/xx-yy/train.sk /disk2/data/xx-yy/train.en --vocabs /disk2/models/xx-yy-r0/vocab.src.spm /disk2/models/xx-yy-r0/vocab.trg.spm --dim-vocabs 32000 32000 --mini-batch-fit --mini-batch 1000 --maxi-batch 1000 --valid-freq 10000 --save-freq 10000 --disp-freq 500 --valid-metrics ce-mean-words perplexity bleu-detok --valid-sets /disk2/data/xx-yy/valid.sk /disk2/data/xx-yy/valid.en --quiet-translation --beam-size 6 --normalize=0.6 --valid-mini-batch 16 --early-stopping 5 --cost-type=ce-mean-words --log /disk2/models/xx-yy-r0/train.log --valid-log /disk2/models/xx-yy-r0/valid.log --enc-depth 6 --dec-depth 6 --transformer-preprocess n --transformer-postprocess da --tied-embeddings-all --dim-emb 1024 --transforme
	The Project Gutenberg EBook of The Adventures of Sherlock Holmes
	by Sir Arthur Conan Doyle
	(#15 in our series by Sir Arthur Conan Doyle)

	Copyright laws are changing all over the world. Be sure to check the
	copyright laws for your country before downloading or redistributing
	this or any other Project Gutenberg eBook.

	This header should be the first thing seen when viewing this Project
	Gutenberg file. Please do not remove it. Do not change or edit the
	from keras.models import Sequential
	from keras.layers import Dense, Activation

	model = Sequential([
	Dense(32, input_shape=(784,)),
	Activation('relu'),
	Dense(10),
	Activation('softmax'),
	])
	$ th

	______ __ \| Torch7
	/_ __/__ ________/ / \| Scientific computing for Lua.
	/ / / _ \/ __/ __/ _ \ \|
	/_/ \___/_/ \__/_//_/ \| https://github.com/torch
	\| http://torch.ch

	th> torch.Tensor{1,2,3}
	1
	class ToxicDataset(Dataset):
	def __init__(self, texts, labels):
	self.texts = texts
	self.vocab = Dictionary(texts)
	special_tokens = {'<pad>': 0, '<unk>':1}
	self.vocab = Dictionary(texts)
	self.vocab.patch_with_special_tokens(special_tokens)

	# Vectorize labels
	self.labels = torch.tensor(labels)
	import os
	from argparse import Namespace
	from collections import Counter
	import json
	import re
	import string

	import numpy as np
	import pandas as pd
	import torch
nationality	nationality_index	split	surname
Arabic	15	train	Totah
Arabic	15	train	Abboud
Arabic	15	train	Fakhoury
Arabic	15	train	Srour
Arabic	15	train	Sayegh
Arabic	15	train	Cham
Arabic	15	train	Haik
Arabic	15	train	Kattan
Arabic	15	train	Khouri
	Language is never, ever, ever, random

	ADAM KILGARRIFF




	Abstract
	Language users never choose words randomly, and language is essentially
	non-random. Statistical hypothesis testing uses a null hypothesis, which