Skip to content

Instantly share code, notes, and snippets.

Avatar

Knut O. Hellan khellan

View GitHub Profile
@khellan
khellan / README.md
Last active Mar 30, 2020
Sentencepiece 0.1.85 for Python 3.8 on OSX/Mac
View README.md

Download the file and install it:

pipenv install <path to local wheel>

There you go.

@khellan
khellan / batch_deleter.py
Created Sep 21, 2018
Batchwise deletion of malformed HBase row keys. It will not stop when done so it needs monitoring.
View batch_deleter.py
import happybase
connection = happybase.Connection(HBASE_MASTER_IP)
table = connection.table(TABLE_NAME)
while True:
batch = table.batch()
for key, _ in table.scan(columns=[COLUMN_NAMES], filter="RowFilter(=, 'regexstring:.*\x09.*')", limit=10000):
batch.delete(key)
batch.send()
print(key)
View firstnames.txt
Abigail - Nabby, Abby, Gail
Abraham - Abe, Bram
Adelaida - Ida, Idly
Alan - Al
Alastair - Al, Alex
Albert - Al, Bert
Alexander - Alex, Lex, Xander, Sander, Sandy
Alexandra - Alex, Ali, Lexie, Sandy
Alfred - Al, Alf, Alfie, Fred, Fredo
Alonzo - Lonnie
View names.input
Satya Nadella
B Turner
Lisa Brummel
Rupert Bader
Janet Kennedy
Jordan Levin
Horacio Rrez
Christophe Capossela
Angela Jones
David Aucsmith
View gist:6d34eacb25cb3a30eb3e7568ff9d9e61
ackage no.companybook.extraction.tables;
import org.junit.Test;
import java.util.HashSet;
import java.util.Set;
import static org.junit.Assert.*;
public class PersonTest {
@khellan
khellan / settings.py
Last active May 31, 2016
Frontera scrapy fetch error
View settings.py
2016-05-31 21:08:31 [scrapy] INFO: Scrapy 1.1.0 started (bot: cb_crawl)
2016-05-31 21:08:31 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'cb_crawl.spiders', 'DOWNLOAD_TIMEOUT': 60, 'ROBOTSTXT_OBEY': True, 'DEPTH_LIMIT': 10, 'CONCURRENT_REQUESTS_PER_DOMAIN': 1, 'CONCURRENT_REQUESTS': 256, 'RETRY_ENABLED': False, 'SPIDER_MODULES': ['cb_crawl.spiders'], 'AUTOTHROTTLE_START_DELAY': 0.25, 'REACTOR_THREADPOOL_MAXSIZE': 20, 'BOT_NAME': 'cb_crawl', 'AJAXCRAWL_ENABLED': True, 'COOKIES_ENABLED': False, 'USER_AGENT': 'cb crawl (+http://www.companybooknetworking.com)', 'SCHEDULER': 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler', 'REDIRECT_ENABLED': False, 'AUTOTHROTTLE_ENABLED': True, 'DOWNLOAD_DELAY': 0.25}
2016-05-31 21:08:31 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.throttle.AutoThrottle']
2016-05-31 21:08:31 [scrapy] INFO: Enabled downloader middlewares
@khellan
khellan / word2vec_optimized.py
Last active Jun 22, 2018
A version of the optimized word2vec that doesn't require access to the training data when restoring the saved model. Run python tensorflow/tensorflow/models/embedding/word2vec_optimized.py --save_path=/Users/knut/data/wiki/model --embedding_size=500 --use --interactive to test.
View word2vec_optimized.py
# Copyright 2015 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
@khellan
khellan / word2vec.py
Created Nov 30, 2015
TensorFlow word2vec with model loading
View word2vec.py
"""Multi-threaded word2vec mini-batched skip-gram model.
Trains the model described in:
(Mikolov, et. al.) Efficient Estimation of Word Representations in Vector Space
ICLR 2013.
http://arxiv.org/abs/1301.3781
This model does traditional minibatching.
The key ops used are:
* placeholder for feeding in tensors for each example.
@khellan
khellan / JRuby 1.6.7 double resume
Created Jun 7, 2012
Double resume in JRuby. Note that the result in JRuby varies so it seems to be time sensitive.
View JRuby 1.6.7 double resume
ruby -v
jruby 1.6.7 (ruby-1.9.2-p312) (2012-02-22 3e82bc8) (Java HotSpot(TM) 64-Bit Server VM 1.7.0_01) [linux-amd64-java]
ruby test/double_resume.rb
Loaded suite test/double_resume
Started
E
Finished in 0.157000 seconds.
1) Error:
test_0001_should_raise_double_resume(ResumingFiberSpec):
@khellan
khellan / gobbler.erl
Created May 15, 2012
Stepwise introduction to a distributed erlang message loop
View gobbler.erl
-module(gobbler).
-behaviour(gen_server).
-export([code_change/3, handle_call/3, handle_cast/2, handle_info/2]).
-export([init/1, start_link/0, terminate/2]).
-export([count/0, increment/0, stop/0]).
count() -> gen_server:call(?MODULE, count).
You can’t perform that action at this time.