Skip to content

Instantly share code, notes, and snippets.

View bittlingmayer's full-sized avatar

Adam Bittlingmayer bittlingmayer

View GitHub Profile
@bittlingmayer
bittlingmayer / README.md
Last active April 23, 2020 14:18
Split a file for train and test (randomly but without shuffling it or otherwise changing the order)

Keybase proof

I hereby claim:

  • I am bittlingmayer on github.
  • I am bittlingmayer (https://keybase.io/bittlingmayer) on keybase.
  • I have a public key ASDgJRCUjWeRFQLnx9CrY6VkCOzEaYoqf8tcPAmISJfnCwo

To claim this, I am signing this object:

<!DOCTYPE html SYSTEM "">
<html lang="de"><head><title>Ausverkauf bei Italien-Bonds</title><meta content="Anleger fürchten den EU-Kurs der neuen Italienregierung und verkaufen ihre Staatsanleihen. Das hat auch Auswirkungen auf andere EU-Staaten. " name="description"/><meta content="" name="keywords"/><meta content="http://app.handelsblatt.com/images/flaggen-italiens-und-der-eu/22593738/2-formatOriginal.jpg" property="outbrain:image"/>
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "NewsArticle",
"headline": "Ausverkauf bei Italien-Bonds",
"image": {
@bittlingmayer
bittlingmayer / README.md
Last active December 2, 2017 11:10
mozilla/Readability #412

See mozilla/readability#412

To set up:

npm install

Modify function _cleanClasses(node) in node_modules/readability/Readability.js

@bittlingmayer
bittlingmayer / ft_wiki_preproc.py
Last active March 4, 2019 22:56
fastText pre-trained vectors preprocessing [moved to ftio.wiki.preproc - pip install ftio / https://github.com/SignalN/ftio]
# See https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh
#
# From https://github.com/facebookresearch/fastText/issues/161:
#
# We now have a script called 'get-wikimedia.sh', that you can use to download and
# process a recent wikipedia dump of any language. This script applies the preprocessing
# we used to create the published word vectors.
#
# The parameters we used to build the word vectors are the default skip-gram settings,
# except with a dimensionality of 300 as indicated on the top of the list of word
@bittlingmayer
bittlingmayer / download.sh
Last active November 3, 2017 17:46
Download fastText pre-trained models for many languages [moved to ftio/wiki/download.sh - pip install ftio / https://github.com/SignalN/ftio]
# See https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
for (( i=1; i<=$#; i++ )); do
wget -c "https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.${!i}.zip"
done
# For example:
# ./download.sh bg el ka hy ru fa es fr de it pt ar tr pl ko
# If stopped it will not re-start automatically, but if re-started it will continue from where it stopped.
@bittlingmayer
bittlingmayer / ngrams.py
Last active September 2, 2017 11:16
ngrams.py [moved to language.ngrams - pip install language / https://github.com/SignalN/language/]
def __ngrams(s, n=3):
# Raw n-grams on sequences
# If given a string, it will return char-level n-grams.
# If given a list of words, it will return word-level n-grams.
return list(zip(*[s[i:] for i in range(n)]))
def ngrams(s, n=3):
# Does not take n-grams across word boundaries (' ')
# If a word is shorter than n, the n-gram is the word.
unpack = lambda l: sum(l, [])
@bittlingmayer
bittlingmayer / fasttext_similarity.py
Created June 10, 2017 12:50
Similarity for two files output by fastText print-word-vectors or print-sentence-vectors
"""
Takes two files produced by fastText's print-word-vectors or print-sentence-vectors and compares the vectors by similarity.
(See https://github.com/facebookresearch/fastText.)
This can be useful for benchmarking output or even generating benchmark data.
For example:
@bittlingmayer
bittlingmayer / shplit.py
Created June 2, 2017 16:17
Shplit.py - shuffle+split a data file
# Shplit.py - shuffle+split a data file
#
# Positional Arguments:
# 1: the filename
# 2: the split factor
#
# The filename must have an extension.
#
# Example:
#
@bittlingmayer
bittlingmayer / README.md
Last active May 29, 2017 11:31
Amazon Reviews Sentiment with fastTest [example]

This code uses fastText supervised learning to predict output labels from input text.

Approach

This is the baseline code. I have not changed anything.

Preprocessing

I applied lowercasing, so "This is a TEST!" becomes "this is a test!".

Parameters