Adam Bittlingmayer bittlingmayer

## README.md

      
              2 files
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                bittlingmayer
                / README.md
            
            
              Last active
              April 23, 2020 14:18
            
              
                Split a file for train and test (randomly but without shuffling it or otherwise changing the order)
              
          
    Split a file without shuffling

Often we want a random sample for test.  Usually that's done by shuffling.  But occasianally we want to preserve the order in train.
This script removes a random sample without otherwise changing the order.  It shuffles the original, takes a random sample for test, and then removes all lines that occur in the sample from train. (See https://stackoverflow.com/questions/4366533/how-to-remove-the-lines-which-appear-on-file-b-from-another-file-a)
. split.sh example.txt


## keybase.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bittlingmayer
                / keybase.md
            
            
              Created
              October 7, 2019 05:24
            
          
    Keybase proof

I hereby claim:

I am bittlingmayer on github.
I am bittlingmayer (https://keybase.io/bittlingmayer) on keybase.
I have a public key ASDgJRCUjWeRFQLnx9CrY6VkCOzEaYoqf8tcPAmISJfnCwo

To claim this, I am signing this object:

  
## original.html
<!DOCTYPE html SYSTEM "">
<html lang="de"><head><title>Ausverkauf bei Italien-Bonds</title><meta content="Anleger fürchten den EU-Kurs der neuen Italienregierung und verkaufen ihre Staatsanleihen. Das hat auch Auswirkungen auf andere EU-Staaten. " name="description"/><meta content="" name="keywords"/><meta content="http://app.handelsblatt.com/images/flaggen-italiens-und-der-eu/22593738/2-formatOriginal.jpg" property="outbrain:image"/>
<script type="application/ld+json">

    {
    "@context": "http://schema.org",
    "@type": "NewsArticle",
    "headline": "Ausverkauf bei Italien-Bonds",

        "image": {

## README.md

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bittlingmayer
                / README.md
            
            
              Last active
              December 2, 2017 11:10
            
              
                mozilla/Readability #412
              
          
    See mozilla/readability#412
To set up:
npm install

Modify function _cleanClasses(node) in node_modules/readability/Readability.js

  
## ft_wiki_preproc.py
# See https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh
#
# From https://github.com/facebookresearch/fastText/issues/161:
#
# We now have a script called 'get-wikimedia.sh', that you can use to download and
# process a recent wikipedia dump of any language. This script applies the preprocessing
# we used to create the published word vectors.
#
# The parameters we used to build the word vectors are the default skip-gram settings,
# except with a dimensionality of 300 as indicated on the top of the list of word

## download.sh
# See https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

for (( i=1; i<=$#; i++ )); do
    wget -c "https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.${!i}.zip"
done

# For example:
# ./download.sh bg el ka hy ru fa es fr de it pt ar tr pl ko

# If stopped it will not re-start automatically, but if re-started it will continue from where it stopped.

## ngrams.py
def __ngrams(s, n=3):
    # Raw n-grams on sequences
    # If given a string, it will return char-level n-grams.
    # If given a list of words, it will return word-level n-grams.
    return list(zip(*[s[i:] for i in range(n)]))

def ngrams(s, n=3):
    # Does not take n-grams across word boundaries (' ')
    # If a word is shorter than n, the n-gram is the word.
    unpack = lambda l: sum(l, [])

## fasttext_similarity.py
"""

Takes two files produced by fastText's print-word-vectors or print-sentence-vectors and compares the vectors by similarity.

(See https://github.com/facebookresearch/fastText.)

This can be useful for benchmarking output or even generating benchmark data.

For example:

## shplit.py
#   Shplit.py - shuffle+split a data file
#
#   Positional Arguments:
#       1: the filename
#       2: the split factor
#
#   The filename must have an extension.
#
#   Example:
#

## README.md

      
              3 files
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                bittlingmayer
                / README.md
            
            
              Last active
              May 29, 2017 11:31
            
              
                Amazon Reviews Sentiment with fastTest [example]
              
          
    This code uses fastText supervised learning to predict output labels from input text.
Approach

This is the baseline code.  I have not changed anything.
Preprocessing

I applied lowercasing, so "This is a TEST!" becomes "this is a test!".
Parameters
	<!DOCTYPE html SYSTEM "">
	<html lang="de"><head><title>Ausverkauf bei Italien-Bonds</title><meta content="Anleger fürchten den EU-Kurs der neuen Italienregierung und verkaufen ihre Staatsanleihen. Das hat auch Auswirkungen auf andere EU-Staaten. " name="description"/><meta content="" name="keywords"/><meta content="http://app.handelsblatt.com/images/flaggen-italiens-und-der-eu/22593738/2-formatOriginal.jpg" property="outbrain:image"/>
	<script type="application/ld+json">

	{
	"@context": "http://schema.org",
	"@type": "NewsArticle",
	"headline": "Ausverkauf bei Italien-Bonds",

	"image": {
	# See https://github.com/facebookresearch/fastText/blob/master/get-wikimedia.sh
	#
	# From https://github.com/facebookresearch/fastText/issues/161:
	#
	# We now have a script called 'get-wikimedia.sh', that you can use to download and
	# process a recent wikipedia dump of any language. This script applies the preprocessing
	# we used to create the published word vectors.
	#
	# The parameters we used to build the word vectors are the default skip-gram settings,
	# except with a dimensionality of 300 as indicated on the top of the list of word
	# See https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

	for (( i=1; i<=$#; i++ )); do
	wget -c "https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.${!i}.zip"
	done

	# For example:
	# ./download.sh bg el ka hy ru fa es fr de it pt ar tr pl ko

	# If stopped it will not re-start automatically, but if re-started it will continue from where it stopped.
	def __ngrams(s, n=3):
	# Raw n-grams on sequences
	# If given a string, it will return char-level n-grams.
	# If given a list of words, it will return word-level n-grams.
	return list(zip(*[s[i:] for i in range(n)]))

	def ngrams(s, n=3):
	# Does not take n-grams across word boundaries (' ')
	# If a word is shorter than n, the n-gram is the word.
	unpack = lambda l: sum(l, [])
	"""

	Takes two files produced by fastText's print-word-vectors or print-sentence-vectors and compares the vectors by similarity.

	(See https://github.com/facebookresearch/fastText.)

	This can be useful for benchmarking output or even generating benchmark data.

	For example:
	# Shplit.py - shuffle+split a data file
	#
	# Positional Arguments:
	# 1: the filename
	# 2: the split factor
	#
	# The filename must have an extension.
	#
	# Example:
	#