miku/.gitignore

## .gitignore
*gz

## README.md

      
    Raw
  

              README.md
            
          
    README

Combinatorical question, how many segmentations of a word of length n are there?
Intuition:
helloworld
h|elloworld
he|lloworld
...
h|e|lloworld
...
h|e|l|l|o|w|o|r|l|d

You can have a single bar in n places (n choose 1), two bars in n choose
2 places and so forth. This is also known as the number of k-combinations
for all k.
Another way to look at it. We have n - 1 slots to place bars into.
For each slot, we can decide, whether it stays empty or not. So the complete
number of decisions we can make amounts to 2 ^ (n - 1).
You can also view the space as the number of all binary numbers of length n - 1.
The number of segmentations is exponential, hence not suitable for any brute
force method at all.
We can't be too greedy in this, for example:
lovesnails

Should not automatically resolved to
love snails

but
loves nails

needs to be considered, too.
One can use a metric (which is introduced later) to find the most probable
segmentation. The computation is simple, less than 10 lines:
def split_pairs(word):
    return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]

def segment(word, word_seq_fitness=None):
    if not word or not word_seq_fitness:
        return []
    all_segmentations = [[first] + segment(rest)] 
                         for (first, rest) in split_pairs(word)]
    return max(all_segmentations, key=word_seq_fitness)

The problem now shifts to the question, what words do have meaning. Choosing
a dictionary list might end up with a poor model, since a lot of real world
string do not make it into the dictionary, for example the bbconline could
be problematic.
A solution is to create a probabilistic model of the english language.
In the simplest case, we need a list of words and associated counts, representing
the words frequency in some corpus. Now the task is to calculate the likelihood,
that a given sequence of words has been drawn from the distribution, that
generated this corpus (and of which we know the frequencies).
Bayes Rule

Determine the probability of each word in the candidate segmentation and
multiply their probabilities. This is Bayes Rule, with the assumtions that
each event (occurence of a word) is independent of the others.
The independence assumptions is strong and quite wrong, words are not independent
of each other at all. (Is this alone enough to call it a bag of words model?)
This is also called naive Bayes, and it is called naive, because it make
the wrong, but useful independence assumption. A data point has a number
of features, all independent.
For natural language, nGrams should be also considered, since words tend
to appear in groups. One can also model occurences within texts (without
a necessary order, e.g. sequence memoizer).
We restrict n to 5 here.
Implementation

See segmentation.py.
Drawbacks, Improvements

A bigger corpus does not mean, that the analysis gets better. There
is the problem of words, that are not in the corpus.

  
## gutenbergfreq.py
#!/usr/bin/env python
# coding: utf-8

import collections
import glob
import os
import re
import string


if __name__ == '__main__':

    # pattern = re.compile('[\W_]+')
    # for path in glob.glob('/media/mtc/Data/tmp/gberg-*'):
    #     counter = collections.Counter()
    #     outfile = '%s.freq' % path
    #     print(path, outfile)
    #     if os.path.exists(outfile):
    #         continue
    #     with open(path) as handle:
    #         for line in handle:
    #             line = line.lower()
    #             for token in line.split():
    #                 token = pattern.sub('', token)
    #                 if token:
    #                     counter[token] += 1
    #     with open('%s.freq' % path, 'w') as output:
    #         for key, value in counter.iteritems():
    #             output.write('%s\t%s\n' % (key, value))

    counter = collections.Counter()
    for i, path in enumerate(glob.glob('/media/mtc/Data/tmp/gberg-*freq')):
        with open(path) as handle:
            print(path, len(counter), i)
            for line in handle:
                token, frequency = line.strip().split()
                frequency = int(frequency)
                if len(token) > 30:
                    continue
                else:
                    counter[token] += frequency

    with open('./gutenberg.freq', 'w') as output:
        for key, value in counter.iteritems():
            output.write('%s\t%s\n' % (key, value))

## segmentation.py
#!/usr/bin/env python
# coding: utf-8
# http://stackoverflow.com/q/19675106/89391

import functools
import math
import os
import sys
import gzip
import urllib


def split_pairs(word):
    return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]

def segment(word, word_seq_fitness=None):
    if not word or not word_seq_fitness:
        return []
    all_segmentations = [[first] + segment(rest, word_seq_fitness=word_seq_fitness)
                         for (first, rest) in split_pairs(word)]
    # print(len(all_segmentations))
    return max(all_segmentations, key=word_seq_fitness)


class OneGramDist(dict):
    """
    1-gram probability distribution for corpora.
    Source: http://norvig.com/ngrams/count_1w.txt
    """
    def __init__(self, filename='count_1w_cleaned.txt'):
        self.total = 0
        print('building probability table...')
        _open = open
        if filename.endswith('gz'):
            _open = gzip.open
        with _open(filename) as handle:
            for line in handle:
                word, count = line.strip().split('\t')
                self[word] = int(count)
                self.total += int(count)

    def __call__(self, word):
        try:
            result = float(self[word]) / self.total
            # print(word, result)
        except KeyError:
            # result = 1.0 / self.total
            return 1.0 / (self.total * 10**(len(word) - 2))
            # return sys.float_info.min
        return result


def onegram_log(onegrams, words):
    """
    Use the log trick to avoid tiny quantities.
    http://machineintelligence.tumblr.com/post/4998477107/the-log-sum-exp-trick
    """

    result = functools.reduce(lambda x, y: x + y,
        (math.log10(onegrams(w)) for w in words))

    print(words, result)
    return result


if __name__ == '__main__':
    sentence = ''.join(sys.argv[1:]).lower()
    if not sentence:
        sentence = 'thequickbrown'
    # onegrams = OneGramDist(filename='count_10M_gb.txt')
    onegrams = OneGramDist(filename='count_1M_gb.txt.gz')
    # onegrams = OneGramDist(filename='count_1w.txt')
    onegram_fitness = functools.partial(onegram_log, onegrams)
    print(sentence)
    print(segment(sentence, word_seq_fitness=onegram_fitness))

## segmetation_examples.txt
Examples (using 1M or 10M frequency list from Gutenberg):

$ python segmentation.py penisland
...
(['p', 'en', 'island'], -15.638518785248923)
(['pe', 'nis', 'land'], -15.108322797049299)
(['pen', 'island'], -8.375691641879506)
(['peni', 'sl', 'and'], -14.687786244444453)
(['penis', 'land'], -9.672682030503244)
(['penisl', 'and'], -15.235864890453465)
(['penisla', 'nd'], -20.26133790488834)
(['penislan', 'd'], -24.41870399665061)
(['penisland'], -16.709351998325303)
['pen', 'island']

$ python segmentation.py therapistfinder
...
(['t', 'he', 'rapist', 'finder'], -24.748890963614368)
(['th', 'era', 'pist', 'finder'], -21.212173151805697)
(['the', 'rapist', 'finder'], -15.237632013697695)
(['ther', 'a', 'pist', 'finder'], -19.589487264674233)
(['thera', 'pist', 'finder'], -19.16492932696031)
(['therap', 'ist', 'finder'], -17.992610282256244)
(['therapi', 'st', 'finder'], -24.47141671545925)
(['therapis', 'tf', 'in', 'der'], -21.371078882293887)
(['therapist', 'finder'], -13.623547247250375)
(['therapistf', 'in', 'der'], -22.861482061222056)
(['therapistfi', 'nder'], -25.98413509261641)
(['therapistfin', 'der'], -23.10614521275257)
(['therapistfind', 'er'], -24.395855144970255)
(['therapistfinde', 'r'], -30.418703996650606)
(['therapistfinder'], -22.709351998325303)
['therapist', 'finder']

$ python segmentation.py whorepresents
...
(['w', 'ho', 'represents'], -18.14739742951673)
(['wh', 'ore', 'presents'], -15.423500721603162)
(['who', 'represents'], -7.398759138614212)
(['whor', 'ep', 'resents'], -20.08562578090216)
(['whore', 'presents'], -10.445817549886325)
(['whorep', 'resents'], -15.673004770048102)
(['whorepr', 'esents'], -23.816644005322647)
(['whorepre', 'sents'], -23.098557710539556)
(['whorepres', 'ents'], -23.31936871896465)
(['whoreprese', 'nts'], -25.459662604329512)
(['whorepresen', 'ts'], -24.458042989379624)
(['whorepresent', 's'], -28.418703996650606)
(['whorepresents'], -20.709351998325303)
['who', 'represents']

It's funny to see it *work* even with Kauderwelsch:

$ python segmentation.py zugarrivesatgaredunord

(['z', 'ug', 'arrives', 'at', 'gare', 'du', 'nord'], -32.50373110714555)
(['zu', 'g', 'arrives', 'at', 'gare', 'du', 'nord'], -29.771130940523193)
(['zug', 'arrives', 'at', 'gare', 'du', 'nord'], -27.6931533335704)
(['zuga', 'rrivesatgaredunord'], -37.43176816321052)
(['zugar', 'rives', 'at', 'gare', 'du', 'nord'], -35.27313676138066)
(['zugarr', 'ive', 'sat', 'gare', 'du', 'nord'], -35.62167660702414)
(['zugarri', 'vesat', 'gare', 'du', 'nord'], -37.06730572297619)
(['zugarriv', 'es', 'at', 'gare', 'du', 'nord'], -36.19798266766783)
(['zugarrive', 'sat', 'gare', 'du', 'nord'], -34.778801974903615)
(['zugarrives', 'at', 'gare', 'du', 'nord'], -34.347608226189784)
(['zugarrivesa', 'tg', 'are', 'du', 'nord'], -36.61899603139459)
(['zugarrivesat', 'gare', 'du', 'nord'], -34.06742498500572)
(['zugarrivesatg', 'are', 'du', 'nord'], -31.83761040103289)
(['zugarrivesatga', 'red', 'un', 'ord'], -34.25155677971655)
(['zugarrivesatgar', 'e', 'du', 'nord'], -34.7329574229286)
(['zugarrivesatgare', 'du', 'nord'], -32.353490844484725)
(['zugarrivesatgared', 'un', 'ord'], -33.482848511720604)
(['zugarrivesatgaredu', 'nord'], -30.97651206109257)
(['zugarrivesatgaredun', 'ord'], -32.2095721169088)
(['zugarrivesatgareduno', 'rd'], -33.55078317272377)
(['zugarrivesatgaredunor', 'd'], -32.66984483306061)
(['zugarrivesatgaredunord'], -29.71588408160526)
['zug', 'arrives', 'at', 'gare', 'du', 'nord']

## test_segmentation.py
#!/usr/bin/env python
# coding: utf-8

import unittest
from segmentation import *

class SegmentationTests(unittest.TestCase):
    """
    Tests for word segmentation functions.
    """
    def test_split_pairs(self):
        self.assertEquals(len(split_pairs("hello")), 5)
        self.assertEquals(split_pairs("hi"), [("h", "i"), ("hi", "")])
        self.assertEquals(split_pairs("yay"), [("y", "ay"),
                                               ("ya", "y"),
                                               ("yay", "")])
	#!/usr/bin/env python
	# coding: utf-8

	import collections
	import glob
	import os
	import re
	import string


	if __name__ == '__main__':

	# pattern = re.compile('[\W_]+')
	# for path in glob.glob('/media/mtc/Data/tmp/gberg-*'):
	# counter = collections.Counter()
	# outfile = '%s.freq' % path
	# print(path, outfile)
	# if os.path.exists(outfile):
	# continue
	# with open(path) as handle:
	# for line in handle:
	# line = line.lower()
	# for token in line.split():
	# token = pattern.sub('', token)
	# if token:
	# counter[token] += 1
	# with open('%s.freq' % path, 'w') as output:
	# for key, value in counter.iteritems():
	# output.write('%s\t%s\n' % (key, value))

	counter = collections.Counter()
	for i, path in enumerate(glob.glob('/media/mtc/Data/tmp/gberg-*freq')):
	with open(path) as handle:
	print(path, len(counter), i)
	for line in handle:
	token, frequency = line.strip().split()
	frequency = int(frequency)
	if len(token) > 30:
	continue
	else:
	counter[token] += frequency

	with open('./gutenberg.freq', 'w') as output:
	for key, value in counter.iteritems():
	output.write('%s\t%s\n' % (key, value))
	#!/usr/bin/env python
	# coding: utf-8
	# http://stackoverflow.com/q/19675106/89391

	import functools
	import math
	import os
	import sys
	import gzip
	import urllib


	def split_pairs(word):
	return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]

	def segment(word, word_seq_fitness=None):
	if not word or not word_seq_fitness:
	return []
	all_segmentations = [[first] + segment(rest, word_seq_fitness=word_seq_fitness)
	for (first, rest) in split_pairs(word)]
	# print(len(all_segmentations))
	return max(all_segmentations, key=word_seq_fitness)


	class OneGramDist(dict):
	"""
	1-gram probability distribution for corpora.
	Source: http://norvig.com/ngrams/count_1w.txt
	"""
	def __init__(self, filename='count_1w_cleaned.txt'):
	self.total = 0
	print('building probability table...')
	_open = open
	if filename.endswith('gz'):
	_open = gzip.open
	with _open(filename) as handle:
	for line in handle:
	word, count = line.strip().split('\t')
	self[word] = int(count)
	self.total += int(count)

	def __call__(self, word):
	try:
	result = float(self[word]) / self.total
	# print(word, result)
	except KeyError:
	# result = 1.0 / self.total
	return 1.0 / (self.total * 10**(len(word) - 2))
	# return sys.float_info.min
	return result


	def onegram_log(onegrams, words):
	"""
	Use the log trick to avoid tiny quantities.
	http://machineintelligence.tumblr.com/post/4998477107/the-log-sum-exp-trick
	"""

	result = functools.reduce(lambda x, y: x + y,
	(math.log10(onegrams(w)) for w in words))

	print(words, result)
	return result


	if __name__ == '__main__':
	sentence = ''.join(sys.argv[1:]).lower()
	if not sentence:
	sentence = 'thequickbrown'
	# onegrams = OneGramDist(filename='count_10M_gb.txt')
	onegrams = OneGramDist(filename='count_1M_gb.txt.gz')
	# onegrams = OneGramDist(filename='count_1w.txt')
	onegram_fitness = functools.partial(onegram_log, onegrams)
	print(sentence)
	print(segment(sentence, word_seq_fitness=onegram_fitness))
	Examples (using 1M or 10M frequency list from Gutenberg):

	$ python segmentation.py penisland
	...
	(['p', 'en', 'island'], -15.638518785248923)
	(['pe', 'nis', 'land'], -15.108322797049299)
	(['pen', 'island'], -8.375691641879506)
	(['peni', 'sl', 'and'], -14.687786244444453)
	(['penis', 'land'], -9.672682030503244)
	(['penisl', 'and'], -15.235864890453465)
	(['penisla', 'nd'], -20.26133790488834)
	(['penislan', 'd'], -24.41870399665061)
	(['penisland'], -16.709351998325303)
	['pen', 'island']

	$ python segmentation.py therapistfinder
	...
	(['t', 'he', 'rapist', 'finder'], -24.748890963614368)
	(['th', 'era', 'pist', 'finder'], -21.212173151805697)
	(['the', 'rapist', 'finder'], -15.237632013697695)
	(['ther', 'a', 'pist', 'finder'], -19.589487264674233)
	(['thera', 'pist', 'finder'], -19.16492932696031)
	(['therap', 'ist', 'finder'], -17.992610282256244)
	(['therapi', 'st', 'finder'], -24.47141671545925)
	(['therapis', 'tf', 'in', 'der'], -21.371078882293887)
	(['therapist', 'finder'], -13.623547247250375)
	(['therapistf', 'in', 'der'], -22.861482061222056)
	(['therapistfi', 'nder'], -25.98413509261641)
	(['therapistfin', 'der'], -23.10614521275257)
	(['therapistfind', 'er'], -24.395855144970255)
	(['therapistfinde', 'r'], -30.418703996650606)
	(['therapistfinder'], -22.709351998325303)
	['therapist', 'finder']

	$ python segmentation.py whorepresents
	...
	(['w', 'ho', 'represents'], -18.14739742951673)
	(['wh', 'ore', 'presents'], -15.423500721603162)
	(['who', 'represents'], -7.398759138614212)
	(['whor', 'ep', 'resents'], -20.08562578090216)
	(['whore', 'presents'], -10.445817549886325)
	(['whorep', 'resents'], -15.673004770048102)
	(['whorepr', 'esents'], -23.816644005322647)
	(['whorepre', 'sents'], -23.098557710539556)
	(['whorepres', 'ents'], -23.31936871896465)
	(['whoreprese', 'nts'], -25.459662604329512)
	(['whorepresen', 'ts'], -24.458042989379624)
	(['whorepresent', 's'], -28.418703996650606)
	(['whorepresents'], -20.709351998325303)
	['who', 'represents']

	It's funny to see it work even with Kauderwelsch:

	$ python segmentation.py zugarrivesatgaredunord

	(['z', 'ug', 'arrives', 'at', 'gare', 'du', 'nord'], -32.50373110714555)
	(['zu', 'g', 'arrives', 'at', 'gare', 'du', 'nord'], -29.771130940523193)
	(['zug', 'arrives', 'at', 'gare', 'du', 'nord'], -27.6931533335704)
	(['zuga', 'rrivesatgaredunord'], -37.43176816321052)
	(['zugar', 'rives', 'at', 'gare', 'du', 'nord'], -35.27313676138066)
	(['zugarr', 'ive', 'sat', 'gare', 'du', 'nord'], -35.62167660702414)
	(['zugarri', 'vesat', 'gare', 'du', 'nord'], -37.06730572297619)
	(['zugarriv', 'es', 'at', 'gare', 'du', 'nord'], -36.19798266766783)
	(['zugarrive', 'sat', 'gare', 'du', 'nord'], -34.778801974903615)
	(['zugarrives', 'at', 'gare', 'du', 'nord'], -34.347608226189784)
	(['zugarrivesa', 'tg', 'are', 'du', 'nord'], -36.61899603139459)
	(['zugarrivesat', 'gare', 'du', 'nord'], -34.06742498500572)
	(['zugarrivesatg', 'are', 'du', 'nord'], -31.83761040103289)
	(['zugarrivesatga', 'red', 'un', 'ord'], -34.25155677971655)
	(['zugarrivesatgar', 'e', 'du', 'nord'], -34.7329574229286)
	(['zugarrivesatgare', 'du', 'nord'], -32.353490844484725)
	(['zugarrivesatgared', 'un', 'ord'], -33.482848511720604)
	(['zugarrivesatgaredu', 'nord'], -30.97651206109257)
	(['zugarrivesatgaredun', 'ord'], -32.2095721169088)
	(['zugarrivesatgareduno', 'rd'], -33.55078317272377)
	(['zugarrivesatgaredunor', 'd'], -32.66984483306061)
	(['zugarrivesatgaredunord'], -29.71588408160526)
	['zug', 'arrives', 'at', 'gare', 'du', 'nord']
	#!/usr/bin/env python
	# coding: utf-8

	import unittest
	from segmentation import *

	class SegmentationTests(unittest.TestCase):
	"""
	Tests for word segmentation functions.
	"""
	def test_split_pairs(self):
	self.assertEquals(len(split_pairs("hello")), 5)
	self.assertEquals(split_pairs("hi"), [("h", "i"), ("hi", "")])
	self.assertEquals(split_pairs("yay"), [("y", "ay"),
	("ya", "y"),
	("yay", "")])