Create a gist now

Instantly share code, notes, and snippets.

@miku /.gitignore
Last active Feb 6, 2017

What would you like to do?
Word segmentation in a nutshell.


Combinatorical question, how many segmentations of a word of length n are there?



You can have a single bar in n places (n choose 1), two bars in n choose 2 places and so forth. This is also known as the number of k-combinations for all k.

Another way to look at it. We have n - 1 slots to place bars into. For each slot, we can decide, whether it stays empty or not. So the complete number of decisions we can make amounts to 2 ^ (n - 1).

You can also view the space as the number of all binary numbers of length n - 1.

The number of segmentations is exponential, hence not suitable for any brute force method at all.

We can't be too greedy in this, for example:


Should not automatically resolved to

love snails


loves nails

needs to be considered, too.

One can use a metric (which is introduced later) to find the most probable segmentation. The computation is simple, less than 10 lines:

def split_pairs(word):
    return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]

def segment(word, word_seq_fitness=None):
    if not word or not word_seq_fitness:
        return []
    all_segmentations = [[first] + segment(rest)] 
                         for (first, rest) in split_pairs(word)]
    return max(all_segmentations, key=word_seq_fitness)

The problem now shifts to the question, what words do have meaning. Choosing a dictionary list might end up with a poor model, since a lot of real world string do not make it into the dictionary, for example the bbconline could be problematic.

A solution is to create a probabilistic model of the english language.

In the simplest case, we need a list of words and associated counts, representing the words frequency in some corpus. Now the task is to calculate the likelihood, that a given sequence of words has been drawn from the distribution, that generated this corpus (and of which we know the frequencies).

Bayes Rule

Determine the probability of each word in the candidate segmentation and multiply their probabilities. This is Bayes Rule, with the assumtions that each event (occurence of a word) is independent of the others.

The independence assumptions is strong and quite wrong, words are not independent of each other at all. (Is this alone enough to call it a bag of words model?)

This is also called naive Bayes, and it is called naive, because it make the wrong, but useful independence assumption. A data point has a number of features, all independent.

For natural language, nGrams should be also considered, since words tend to appear in groups. One can also model occurences within texts (without a necessary order, e.g. sequence memoizer).

We restrict n to 5 here.



Drawbacks, Improvements

A bigger corpus does not mean, that the analysis gets better. There is the problem of words, that are not in the corpus.

#!/usr/bin/env python
# coding: utf-8
import collections
import glob
import os
import re
import string
if __name__ == '__main__':
# pattern = re.compile('[\W_]+')
# for path in glob.glob('/media/mtc/Data/tmp/gberg-*'):
# counter = collections.Counter()
# outfile = '%s.freq' % path
# print(path, outfile)
# if os.path.exists(outfile):
# continue
# with open(path) as handle:
# for line in handle:
# line = line.lower()
# for token in line.split():
# token = pattern.sub('', token)
# if token:
# counter[token] += 1
# with open('%s.freq' % path, 'w') as output:
# for key, value in counter.iteritems():
# output.write('%s\t%s\n' % (key, value))
counter = collections.Counter()
for i, path in enumerate(glob.glob('/media/mtc/Data/tmp/gberg-*freq')):
with open(path) as handle:
print(path, len(counter), i)
for line in handle:
token, frequency = line.strip().split()
frequency = int(frequency)
if len(token) > 30:
counter[token] += frequency
with open('./gutenberg.freq', 'w') as output:
for key, value in counter.iteritems():
output.write('%s\t%s\n' % (key, value))
#!/usr/bin/env python
# coding: utf-8
import functools
import math
import os
import sys
import gzip
import urllib
def split_pairs(word):
return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]
def segment(word, word_seq_fitness=None):
if not word or not word_seq_fitness:
return []
all_segmentations = [[first] + segment(rest, word_seq_fitness=word_seq_fitness)
for (first, rest) in split_pairs(word)]
# print(len(all_segmentations))
return max(all_segmentations, key=word_seq_fitness)
class OneGramDist(dict):
1-gram probability distribution for corpora.
def __init__(self, filename='count_1w_cleaned.txt'): = 0
print('building probability table...')
_open = open
if filename.endswith('gz'):
_open =
with _open(filename) as handle:
for line in handle:
word, count = line.strip().split('\t')
self[word] = int(count) += int(count)
def __call__(self, word):
result = float(self[word]) /
# print(word, result)
except KeyError:
# result = 1.0 /
return 1.0 / ( * 10**(len(word) - 2))
# return sys.float_info.min
return result
def onegram_log(onegrams, words):
Use the log trick to avoid tiny quantities.
result = functools.reduce(lambda x, y: x + y,
(math.log10(onegrams(w)) for w in words))
print(words, result)
return result
if __name__ == '__main__':
sentence = ''.join(sys.argv[1:]).lower()
if not sentence:
sentence = 'thequickbrown'
# onegrams = OneGramDist(filename='count_10M_gb.txt')
onegrams = OneGramDist(filename='count_1M_gb.txt.gz')
# onegrams = OneGramDist(filename='count_1w.txt')
onegram_fitness = functools.partial(onegram_log, onegrams)
print(segment(sentence, word_seq_fitness=onegram_fitness))
Examples (using 1M or 10M frequency list from Gutenberg):
$ python penisland
(['p', 'en', 'island'], -15.638518785248923)
(['pe', 'nis', 'land'], -15.108322797049299)
(['pen', 'island'], -8.375691641879506)
(['peni', 'sl', 'and'], -14.687786244444453)
(['penis', 'land'], -9.672682030503244)
(['penisl', 'and'], -15.235864890453465)
(['penisla', 'nd'], -20.26133790488834)
(['penislan', 'd'], -24.41870399665061)
(['penisland'], -16.709351998325303)
['pen', 'island']
$ python therapistfinder
(['t', 'he', 'rapist', 'finder'], -24.748890963614368)
(['th', 'era', 'pist', 'finder'], -21.212173151805697)
(['the', 'rapist', 'finder'], -15.237632013697695)
(['ther', 'a', 'pist', 'finder'], -19.589487264674233)
(['thera', 'pist', 'finder'], -19.16492932696031)
(['therap', 'ist', 'finder'], -17.992610282256244)
(['therapi', 'st', 'finder'], -24.47141671545925)
(['therapis', 'tf', 'in', 'der'], -21.371078882293887)
(['therapist', 'finder'], -13.623547247250375)
(['therapistf', 'in', 'der'], -22.861482061222056)
(['therapistfi', 'nder'], -25.98413509261641)
(['therapistfin', 'der'], -23.10614521275257)
(['therapistfind', 'er'], -24.395855144970255)
(['therapistfinde', 'r'], -30.418703996650606)
(['therapistfinder'], -22.709351998325303)
['therapist', 'finder']
$ python whorepresents
(['w', 'ho', 'represents'], -18.14739742951673)
(['wh', 'ore', 'presents'], -15.423500721603162)
(['who', 'represents'], -7.398759138614212)
(['whor', 'ep', 'resents'], -20.08562578090216)
(['whore', 'presents'], -10.445817549886325)
(['whorep', 'resents'], -15.673004770048102)
(['whorepr', 'esents'], -23.816644005322647)
(['whorepre', 'sents'], -23.098557710539556)
(['whorepres', 'ents'], -23.31936871896465)
(['whoreprese', 'nts'], -25.459662604329512)
(['whorepresen', 'ts'], -24.458042989379624)
(['whorepresent', 's'], -28.418703996650606)
(['whorepresents'], -20.709351998325303)
['who', 'represents']
It's funny to see it *work* even with Kauderwelsch:
$ python zugarrivesatgaredunord
(['z', 'ug', 'arrives', 'at', 'gare', 'du', 'nord'], -32.50373110714555)
(['zu', 'g', 'arrives', 'at', 'gare', 'du', 'nord'], -29.771130940523193)
(['zug', 'arrives', 'at', 'gare', 'du', 'nord'], -27.6931533335704)
(['zuga', 'rrivesatgaredunord'], -37.43176816321052)
(['zugar', 'rives', 'at', 'gare', 'du', 'nord'], -35.27313676138066)
(['zugarr', 'ive', 'sat', 'gare', 'du', 'nord'], -35.62167660702414)
(['zugarri', 'vesat', 'gare', 'du', 'nord'], -37.06730572297619)
(['zugarriv', 'es', 'at', 'gare', 'du', 'nord'], -36.19798266766783)
(['zugarrive', 'sat', 'gare', 'du', 'nord'], -34.778801974903615)
(['zugarrives', 'at', 'gare', 'du', 'nord'], -34.347608226189784)
(['zugarrivesa', 'tg', 'are', 'du', 'nord'], -36.61899603139459)
(['zugarrivesat', 'gare', 'du', 'nord'], -34.06742498500572)
(['zugarrivesatg', 'are', 'du', 'nord'], -31.83761040103289)
(['zugarrivesatga', 'red', 'un', 'ord'], -34.25155677971655)
(['zugarrivesatgar', 'e', 'du', 'nord'], -34.7329574229286)
(['zugarrivesatgare', 'du', 'nord'], -32.353490844484725)
(['zugarrivesatgared', 'un', 'ord'], -33.482848511720604)
(['zugarrivesatgaredu', 'nord'], -30.97651206109257)
(['zugarrivesatgaredun', 'ord'], -32.2095721169088)
(['zugarrivesatgareduno', 'rd'], -33.55078317272377)
(['zugarrivesatgaredunor', 'd'], -32.66984483306061)
(['zugarrivesatgaredunord'], -29.71588408160526)
['zug', 'arrives', 'at', 'gare', 'du', 'nord']
#!/usr/bin/env python
# coding: utf-8
import unittest
from segmentation import *
class SegmentationTests(unittest.TestCase):
Tests for word segmentation functions.
def test_split_pairs(self):
self.assertEquals(len(split_pairs("hello")), 5)
self.assertEquals(split_pairs("hi"), [("h", "i"), ("hi", "")])
self.assertEquals(split_pairs("yay"), [("y", "ay"),
("ya", "y"),
("yay", "")])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment