public
Last active

Word segmentation in a nutshell.

  • Download Gist
README.md
Markdown

README

Combinatorical question, how many segmentations of a word of length n are there?

Intuition:

helloworld
h|elloworld
he|lloworld
...
h|e|lloworld
...
h|e|l|l|o|w|o|r|l|d

You can have a single bar in n places (n choose 1), two bars in n choose 2 places and so forth. This is also known as the number of k-combinations for all k.

Another way to look at it. We have n - 1 slots to place bars into. For each slot, we can decide, whether it stays empty or not. So the complete number of decisions we can make amounts to 2 ^ (n - 1).

You can also view the space as the number of all binary numbers of length n - 1.

The number of segmentations is exponential, hence not suitable for any brute force method at all.

We can't be too greedy in this, for example:

lovesnails

Should not automatically resolved to

love snails

but

loves nails

needs to be considered, too.

One can use a metric (which is introduced later) to find the most probable segmentation. The computation is simple, less than 10 lines:

def split_pairs(word):
    return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]

def segment(word, word_seq_fitness=None):
    if not word or not word_seq_fitness:
        return []
    all_segmentations = [[first] + segment(rest)] 
                         for (first, rest) in split_pairs(word)]
    return max(all_segmentations, key=word_seq_fitness)

The problem now shifts to the question, what words do have meaning. Choosing a dictionary list might end up with a poor model, since a lot of real world string do not make it into the dictionary, for example the bbconline could be problematic.

A solution is to create a probabilistic model of the english language.

In the simplest case, we need a list of words and associated counts, representing the words frequency in some corpus. Now the task is to calculate the likelihood, that a given sequence of words has been drawn from the distribution, that generated this corpus (and of which we know the frequencies).

Bayes Rule

Determine the probability of each word in the candidate segmentation and multiply their probabilities. This is Bayes Rule, with the assumtions that each event (occurence of a word) is independent of the others.

The independence assumptions is strong and quite wrong, words are not independent of each other at all. (Is this alone enough to call it a bag of words model?)

This is also called naive Bayes, and it is called naive, because it make the wrong, but useful independence assumption. A data point has a number of features, all independent.

For natural language, nGrams should be also considered, since words tend to appear in groups. One can also model occurences within texts (without a necessary order, e.g. sequence memoizer).

We restrict n to 5 here.

Implementation

See segmentation.py.

Drawbacks, Improvements

A bigger corpus does not mean, that the analysis gets better. There is the problem of words, that are not in the corpus.

gutenbergfreq.py
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
#!/usr/bin/env python
# coding: utf-8
 
import collections
import glob
import os
import re
import string
 
 
if __name__ == '__main__':
# pattern = re.compile('[\W_]+')
# for path in glob.glob('/media/mtc/Data/tmp/gberg-*'):
# counter = collections.Counter()
# outfile = '%s.freq' % path
# print(path, outfile)
# if os.path.exists(outfile):
# continue
# with open(path) as handle:
# for line in handle:
# line = line.lower()
# for token in line.split():
# token = pattern.sub('', token)
# if token:
# counter[token] += 1
# with open('%s.freq' % path, 'w') as output:
# for key, value in counter.iteritems():
# output.write('%s\t%s\n' % (key, value))
 
counter = collections.Counter()
for i, path in enumerate(glob.glob('/media/mtc/Data/tmp/gberg-*freq')):
with open(path) as handle:
print(path, len(counter), i)
for line in handle:
token, frequency = line.strip().split()
frequency = int(frequency)
if len(token) > 30:
continue
else:
counter[token] += frequency
 
with open('./gutenberg.freq', 'w') as output:
for key, value in counter.iteritems():
output.write('%s\t%s\n' % (key, value))
segmentation.py
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
#!/usr/bin/env python
# coding: utf-8
# http://stackoverflow.com/q/19675106/89391
 
import functools
import math
import os
import sys
import gzip
import urllib
 
 
def split_pairs(word):
return [(word[:i + 1], word[i + 1:]) for i in range(len(word))]
 
def segment(word, word_seq_fitness=None):
if not word or not word_seq_fitness:
return []
all_segmentations = [[first] + segment(rest, word_seq_fitness=word_seq_fitness)
for (first, rest) in split_pairs(word)]
# print(len(all_segmentations))
return max(all_segmentations, key=word_seq_fitness)
 
 
class OneGramDist(dict):
"""
1-gram probability distribution for corpora.
Source: http://norvig.com/ngrams/count_1w.txt
"""
def __init__(self, filename='count_1w_cleaned.txt'):
self.total = 0
print('building probability table...')
_open = open
if filename.endswith('gz'):
_open = gzip.open
with _open(filename) as handle:
for line in handle:
word, count = line.strip().split('\t')
self[word] = int(count)
self.total += int(count)
 
def __call__(self, word):
try:
result = float(self[word]) / self.total
# print(word, result)
except KeyError:
# result = 1.0 / self.total
return 1.0 / (self.total * 10**(len(word) - 2))
# return sys.float_info.min
return result
 
 
def onegram_log(onegrams, words):
"""
Use the log trick to avoid tiny quantities.
http://machineintelligence.tumblr.com/post/4998477107/the-log-sum-exp-trick
"""
 
result = functools.reduce(lambda x, y: x + y,
(math.log10(onegrams(w)) for w in words))
 
print(words, result)
return result
 
 
if __name__ == '__main__':
sentence = ''.join(sys.argv[1:]).lower()
if not sentence:
sentence = 'thequickbrown'
# onegrams = OneGramDist(filename='count_10M_gb.txt')
onegrams = OneGramDist(filename='count_1M_gb.txt.gz')
# onegrams = OneGramDist(filename='count_1w.txt')
onegram_fitness = functools.partial(onegram_log, onegrams)
print(sentence)
print(segment(sentence, word_seq_fitness=onegram_fitness))
segmetation_examples.txt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
Examples (using 1M or 10M frequency list from Gutenberg):
 
$ python segmentation.py penisland
...
(['p', 'en', 'island'], -15.638518785248923)
(['pe', 'nis', 'land'], -15.108322797049299)
(['pen', 'island'], -8.375691641879506)
(['peni', 'sl', 'and'], -14.687786244444453)
(['penis', 'land'], -9.672682030503244)
(['penisl', 'and'], -15.235864890453465)
(['penisla', 'nd'], -20.26133790488834)
(['penislan', 'd'], -24.41870399665061)
(['penisland'], -16.709351998325303)
['pen', 'island']
 
$ python segmentation.py therapistfinder
...
(['t', 'he', 'rapist', 'finder'], -24.748890963614368)
(['th', 'era', 'pist', 'finder'], -21.212173151805697)
(['the', 'rapist', 'finder'], -15.237632013697695)
(['ther', 'a', 'pist', 'finder'], -19.589487264674233)
(['thera', 'pist', 'finder'], -19.16492932696031)
(['therap', 'ist', 'finder'], -17.992610282256244)
(['therapi', 'st', 'finder'], -24.47141671545925)
(['therapis', 'tf', 'in', 'der'], -21.371078882293887)
(['therapist', 'finder'], -13.623547247250375)
(['therapistf', 'in', 'der'], -22.861482061222056)
(['therapistfi', 'nder'], -25.98413509261641)
(['therapistfin', 'der'], -23.10614521275257)
(['therapistfind', 'er'], -24.395855144970255)
(['therapistfinde', 'r'], -30.418703996650606)
(['therapistfinder'], -22.709351998325303)
['therapist', 'finder']
 
$ python segmentation.py whorepresents
...
(['w', 'ho', 'represents'], -18.14739742951673)
(['wh', 'ore', 'presents'], -15.423500721603162)
(['who', 'represents'], -7.398759138614212)
(['whor', 'ep', 'resents'], -20.08562578090216)
(['whore', 'presents'], -10.445817549886325)
(['whorep', 'resents'], -15.673004770048102)
(['whorepr', 'esents'], -23.816644005322647)
(['whorepre', 'sents'], -23.098557710539556)
(['whorepres', 'ents'], -23.31936871896465)
(['whoreprese', 'nts'], -25.459662604329512)
(['whorepresen', 'ts'], -24.458042989379624)
(['whorepresent', 's'], -28.418703996650606)
(['whorepresents'], -20.709351998325303)
['who', 'represents']
 
It's funny to see it *work* even with Kauderwelsch:
 
$ python segmentation.py zugarrivesatgaredunord
 
(['z', 'ug', 'arrives', 'at', 'gare', 'du', 'nord'], -32.50373110714555)
(['zu', 'g', 'arrives', 'at', 'gare', 'du', 'nord'], -29.771130940523193)
(['zug', 'arrives', 'at', 'gare', 'du', 'nord'], -27.6931533335704)
(['zuga', 'rrivesatgaredunord'], -37.43176816321052)
(['zugar', 'rives', 'at', 'gare', 'du', 'nord'], -35.27313676138066)
(['zugarr', 'ive', 'sat', 'gare', 'du', 'nord'], -35.62167660702414)
(['zugarri', 'vesat', 'gare', 'du', 'nord'], -37.06730572297619)
(['zugarriv', 'es', 'at', 'gare', 'du', 'nord'], -36.19798266766783)
(['zugarrive', 'sat', 'gare', 'du', 'nord'], -34.778801974903615)
(['zugarrives', 'at', 'gare', 'du', 'nord'], -34.347608226189784)
(['zugarrivesa', 'tg', 'are', 'du', 'nord'], -36.61899603139459)
(['zugarrivesat', 'gare', 'du', 'nord'], -34.06742498500572)
(['zugarrivesatg', 'are', 'du', 'nord'], -31.83761040103289)
(['zugarrivesatga', 'red', 'un', 'ord'], -34.25155677971655)
(['zugarrivesatgar', 'e', 'du', 'nord'], -34.7329574229286)
(['zugarrivesatgare', 'du', 'nord'], -32.353490844484725)
(['zugarrivesatgared', 'un', 'ord'], -33.482848511720604)
(['zugarrivesatgaredu', 'nord'], -30.97651206109257)
(['zugarrivesatgaredun', 'ord'], -32.2095721169088)
(['zugarrivesatgareduno', 'rd'], -33.55078317272377)
(['zugarrivesatgaredunor', 'd'], -32.66984483306061)
(['zugarrivesatgaredunord'], -29.71588408160526)
['zug', 'arrives', 'at', 'gare', 'du', 'nord']
test_segmentation.py
Python
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
#!/usr/bin/env python
# coding: utf-8
 
import unittest
from segmentation import *
 
class SegmentationTests(unittest.TestCase):
"""
Tests for word segmentation functions.
"""
def test_split_pairs(self):
self.assertEquals(len(split_pairs("hello")), 5)
self.assertEquals(split_pairs("hi"), [("h", "i"), ("hi", "")])
self.assertEquals(split_pairs("yay"), [("y", "ay"),
("ya", "y"),
("yay", "")])
 

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.