Skip to content

Instantly share code, notes, and snippets.

@kylebgorman
Created January 25, 2018 19:58
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save kylebgorman/7d406f577ef1922b2dd3a5ac52752dea to your computer and use it in GitHub Desktop.
Save kylebgorman/7d406f577ef1922b2dd3a5ac52752dea to your computer and use it in GitHub Desktop.
Python string processing with edit transducers
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"# String processing with Pynini edit transducers\n",
"\n",
"<a href=\"mailto:kbg@google.com\">Kyle Gorman</a> & <a href=\"mailto:rws@google.com\">Richard Sproat</a><br>Google, Inc.<br>111 8th Ave., New York, NY 10011"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"String _alignment_ and _matching_ are used in bioinformatics, information retrieval, and natural language processing. For example, RNA and DNA bases are aligned to detect mutations, and string matching is used to detect plagiarism, correct spelling mistakes, and provide \"translation memories\" to assist human translators. Efficient algorithms exist for these problems, but their implementations tend to be complex and use ad hoc designs, limiting code reuse."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In what follows, we describe a data structure called the _edit transducer_ and show how it can be used for string problems. An edit transducer is an abstract data structure&mdash;more specifically, a _weighted finite-state transducer_ (WFST)&mdash;which expresses:\n",
"\n",
"* the ways in which strings may differ from each other, and\n",
"* the costs associated with these differences.\n",
"\n",
"An edit transducer and an operation called _composition_ is first used to create a _lattice_&mdash;itself a WFST&mdash;aligning pairs of strings. We then use the lattice to compute the _Levenshtein distance_, a widely-used metric of string similarity, or to build a _Levenshtein automaton_ used to find similar strings."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Our implementation of edit transducers allows users to specify the costs for the various types of edits. For instance, if we're looking for mutations of a gene in a larger genome, we might specify that insertions (i.e, bases present in the genome not matching any of those in the gene) are essentially \"free\", but deletions (the inverse) have a non-zero cost. Few existing implementations of string algorithms support this. For example, <a href=\"https://docs.python.org/2/library/difflib.html#difflib.SequenceMatcher\">`difflib`</a>, part of the Python core library, implements the textbook algorithm for computing <a href=\"https://en.wikipedia.org/wiki/Levenshtein_distance\">Levenshtein distance</a> (Wagner & Fischer 1974), but it only supports fixed edit costs. The same is true of Python implementations of \n",
"<a href=\"https://en.wikipedia.org/wiki/Levenshtein_automaton\">Levenshtein automaton</a> (Schulz & Mihov 2002) available <a href=\"http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata\">here</a> and <a href=\"http://julesjacobs.github.io/2015/06/17/disqus-levenshtein-simple-and-fast.html\">here</a>."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Rather than implementing these algorithms from scratch, we use <a href=\"http://pynini.opengrm.org\">Pynini</a>, a Python library which uses <a href=\"http://openfst.org\">OpenFst</a>, a C++ 11 library for WFSTs, as a backend. Both Pynini and OpenFst were developed at Google for use in speech and language products. These libraries allow us to we delegate the most difficult work to compiled code, carefully tested and extensively optimized, and minimize the risk of programmer error."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"In what follows, we provide a brief introduction to finite-state machines (see <a href=\"https://www.oreilly.com/ideas/how-to-get-superior-text-processing-in-python-with-pynini\">here</a> for a gentler, more detailed introduction). Then, we describe the construction of an edit transducer. Finally, we show how the edit transducer can be used to compute Levenshtein distance and implement Levenshtein automata."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Throughout this tutorial, we apply various optimizations to the WFSTs using the `optimize` method and functions like `prune` and `synchronize`. While this produces smaller, more efficient WFSTs, our motivation is primarily pedagogical: optimized WFSTs are usually easier to read and understand. We reserve general discussion of WFST optimization for future work."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Before beginning, we import all the necessary libraries, including Pynini:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"from __future__ import division\n",
"\n",
"import string # Demonstration only.\n",
"\n",
"from pynini import acceptor # Demonstration only.\n",
"from pynini import compose\n",
"from pynini import invert\n",
"from pynini import isomorphic # Demonstration only.\n",
"from pynini import NO_STATE_ID\n",
"from pynini import prune\n",
"from pynini import shortestdistance\n",
"from pynini import shortestpath\n",
"from pynini import string_map\n",
"from pynini import synchronize\n",
"from pynini import transducer\n",
"from pynini import union"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Formal preliminaries"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"A finite-state machine is an abstract machine whose behavior can be described by transitions between a finite set of states. For instance, a <a href=\"https://www.oreilly.com/ideas/how-to-get-superior-text-processing-in-python-with-pynini\">gumball machine</a> is essentially a two-state machine where the transitions include \"inserting a coin\", \"turning the knob\", and \"yielding a gumball\". In software, finite-state machines are used to implement regular expression matching. \n",
"A regular expression represents a (possibly infinite) set of strings. Such sets are naturally expressed as a _finite-state acceptor_ (or FSA). Consider the regular expression `ba+`, which matches `ba`, `baa`, `baaa` , and so on. In Pynini, we can express this (somewhat verbosely) as follows:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"200pt\" height=\"84pt\"\n",
" viewBox=\"0.00 0.00 200.00 84.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 80)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-80 196,-80 196,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"92\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"92\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.063,-21C44.1928,-21 54.1239,-21 63.2953,-21\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"63.5594,-24.5001 73.5593,-21 63.5593,-17.5001 63.5594,-24.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"55\" y=\"-24.8\" font-family=\"Times,serif\" font-size=\"14.00\">b</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\"><title>2</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"170\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"170\" cy=\"-21\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"170\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">2</text>\n",
"</g>\n",
"<!-- 1&#45;&gt;2 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>1&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M110.246,-21C118.328,-21 128.217,-21 137.561,-21\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"137.752,-24.5001 147.752,-21 137.752,-17.5001 137.752,-24.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"129\" y=\"-24.8\" font-family=\"Times,serif\" font-size=\"14.00\">a</text>\n",
"</g>\n",
"<!-- 2&#45;&gt;2 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>2&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M162.683,-41.466C161.78,-51.3101 164.219,-60 170,-60 173.613,-60 175.921,-56.6055 176.923,-51.6563\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"180.428,-51.5937 177.317,-41.466 173.433,-51.3233 180.428,-51.5937\"/>\n",
"<text text-anchor=\"middle\" x=\"170\" y=\"-63.8\" font-family=\"Times,serif\" font-size=\"14.00\">a</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc016b90>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sheep = acceptor(\"b\") + acceptor(\"a\").plus\n",
"sheep.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We say that a string matches this regular expression if, by beginning at the start state (here, 0) and taking the appropriate transitions, we can each match the symbols in the string, terminating at the final state (2) once the string has been consumed. For instance, for the string `baaa`, we start in state 0, consume the `b` via transition to state 1, consume the first `a` via transition to state 2, and then consume the remaining `a` symbols via self-transitions at state 2, a final state. However, a single `b` would not match because the path would terminate at state 1, which is not a final state, and `moo` would not match because there is no arc labeled `m` leaving the start state."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"_Finite-state transducers_ (FSTs) are a generalization of FSAs which represent the relations between two (possibly infinite) sets of strings. Regular expression substitution (e.g., Python's built-in `re.sub`) can be implemented with FSTs, though the FST formulation is much more expressive. In an FST, each transition is labeled with a pair of (input, output) labels. For instance, the following FST represents the relation `(b:m)(a:o)+`:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"224pt\" height=\"84pt\"\n",
" viewBox=\"0.00 0.00 224.00 84.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 80)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-80 220,-80 220,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"106\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"106\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.4034,-21C48.2541,-21 64.1831,-21 77.7083,-21\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"77.916,-24.5001 87.916,-21 77.916,-17.5001 77.916,-24.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"62\" y=\"-24.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:m</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\"><title>2</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"194\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"194\" cy=\"-21\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"194\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">2</text>\n",
"</g>\n",
"<!-- 1&#45;&gt;2 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>1&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M124.403,-21C135.14,-21 149.224,-21 161.835,-21\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"161.976,-24.5001 171.976,-21 161.976,-17.5001 161.976,-24.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"148\" y=\"-24.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:o</text>\n",
"</g>\n",
"<!-- 2&#45;&gt;2 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>2&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M185.63,-41.0663C184.472,-51.0736 187.262,-60 194,-60 198.317,-60 201.013,-56.3366 202.089,-51.0725\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"205.588,-51.1607 202.37,-41.0663 198.591,-50.964 205.588,-51.1607\"/>\n",
"<text text-anchor=\"middle\" x=\"194\" y=\"-63.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:o</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc016d88>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sheep_to_cow = transducer(\"b\", \"m\") + transducer(\"a\", \"o\").plus\n",
"sheep_to_cow.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"If the input here is `baaa`, for example, then we transition from state 0 to 1, 1 to 2, and then 2 to 2 (twice), generating the output string `mooo`."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, a _weighted finite-state transducer_ (WFST) adds weights to the transitions (and final states), so that each pair of strings in the relation has an associated weight. In this tutorial, we use real-valued weights. The _shortest distance_ algorithm is used to find the weight of the lowest-weight path through an WFST; the closely-related _shortest path_ algorithm returns the lowest-weight path(s). In Pynini, all machines are represented as WFSTs; an unweighted FST is simply an WFST where all weights are zero or one, and a FSA is simply a FST where all transitions have the same input and output labels."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Edit transducers"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"An edit transducer is a WFST and thus denotes a weighted relation. For simplicity, we assume that this relation has the domain and range $\\Sigma^*$ where $\\Sigma$ is the _alphabet_. We assume further that each pair of strings in the relation (that is, each _path_) is associated with a non-negative real number weight, and the weight is zero if and only if the input string $s_i$ and the output string $s_o$ are equal."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The edit transducer expresses the various ways in which a string can match, or fail to match, another string drawn from the same alphabet. We decompose this into four character-level operations:\n",
"\n",
"* When an input character and output character are the same, they _match_; e.g., `a` matches `a`\n",
"* An output character not present in the input is _inserted_\n",
"* An input character not present in the output is _deleted_\n",
"* When an input character and an output character are different, the latter is _substituted_ for the former; e.g., `b` is substituted for `a`\n",
"\n",
"We now proceed to implement these operations as WFSTs."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Implementing the edit transducer"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We implement the match operation using an acceptor representing the (unweighted) union over the alphabet:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"126pt\" height=\"51pt\"\n",
" viewBox=\"0.00 0.00 126.00 51.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 47)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-47 122,-47 122,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"96\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"96\" cy=\"-21\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"96\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.2456,-21C44.3279,-21 54.2171,-21 63.5606,-21\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"63.7524,-24.5001 73.7524,-21 63.7524,-17.5001 63.7524,-24.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"55\" y=\"-24.8\" font-family=\"Times,serif\" font-size=\"14.00\">a</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-10.5392C38.2739,-7.11518 44.6448,-3.74452 51,-2 56.9627,-0.363206 63.1713,-1.2777 68.9806,-3.40022\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"67.7384,-6.68677 78.2663,-7.87559 70.7776,-0.38094 67.7384,-6.68677\"/>\n",
"<text text-anchor=\"middle\" x=\"55\" y=\"-5.8\" font-family=\"Times,serif\" font-size=\"14.00\">b</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc016ae8>"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# For simplicity, we assume a two-character alphabet {\"a\", \"b\"}.\n",
"match = union(\"a\", \"b\")\n",
"match.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Note that an arc labeled `a`, for example, is shorthand for `a:a`."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We implement the insertion operation as a weighted relation from the empty set (here denoted by an empty string) to the above union. Note that when we \"insert\", the input side arc is labeled `<epsilon>`, a reserved symbol which matches no other symbol:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"196pt\" height=\"56pt\"\n",
" viewBox=\"0.00 0.00 196.00 55.67\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 51.6721)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-51.6721 192,-51.6721 192,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-25.6721\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-21.9721\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-25.6721\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-25.6721\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-21.9721\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.0338,-25.6721C59.8836,-25.6721 103.652,-25.6721 133.522,-25.6721\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"133.836,-29.1722 143.836,-25.6721 133.836,-22.1722 133.836,-29.1722\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-29.4721\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:a/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-15.2113C38.2739,-11.7873 44.6448,-8.41662 51,-6.67209 84.43,2.5046 95.3502,1.66289 129,-6.67209 132.057,-7.42928 135.14,-8.49857 138.149,-9.74838\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"136.848,-13.007 147.377,-14.1946 139.887,-6.70082 136.848,-13.007\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-10.4721\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:b/1</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc016e68>"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We use a \"1\" as a non-zero weight.\n",
"insert = transducer(\"\", match, weight=1)\n",
"insert.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The deletion operation is implemented by the inverse of the insert operation WFST:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"196pt\" height=\"56pt\"\n",
" viewBox=\"0.00 0.00 196.00 55.67\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 51.6721)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-51.6721 192,-51.6721 192,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-25.6721\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-21.9721\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-25.6721\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-25.6721\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-21.9721\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.0338,-25.6721C59.8836,-25.6721 103.652,-25.6721 133.522,-25.6721\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"133.836,-29.1722 143.836,-25.6721 133.836,-22.1722 133.836,-29.1722\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-29.4721\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-15.2113C38.2739,-11.7873 44.6448,-8.41662 51,-6.67209 84.43,2.5046 95.3502,1.66289 129,-6.67209 132.057,-7.42928 135.14,-8.49857 138.149,-9.74838\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"136.848,-13.007 147.377,-14.1946 139.887,-6.70082 136.848,-13.007\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-10.4721\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc016c38>"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delete = transducer(match, \"\", weight=1)\n",
"# Or, equivalently: delete = invert(insert)\n",
"delete.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The substitution operation can be expressed as the cross-product of the alphabet with itself (the `synchronize` function helps to remove unnecessary epsilon arcs). Note that this machine permits self-substitution (i.e., \"substituting\" a character for itself), which will be optimized away at a later stage:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"146pt\" height=\"106pt\"\n",
" viewBox=\"0.00 0.00 146.00 106.01\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 102.012)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-102.012 142,-102.012 142,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-45.0121\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-41.3121\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"116\" cy=\"-45.0121\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"116\" cy=\"-45.0121\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-41.3121\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M28.2982,-60.0485C33.8579,-67.5549 41.6576,-75.8952 51,-80.0121 62.3878,-85.0304 67.4696,-84.6934 79,-80.0121 84.795,-77.6594 90.1696,-73.8762 94.9041,-69.6491\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"97.6549,-71.8468 102.229,-62.2903 92.6937,-66.9085 97.6549,-71.8468\"/>\n",
"<text text-anchor=\"middle\" x=\"65\" y=\"-86.8121\" font-family=\"Times,serif\" font-size=\"14.00\">a:a/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.1629,-45.0121C49.3087,-45.0121 67.8059,-45.0121 83.5798,-45.0121\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"83.9658,-48.5122 93.9658,-45.0121 83.9657,-41.5122 83.9658,-48.5122\"/>\n",
"<text text-anchor=\"middle\" x=\"65\" y=\"-48.8121\" font-family=\"Times,serif\" font-size=\"14.00\">a:b/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-34.5513C38.2739,-31.1273 44.6448,-27.7567 51,-26.0121 63.0005,-22.7179 66.9206,-23.0201 79,-26.0121 82.0569,-26.7693 85.1404,-27.8386 88.1486,-29.0884\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"86.8485,-32.3471 97.3766,-33.5346 89.8869,-26.0409 86.8485,-32.3471\"/>\n",
"<text text-anchor=\"middle\" x=\"65\" y=\"-29.8121\" font-family=\"Times,serif\" font-size=\"14.00\">b:a/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M26.711,-29.1522C32.1156,-19.977 40.2887,-9.23969 51,-4.01215 62.1836,1.44591 67.6633,1.12033 79,-4.01215 86.1192,-7.23524 92.3887,-12.639 97.6167,-18.4832\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"95.1012,-20.9415 104.082,-26.5621 100.567,-16.5676 95.1012,-20.9415\"/>\n",
"<text text-anchor=\"middle\" x=\"65\" y=\"-7.81215\" font-family=\"Times,serif\" font-size=\"14.00\">b:b/1</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc047068>"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"substitute = synchronize(transducer(match, match, weight=1))\n",
"substitute.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We now compute the union of these four operations:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"196pt\" height=\"274pt\"\n",
" viewBox=\"0.00 0.00 196.00 274.46\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 270.463)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-270.463 192,-270.463 192,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-125.463\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-121.763\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-125.463\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-125.463\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-121.763\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M18.1579,-143.69C18.0985,-169.21 22.3139,-214.878 51,-236.463 78.7009,-257.306 100.96,-256.847 129,-236.463 153.628,-218.559 161.747,-183.419 164.252,-157.452\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"167.755,-157.54 164.997,-147.311 160.774,-157.028 167.755,-157.54\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-255.263\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:a/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M21.306,-143.451C24.5996,-160.863 32.4662,-186.351 51,-198.463 80.0195,-217.427 99.561,-216.769 129,-198.463 143.864,-189.22 152.772,-171.831 158.008,-156.343\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"161.427,-157.128 160.944,-146.544 154.721,-155.118 161.427,-157.128\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-215.263\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:b/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M28.2982,-140.499C33.8579,-148.005 41.6576,-156.346 51,-160.463 82.723,-174.442 96.8796,-173.503 129,-160.463 134.795,-158.11 140.17,-154.327 144.904,-150.099\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"147.655,-152.297 152.229,-142.741 142.694,-147.359 147.655,-152.297\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-174.263\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.0338,-125.463C59.8836,-125.463 103.652,-125.463 133.522,-125.463\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"133.836,-128.963 143.836,-125.463 133.836,-121.963 133.836,-128.963\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-129.263\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge5\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-115.002C38.2739,-111.578 44.6448,-108.207 51,-106.463 84.43,-97.2859 95.3502,-98.1276 129,-106.463 132.057,-107.22 135.14,-108.289 138.149,-109.539\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"136.848,-112.798 147.377,-113.985 139.887,-106.491 136.848,-112.798\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-110.263\" font-family=\"Times,serif\" font-size=\"14.00\">a:b/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge6\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M25.8645,-109.01C31.1066,-98.7671 39.4137,-86.4405 51,-80.4626 81.8078,-64.5673 97.7548,-65.4454 129,-80.4626 137.037,-84.3253 143.841,-90.9526 149.297,-97.9514\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"146.469,-100.015 155.075,-106.195 152.201,-95.9971 146.469,-100.015\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-84.2626\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge7\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M20.9599,-107.436C23.9965,-89.3181 31.6787,-62.2914 51,-49.4626 79.8802,-30.2868 99.7063,-30.9246 129,-49.4626 144.631,-59.3546 153.608,-78.0718 158.689,-94.4552\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"155.356,-95.5349 161.373,-104.255 162.107,-93.6855 155.356,-95.5349\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-53.2626\" font-family=\"Times,serif\" font-size=\"14.00\">b:a/1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge8\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M18.2156,-107.394C18.2264,-82.0981 22.5266,-36.8332 51,-15.4626 78.7261,5.3471 100.932,4.88413 129,-15.4626 153.451,-33.1876 161.6,-68.0137 164.162,-93.7522\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"160.678,-94.1007 164.93,-103.805 167.658,-93.5671 160.678,-94.1007\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-19.2626\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc047030>"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"ops = union(match, delete, insert, substitute)\n",
"ops.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"source": [
"The `ops` transducer currently only accepts zero or one elements of $\\Sigma$, whereas we want a transducer over the infinite set $\\Sigma^{*}$. To do this, we compute its (_concatenative_) _closure_:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"edit = ops.closure().optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"### Applying the edit transducer"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"To \"apply\" the relation expressed by the edit transducer, we first _compose_ the input string $s_i$ with the left side of the edit transducer. The composition operation is a generalization of set intersection. When we compose an acceptor with the left size of a transducer, we are essentially intersecting the set denoted by the acceptor (here a single string) with the domain of the transducer, producing a WFST denoting a relation from $\\{s_i\\}$ to $\\Sigma^{*}$."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We then compose the resulting transducer with the output string $s_o$. When we compose a the right size of a transducer an acceptor, we are intersecting the set denoted by the range of the transducer with the set denoted by the acceptor (here a single string). This produces a WFST denoting a relation from $\\{s_i\\}$ to $\\{s_o\\}$. We refer to this as the _lattice_:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"s_i = \"abba\"\n",
"s_o = \"baba\"\n",
"\n",
"lattice = compose(compose(s_i, edit), s_o)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Every path in this lattice is a possible alignment of the string `abba` to the string `baba`, and the cost of that path is the edit distance for that alignment. when visualizing the lattice, which can be a relatively large WFST, it is sometimes helpful to first \"prune\" it so that it contains only the optimal paths:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"612pt\" height=\"137pt\"\n",
" viewBox=\"0.00 0.00 612.00 137.43\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(0.916168 0.916168) rotate(0) translate(4 146)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-146 664,-146 664,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-72\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-68.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"162\" cy=\"-72\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"162\" y=\"-68.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.1285,-72C60.3298,-72 104.789,-72 133.607,-72\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"133.811,-75.5001 143.811,-72 133.811,-68.5001 133.811,-75.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-75.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\"><title>2</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"162\" cy=\"-18\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"162\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">2</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;2 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M33.5998,-62.3802C38.9334,-59.1239 45.1056,-55.6459 51,-53 78.4631,-40.672 111.569,-30.7611 134.241,-24.6751\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"135.377,-27.9959 144.164,-22.077 133.604,-21.2242 135.377,-27.9959\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-56.8\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:b/1</text>\n",
"</g>\n",
"<!-- 3 -->\n",
"<g id=\"node4\" class=\"node\"><title>3</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"213\" cy=\"-124\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"213\" y=\"-120.3\" font-family=\"Times,serif\" font-size=\"14.00\">3</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;3 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;3</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M33.1349,-81.7742C38.5247,-85.124 44.8533,-88.6231 51,-91 96.1885,-108.474 152.265,-117.286 184.903,-121.239\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"184.691,-124.737 195.024,-122.399 185.488,-117.783 184.691,-124.737\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-115.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:b/1</text>\n",
"</g>\n",
"<!-- 9 -->\n",
"<g id=\"node5\" class=\"node\"><title>9</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"264\" cy=\"-72\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"264\" y=\"-68.3\" font-family=\"Times,serif\" font-size=\"14.00\">9</text>\n",
"</g>\n",
"<!-- 1&#45;&gt;9 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>1&#45;&gt;9</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M180.419,-72C195.618,-72 217.933,-72 235.536,-72\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"235.781,-75.5001 245.781,-72 235.781,-68.5001 235.781,-75.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"213\" y=\"-75.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 7 -->\n",
"<g id=\"node6\" class=\"node\"><title>7</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"264\" cy=\"-18\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"264\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">7</text>\n",
"</g>\n",
"<!-- 2&#45;&gt;7 -->\n",
"<g id=\"edge5\" class=\"edge\"><title>2&#45;&gt;7</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M180.419,-18C195.618,-18 217.933,-18 235.536,-18\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"235.781,-21.5001 245.781,-18 235.781,-14.5001 235.781,-21.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"213\" y=\"-21.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 4 -->\n",
"<g id=\"node7\" class=\"node\"><title>4</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"408\" cy=\"-77\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"408\" y=\"-73.3\" font-family=\"Times,serif\" font-size=\"14.00\">4</text>\n",
"</g>\n",
"<!-- 3&#45;&gt;4 -->\n",
"<g id=\"edge6\" class=\"edge\"><title>3&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M231.363,-122.508C261.365,-119.55 324.541,-111.638 375,-94 377.637,-93.0782 380.322,-91.9647 382.959,-90.7537\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"384.789,-93.75 392.129,-86.1088 381.626,-87.5055 384.789,-93.75\"/>\n",
"<text text-anchor=\"middle\" x=\"264\" y=\"-123.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:a/1</text>\n",
"</g>\n",
"<!-- 9&#45;&gt;4 -->\n",
"<g id=\"edge12\" class=\"edge\"><title>9&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M282.129,-72.6031C306.33,-73.4553 350.789,-75.0207 379.607,-76.0355\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"379.694,-79.5406 389.811,-76.3947 379.94,-72.5449 379.694,-79.5406\"/>\n",
"<text text-anchor=\"middle\" x=\"336\" y=\"-78.8\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:a/1</text>\n",
"</g>\n",
"<!-- 7&#45;&gt;4 -->\n",
"<g id=\"edge9\" class=\"edge\"><title>7&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M281.651,-21.7509C303.755,-27.1045 343.644,-38.0826 375,-54 378.435,-55.744 381.908,-57.8336 385.235,-60.0324\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"383.394,-63.0176 393.576,-65.9455 387.442,-57.3069 383.394,-63.0176\"/>\n",
"<text text-anchor=\"middle\" x=\"336\" y=\"-57.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 8 -->\n",
"<g id=\"node10\" class=\"node\"><title>8</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"408\" cy=\"-18\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"408\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">8</text>\n",
"</g>\n",
"<!-- 7&#45;&gt;8 -->\n",
"<g id=\"edge10\" class=\"edge\"><title>7&#45;&gt;8</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M281.055,-11.9409C286.085,-10.3266 291.706,-8.80786 297,-8 331.27,-2.77004 340.73,-2.77004 375,-8 377.068,-8.31557 379.186,-8.73961 381.299,-9.23544\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.372,-12.6104 390.945,-11.9409 382.262,-5.87048 380.372,-12.6104\"/>\n",
"<text text-anchor=\"middle\" x=\"336\" y=\"-11.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 5 -->\n",
"<g id=\"node8\" class=\"node\"><title>5</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"552\" cy=\"-49\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"552\" y=\"-45.3\" font-family=\"Times,serif\" font-size=\"14.00\">5</text>\n",
"</g>\n",
"<!-- 4&#45;&gt;5 -->\n",
"<g id=\"edge7\" class=\"edge\"><title>4&#45;&gt;5</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M425.847,-73.6781C450.187,-68.8786 495.432,-59.9571 524.335,-54.2579\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"525.041,-57.6861 534.175,-52.3176 523.687,-50.8184 525.041,-57.6861\"/>\n",
"<text text-anchor=\"middle\" x=\"480\" y=\"-73.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 6 -->\n",
"<g id=\"node9\" class=\"node\"><title>6</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"638\" cy=\"-49\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"638\" cy=\"-49\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"638\" y=\"-45.3\" font-family=\"Times,serif\" font-size=\"14.00\">6</text>\n",
"</g>\n",
"<!-- 5&#45;&gt;6 -->\n",
"<g id=\"edge8\" class=\"edge\"><title>5&#45;&gt;6</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M570.405,-49C580.589,-49 593.751,-49 605.682,-49\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"605.744,-52.5001 615.744,-49 605.744,-45.5001 605.744,-52.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"593\" y=\"-52.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 8&#45;&gt;5 -->\n",
"<g id=\"edge11\" class=\"edge\"><title>8&#45;&gt;5</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M426.028,-19.5924C447.867,-21.905 486.772,-26.9402 519,-36 521.204,-36.6197 523.463,-37.3438 525.711,-38.1273\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"524.704,-41.4895 535.294,-41.8036 527.211,-34.954 524.704,-41.4895\"/>\n",
"<text text-anchor=\"middle\" x=\"480\" y=\"-39.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc016b58>"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"prune(lattice, weight=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"If we take the path $[0, 3, 4, 5, 6]$, along the top of the graph, we obtain the following alignment, with two matches and two substitutions:\n",
"\n",
" abba\n",
" baba\n",
"\n",
"If, however, we take the path $[0, 2, 7, 8, 5, 6]$, along the bottom of the graph, we obtain the following alignment, with two matches, one insertion, and one deletion:\n",
"\n",
" _abba\n",
" bab_a\n",
"\n",
"Both of these paths have the same cost (2) and thus are present in the pruned lattice of optimal paths. Note of course that the optimal pat ths depend on the costs we assign to the different edit operations. If, for instance, we had assigned a zero cost to insertions and a unit cost to deletions and substitution, the former path above would no longer be optimal and the latter path would have a cost of 1."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Edit transducers for larger vocabularies"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The above example used a small (two-symbol) alphabet, as was the resulting transducer. However, the size of the transducer will grow quadratically as a function of the alphabet size. While this is still manageable for the 27 lowercase characters of English or the <a href=\"https://en.wikipedia.org/wiki/ASCII#Printable_characters\">95 printable ASCII characters</a>&mdash;the resulting machine will have 9,215 arcs using the above formulation&mdash;but more than _four billion_ arcs will be needed for the 65,536 code points of the Unicode <a href=\"https://en.wikipedia.org/wiki/Basic_Multilingual_Plane\">Basic Multilingual Plane</a>. This is clearly not feasible."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Fortunately, there is a way to trick to avoid quadratic growth: we factor the edit transducer into two smaller machines. Above, we computed the lattice $L$ for a pair of strings strings $s_i, s_o$ as follows:\n",
"\n",
"$$L = s_i~\\circ~\\mathcal{E}~\\circ~s_o$$\n",
"\n",
"where $\\mathcal{E}$ is the edit transducer and $\\circ$ is the composition operator. We factor $\\mathcal{E}$ into two smaller transducers $\\mathcal{E}_i$ and $\\mathcal{E}_o$ such that $\\mathcal{E}_i~\\circ~\\mathcal{E}_o$ is <a href=\"http://www.openfst.org/twiki/bin/view/FST/IsomorphicDoc\">isomorphic</a> to the original $\\mathcal{E}$. We can compute a new lattice $L'$ as follows:\n",
"\n",
"$$L' = (s_i~\\circ~\\mathcal{E}_i)~\\circ~(\\mathcal{E}_o~\\circ~s_o)$$\n",
"\n",
"Because $\\mathcal{E}_i~\\circ~\\mathcal{E}_o$ is isomorphic to $\\mathcal{E}$, and \n",
"<a href=\"https://en.wikipedia.org/wiki/Composition_of_relations\">composition is associative</a>, $L$ and $L'$ will also be isomorphic."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Not all regular relations can be factored into two smaller machines, but it is straightforward to do so for the edit transducer above. The basic insight is that instead of enumerating all the possible substitutions (or insertions, or deletions), we treat edits as a two-part process. In the case of substitution, we first map each input symbol to a special `<substition>` symbol in the left factor $\\mathcal{E}_i$, and then mapping this special symbol to an output symbol in the right factor $\\mathcal{E}_o$. Similar tricks can be applied to the insertion and deletion operations. As a result, the size of the two factors is a linear &mdash;rather than quadratic&mdash;function of alphabet size."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We now proceed to construct the two factors, focusing on the input-side factor $\\mathcal{E}_i$ first and then transforming it to generate $\\mathcal{E}_o$. For both factors, the match operation is implemented as before: an unweighted union over the alphabet. Note that we halve the costs in $\\mathcal{E}_i$ because the cost will also be incurred a second time as the path traverses $\\mathcal{E}_o$."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The insert operation maps from epsilon to a reserved symbol `<insert>` with a cost of 1/2:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"246pt\" height=\"50pt\"\n",
" viewBox=\"0.00 0.00 246.00 50.00\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 46)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-46 242,-46 242,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"216\" cy=\"-21\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"216\" cy=\"-21\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"216\" y=\"-17.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.0508,-21C68.9511,-21 141.636,-21 183.546,-21\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"183.745,-24.5001 193.745,-21 183.745,-17.5001 183.745,-24.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-24.8\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:&lt;insert&gt;/0.5</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc047148>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The square brackets cause the string to be interpreted as a \"reserved\" symbol.\n",
"i_insert = transducer(\"\", \"[<insert>]\", weight=1/2)\n",
"i_insert.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The delete and the substitute operations map from the input alphabet to `<delete>` and to `<substitute>`, respectively, with a cost of 1/2:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"200pt\" height=\"56pt\"\n",
" viewBox=\"0.00 0.00 200.00 55.96\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 51.9589)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-51.9589 196,-51.9589 196,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-25.9589\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-22.2589\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"170\" cy=\"-25.9589\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"170\" cy=\"-25.9589\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"170\" y=\"-22.2589\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.2069,-25.9589C60.9195,-25.9589 106.873,-25.9589 137.7,-25.9589\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"137.914,-29.459 147.913,-25.9589 137.913,-22.459 137.914,-29.459\"/>\n",
"<text text-anchor=\"middle\" x=\"92\" y=\"-29.7589\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;delete&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-15.4981C38.2739,-12.0741 44.6448,-8.70347 51,-6.95895 86.1444,2.68835 97.6246,1.80347 133,-6.95895 136.057,-7.71613 139.14,-8.78543 142.149,-10.0352\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"140.848,-13.2939 151.377,-14.4814 143.887,-6.98768 140.848,-13.2939\"/>\n",
"<text text-anchor=\"middle\" x=\"92\" y=\"-10.7589\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;delete&gt;/0.5</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc0470d8>"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"i_delete = transducer(match, \"[<delete>]\", weight=1/2)\n",
"# We have to optimize twice to get the minimal machine.\n",
"i_delete.optimize(True).optimize(True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"218pt\" height=\"57pt\"\n",
" viewBox=\"0.00 0.00 218.00 57.49\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 53.4865)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-53.4865 214,-53.4865 214,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-27.4865\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-23.7865\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"188\" cy=\"-27.4865\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"188\" cy=\"-27.4865\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"188\" y=\"-23.7865\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.0333,-27.4865C64.0803,-27.4865 120.392,-27.4865 155.753,-27.4865\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"156,-30.9866 166,-27.4865 155.999,-23.9866 156,-30.9866\"/>\n",
"<text text-anchor=\"middle\" x=\"101\" y=\"-31.2865\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;substitute&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M32.9086,-17.0257C38.2739,-13.6017 44.6448,-10.231 51,-8.48652 93.859,3.27847 107.859,2.19935 151,-8.48652 154.057,-9.2437 157.14,-10.313 160.149,-11.5628\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"158.848,-14.8214 169.377,-16.009 161.887,-8.51525 158.848,-14.8214\"/>\n",
"<text text-anchor=\"middle\" x=\"101\" y=\"-12.2865\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;substitute&gt;/0.5</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc0472d0>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"i_substitute = transducer(match, \"[<substitute>]\", weight=1/2)\n",
"# Ditto.\n",
"i_substitute.optimize(True).optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Then, we put the four left-factor operations together:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"246pt\" height=\"260pt\"\n",
" viewBox=\"0.00 0.00 246.00 259.65\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 255.652)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-255.652 242,-255.652 242,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-120.652\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-116.952\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"216\" cy=\"-120.652\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"216\" cy=\"-120.652\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"216\" y=\"-116.952\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M19.4459,-138.803C21.0713,-160.476 27.4316,-195.896 51,-212.652 97.3657,-245.615 132.012,-244.722 179,-212.652 198.754,-199.17 207.669,-173.079 211.692,-152.116\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"215.184,-152.449 213.341,-142.015 208.276,-151.321 215.184,-152.449\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-240.452\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:&lt;insert&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M23.9904,-137.631C28.7121,-150.319 37.1763,-166.775 51,-174.652 75.714,-188.734 153.921,-188.074 179,-174.652 189.268,-169.157 197.113,-159.296 202.818,-149.491\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"205.92,-151.111 207.463,-140.629 199.721,-147.861 205.92,-151.111\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-188.452\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M33.752,-129.815C38.9978,-132.581 45.0758,-135.254 51,-136.652 106.367,-149.722 123.35,-148.462 179,-136.652 181.668,-136.086 184.378,-135.308 187.053,-134.398\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"188.443,-137.611 196.444,-130.666 185.858,-131.106 188.443,-137.611\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-149.452\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;delete&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M35.9187,-118.768C40.781,-118.308 46.095,-117.881 51,-117.652 107.827,-114.994 122.16,-115.288 179,-117.652 180.527,-117.716 182.09,-117.794 183.668,-117.884\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"183.859,-121.406 194.079,-118.613 184.347,-114.423 183.859,-121.406\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-121.452\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;substitute&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge5\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M31.2815,-108.495C36.8258,-103.839 43.7531,-99.0677 51,-96.652 104.97,-78.6622 124.552,-80.1665 179,-96.652 182.853,-97.8186 186.671,-99.5486 190.299,-101.555\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"188.515,-104.568 198.839,-106.948 192.253,-98.6495 188.515,-104.568\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-100.452\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge6\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M23.7207,-103.334C28.3252,-90.1523 36.7632,-72.914 51,-64.652 100.204,-36.098 129.068,-37.3916 179,-64.652 189.749,-70.5206 197.768,-81.125 203.474,-91.5395\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"200.341,-93.0997 207.904,-100.519 206.619,-90.0024 200.341,-93.0997\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-68.452\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;delete&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge7\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M19.1347,-102.526C20.3983,-80.0168 26.3439,-42.4231 51,-24.652 97.1507,8.61147 132.242,7.75157 179,-24.652 199.84,-39.0939 208.634,-67.1855 212.335,-89.2661\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"208.877,-89.8124 213.738,-99.2264 215.809,-88.8359 208.877,-89.8124\"/>\n",
"<text text-anchor=\"middle\" x=\"115\" y=\"-28.452\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;substitute&gt;/0.5</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc047298>"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"i_ops = union(match, i_insert, i_delete, i_substitute)\n",
"i_ops.optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"For the right factor $\\mathcal{E}_o$, we need an insert operation which maps from `<insert>` to the alphabet, a delete operation which maps from `<delete>` to epsilon, and so on. Rather than building these directly, we can construct the entire set of right operations by `invert`ing the left factor and then swapping `<insert>` and `<delete>` with `relabel_pairs`. This first requires us to look up the integer labels associated with these special symbols:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"248pt\" height=\"262pt\"\n",
" viewBox=\"0.00 0.00 248.00 262.15\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 258.15)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-258.15 244,-258.15 244,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-123.15\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-119.45\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"218\" cy=\"-123.15\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"218\" cy=\"-123.15\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"218\" y=\"-119.45\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M19.4459,-141.301C21.0713,-162.975 27.4316,-198.395 51,-215.15 98.0901,-248.629 133.278,-247.722 181,-215.15 200.754,-201.668 209.669,-175.578 213.692,-154.614\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"217.184,-154.947 215.341,-144.514 210.276,-153.819 217.184,-154.947\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-242.95\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;delete&gt;:&lt;epsilon&gt;/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M23.9904,-140.13C28.7121,-152.817 37.1763,-169.274 51,-177.15 76.1001,-191.453 155.529,-190.782 181,-177.15 191.268,-171.655 199.113,-161.795 204.818,-151.99\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"207.92,-153.609 209.463,-143.127 201.721,-150.359 207.92,-153.609\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-190.95\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M33.752,-132.313C38.9978,-135.079 45.0758,-137.752 51,-139.15 107.232,-152.425 124.481,-151.145 181,-139.15 183.668,-138.584 186.378,-137.806 189.053,-136.896\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"190.443,-140.11 198.444,-133.164 187.858,-133.605 190.443,-140.11\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-151.95\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;insert&gt;:a/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M35.9187,-121.267C40.781,-120.806 46.095,-120.38 51,-120.15 108.715,-117.451 123.272,-117.749 181,-120.15 182.527,-120.214 184.09,-120.293 185.668,-120.383\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"185.859,-123.905 196.079,-121.111 186.347,-116.922 185.859,-123.905\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-123.95\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;substitute&gt;:a/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge5\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M31.2815,-110.993C36.8258,-106.337 43.7531,-101.566 51,-99.1504 105.813,-80.8795 125.701,-82.4073 181,-99.1504 184.853,-100.317 188.671,-102.047 192.299,-104.053\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"190.515,-107.067 200.839,-109.447 194.253,-101.148 190.515,-107.067\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-102.95\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge6\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M23.5412,-105.806C28.0746,-92.361 36.5051,-74.6358 51,-66.1504 100.862,-36.9608 130.399,-38.2608 181,-66.1504 191.941,-72.1806 200.016,-83.0908 205.714,-93.7636\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"202.641,-95.4499 210.121,-102.953 208.953,-92.4227 202.641,-95.4499\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-69.9504\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;insert&gt;:b/0.5</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge7\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M18.9873,-105.056C20.0679,-82.1389 25.7919,-43.4371 51,-25.1504 97.7679,8.77646 133.622,7.91903 181,-25.1504 202.299,-40.0172 211.034,-68.9487 214.596,-91.5399\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"211.159,-92.2544 215.928,-101.715 218.1,-91.3453 211.159,-92.2544\"/>\n",
"<text text-anchor=\"middle\" x=\"116\" y=\"-28.9504\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;substitute&gt;:b/0.5</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc0470a0>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"o_ops = invert(i_ops)\n",
"syms = o_ops.input_symbols()\n",
"insert_label = syms.find(\"<insert>\")\n",
"delete_label = syms.find(\"<delete>\")\n",
"o_ops.relabel_pairs(ipairs=((insert_label, delete_label),\n",
" (delete_label, insert_label)))"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Finally, we compute the closures of the two sets of operations and optimize:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"e_i = i_ops.closure().optimize(True)\n",
"e_o = o_ops.closure().optimize(True)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Just to be sure, let's confirm that the composition of the two factors is in fact isomorphic to the single-machine edit transducer:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"factored_edit = compose(e_i, e_o)\n",
"factored_edit.optimize(True)\n",
"assert isomorphic(edit, factored_edit)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"And, as we can see below, this construction produced a pruned lattice isomorphic to the one above:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"612pt\" height=\"142pt\"\n",
" viewBox=\"0.00 0.00 612.00 142.01\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(0.916168 0.916168) rotate(0) translate(4 151)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-151 664,-151 664,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-68\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-64.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node2\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"162\" cy=\"-122\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"162\" y=\"-118.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M33.5998,-77.6198C38.9334,-80.8761 45.1056,-84.3541 51,-87 78.4631,-99.328 111.569,-109.239 134.241,-115.325\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"133.604,-118.776 144.164,-117.923 135.377,-112.004 133.604,-118.776\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-116.8\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:b/1</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\"><title>2</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"162\" cy=\"-68\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"162\" y=\"-64.3\" font-family=\"Times,serif\" font-size=\"14.00\">2</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;2 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.1285,-68C60.3298,-68 104.789,-68 133.607,-68\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"133.811,-71.5001 143.811,-68 133.811,-64.5001 133.811,-71.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-71.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 3 -->\n",
"<g id=\"node4\" class=\"node\"><title>3</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"213\" cy=\"-18\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"213\" y=\"-14.3\" font-family=\"Times,serif\" font-size=\"14.00\">3</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;3 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;3</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M33.1196,-58.1858C38.5077,-54.8318 44.8392,-51.3402 51,-49 96.1846,-31.8367 152.262,-23.8032 184.902,-20.3393\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"185.418,-23.8053 195.023,-19.3331 184.726,-16.8396 185.418,-23.8053\"/>\n",
"<text text-anchor=\"middle\" x=\"90\" y=\"-52.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:b/1</text>\n",
"</g>\n",
"<!-- 8 -->\n",
"<g id=\"node5\" class=\"node\"><title>8</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"264\" cy=\"-122\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"264\" y=\"-118.3\" font-family=\"Times,serif\" font-size=\"14.00\">8</text>\n",
"</g>\n",
"<!-- 1&#45;&gt;8 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>1&#45;&gt;8</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M180.419,-122C195.618,-122 217.933,-122 235.536,-122\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"235.781,-125.5 245.781,-122 235.781,-118.5 235.781,-125.5\"/>\n",
"<text text-anchor=\"middle\" x=\"213\" y=\"-125.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 7 -->\n",
"<g id=\"node6\" class=\"node\"><title>7</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"264\" cy=\"-68\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"264\" y=\"-64.3\" font-family=\"Times,serif\" font-size=\"14.00\">7</text>\n",
"</g>\n",
"<!-- 2&#45;&gt;7 -->\n",
"<g id=\"edge5\" class=\"edge\"><title>2&#45;&gt;7</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M180.419,-68C195.618,-68 217.933,-68 235.536,-68\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"235.781,-71.5001 245.781,-68 235.781,-64.5001 235.781,-71.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"213\" y=\"-71.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 4 -->\n",
"<g id=\"node7\" class=\"node\"><title>4</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"408\" cy=\"-64\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"408\" y=\"-60.3\" font-family=\"Times,serif\" font-size=\"14.00\">4</text>\n",
"</g>\n",
"<!-- 3&#45;&gt;4 -->\n",
"<g id=\"edge6\" class=\"edge\"><title>3&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M231.171,-19.2294C244.995,-20.4084 264.874,-22.5314 282,-26 324.184,-34.5436 333.97,-40.0004 375,-53 377.013,-53.6377 379.091,-54.3131 381.176,-55.003\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.177,-58.3598 390.771,-58.2507 382.421,-51.7293 380.177,-58.3598\"/>\n",
"<text text-anchor=\"middle\" x=\"264\" y=\"-29.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:a/1</text>\n",
"</g>\n",
"<!-- 8&#45;&gt;4 -->\n",
"<g id=\"edge11\" class=\"edge\"><title>8&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M279.714,-113.058C285.058,-109.989 291.208,-106.661 297,-104 304.88,-100.38 351.161,-83.8025 381.127,-73.1572\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"382.331,-76.444 390.585,-69.8018 379.99,-69.8469 382.331,-76.444\"/>\n",
"<text text-anchor=\"middle\" x=\"336\" y=\"-107.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"<!-- 9 -->\n",
"<g id=\"node10\" class=\"node\"><title>9</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"408\" cy=\"-129\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"408\" y=\"-125.3\" font-family=\"Times,serif\" font-size=\"14.00\">9</text>\n",
"</g>\n",
"<!-- 8&#45;&gt;9 -->\n",
"<g id=\"edge10\" class=\"edge\"><title>8&#45;&gt;9</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M282.129,-122.844C306.33,-124.037 350.789,-126.229 379.607,-127.65\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"379.651,-131.156 389.811,-128.153 379.995,-124.164 379.651,-131.156\"/>\n",
"<text text-anchor=\"middle\" x=\"336\" y=\"-130.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 7&#45;&gt;4 -->\n",
"<g id=\"edge9\" class=\"edge\"><title>7&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M281.463,-63.0654C286.411,-61.8054 291.885,-60.6297 297,-60 331.407,-55.7643 340.4,-57.8459 375,-60 376.537,-60.0957 378.112,-60.2169 379.701,-60.3572\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"379.777,-63.8861 390.097,-61.4886 380.535,-56.9272 379.777,-63.8861\"/>\n",
"<text text-anchor=\"middle\" x=\"336\" y=\"-63.8\" font-family=\"Times,serif\" font-size=\"14.00\">&lt;epsilon&gt;:a/1</text>\n",
"</g>\n",
"<!-- 5 -->\n",
"<g id=\"node8\" class=\"node\"><title>5</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"552\" cy=\"-99\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"552\" y=\"-95.3\" font-family=\"Times,serif\" font-size=\"14.00\">5</text>\n",
"</g>\n",
"<!-- 4&#45;&gt;5 -->\n",
"<g id=\"edge7\" class=\"edge\"><title>4&#45;&gt;5</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M425.969,-66.7188C447.746,-70.3965 486.585,-77.5929 519,-87 521.048,-87.5943 523.151,-88.2582 525.254,-88.9605\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"524.288,-92.3319 534.882,-92.4086 526.648,-85.7418 524.288,-92.3319\"/>\n",
"<text text-anchor=\"middle\" x=\"480\" y=\"-90.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:b</text>\n",
"</g>\n",
"<!-- 6 -->\n",
"<g id=\"node9\" class=\"node\"><title>6</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"638\" cy=\"-99\" rx=\"18\" ry=\"18\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"638\" cy=\"-99\" rx=\"22\" ry=\"22\"/>\n",
"<text text-anchor=\"middle\" x=\"638\" y=\"-95.3\" font-family=\"Times,serif\" font-size=\"14.00\">6</text>\n",
"</g>\n",
"<!-- 5&#45;&gt;6 -->\n",
"<g id=\"edge8\" class=\"edge\"><title>5&#45;&gt;6</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M570.405,-99C580.589,-99 593.751,-99 605.682,-99\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"605.744,-102.5 615.744,-99 605.744,-95.5001 605.744,-102.5\"/>\n",
"<text text-anchor=\"middle\" x=\"593\" y=\"-102.8\" font-family=\"Times,serif\" font-size=\"14.00\">a:a</text>\n",
"</g>\n",
"<!-- 9&#45;&gt;5 -->\n",
"<g id=\"edge12\" class=\"edge\"><title>9&#45;&gt;5</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M425.847,-125.441C450.187,-120.299 495.432,-110.74 524.335,-104.633\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"525.114,-108.046 534.175,-102.555 523.668,-101.197 525.114,-108.046\"/>\n",
"<text text-anchor=\"middle\" x=\"480\" y=\"-124.8\" font-family=\"Times,serif\" font-size=\"14.00\">b:&lt;epsilon&gt;/1</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc0471b8>"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"i_edit = compose(s_i, e_i)\n",
"o_edit = compose(e_o, s_o)\n",
"prune(compose(i_edit, o_edit), weight=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We now proceed to wrap this into a class called `EditTransducer`. We expose the following parameters to the user:\n",
" \n",
"* the alphabet (an iterable of strings)\n",
"* the cost of insertion (defaulting to 1)\n",
"* the cost of deletion (defaulting to 1)\n",
"* the cost of substitution (defaulting to 1)\n",
"\n",
"In addition to a constructor which builds the two factors, we add a method which builds the lattice for a pair of strings, and a static method which checks that a lattice is well-formed (i.e., has a valid start state). If the lattice is illformed, this indicates that one or more characters in the input and/or output string is not in the provided alphabet. Since most operations will be undefined in this case, an exception is thrown. This class is given below:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"DEFAULT_INSERT_COST = 1\n",
"DEFAULT_DELETE_COST = 1\n",
"DEFAULT_SUBSTITUTE_COST = 1\n",
"\n",
"\n",
"class LatticeError(Exception):\n",
" pass\n",
"\n",
"\n",
"class EditTransducer(object):\n",
" \"\"\"Factored edit transducer.\n",
" \n",
" This class stores the two factors of an finite-alphabet edit transducer and\n",
" supports insertion, deletion, and substitution operations with user-specified\n",
" costs.\n",
"\n",
" Note that the cost of substitution must be less than the cost of insertion\n",
" plus the cost of deletion or no optimal path will include substitution.\n",
" \"\"\"\n",
"\n",
" # Reserved labels for edit operations.\n",
" DELETE = \"<delete>\"\n",
" INSERT = \"<insert>\"\n",
" SUBSTITUTE = \"<substitute>\"\n",
"\n",
" def __init__(self,\n",
" alphabet,\n",
" insert_cost=DEFAULT_INSERT_COST,\n",
" delete_cost=DEFAULT_DELETE_COST,\n",
" substitute_cost=DEFAULT_SUBSTITUTE_COST):\n",
" \"\"\"Constructor.\n",
"\n",
" Args:\n",
" alphabet: edit alphabet (an iterable of strings).\n",
" insert_cost: the cost for the insertion operation.\n",
" delete_cost: the cost for the deletion operation.\n",
" substitute_cost: the cost for the substitution operation.\n",
" \"\"\"\n",
" # Left factor; note that we divide the edit costs by two because they also\n",
" # will be incurred when traversing the right factor.\n",
" match = union(*alphabet).optimize(True)\n",
" i_insert = transducer(\"\", \"[{}]\".format(self.INSERT),\n",
" weight=insert_cost / 2).optimize(True)\n",
" i_delete = transducer(match, \"[{}]\".format(self.DELETE),\n",
" weight=delete_cost / 2).optimize(True)\n",
" i_substitute = transducer(match, \"[{}]\".format(self.SUBSTITUTE),\n",
" weight=substitute_cost / 2).optimize(True)\n",
" i_ops = union(match, i_insert, i_delete, i_substitute).optimize(True)\n",
" # Right factor; this is constructed by inverting the left factor (i.e.,\n",
" # swapping the input and output labels), then swapping the insert and delete\n",
" # labels on what is now the input side.\n",
" o_ops = invert(i_ops)\n",
" syms = o_ops.input_symbols()\n",
" insert_label = syms.find(self.INSERT)\n",
" delete_label = syms.find(self.DELETE)\n",
" o_ops.relabel_pairs(ipairs=((insert_label, delete_label),\n",
" (delete_label, insert_label)))\n",
" # Computes the closure for both sets of ops.\n",
" self._e_i = i_ops.closure().optimize(True)\n",
" self._e_o = o_ops.closure().optimize(True)\n",
" \n",
" @staticmethod\n",
" def check_wellformed_lattice(lattice):\n",
" \"\"\"Raises an error if the lattice is empty.\n",
"\n",
" Args:\n",
" lattice: A lattice FST.\n",
"\n",
" Raises:\n",
" LatticeError: Lattice is empty.\n",
" \"\"\"\n",
" if lattice.start() == NO_STATE_ID:\n",
" raise LatticeError(\"Lattice is empty\")\n",
"\n",
" def _create_lattice(self, iset, oset):\n",
" \"\"\"Creates edit lattice for a pair of input/output strings or acceptors.\n",
"\n",
" Args:\n",
" iset: input string or acceptor\n",
" oset: output string or acceptor.\n",
"\n",
" Returns:\n",
" A lattice FST.\n",
" \"\"\"\n",
" l_i = compose(iset, self._e_i)\n",
" l_o = compose(self._e_o, oset)\n",
" lattice = compose(l_i, l_o)\n",
" EditTransducer.check_wellformed_lattice(lattice)\n",
" return lattice"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## From edit transducers to Levenshtein distance"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The above class constructs the two factors of the edit transducer, and then applies them to input/output strings or sets to create the lattice. But the lattice does not direclty measure the similarity of the input and output string. To compute that, we compute the _shortest distance_ from all final state to the start state&mdash;the cost of the shortest path through the lattice. With default weights, this will simply be the minimal number of insertions, deletions, and substitutions needed to align the two strings; this metric is known as _Levenshtein distance_. We implement this below by deriving a new class, `LevenshteinDistance`, from the `EditTransducer` base class:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"class LevenshteinDistance(EditTransducer):\n",
" \"\"\"Edit transducer augmented with a distance calculator.\"\"\"\n",
"\n",
" def distance(self, iset, oset):\n",
" \"\"\"Computes minimum distance.\n",
"\n",
" This method computes, for a pair of input/output strings or acceptors, the\n",
" minimum edit distance according to the underlying edit transducer.\n",
"\n",
" Args:\n",
" iset: input string or acceptor.\n",
" oset: output string or acceptor.\n",
"\n",
" Returns:\n",
" Minimum edit distance according to the edit transducer.\n",
" \"\"\"\n",
" lattice = self._create_lattice(iset, oset)\n",
" start_state = lattice.start()\n",
" # The shortest cost from all final states to the start state is\n",
" # equivalent to the cost of the shortest path.\n",
" return float(shortestdistance(lattice, reverse=True)[start_state])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The following example shows computation of edit distance for the strings `abba`, `baba`, which as we saw above, is `2.0` using the default costs."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"ld = LevenshteinDistance((\"a\", \"b\"))\n",
"assert ld.distance(s_i, s_o) == 2.0"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## From Levenshtein distance to Levenshtein automata"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Imagine we have a query string and wish to find the string or strings in a fixed _lexicon_ it is closest to, in terms of edit distance. This problem arises in many applications, including search and spelling correction. Naïvely, we could simply use `LevenshteinDistance` to compute the distance between the query and each word in the lexicon, though this approach scales very poorly. Schulz & Mihov (2002) describe an efficient alternative using a _Levenshtein automaton_. Now, quite a bit of ink has been spilled trying to\n",
"<a href=\"http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html\">understand Schulz & Mihov's paper</a>, but their idea is a straighforward extension of our `EditTransducer` base class."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Like an edit transducer, a Levenshtein automaton is a WFST representing a weighted relation over sets of strings in which domain of this relation is $\\Sigma^{*}$. However, in a Levenshtein automaton, the range is a subset thereof defined by a fixed, user-specified _lexicon_, a finite set of strings. The relation itself is specified by an edit transducer. Therefore, if we compose an input string with the Levenshtein transducer and prune all non-optimal paths, we obtain a lattice representing the alignment of the input string to all of the \"closest\" strings in the lexicon (according to the edit distance)."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"Using the two-factor edit transducer above, we can implement first this by compiling the lexicon into an FSA using Pynini's `string_map` function, which constructs the finite-state equivalent of a <a href=\"https://en.wikipedia.org/wiki/Trie\">prefix tree</a>:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [
{
"data": {
"image/svg+xml": [
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n",
"<!DOCTYPE svg PUBLIC \"-//W3C//DTD SVG 1.1//EN\"\n",
" \"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\">\n",
"<!-- Generated by graphviz version 2.36.0 (20140111.2315)\n",
" -->\n",
"<!-- Title: FST Pages: 1 -->\n",
"<svg width=\"612pt\" height=\"279pt\"\n",
" viewBox=\"0.00 0.00 612.00 279.19\" xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\">\n",
"<g id=\"graph0\" class=\"graph\" transform=\"scale(0.78866 0.78866) rotate(0) translate(4 350)\">\n",
"<title>FST</title>\n",
"<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-350 772,-350 772,4 -4,4\"/>\n",
"<!-- 0 -->\n",
"<g id=\"node1\" class=\"node\"><title>0</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" stroke-width=\"2\" cx=\"18\" cy=\"-221\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"18\" y=\"-217.3\" font-family=\"Times,serif\" font-size=\"14.00\">0</text>\n",
"</g>\n",
"<!-- 3 -->\n",
"<g id=\"node2\" class=\"node\"><title>3</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"92\" cy=\"-296\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"92\" y=\"-292.3\" font-family=\"Times,serif\" font-size=\"14.00\">3</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;3 -->\n",
"<g id=\"edge1\" class=\"edge\"><title>0&#45;&gt;3</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M31.2138,-233.723C42.3486,-245.321 58.9786,-262.644 71.8747,-276.078\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"69.5163,-278.675 78.9666,-283.465 74.5661,-273.827 69.5163,-278.675\"/>\n",
"<text text-anchor=\"middle\" x=\"55\" y=\"-263.8\" font-family=\"Times,serif\" font-size=\"14.00\">r</text>\n",
"</g>\n",
"<!-- 2 -->\n",
"<g id=\"node3\" class=\"node\"><title>2</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"92\" cy=\"-221\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"92\" y=\"-217.3\" font-family=\"Times,serif\" font-size=\"14.00\">2</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;2 -->\n",
"<g id=\"edge2\" class=\"edge\"><title>0&#45;&gt;2</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M36.063,-221C44.1928,-221 54.1239,-221 63.2953,-221\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"63.5594,-224.5 73.5593,-221 63.5593,-217.5 63.5594,-224.5\"/>\n",
"<text text-anchor=\"middle\" x=\"55\" y=\"-224.8\" font-family=\"Times,serif\" font-size=\"14.00\">g</text>\n",
"</g>\n",
"<!-- 1 -->\n",
"<g id=\"node4\" class=\"node\"><title>1</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"92\" cy=\"-143\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"92\" y=\"-139.3\" font-family=\"Times,serif\" font-size=\"14.00\">1</text>\n",
"</g>\n",
"<!-- 0&#45;&gt;1 -->\n",
"<g id=\"edge3\" class=\"edge\"><title>0&#45;&gt;1</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M30.8893,-208.12C42.1453,-195.926 59.2213,-177.427 72.3004,-163.258\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"74.882,-165.621 79.093,-155.899 69.7383,-160.873 74.882,-165.621\"/>\n",
"<text text-anchor=\"middle\" x=\"55\" y=\"-188.8\" font-family=\"Times,serif\" font-size=\"14.00\">c</text>\n",
"</g>\n",
"<!-- 8 -->\n",
"<g id=\"node9\" class=\"node\"><title>8</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-316\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-312.3\" font-family=\"Times,serif\" font-size=\"14.00\">8</text>\n",
"</g>\n",
"<!-- 3&#45;&gt;8 -->\n",
"<g id=\"edge8\" class=\"edge\"><title>3&#45;&gt;8</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M109.699,-300.639C118.241,-303.011 128.842,-305.956 138.469,-308.63\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"137.819,-312.082 148.391,-311.386 139.693,-305.338 137.819,-312.082\"/>\n",
"<text text-anchor=\"middle\" x=\"129\" y=\"-309.8\" font-family=\"Times,serif\" font-size=\"14.00\">o</text>\n",
"</g>\n",
"<!-- 7 -->\n",
"<g id=\"node7\" class=\"node\"><title>7</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-261\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-257.3\" font-family=\"Times,serif\" font-size=\"14.00\">7</text>\n",
"</g>\n",
"<!-- 2&#45;&gt;7 -->\n",
"<g id=\"edge6\" class=\"edge\"><title>2&#45;&gt;7</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M108.269,-229.483C117.72,-234.733 130.061,-241.59 140.819,-247.566\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"139.311,-250.732 149.753,-252.529 142.711,-244.613 139.311,-250.732\"/>\n",
"<text text-anchor=\"middle\" x=\"129\" y=\"-245.8\" font-family=\"Times,serif\" font-size=\"14.00\">r</text>\n",
"</g>\n",
"<!-- 6 -->\n",
"<g id=\"node8\" class=\"node\"><title>6</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-207\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-203.3\" font-family=\"Times,serif\" font-size=\"14.00\">6</text>\n",
"</g>\n",
"<!-- 2&#45;&gt;6 -->\n",
"<g id=\"edge7\" class=\"edge\"><title>2&#45;&gt;6</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M110.063,-217.682C118.402,-216.061 128.636,-214.071 138.002,-212.25\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"138.911,-215.638 148.059,-210.294 137.575,-208.767 138.911,-215.638\"/>\n",
"<text text-anchor=\"middle\" x=\"129\" y=\"-217.8\" font-family=\"Times,serif\" font-size=\"14.00\">o</text>\n",
"</g>\n",
"<!-- 5 -->\n",
"<g id=\"node5\" class=\"node\"><title>5</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-143\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-139.3\" font-family=\"Times,serif\" font-size=\"14.00\">5</text>\n",
"</g>\n",
"<!-- 1&#45;&gt;5 -->\n",
"<g id=\"edge4\" class=\"edge\"><title>1&#45;&gt;5</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M110.063,-143C118.193,-143 128.124,-143 137.295,-143\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"137.559,-146.5 147.559,-143 137.559,-139.5 137.559,-146.5\"/>\n",
"<text text-anchor=\"middle\" x=\"129\" y=\"-146.8\" font-family=\"Times,serif\" font-size=\"14.00\">h</text>\n",
"</g>\n",
"<!-- 4 -->\n",
"<g id=\"node6\" class=\"node\"><title>4</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"166\" cy=\"-86\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"166\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">4</text>\n",
"</g>\n",
"<!-- 1&#45;&gt;4 -->\n",
"<g id=\"edge5\" class=\"edge\"><title>1&#45;&gt;4</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M106.543,-132.279C116.77,-124.182 131.023,-112.898 142.895,-103.5\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"145.359,-106.013 151.027,-97.062 141.014,-100.525 145.359,-106.013\"/>\n",
"<text text-anchor=\"middle\" x=\"129\" y=\"-120.8\" font-family=\"Times,serif\" font-size=\"14.00\">a</text>\n",
"</g>\n",
"<!-- 11 -->\n",
"<g id=\"node12\" class=\"node\"><title>11</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"246\" cy=\"-143\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"246\" y=\"-139.3\" font-family=\"Times,serif\" font-size=\"14.00\">11</text>\n",
"</g>\n",
"<!-- 5&#45;&gt;11 -->\n",
"<g id=\"edge11\" class=\"edge\"><title>5&#45;&gt;11</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M184.311,-143C193.669,-143 205.469,-143 216.171,-143\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"216.376,-146.5 226.376,-143 216.376,-139.5 216.376,-146.5\"/>\n",
"<text text-anchor=\"middle\" x=\"205\" y=\"-146.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 10 -->\n",
"<g id=\"node10\" class=\"node\"><title>10</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"246\" cy=\"-86\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"246\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">10</text>\n",
"</g>\n",
"<!-- 4&#45;&gt;10 -->\n",
"<g id=\"edge9\" class=\"edge\"><title>4&#45;&gt;10</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M184.311,-86C193.669,-86 205.469,-86 216.171,-86\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"216.376,-89.5001 226.376,-86 216.376,-82.5001 216.376,-89.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"205\" y=\"-89.8\" font-family=\"Times,serif\" font-size=\"14.00\">m</text>\n",
"</g>\n",
"<!-- 9 -->\n",
"<g id=\"node11\" class=\"node\"><title>9</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"246\" cy=\"-31\" rx=\"18\" ry=\"18\"/>\n",
"<text text-anchor=\"middle\" x=\"246\" y=\"-27.3\" font-family=\"Times,serif\" font-size=\"14.00\">9</text>\n",
"</g>\n",
"<!-- 4&#45;&gt;9 -->\n",
"<g id=\"edge10\" class=\"edge\"><title>4&#45;&gt;9</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M181.306,-75.9122C192.851,-67.7716 209.3,-56.1728 222.59,-46.8017\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"224.629,-49.6471 230.784,-41.0239 220.595,-43.9262 224.629,-49.6471\"/>\n",
"<text text-anchor=\"middle\" x=\"205\" y=\"-67.8\" font-family=\"Times,serif\" font-size=\"14.00\">i</text>\n",
"</g>\n",
"<!-- 13 -->\n",
"<g id=\"node14\" class=\"node\"><title>13</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"246\" cy=\"-263\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"246\" y=\"-259.3\" font-family=\"Times,serif\" font-size=\"14.00\">13</text>\n",
"</g>\n",
"<!-- 7&#45;&gt;13 -->\n",
"<g id=\"edge13\" class=\"edge\"><title>7&#45;&gt;13</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M184.311,-261.444C193.669,-261.684 205.469,-261.986 216.171,-262.261\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"216.289,-265.765 226.376,-262.522 216.469,-258.767 216.289,-265.765\"/>\n",
"<text text-anchor=\"middle\" x=\"205\" y=\"-265.8\" font-family=\"Times,serif\" font-size=\"14.00\">u</text>\n",
"</g>\n",
"<!-- 12 -->\n",
"<g id=\"node13\" class=\"node\"><title>12</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"246\" cy=\"-206\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"246\" y=\"-202.3\" font-family=\"Times,serif\" font-size=\"14.00\">12</text>\n",
"</g>\n",
"<!-- 6&#45;&gt;12 -->\n",
"<g id=\"edge12\" class=\"edge\"><title>6&#45;&gt;12</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M184.311,-206.778C193.669,-206.658 205.469,-206.507 216.171,-206.37\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"216.422,-209.867 226.376,-206.239 216.332,-202.867 216.422,-209.867\"/>\n",
"<text text-anchor=\"middle\" x=\"205\" y=\"-210.8\" font-family=\"Times,serif\" font-size=\"14.00\">u</text>\n",
"</g>\n",
"<!-- 14 -->\n",
"<g id=\"node15\" class=\"node\"><title>14</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"246\" cy=\"-320\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"246\" y=\"-316.3\" font-family=\"Times,serif\" font-size=\"14.00\">14</text>\n",
"</g>\n",
"<!-- 8&#45;&gt;14 -->\n",
"<g id=\"edge14\" class=\"edge\"><title>8&#45;&gt;14</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M184.311,-316.888C193.669,-317.368 205.469,-317.973 216.171,-318.522\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"216.21,-322.028 226.376,-319.045 216.568,-315.037 216.21,-322.028\"/>\n",
"<text text-anchor=\"middle\" x=\"205\" y=\"-321.8\" font-family=\"Times,serif\" font-size=\"14.00\">q</text>\n",
"</g>\n",
"<!-- 16 -->\n",
"<g id=\"node17\" class=\"node\"><title>16</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"324\" cy=\"-86\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"324\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">16</text>\n",
"</g>\n",
"<!-- 10&#45;&gt;16 -->\n",
"<g id=\"edge16\" class=\"edge\"><title>10&#45;&gt;16</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M265.787,-86C274.346,-86 284.67,-86 294.178,-86\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"294.43,-89.5001 304.43,-86 294.43,-82.5001 294.43,-89.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"285\" y=\"-89.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 15 -->\n",
"<g id=\"node16\" class=\"node\"><title>15</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"324\" cy=\"-29\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"324\" y=\"-25.3\" font-family=\"Times,serif\" font-size=\"14.00\">15</text>\n",
"</g>\n",
"<!-- 9&#45;&gt;15 -->\n",
"<g id=\"edge15\" class=\"edge\"><title>9&#45;&gt;15</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M264.246,-30.5462C273.152,-30.3118 284.252,-30.0197 294.398,-29.7527\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"294.59,-33.2489 304.495,-29.487 294.406,-26.2514 294.59,-33.2489\"/>\n",
"<text text-anchor=\"middle\" x=\"285\" y=\"-34.8\" font-family=\"Times,serif\" font-size=\"14.00\">t</text>\n",
"</g>\n",
"<!-- 17 -->\n",
"<g id=\"node18\" class=\"node\"><title>17</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"324\" cy=\"-143\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"324\" y=\"-139.3\" font-family=\"Times,serif\" font-size=\"14.00\">17</text>\n",
"</g>\n",
"<!-- 11&#45;&gt;17 -->\n",
"<g id=\"edge17\" class=\"edge\"><title>11&#45;&gt;17</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M265.787,-143C274.346,-143 284.67,-143 294.178,-143\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"294.43,-146.5 304.43,-143 294.43,-139.5 294.43,-146.5\"/>\n",
"<text text-anchor=\"middle\" x=\"285\" y=\"-146.8\" font-family=\"Times,serif\" font-size=\"14.00\">s</text>\n",
"</g>\n",
"<!-- 18 -->\n",
"<g id=\"node19\" class=\"node\"><title>18</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"324\" cy=\"-206\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"324\" y=\"-202.3\" font-family=\"Times,serif\" font-size=\"14.00\">18</text>\n",
"</g>\n",
"<!-- 12&#45;&gt;18 -->\n",
"<g id=\"edge18\" class=\"edge\"><title>12&#45;&gt;18</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M265.787,-206C274.346,-206 284.67,-206 294.178,-206\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"294.43,-209.5 304.43,-206 294.43,-202.5 294.43,-209.5\"/>\n",
"<text text-anchor=\"middle\" x=\"285\" y=\"-209.8\" font-family=\"Times,serif\" font-size=\"14.00\">d</text>\n",
"</g>\n",
"<!-- 19 -->\n",
"<g id=\"node20\" class=\"node\"><title>19</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"324\" cy=\"-263\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"324\" y=\"-259.3\" font-family=\"Times,serif\" font-size=\"14.00\">19</text>\n",
"</g>\n",
"<!-- 13&#45;&gt;19 -->\n",
"<g id=\"edge19\" class=\"edge\"><title>13&#45;&gt;19</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M265.787,-263C274.346,-263 284.67,-263 294.178,-263\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"294.43,-266.5 304.43,-263 294.43,-259.5 294.43,-266.5\"/>\n",
"<text text-anchor=\"middle\" x=\"285\" y=\"-266.8\" font-family=\"Times,serif\" font-size=\"14.00\">y</text>\n",
"</g>\n",
"<!-- 20 -->\n",
"<g id=\"node21\" class=\"node\"><title>20</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"324\" cy=\"-320\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"324\" y=\"-316.3\" font-family=\"Times,serif\" font-size=\"14.00\">20</text>\n",
"</g>\n",
"<!-- 14&#45;&gt;20 -->\n",
"<g id=\"edge20\" class=\"edge\"><title>14&#45;&gt;20</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M265.787,-320C274.346,-320 284.67,-320 294.178,-320\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"294.43,-323.5 304.43,-320 294.43,-316.5 294.43,-323.5\"/>\n",
"<text text-anchor=\"middle\" x=\"285\" y=\"-323.8\" font-family=\"Times,serif\" font-size=\"14.00\">u</text>\n",
"</g>\n",
"<!-- 21 -->\n",
"<g id=\"node22\" class=\"node\"><title>21</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-29\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"410\" y=\"-25.3\" font-family=\"Times,serif\" font-size=\"14.00\">21</text>\n",
"</g>\n",
"<!-- 15&#45;&gt;21 -->\n",
"<g id=\"edge21\" class=\"edge\"><title>15&#45;&gt;21</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M343.642,-29C354.409,-29 368.178,-29 380.299,-29\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.41,-32.5001 390.41,-29 380.41,-25.5001 380.41,-32.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"365\" y=\"-32.8\" font-family=\"Times,serif\" font-size=\"14.00\">h</text>\n",
"</g>\n",
"<!-- 22 -->\n",
"<g id=\"node23\" class=\"node\"><title>22</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-86\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"410\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">22</text>\n",
"</g>\n",
"<!-- 16&#45;&gt;22 -->\n",
"<g id=\"edge22\" class=\"edge\"><title>16&#45;&gt;22</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M343.642,-86C354.409,-86 368.178,-86 380.299,-86\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.41,-89.5001 390.41,-86 380.41,-82.5001 380.41,-89.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"365\" y=\"-89.8\" font-family=\"Times,serif\" font-size=\"14.00\">m</text>\n",
"</g>\n",
"<!-- 23 -->\n",
"<g id=\"node24\" class=\"node\"><title>23</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-143\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"410\" y=\"-139.3\" font-family=\"Times,serif\" font-size=\"14.00\">23</text>\n",
"</g>\n",
"<!-- 17&#45;&gt;23 -->\n",
"<g id=\"edge23\" class=\"edge\"><title>17&#45;&gt;23</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M343.642,-143C354.409,-143 368.178,-143 380.299,-143\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.41,-146.5 390.41,-143 380.41,-139.5 380.41,-146.5\"/>\n",
"<text text-anchor=\"middle\" x=\"365\" y=\"-146.8\" font-family=\"Times,serif\" font-size=\"14.00\">h</text>\n",
"</g>\n",
"<!-- 24 -->\n",
"<g id=\"node25\" class=\"node\"><title>24</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-204\" rx=\"19.4968\" ry=\"19.4968\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-204\" rx=\"23.4965\" ry=\"23.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"410\" y=\"-200.3\" font-family=\"Times,serif\" font-size=\"14.00\">24</text>\n",
"</g>\n",
"<!-- 18&#45;&gt;24 -->\n",
"<g id=\"edge24\" class=\"edge\"><title>18&#45;&gt;24</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M343.642,-205.556C353.166,-205.329 365.038,-205.047 376.042,-204.785\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"376.306,-208.28 386.22,-204.542 376.139,-201.281 376.306,-208.28\"/>\n",
"<text text-anchor=\"middle\" x=\"365\" y=\"-209.8\" font-family=\"Times,serif\" font-size=\"14.00\">a</text>\n",
"</g>\n",
"<!-- 25 -->\n",
"<g id=\"node26\" class=\"node\"><title>25</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-265\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"410\" y=\"-261.3\" font-family=\"Times,serif\" font-size=\"14.00\">25</text>\n",
"</g>\n",
"<!-- 19&#45;&gt;25 -->\n",
"<g id=\"edge25\" class=\"edge\"><title>19&#45;&gt;25</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M343.642,-263.444C354.409,-263.7 368.178,-264.028 380.299,-264.317\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.329,-267.818 390.41,-264.557 380.496,-260.82 380.329,-267.818\"/>\n",
"<text text-anchor=\"middle\" x=\"365\" y=\"-267.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 26 -->\n",
"<g id=\"node27\" class=\"node\"><title>26</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"410\" cy=\"-322\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"410\" y=\"-318.3\" font-family=\"Times,serif\" font-size=\"14.00\">26</text>\n",
"</g>\n",
"<!-- 20&#45;&gt;26 -->\n",
"<g id=\"edge26\" class=\"edge\"><title>20&#45;&gt;26</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M343.642,-320.444C354.409,-320.7 368.178,-321.028 380.299,-321.317\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"380.329,-324.818 390.41,-321.557 380.496,-317.82 380.329,-324.818\"/>\n",
"<text text-anchor=\"middle\" x=\"365\" y=\"-324.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 27 -->\n",
"<g id=\"node28\" class=\"node\"><title>27</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"492\" cy=\"-29\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"492\" y=\"-25.3\" font-family=\"Times,serif\" font-size=\"14.00\">27</text>\n",
"</g>\n",
"<!-- 21&#45;&gt;27 -->\n",
"<g id=\"edge27\" class=\"edge\"><title>21&#45;&gt;27</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M429.555,-29C439.179,-29 451.154,-29 461.97,-29\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"462.269,-32.5001 472.269,-29 462.269,-25.5001 462.269,-32.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"453\" y=\"-32.8\" font-family=\"Times,serif\" font-size=\"14.00\">n</text>\n",
"</g>\n",
"<!-- 28 -->\n",
"<g id=\"node29\" class=\"node\"><title>28</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"492\" cy=\"-86\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"492\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">28</text>\n",
"</g>\n",
"<!-- 22&#45;&gt;28 -->\n",
"<g id=\"edge28\" class=\"edge\"><title>22&#45;&gt;28</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M429.555,-86C439.179,-86 451.154,-86 461.97,-86\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"462.269,-89.5001 472.269,-86 462.269,-82.5001 462.269,-89.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"453\" y=\"-89.8\" font-family=\"Times,serif\" font-size=\"14.00\">b</text>\n",
"</g>\n",
"<!-- 29 -->\n",
"<g id=\"node30\" class=\"node\"><title>29</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"492\" cy=\"-143\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"492\" y=\"-139.3\" font-family=\"Times,serif\" font-size=\"14.00\">29</text>\n",
"</g>\n",
"<!-- 23&#45;&gt;29 -->\n",
"<g id=\"edge29\" class=\"edge\"><title>23&#45;&gt;29</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M429.555,-143C439.179,-143 451.154,-143 461.97,-143\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"462.269,-146.5 472.269,-143 462.269,-139.5 462.269,-146.5\"/>\n",
"<text text-anchor=\"middle\" x=\"453\" y=\"-146.8\" font-family=\"Times,serif\" font-size=\"14.00\">i</text>\n",
"</g>\n",
"<!-- 30 -->\n",
"<g id=\"node31\" class=\"node\"><title>30</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"492\" cy=\"-263\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"492\" y=\"-259.3\" font-family=\"Times,serif\" font-size=\"14.00\">30</text>\n",
"</g>\n",
"<!-- 25&#45;&gt;30 -->\n",
"<g id=\"edge30\" class=\"edge\"><title>25&#45;&gt;30</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M429.555,-264.536C439.179,-264.296 451.154,-263.996 461.97,-263.726\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"462.359,-267.217 472.269,-263.468 462.184,-260.219 462.359,-267.217\"/>\n",
"<text text-anchor=\"middle\" x=\"453\" y=\"-267.8\" font-family=\"Times,serif\" font-size=\"14.00\">r</text>\n",
"</g>\n",
"<!-- 31 -->\n",
"<g id=\"node32\" class=\"node\"><title>31</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"492\" cy=\"-322\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"492\" y=\"-318.3\" font-family=\"Times,serif\" font-size=\"14.00\">31</text>\n",
"</g>\n",
"<!-- 26&#45;&gt;31 -->\n",
"<g id=\"edge31\" class=\"edge\"><title>26&#45;&gt;31</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M429.555,-322C439.179,-322 451.154,-322 461.97,-322\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"462.269,-325.5 472.269,-322 462.269,-318.5 462.269,-325.5\"/>\n",
"<text text-anchor=\"middle\" x=\"453\" y=\"-325.8\" font-family=\"Times,serif\" font-size=\"14.00\">f</text>\n",
"</g>\n",
"<!-- 32 -->\n",
"<g id=\"node33\" class=\"node\"><title>32</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"574\" cy=\"-29\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"574\" y=\"-25.3\" font-family=\"Times,serif\" font-size=\"14.00\">32</text>\n",
"</g>\n",
"<!-- 27&#45;&gt;32 -->\n",
"<g id=\"edge32\" class=\"edge\"><title>27&#45;&gt;32</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M511.555,-29C521.179,-29 533.154,-29 543.97,-29\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"544.269,-32.5001 554.269,-29 544.269,-25.5001 544.269,-32.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"531\" y=\"-32.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 33 -->\n",
"<g id=\"node34\" class=\"node\"><title>33</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"574\" cy=\"-86\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"574\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">33</text>\n",
"</g>\n",
"<!-- 28&#45;&gt;33 -->\n",
"<g id=\"edge33\" class=\"edge\"><title>28&#45;&gt;33</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M511.555,-86C521.179,-86 533.154,-86 543.97,-86\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"544.269,-89.5001 554.269,-86 544.269,-82.5001 544.269,-89.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"531\" y=\"-89.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 34 -->\n",
"<g id=\"node35\" class=\"node\"><title>34</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"574\" cy=\"-146\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"574\" y=\"-142.3\" font-family=\"Times,serif\" font-size=\"14.00\">34</text>\n",
"</g>\n",
"<!-- 29&#45;&gt;34 -->\n",
"<g id=\"edge34\" class=\"edge\"><title>29&#45;&gt;34</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M511.555,-143.696C521.179,-144.057 533.154,-144.506 543.97,-144.911\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"544.145,-148.42 554.269,-145.298 544.407,-141.425 544.145,-148.42\"/>\n",
"<text text-anchor=\"middle\" x=\"531\" y=\"-148.8\" font-family=\"Times,serif\" font-size=\"14.00\">r</text>\n",
"</g>\n",
"<!-- 35 -->\n",
"<g id=\"node36\" class=\"node\"><title>35</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"574\" cy=\"-262\" rx=\"19.4968\" ry=\"19.4968\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"574\" cy=\"-262\" rx=\"23.4965\" ry=\"23.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"574\" y=\"-258.3\" font-family=\"Times,serif\" font-size=\"14.00\">35</text>\n",
"</g>\n",
"<!-- 30&#45;&gt;35 -->\n",
"<g id=\"edge35\" class=\"edge\"><title>30&#45;&gt;35</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M511.555,-262.768C520.029,-262.662 530.327,-262.533 540.05,-262.412\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"540.307,-265.909 550.262,-262.284 540.219,-258.91 540.307,-265.909\"/>\n",
"<text text-anchor=\"middle\" x=\"531\" y=\"-265.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 36 -->\n",
"<g id=\"node37\" class=\"node\"><title>36</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"574\" cy=\"-323\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"574\" y=\"-319.3\" font-family=\"Times,serif\" font-size=\"14.00\">36</text>\n",
"</g>\n",
"<!-- 31&#45;&gt;36 -->\n",
"<g id=\"edge36\" class=\"edge\"><title>31&#45;&gt;36</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M511.555,-322.232C521.179,-322.352 533.154,-322.502 543.97,-322.637\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"544.226,-326.141 554.269,-322.766 544.313,-319.141 544.226,-326.141\"/>\n",
"<text text-anchor=\"middle\" x=\"531\" y=\"-325.8\" font-family=\"Times,serif\" font-size=\"14.00\">o</text>\n",
"</g>\n",
"<!-- 37 -->\n",
"<g id=\"node38\" class=\"node\"><title>37</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"660\" cy=\"-27\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"660\" y=\"-23.3\" font-family=\"Times,serif\" font-size=\"14.00\">37</text>\n",
"</g>\n",
"<!-- 32&#45;&gt;37 -->\n",
"<g id=\"edge37\" class=\"edge\"><title>32&#45;&gt;37</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M593.642,-28.5561C604.409,-28.2998 618.178,-27.972 630.299,-27.6834\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"630.496,-31.1798 640.41,-27.4426 630.329,-24.1817 630.496,-31.1798\"/>\n",
"<text text-anchor=\"middle\" x=\"617\" y=\"-32.8\" font-family=\"Times,serif\" font-size=\"14.00\">s</text>\n",
"</g>\n",
"<!-- 38 -->\n",
"<g id=\"node39\" class=\"node\"><title>38</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"660\" cy=\"-86\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"660\" y=\"-82.3\" font-family=\"Times,serif\" font-size=\"14.00\">38</text>\n",
"</g>\n",
"<!-- 33&#45;&gt;38 -->\n",
"<g id=\"edge38\" class=\"edge\"><title>33&#45;&gt;38</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M593.642,-86C604.409,-86 618.178,-86 630.299,-86\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"630.41,-89.5001 640.41,-86 630.41,-82.5001 630.41,-89.5001\"/>\n",
"<text text-anchor=\"middle\" x=\"617\" y=\"-89.8\" font-family=\"Times,serif\" font-size=\"14.00\">r</text>\n",
"</g>\n",
"<!-- 39 -->\n",
"<g id=\"node40\" class=\"node\"><title>39</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"660\" cy=\"-147\" rx=\"19.4968\" ry=\"19.4968\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"660\" cy=\"-147\" rx=\"23.4965\" ry=\"23.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"660\" y=\"-143.3\" font-family=\"Times,serif\" font-size=\"14.00\">39</text>\n",
"</g>\n",
"<!-- 34&#45;&gt;39 -->\n",
"<g id=\"edge39\" class=\"edge\"><title>34&#45;&gt;39</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M593.642,-146.222C603.166,-146.335 615.038,-146.477 626.042,-146.608\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"626.179,-150.109 636.22,-146.729 626.262,-143.11 626.179,-150.109\"/>\n",
"<text text-anchor=\"middle\" x=\"617\" y=\"-150.8\" font-family=\"Times,serif\" font-size=\"14.00\">e</text>\n",
"</g>\n",
"<!-- 40 -->\n",
"<g id=\"node41\" class=\"node\"><title>40</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"660\" cy=\"-323\" rx=\"19.4965\" ry=\"19.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"660\" y=\"-319.3\" font-family=\"Times,serif\" font-size=\"14.00\">40</text>\n",
"</g>\n",
"<!-- 36&#45;&gt;40 -->\n",
"<g id=\"edge40\" class=\"edge\"><title>36&#45;&gt;40</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M593.642,-323C604.409,-323 618.178,-323 630.299,-323\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"630.41,-326.5 640.41,-323 630.41,-319.5 630.41,-326.5\"/>\n",
"<text text-anchor=\"middle\" x=\"617\" y=\"-326.8\" font-family=\"Times,serif\" font-size=\"14.00\">r</text>\n",
"</g>\n",
"<!-- 41 -->\n",
"<g id=\"node42\" class=\"node\"><title>41</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"744\" cy=\"-23\" rx=\"19.4968\" ry=\"19.4968\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"744\" cy=\"-23\" rx=\"23.4965\" ry=\"23.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"744\" y=\"-19.3\" font-family=\"Times,serif\" font-size=\"14.00\">41</text>\n",
"</g>\n",
"<!-- 37&#45;&gt;41 -->\n",
"<g id=\"edge41\" class=\"edge\"><title>37&#45;&gt;41</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M679.607,-26.0923C688.68,-25.6498 699.862,-25.1043 710.297,-24.5953\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"710.559,-28.0868 720.376,-24.1036 710.217,-21.0951 710.559,-28.0868\"/>\n",
"<text text-anchor=\"middle\" x=\"702\" y=\"-29.8\" font-family=\"Times,serif\" font-size=\"14.00\">s</text>\n",
"</g>\n",
"<!-- 42 -->\n",
"<g id=\"node43\" class=\"node\"><title>42</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"744\" cy=\"-88\" rx=\"19.4968\" ry=\"19.4968\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"744\" cy=\"-88\" rx=\"23.4965\" ry=\"23.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"744\" y=\"-84.3\" font-family=\"Times,serif\" font-size=\"14.00\">42</text>\n",
"</g>\n",
"<!-- 38&#45;&gt;42 -->\n",
"<g id=\"edge42\" class=\"edge\"><title>38&#45;&gt;42</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M679.607,-86.4538C688.68,-86.6751 699.862,-86.9479 710.297,-87.2024\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"710.294,-90.7032 720.376,-87.4482 710.464,-83.7053 710.294,-90.7032\"/>\n",
"<text text-anchor=\"middle\" x=\"702\" y=\"-91.8\" font-family=\"Times,serif\" font-size=\"14.00\">t</text>\n",
"</g>\n",
"<!-- 43 -->\n",
"<g id=\"node44\" class=\"node\"><title>43</title>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"744\" cy=\"-323\" rx=\"19.4968\" ry=\"19.4968\"/>\n",
"<ellipse fill=\"none\" stroke=\"black\" cx=\"744\" cy=\"-323\" rx=\"23.4965\" ry=\"23.4965\"/>\n",
"<text text-anchor=\"middle\" x=\"744\" y=\"-319.3\" font-family=\"Times,serif\" font-size=\"14.00\">43</text>\n",
"</g>\n",
"<!-- 40&#45;&gt;43 -->\n",
"<g id=\"edge43\" class=\"edge\"><title>40&#45;&gt;43</title>\n",
"<path fill=\"none\" stroke=\"black\" d=\"M679.607,-323C688.68,-323 699.862,-323 710.297,-323\"/>\n",
"<polygon fill=\"black\" stroke=\"black\" points=\"710.376,-326.5 720.376,-323 710.376,-319.5 710.376,-326.5\"/>\n",
"<text text-anchor=\"middle\" x=\"702\" y=\"-326.8\" font-family=\"Times,serif\" font-size=\"14.00\">t</text>\n",
"</g>\n",
"</g>\n",
"</svg>\n"
],
"text/plain": [
"<vector Fst at 0x7f88dc0477a0>"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lexicon = (\"gruyere\", \"cheshire\", \"roquefort\",\n",
" \"camembert\", \"gouda\", \"caithness\")\n",
"compiled_lexicon = string_map(lexicon)\n",
"compiled_lexicon"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We can then use this machine in place of $s_o$, the output-side string used to construct the lattice. Recall that the full two-factor lattice for two pairs of strings $s_i$ and $s_o$ was given by:\n",
"\n",
"$$L' = (s_i~\\circ~\\mathcal{E}_i)~\\circ~(\\mathcal{E}_o~\\circ~s_o)$$\n",
"\n",
"In a Levenshtein transducer, we simply replace $s_o$ with a lexicon acceptor. Assuming the lexicon is fixed, it is not necessary to recompute the $\\mathcal{E}_o~\\circ~s_o$ term; instead, we compiling the lexicon and composing the right-hand side of the right factor with it, and then store the result for reuse. Once we have computed the lattice $L'$, it is straightforward to compute the Levenshtein distance to the nearest element(s) in the lexicon. Pynini's `shortestpath` function can be used to compute (one of) the nearest strings in the lexicon. To obtain all of the closest strings, we simply prune the lattice of all suboptimal paths, optimize, and then enumerate the strings in the pruned lattice.\n",
"This is accomplished below by deriving a new class, `LevenshteinDistance`, from `LevenshteinDistance`:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": true,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"class LevenshteinAutomaton(LevenshteinDistance):\n",
" \"\"\"Edit transducer with a fixed output lexicon and closest-match methods.\"\"\"\n",
"\n",
" def __init__(self,\n",
" alphabet,\n",
" lexicon,\n",
" insert_cost=DEFAULT_INSERT_COST,\n",
" delete_cost=DEFAULT_DELETE_COST,\n",
" substitute_cost=DEFAULT_SUBSTITUTE_COST):\n",
" super(LevenshteinAutomaton, self).__init__(alphabet, insert_cost,\n",
" delete_cost, substitute_cost)\n",
" # Compiles lexicon and composes the right factor with it.\n",
" compiled_lexicon = string_map(lexicon)\n",
" self._l_o = compose(self._e_o, compiled_lexicon)\n",
" self._l_o.optimize(True)\n",
"\n",
" def _create_levenshtein_automaton_lattice(self, query):\n",
" \"\"\"Constructs a lattice for a query string.\n",
"\n",
" Args:\n",
" query: input string or acceptor.\n",
"\n",
" Returns:\n",
" A lattice FST.\n",
" \"\"\"\n",
" l_i = compose(query, self._e_i)\n",
" lattice = compose(l_i, self._l_o)\n",
" EditTransducer.check_wellformed_lattice(lattice)\n",
" return lattice\n",
"\n",
" def closest_match(self, query):\n",
" \"\"\"Returns the closest string to the query in the lexicon.\n",
"\n",
" This method computes, for an input string or acceptor, the closest string\n",
" in the lexicon according to the underlying edit transducer. In the case of\n",
" a tie (i.e., where there are multiple closest strings), only one will be\n",
" returned; tie breaking is deterministic but difficult to reason about and\n",
" thus should be considered unspecified.) The `closest_matches` method can be\n",
" used to enumerate all the ties.\n",
"\n",
" Args:\n",
" query: input string or acceptor.\n",
"\n",
" Returns:\n",
" The closest string in the lexicon.\n",
" \"\"\"\n",
" lattice = self._create_levenshtein_automaton_lattice(query)\n",
" # For implementation reasons, the shortest path (when k = 1) is in reverse\n",
" # state order, so we perform a topological sort ahead of time.\n",
" return shortestpath(lattice).topsort().stringify()\n",
"\n",
" def closest_matches(self, query):\n",
" \"\"\"Returns all of the closest strings to the query in the lexicon.\n",
"\n",
" This method returns, for an input string or acceptor, the closest strings\n",
" in the lexicon according to the underlying edit transducer. A string is\n",
" \"closest\" if it has the same edit distance as the closest string. The order\n",
" in which the strings are returned is deterministic but difficult to reason\n",
" about and thus should be considered unspecified.\n",
"\n",
" Args:\n",
" query: input string or acceptor.\n",
"\n",
" Returns:\n",
" An iterator of the closest strings in the lexicon.\n",
" \"\"\"\n",
" lattice = self._create_levenshtein_automaton_lattice(query)\n",
" # Prunes all paths whose weights are worse than the best path.\n",
" lattice.prune(weight=0).project(True).optimize(True)\n",
" return lattice.paths().iter_ostrings()"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"The following example shows the Levenshtein automaton in action:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false,
"deletable": true,
"editable": true
},
"outputs": [],
"source": [
"la = LevenshteinAutomaton(string.ascii_lowercase, lexicon)\n",
"assert la.closest_match(\"rockford\") == \"roquefort\"\n",
"assert next(la.closest_matches(\"cheesesure\")) == \"cheshire\""
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Possible extensions"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"We have made the above code available as <a href=\"http://github.com/kylebgorman/EditTransducer\">EditTransducer</a>, an MIT-licensed Python library, and encourage readers to tinker with it. Here are several possible extensions, which we leave as exercises for the reader:\n",
"\n",
"* The edit transducer could be made to output the optimal alignments, to help users visualize the process.\n",
"* Currently, the edit transducer assigns the same cost to all insertions, to all deletions, and to all substitutions, respectively. However, one could also make it possible to vary the costs depending on the input and/or outputs. For instance, if we are attempting to expand abbreviations like `mtn` to full words like `mountain`, we might make insertion of vowels cheaper than that of consonants.\n",
"* In some applications, especially computational genomics, it is considered desirable to use an <a href=\"https://en.wikipedia.org/wiki/Gap_penalty#Affine\">_affine gap penalty_</a> assigning separate costs for \"opening\" a gap&mdash;i.e., starting a sequence of insertions or deletions&mdash;and for \"continuing\" the gap&mdash;i.e., for each insertion and deletion in the \"gap\". The edit transducer could be extended to support affine gap penalties.\n",
"* <a href=\"http://norvig.com/spell-correct.html\">This blog post</a> describes a spelling correction system which essentially consists of a limited form of a Levenshtein automaton (the implementation only supports two or fewer edits); in the case that there is more than one closest match in the lexicon, it selects the word with the highest frequency according to some user-provided corpus. `LevenshteinAutomaton` could be extended in various ways so as to support word-frequency-based tie-breaking.\n",
"* When we compose strings with the edit transducer factors, the strings are first implicitly compiled as FSAs where each arc is a byte in the input string; as is the default in Pynini, strings are assumed to be UTF-8 encoded unless otherwise specified. Thus when we compute the edit distance, we are computing it by comparing the underlying bytes between the input and output strings. However, Pynini also supports an\n",
"alternative modes of string compilation, including one in which <a href=\"http://openfst.cs.nyu.edu/twiki/bin/view/GRM/PyniniStringDoc\">each label is a Unicode codepoint</a>). This may be more appropriate for languages where glyphs are multibyte sequences. This mode could be optionally enabled in the constructor of `EditTransducer` and `LevenshteinAutomaton` and then used when compiling the lattice (and in the case of `LevenshteinAutomaton`, when compiling the lexicon)."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## Conclusions\n",
"\n",
"We have shown how to construct an edit transducer and to use it for computing edit distance and approximate string. The edit transducer formulation of these (and related) problems is efficient: composition with the factored edit transducer is linear in the size of the input strings or sets, and the shortest distance and path computations are quadratic in the size of the lattice. They are also expressive in that they support user-defined costs and could be extended to support arbitrary edit operations. Finally, by implementing these algorithms using generic finite-state operations, we delegate much of the difficult work to compiled code, carefully tested and extensively optimized, exposed by Pynini."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": true,
"editable": true
},
"source": [
"## References\n",
"\n",
"Gorman, K. 2016. Pynini: A Python library weighted finite-state grammar compilation.\n",
"In _Proceedings of the ACL Workshop on Statistical NLP and Weighted Automata_, pages 75-80.\n",
"\n",
"Schulz, K., and Mihov, S. 2002. Fast string correction with Levenshtein-automata.\n",
"_International Journal of Document Analysis and Recognition_ 5(1): 67-85.\n",
"\n",
"Wagner, R. A., and Fischer, M. J. 1974. The string-to-string correction problem.\n",
"_Journal of the ACM_ 21(1): 168-173."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment