This second post illustrates how to implement Norvig's spelling corrector in Java 9 following a functional style.
The next post [Implementing Norvig's Algo in Kotlin](../spelling-jvm-3-kotlin-implementation) presents an idiomatic Kotlin implementation of Norvig's spelling corrector.
The previous post ([Norvig's Approach to Spelling Correction](../spelling-jvm-1-introduction)) discusses Norvig's approach to generating typo corrections.
If you haven't already, taking a cursory look at Norvig's Python script may be useful in digesting this Java implementation.
Everything having to do with spelling correction revolves around a curated list of valid words we refer to as the the dictionary.
In our dictionary implementation we accompany every word with an integer rank indicating how
frequently the word is used in relation to all other words. The lower the rank, the more
frequently used the word is. Thus, for instance, the has rank 1
while triose has rank
106295
.
The dictionary declaration is:
https://gist.github.com/f7be3def0a1db84b22423ceddaae8323
In order to hold word splits we define an inner data class WordSplit as follows:
https://gist.github.com/d7dcdbdcd8c3f2257ff66ab973623d1e
Note: no getter/setter or other boilerplate needed. This is an immutable data class visible to the
SpellingCorrector
and its tests only.
Armed with this definition we can now formulate the rather trivial implementation of splits
:
https://gist.github.com/852bbad457f945490f770e7408f00813
The `split()` method takes a string (the typo) and returns all possible pairs of word splits. Thus, for our beloved but misspelt hero _dlbert_ we get the equivalent of:https://gist.github.com/1c211fe4f2b4220681db51d3aa8bb109
By the way, all four edit implementations take a List<WordSplit>
argument rather than a simple
List<String>
. All of them return a Stream<String>
, too.
IntStream.rangeClosed(int startInclude, int endInclusive)
generates an IntStream
. In our
example this means all integers between 0
and 6
. Because we want to generate a Stream<String>
(not an IntStream
) we apply boxed()
to the range IntStream
so it becomes a
Stream<Integer>
.
Subsequently, each successive integer value is used as a substring index for splitting the word
into a pair of left
and right
fragments. This Stream<String>
is finally collected as a
List<WordSplit>
.
Generating all possible deletes for a word is also quite simple:
https://gist.github.com/2382d7274c46ad371d665939bfacc41e
Here, we stream over the input List<WordPair>
selecting only the ones where right
is not
empty (which is equivalent to all splits but the very first one). Next, we concatenate each
word split into a string formed by the left
part followed by the "tail" of the right
one.
https://gist.github.com/223fb7de948c493d6d9e4491dd40b041
Inserts are the opposite of deletes and while their implementation is still quite simple:
https://gist.github.com/c9b7cb05ca8dd1e3dc033ba4cfbc83e3
In this method our friend LETTERS
makes its first appearance. Again for the sake of simplicity we
deal only with lowercase ASCII characters. There is no support for diacritics (whose implementation
is not complicated anyways). LETTERS
is statically defined as:
https://gist.github.com/fc347f27bc0a7403cbf3f03096f46d20
For our canine friend _dgbert_ `inserts()` would yield the equivalent of:Splitting on an empty regex results in each string's character becoming a 1-char string on its own right.
https://gist.github.com/85c03e29e234073e0e5f216ce80af765
Implementing transposition requires the right
fragment to have more than one character in length,
which makes sense as transposing involves not one but two letters:
https://gist.github.com/9abae5594fcc7afe588fc50caa4e7529
For our heroine _alice_ `transposes()` yields the equivalent of:https://gist.github.com/7b7eef6b4b1b7e4137056bccf5ece3e0
The replace
edit is very prolific requiring the substitution of every letter in the typo for
each letter in the alphabet. It also requires the right segment not to be empty (as something
needs to be replaced). Because we're nesting the processing of each letter inside the processing
of every typo character we need to apply flatMap
rathen than map
:
https://gist.github.com/fde980b4c57ca9f5b857bd36eb742742
<img src="/images/posts/spelling-jvm/small-asok.png" style="float: left;" margin-right: 4px;>
For our little friend asok replaces()
yields the equivalent of:
https://gist.github.com/46738b2b2fb729677aa6f2ba77feeeb0
To cover all our bases we need to run and then concatenate all four edits. For this we define a static list of edits as follows:https://gist.github.com/e52bc689ba7d392061add01837022137
This is an instance of a common functional pattern where code is treated as data. Here, we have a
list of functions mapping from List<WordSplit>
to Stream<String>
(which is, precisely, the
type of transformation our four edits implement). In this context Java method references shine as
they, effectively, provide us with a data-centric view of executable code.
We make use of this list of functions to implement edit1
as follows:
https://gist.github.com/3cf1b84a81394eadb131c34a9146a9ae
Because the four edits are independent of one another we can parallelize their execution. This is
transparently handled by Java by means of the Stream.parallelStream()
method.
Also, because each edit returns a nested Stream<String>
we need to use flatMap
instead of
map
.
As it happens, edit1
is not yet done when all edits have been concatenated.
Once edit1
generates all possible edits we need to screen them for duplicates as well as for
actual appearance in the dictionary. We also need to order the resulting valid words so the most
frequently used ones are shown first. This is all achieved as follows:
https://gist.github.com/2436610b65ee459f0feedb7817d84df1
As customary, results are collected into a List<String>
suitable for displaying suggestions to
the user.
Despite generating lots of possible suggestions, edit1
can fail to yield a dictionary word
In this case we assume our typo stems not from one but two transcription mistakes.
edit2
is quite simply defined in terms of edit1
as follows:
https://gist.github.com/23894a58f20e690adfff28287462b294
Here we see our friend pack()
in action once again to ensure word uniqueness, validity and
order.
Again, we use flatMap
because we generate candidate suggestions from the nested edit1
's Stream
of results.
We're almost there now. Our workhorse method getCorrections()
method looks as follows:
https://gist.github.com/6777bf4275e0e3c6c2376f511306a217
Note we return Optional<List<String>>
rather than a plain List<String>
. Why?
We need to express the absence of suggestions not because the typo doesn't match anything known
but, on the contrary, because it was handed a valid dictionary word for which the notion of
suggestion simply doesn't make sense. In this scenario we return Optional.empty()
.
It could be argued that returning List.empty()
would be appropriate to indicate such
inapplicability.
In reality, however, it is perfectly possible for a word absent from the dictionary (.i.e, a typo)
not to resemble any known word. In this case we still need to make it explicit we're dealing with
a typo. Thus, List.empty()
is not semantically equivalent to Optional.empty()
.
Yes, we are ;-)
The complete Java implementation has 300+ lines including comments and empty lines. Its neutron star-dense incarnation comes down to 120 lines.
Compare that to Norvig's feat of packing the whole thing in 44 lines of Python code!
Java's verbosity is precisely one of the most important reason alternative JVM languages have emerged. In the next 3 posts we'll see how the Kotlyn, Scala and Xtend versions improve upon Java in terms of economy of expression and clarity of intent.
Next in our plate: Implementing Norvig's Algo in Kotlin