kirel/draft.markdown

## draft.markdown

      
    Raw
  

              draft.markdown
            
          
    Preface

This is an overview over the inner workings of Detexify. Extended knowlege in pattern recognition or
machine learning is
not necessary as I will explain some basics but understanding of linear algebra will definitely help.
I have to note that I am not an expert, either. I more or less stumbled into this because
of this project. Experts in this field may safely skip the first section.
What's the problem?

The specific problem in this case is "What's the LaTeX code for this symbol?".
But we can generalise that. We have a bunch of objects that can be partitioned into k categories.
Suppose we look at a certain object and now the question is "What category is this object from?".
While human beings are quite good at deciding this kind of questions computers find it rather difficult.
The first thing to do is to give the computer some understanding of the objects that should be classified.
This is done by mapping the objects into the so called feature space. A feature is some property of the object
that can be quantized for example if the objects are people I may choose their age in years as one feature.
It is often possible to choose the n-dimensional rational vector space as the feature space. fig.1 shows a two
dimensional feature space with objects of categories A, B and C mapped into it. The problem of classification
of a new object is then the decision which the nearest category is. This of course means that the computer must
know some samples of each category (which is called training data) to base its decision on.
(This is called statistical pattern recognition - there are other approaches but they will not be described here as Detexify is based on the statistical one.)
fig.1
A    x     x
  x    x                x     B
    x               x    x
                  x   x    x
      x              x
  x x
  x   x                 x   x
C     X              x    x
                  A* x    
                            x
                      x        x

Detexify's implementation

The feature space (1)

The objects in Detexify are symbols but what the computer gets as raw data are strokes of ink (polylines).
These look basically like [[{x:3,y:5,timestamp:123}, {x:10,y:45,timestamp:125}]] - an Array of Strokes and each Stroke
is an Array of Hashes that contain position and timestamp. From these strokes a set of features is extracted (which is called the feature vector and will be described later).
The feature space is the n-dimensional rational vector space (n=15 at the time of writing).
Classification

The classification in Detexify is implemented with the
k-nearest neighbor algorithm (almost).
To classify an unknown pattern the training data is first sorted based on the euclidean distance in the feature space to this pattern. Usually then the k nearest neighbours are examined and the pattern is probably of the class most common among them but since I want 5 entries in the hit list of symbols
nearest neighbours are examined until symbols of 5 different classes are found. The score of a symbol (class) is then the number of occurences among these nearest neighbours.
The feature space (2)

The chosen features are an important aspect of pattern recognition. When reading papers about handwritten character recognition
I realized that there does not seem to be any consent about what the best feature set for a given alphabet of characters is.
Everyone just develops an ad-hoc solution. Of course the requirements of the feature set are pretty clear. It must separate different classes sufficiently and abstract from inner-class variation.
My solution is a Potpourri of approaches I found in different papers especially on chinese and mathematical symbol recognition.
Before extracting any features I normalize (preprocess) the strokes in two ways. First the strokes are scaled and translated to fit and be centered in [0,1]x[0,1]. Next the points are redistributed. The browser creates strokes homogeneous in time by firing a mousemove event every few milliseconds but instead I want strokes to be homogeneous in space so that each pair of neighbouring points has the same distance to each other.
The features I use (at the time of writing) are number of strokes, point density and directional features. Point density means simply that I count how many points of the strokes fall into a certain area of [0,1]x[0,1] - for example [0,1]x[0,0.2]. The directional features I use can best be described with the help of some ascii art:
fig.2
NW N NE   |         x --› x   |
  \|/     |   v    ^  w       |   v -› NE
W -o- E   |      /            |
  /|\     |    x              |   w -› E
SW S SE   |     \             |

For two succeeding points in a stroke the vector between them is assigned one of N, NE, E, E, SE, S, SW, W and NW.
This way a histogram of orientations is obtained each entry containing the number of stroke segments between two succeding
points heading into a specific direction.
All the features extracted are numeric values and thus can be concatenated into a rational vector - this is the feature vector.
Reasons for choices made

Handwritten character recognition bears immanent difficulties.
There is a large variety in different writers output in such things as
stroke number and stroke order but also many other things. Another difficulty is the sheer number of LaTeX symbols
(the comprehensive list contains over 4000). I chose a k-nearest neighbours classifier because I thought that would support
some of the variety in different writers output. To understand that please look at fig.1 again.
Class A might be any symbol but there are two predominant ways to write it (A and A*) which results in distribution in the feature
space like in fig.1. But if training samples of both ways are available the classifier should be able to recognize the correct
class anyway. The chosen features are all histogram features which abstract from stroke order. This combination works pretty well
so far - better than I expected actually.
Foresight

Of course there is a lot room for improvement. The training samples are saved as full strokes so a change in features used is
possible at any moment. The vast amount of training data you already provided also makes a good test set for different classifiers.
A few samples could be used as test samples while the rest remains as training data. A new classifier can then be tested against
the test samples and the real class of the test samples can be compared to the classifiers output.
All in all I think Detexify is a solid base to improve on.