ffledgling/Honors Project Report, Draft I.mkd

## Honors Project Report, Draft I.mkd

      
    Raw
  

              Honors Project Report, Draft I.mkd
            
          
    Honors Project Report

Anhad Jai Singh,

Computer Science and Engineering,

Centre for Visual Information Technology (CVIT)
Problem Statement

The problem we set out to tackle was to create a classifier that is faster in practice than most present day
classifiers and compact enough to be used in today's mobile devices within the limited memory and
space constraints. The classifier should be capable of handling a large number of classes and
samples.
The classifiers we compete against are (i) One Vs. One classifier with a Linear SVM and (ii) One
Vs. Rest classifier with Linear SVM. These classifier are compact enough to use within the
constraints of the limited resources available on mobile devices and the accuracies are good enough
to compete with other more complex non-linear classifiers for a large class of problems.
Synopsis of Solution

We propose a novel tree-based classifier that attempts to compete with both the aforementioned
classifiers and attempts to be more compact than One Vs. One while reducing the number of
computations required while testing over both One Vs. Rest and One Vs. One.
The Process

Here we expound upon the different approaches and methodologies we tried and experimented with during the process
of designing and constructing the classifier. All the code and history of the project can be seen on
Github at
https://github.com/ffledgling/library.
Datasets Used

We use the following two datasets:

Optical Recognition of Handwritten Digits Data
Set

    Data Set Characteristics:  Multivariate
    Number of Instances: 5620
    Attribute Characteristics: Integer
    Number of Attributes: 64
    Date Donated: 1998-07-01

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


Letter Recognition Data Set

    Data Set Characteristics:  Multivariate
    Number of Instances: 20000
    Attribute Characteristics: Integer
    Number of Attributes: 16
    Date Donated: 1991-01-01

Initial approach

We start by creating a base line tree-based classifier. This classifier is created by
equi-partitioning the original list of classes, computing all such partitions and then picking the
best one based on the accuracy. The accuracy in question is the accuracy afforded by a (liblinear
based) linear SVM that partitions the original space by treating the separation as a two-class
classification problem.
These two partitioned classes are then treated as a sub-problem and are recursively solved using the
same approach until the level where each set of "classes" to partition at a given step at a node is
simply a singleton set and no further classification is possible. This acts as the base case for the
recursive problem.
Although a good starting point this approach has a few problems. We shall list and elucidate these
problems for they form the basis for the other methodologies and experiments we perform, each of
the newer approaches attempting so solve some of the problems posed by this approach.
I. Computation Time

The amount of computation required to compute and generate a tree for even small datasets with a
nominal number of classes is with this approach is huge. When attempted naively, this approach has 2^N-1
number of partitions and resultant Linear SVMs that are computed at the first partitioning step
itself, assuming N number of classes. Accounting for all the partitions that are computed down the
tree, this number increases to 2^^N number of partitions and Linear SVMs computed. This is
ofcourse the worst possible outcome computationally. On a fairly modern Intel® Core™ i5-2520M (2.5GHz, 3MB L3, 1333MHz FSB) CPU this took close to 20 minutes to run for Dataset-I, which has 10
classes, and 5620 samples.
II. Accuracy

The accuracy obtained via this initial approach was okay. It was close enough to compete
with the classifiers we set as benchmarks, but was outperformed by other more complex
classifiers such as Random Forest and LDA.
III. Separability

Not all data and datasets are created equal, and as a result, many of the more natural datasets are
not very cleanly separable using linear means. This problem is aggravated when one tries to force a
data samples into partitions of equal sizes, which is what the naive approach explained above does.
Class Partitioning Strategies

I. Pairwise

The pairwise strategy is an interesting one, and the first improvement that was tried out over the
naive approach. The pairwise strategy attempts to reduce the number of linear classifiers trained
and tested at each node, by attempting to approximate the equipartition approach. We explain how
this is done.
Partitioning the class labels into two sets can be treated as a decision problem where each class
label needs to be assigned to one of two possible sets. Then the overall problem simply reduces to
picking the 'best' partitioning of the given class labels from the candidate partitioning.
To approximate the naive method, were we test quite a few possible partitionings, but we attempt to
eliminate partitionings that would not have yielded good results at the outset.
We use a construction-ist approach to create the possible partitions. We do this by picking 2
classes, training a Linear SVM on these two classes, then using this linear classifier, classes are
placed into partitions based on whichever side of the hyperlane a majority of the samples from that
class lie on. We then train a fresh Linear Classifier on the decided partition.
We do this for all ^NC₂ possible pairs and pick the 'best' among all the
candidates.
This approach is faster than the naive approach by orders of magnitude. The new approach takes
O(N^²) versus the O(2^N) required by the older approach per node. The results with this pairwise
partitioning approach are very close to or better than those obtained by the naive approach
(+- %5), and is a definite improvement. Further reductions in computations are possible using
memoization, since this approach lends itself to memoization very cleanly. More details can be found
in the Computational Optimizations section.
II. GMM Based

Although much faster than the naive approach, the pairwise paritioning approach still left a lot to
be desired in terms of speed, and even accuracy with certain datasets. We attempted to try a
completely different approach for partition selection to attempt to separate classes in a novel way.
For this we use a GMM model. We assume that our data is made of only two classes, these are proxy
classes for the partitions that we actually want create from the set of class labels. We then
attempt to fit a Gaussian Mixture Model (GMM) with two Gaussians to the data (using EM). After the
model is created we attempt to identify which classes are biased towards a particular Gaussian. We
do this by attempting to classify a few samples from every class in our data with the given GMM. All
classes that bias towards a particular Gaussian are treated as part of the same partition. After
this a linear SVM trained to create a separating hyperplane between the two partitions. This is the
Linear SVM finally used for the purposes of classification.
This approach performed extremely well in terms of speed on both datasets, it was a full order of
magnitude faster on the larger Dataset II. The accuracy was comparable, although close to the lower
end of the spectrum, of that returned by the naive and pairwise approach for both dataset I and
dataset II.
The problem with this approach is the underlying assumption that the data to be classified cleanly
fits a Gaussian distribution can be flawed at times, and in such cases can lead to poor or no
results at all. For example the first implementation of this approach caused Dataset II to have a
null set and a universal set as it's two partitions because all the classes were more biased towards
only one of the two Gaussians. This was eventually fixed by forcing a partition in cases where
this happened, but clubbing the lesser biased of all the biased classes together into one partition
and grouping the remaining highly biased ones as the other partition.
III. KNN Based

Given the problems of interpreting all data as modelled by GMMs, we sought another more flexible
approach. The idea we propose is to use KNNs to attempt to segregate the classes into two
partitions. This method does not assume any underlying distribution of the data, but is still fairly
fast like fitting GMMs, and just like the approach above requires training only a single Linear SVM
and KNN based classifier instead of the large number used by earlier approaches. We have not yet
attempted to use KNN on either of the data, but we expect performance as good as if not better than
that being put forth by the GMM based approach, especially for Dataset II.
IV. Exhaustive Search

The final approach, Exhaustive search is not better than any of the aforementioned approaches on any
front, except accuracy. We attempt to try all possible combinations of all possible partitions at
every possible node. This will produce a possible O(N*^NC₂) number of trees,
and a comparable number of linear classifiers. We then pick the best possible tree based on
accuracy. This tree, obtained after much time and effort, serves as the gold standard and the upper
bound for what any classifier, using the general structure we are working on, can do.
The code to find the best tree exhaustively, currently does not exist, but it is fairly easy to
come up with it, and is one of the high priority tasks when this project moves forward.
Selection Strategies

Various selection strategies are what help pick the 'best' partition based on certain parameters
that are computed for each possibility. The parameters we take into consideration are:

Accuracy - Accuracy of the classifier for it's own split.
Overlap - Measure of how clean the separation by the classifier is, the less the number of
mis-classified samples from the given classes, the better.
Balance - Measure how uniform the split is, more uniform splits keep the tree more balanced and
as a result limit the maximum depth of the tree.
Purity - Measure of how pure the split specifically for classes that do not overlap in training.
Margin - Measure of the margin of the linear classifier.

All these parameters measure different aspects of the classifier, that may or may not be relevant to
the overall structure of the tree, depending on the kind of tree being built. It is possible to
specify and use different objective functions, for example one function that we use is a simply
product of all the above parameters, each normalized to lie between 0 and 1. This is a good possible
starting point that gives decent results for both Datasets we encounter and use.
Computational Optimizations

There are several system and code level optimizations that can be done to improve the overall speed
and efficiency of the library, the broad ideas and details are listed out in this section.
Parallelization

A lot of the training process for many of the approaches lends itself to parallelization.
Approaches that train multiple classifiers at each node for the purpose of choosing partitions, have
identical input training and testing sample data, only certain input parameters to the functions
need to be changed. All the training and testing data is read-only by definition, none of the
functions need to alter this data. These are both huge advantages and can be leveraged for
parallelization. The current parallelization relies on Python's multiprocessing module. This
module has quite a few drawbacks, for one, multiprocessing duplicates the input data instead of
sharing memory to access the read only object, the overhead of copying data seems to offset a lot of
the benefits gained by parallel computation. Lower level languages such as C and C++ that do not
have the Threading drawbacks that Python does and are more easily adapted to shared memory are
better suited for such a task. An eventual re-write of the library in C after the prototyping and
exploration is complete is something to be considered strongly.
Memoization

A lot of the approaches when implemented naively, recompute information already computed by one of
the parent nodes, storing this information and re-using it directly should ideally save a lot many
computation cycles. The memoization branch in the Git repository is a first step in this direction
and implements basic memoization although there is a lot of room for improvement.
As an example, in the pairwise-classification approach, Linear SVMs for all possible pairings
already exist for all nodes below the root node, because the root node is forced to compute them all
for itself anyway. Memoizing these results saves the same computation down the tree.
Similarly, because of overlapping classes that are common to both left and right partitions, it so
happens at times that sub-tree further down the tree are essentially duplicates of one another. That
is there exist sub-trees that solve the same sub-problem of classifying the same original set of
classes in different parts of the tree. This becomes more common with smaller and smaller sub-trees,
for example, one is very likely to spot four to five identical classifiers for two-letter separation
in Dataset II.
Storing these classifiers and then accessing them directly will reduce the number of nodes that
actually need to be computed in the tree.
Reducing Memory Overhead

As mentioned previously, Python does not allow a lot of flexibility when it comes to accessing and
sharing objects in shared memory. Being forced to copy data around using multiprocessing to get
around the Global Interpreter Lock issue in Python also adds a lot of unnecessary overhead. This
problem might possibly be solved using Python's shared memory map module, aka mmap or using
"Managed" objects such as arrays, but it would be best to step around these issues by using a low
level language such as C or C++.
Visualising Results

Visualising the results of any sort of classification is important for the purposes of understanding
the underlying problem being solved. Visualisation can also be an important tool in monitoring the
progress of any sort of long running process, especially one as unpredictable as classifier
training. For the aforementioned reasons, special effort was made to add visualisation tools into
the library itself. Currently, the classifier does all of the following for the purposes of aiding
visualisation:


Textual representation of the classifier tree is printed out after the entire classifier is
trained, with basic metrics, accuracies and partitions.


We log node process as a list of json objects, in a file called live.json by default.


The above log file is then used to visualize the tree in a browser as it is being built using the
d3 Javascript data visualisation library.


The final log of JSON objects can be stored/saved for future re-use for the purposes of
visualisation. The tree can be reconstructed without having to re-run the classifier training
itself.


Please look at the README.md file included with the repository for more information on how to use
this particular feature.
Future work

There is room for a lot more work to be done. Some of this work is encapsulated in the various
todo-*.mkd files included in the repository. The higher-level goals are encapsulated in the
TODO.mkd file in the same repository.
The over all ideas that are yet to be explored are:

Picking an optimal objective function that takes into account aforementioned classifier properties
and helps choose a well balanced yet clean partition.
To increase the chances of picking up a good solution and competing alternatives, one needs to
look to diversity enhancing optimization methods and techniques.
Look into possibly using A*Search and similar look-ahead based computation techniques to improve
overall accuracy of the tree based classifier. Other approaches that do not assume that a local
maxima is the global maximum are also good to have.
Other possible partitioning strategies, for example KNN, Mean Shift, LDA, PCA among other
possibilities.
Improve the parallelization quality and speed wise so that the library can scale linearly with the
number of cores it has access to, both on a single machine and across multiple nodes in a cluster.
Look into porting library over to C or C++ to resolve shared memory management issues.
Looking into using GPU based lib-linear for training the classifier, offloading work to the GPU
will be much faster.
Attempt to use different kinds of datasets, images, audio, text, instead of the niche hand-written
datasets.
Explore the limits of the library by training it on datasets with an order of magnitude more
classes and samples.