Skip to content

Instantly share code, notes, and snippets.

@ffledgling
Last active August 29, 2015 14:19
Show Gist options
  • Save ffledgling/e4411a88dbcedf0be709 to your computer and use it in GitHub Desktop.
Save ffledgling/e4411a88dbcedf0be709 to your computer and use it in GitHub Desktop.

Honors Project Report

Anhad Jai Singh,
Computer Science and Engineering,
Centre for Visual Information Technology (CVIT)

Problem Statement

The problem we set out to tackle was to create a classifier that is faster in practice than most present day classifiers and compact enough to be used in today's mobile devices within the limited memory and space constraints. The classifier should be capable of handling a large number of classes and samples.

The classifiers we compete against are (i) One Vs. One classifier with a Linear SVM and (ii) One Vs. Rest classifier with Linear SVM. These classifier are compact enough to use within the constraints of the limited resources available on mobile devices and the accuracies are good enough to compete with other more complex non-linear classifiers for a large class of problems.

Synopsis of Solution

We propose a novel tree-based classifier that attempts to compete with both the aforementioned classifiers and attempts to be more compact than One Vs. One while reducing the number of computations required while testing over both One Vs. Rest and One Vs. One.

The Process

Here we expound upon the different approaches and methodologies we tried and experimented with during the process of designing and constructing the classifier. All the code and history of the project can be seen on Github at https://github.com/ffledgling/library.

Datasets Used

We use the following two datasets:

  1. Optical Recognition of Handwritten Digits Data Set
    Data Set Characteristics:  Multivariate
    Number of Instances: 5620
    Attribute Characteristics: Integer
    Number of Attributes: 64
    Date Donated: 1998-07-01
Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
  1. Letter Recognition Data Set
    Data Set Characteristics:  Multivariate
    Number of Instances: 20000
    Attribute Characteristics: Integer
    Number of Attributes: 16
    Date Donated: 1991-01-01

Initial approach

We start by creating a base line tree-based classifier. This classifier is created by equi-partitioning the original list of classes, computing all such partitions and then picking the best one based on the accuracy. The accuracy in question is the accuracy afforded by a (liblinear based) linear SVM that partitions the original space by treating the separation as a two-class classification problem.

These two partitioned classes are then treated as a sub-problem and are recursively solved using the same approach until the level where each set of "classes" to partition at a given step at a node is simply a singleton set and no further classification is possible. This acts as the base case for the recursive problem.

Although a good starting point this approach has a few problems. We shall list and elucidate these problems for they form the basis for the other methodologies and experiments we perform, each of the newer approaches attempting so solve some of the problems posed by this approach.

I. Computation Time

The amount of computation required to compute and generate a tree for even small datasets with a nominal number of classes is with this approach is huge. When attempted naively, this approach has 2N-1 number of partitions and resultant Linear SVMs that are computed at the first partitioning step itself, assuming N number of classes. Accounting for all the partitions that are computed down the tree, this number increases to 2^N number of partitions and Linear SVMs computed. This is ofcourse the worst possible outcome computationally. On a fairly modern Intel® Core™ i5-2520M (2.5GHz, 3MB L3, 1333MHz FSB) CPU this took close to 20 minutes to run for Dataset-I, which has 10 classes, and 5620 samples.

II. Accuracy

The accuracy obtained via this initial approach was okay. It was close enough to compete with the classifiers we set as benchmarks, but was outperformed by other more complex classifiers such as Random Forest and LDA.

III. Separability

Not all data and datasets are created equal, and as a result, many of the more natural datasets are not very cleanly separable using linear means. This problem is aggravated when one tries to force a data samples into partitions of equal sizes, which is what the naive approach explained above does.

Class Partitioning Strategies

I. Pairwise

The pairwise strategy is an interesting one, and the first improvement that was tried out over the naive approach. The pairwise strategy attempts to reduce the number of linear classifiers trained and tested at each node, by attempting to approximate the equipartition approach. We explain how this is done. Partitioning the class labels into two sets can be treated as a decision problem where each class label needs to be assigned to one of two possible sets. Then the overall problem simply reduces to picking the 'best' partitioning of the given class labels from the candidate partitioning.

To approximate the naive method, were we test quite a few possible partitionings, but we attempt to eliminate partitionings that would not have yielded good results at the outset. We use a construction-ist approach to create the possible partitions. We do this by picking 2 classes, training a Linear SVM on these two classes, then using this linear classifier, classes are placed into partitions based on whichever side of the hyperlane a majority of the samples from that class lie on. We then train a fresh Linear Classifier on the decided partition. We do this for all NC2 possible pairs and pick the 'best' among all the candidates.

This approach is faster than the naive approach by orders of magnitude. The new approach takes O(N^2) versus the O(2^N) required by the older approach per node. The results with this pairwise partitioning approach are very close to or better than those obtained by the naive approach (+- %5), and is a definite improvement. Further reductions in computations are possible using memoization, since this approach lends itself to memoization very cleanly. More details can be found in the Computational Optimizations section.

II. GMM Based

Although much faster than the naive approach, the pairwise paritioning approach still left a lot to be desired in terms of speed, and even accuracy with certain datasets. We attempted to try a completely different approach for partition selection to attempt to separate classes in a novel way.

For this we use a GMM model. We assume that our data is made of only two classes, these are proxy classes for the partitions that we actually want create from the set of class labels. We then attempt to fit a Gaussian Mixture Model (GMM) with two Gaussians to the data (using EM). After the model is created we attempt to identify which classes are biased towards a particular Gaussian. We do this by attempting to classify a few samples from every class in our data with the given GMM. All classes that bias towards a particular Gaussian are treated as part of the same partition. After this a linear SVM trained to create a separating hyperplane between the two partitions. This is the Linear SVM finally used for the purposes of classification.

This approach performed extremely well in terms of speed on both datasets, it was a full order of magnitude faster on the larger Dataset II. The accuracy was comparable, although close to the lower end of the spectrum, of that returned by the naive and pairwise approach for both dataset I and dataset II.

The problem with this approach is the underlying assumption that the data to be classified cleanly fits a Gaussian distribution can be flawed at times, and in such cases can lead to poor or no results at all. For example the first implementation of this approach caused Dataset II to have a null set and a universal set as it's two partitions because all the classes were more biased towards only one of the two Gaussians. This was eventually fixed by forcing a partition in cases where this happened, but clubbing the lesser biased of all the biased classes together into one partition and grouping the remaining highly biased ones as the other partition.

III. KNN Based

Given the problems of interpreting all data as modelled by GMMs, we sought another more flexible approach. The idea we propose is to use KNNs to attempt to segregate the classes into two partitions. This method does not assume any underlying distribution of the data, but is still fairly fast like fitting GMMs, and just like the approach above requires training only a single Linear SVM and KNN based classifier instead of the large number used by earlier approaches. We have not yet attempted to use KNN on either of the data, but we expect performance as good as if not better than that being put forth by the GMM based approach, especially for Dataset II.

IV. Exhaustive Search

The final approach, Exhaustive search is not better than any of the aforementioned approaches on any front, except accuracy. We attempt to try all possible combinations of all possible partitions at every possible node. This will produce a possible O(N*NC2) number of trees, and a comparable number of linear classifiers. We then pick the best possible tree based on accuracy. This tree, obtained after much time and effort, serves as the gold standard and the upper bound for what any classifier, using the general structure we are working on, can do.

The code to find the best tree exhaustively, currently does not exist, but it is fairly easy to come up with it, and is one of the high priority tasks when this project moves forward.

Selection Strategies

Various selection strategies are what help pick the 'best' partition based on certain parameters that are computed for each possibility. The parameters we take into consideration are:

  1. Accuracy - Accuracy of the classifier for it's own split.
  2. Overlap - Measure of how clean the separation by the classifier is, the less the number of mis-classified samples from the given classes, the better.
  3. Balance - Measure how uniform the split is, more uniform splits keep the tree more balanced and as a result limit the maximum depth of the tree.
  4. Purity - Measure of how pure the split specifically for classes that do not overlap in training.
  5. Margin - Measure of the margin of the linear classifier.

All these parameters measure different aspects of the classifier, that may or may not be relevant to the overall structure of the tree, depending on the kind of tree being built. It is possible to specify and use different objective functions, for example one function that we use is a simply product of all the above parameters, each normalized to lie between 0 and 1. This is a good possible starting point that gives decent results for both Datasets we encounter and use.

Computational Optimizations

There are several system and code level optimizations that can be done to improve the overall speed and efficiency of the library, the broad ideas and details are listed out in this section.

Parallelization

A lot of the training process for many of the approaches lends itself to parallelization. Approaches that train multiple classifiers at each node for the purpose of choosing partitions, have identical input training and testing sample data, only certain input parameters to the functions need to be changed. All the training and testing data is read-only by definition, none of the functions need to alter this data. These are both huge advantages and can be leveraged for parallelization. The current parallelization relies on Python's multiprocessing module. This module has quite a few drawbacks, for one, multiprocessing duplicates the input data instead of sharing memory to access the read only object, the overhead of copying data seems to offset a lot of the benefits gained by parallel computation. Lower level languages such as C and C++ that do not have the Threading drawbacks that Python does and are more easily adapted to shared memory are better suited for such a task. An eventual re-write of the library in C after the prototyping and exploration is complete is something to be considered strongly.

Memoization

A lot of the approaches when implemented naively, recompute information already computed by one of the parent nodes, storing this information and re-using it directly should ideally save a lot many computation cycles. The memoization branch in the Git repository is a first step in this direction and implements basic memoization although there is a lot of room for improvement.

As an example, in the pairwise-classification approach, Linear SVMs for all possible pairings already exist for all nodes below the root node, because the root node is forced to compute them all for itself anyway. Memoizing these results saves the same computation down the tree.

Similarly, because of overlapping classes that are common to both left and right partitions, it so happens at times that sub-tree further down the tree are essentially duplicates of one another. That is there exist sub-trees that solve the same sub-problem of classifying the same original set of classes in different parts of the tree. This becomes more common with smaller and smaller sub-trees, for example, one is very likely to spot four to five identical classifiers for two-letter separation in Dataset II.

Storing these classifiers and then accessing them directly will reduce the number of nodes that actually need to be computed in the tree.

Reducing Memory Overhead

As mentioned previously, Python does not allow a lot of flexibility when it comes to accessing and sharing objects in shared memory. Being forced to copy data around using multiprocessing to get around the Global Interpreter Lock issue in Python also adds a lot of unnecessary overhead. This problem might possibly be solved using Python's shared memory map module, aka mmap or using "Managed" objects such as arrays, but it would be best to step around these issues by using a low level language such as C or C++.

Visualising Results

Visualising the results of any sort of classification is important for the purposes of understanding the underlying problem being solved. Visualisation can also be an important tool in monitoring the progress of any sort of long running process, especially one as unpredictable as classifier training. For the aforementioned reasons, special effort was made to add visualisation tools into the library itself. Currently, the classifier does all of the following for the purposes of aiding visualisation:

  1. Textual representation of the classifier tree is printed out after the entire classifier is trained, with basic metrics, accuracies and partitions.

  2. We log node process as a list of json objects, in a file called live.json by default.

  3. The above log file is then used to visualize the tree in a browser as it is being built using the d3 Javascript data visualisation library.

  4. The final log of JSON objects can be stored/saved for future re-use for the purposes of visualisation. The tree can be reconstructed without having to re-run the classifier training itself.

Please look at the README.md file included with the repository for more information on how to use this particular feature.

Future work

There is room for a lot more work to be done. Some of this work is encapsulated in the various todo-*.mkd files included in the repository. The higher-level goals are encapsulated in the TODO.mkd file in the same repository.

The over all ideas that are yet to be explored are:

  • Picking an optimal objective function that takes into account aforementioned classifier properties and helps choose a well balanced yet clean partition.
  • To increase the chances of picking up a good solution and competing alternatives, one needs to look to diversity enhancing optimization methods and techniques.
  • Look into possibly using A*Search and similar look-ahead based computation techniques to improve overall accuracy of the tree based classifier. Other approaches that do not assume that a local maxima is the global maximum are also good to have.
  • Other possible partitioning strategies, for example KNN, Mean Shift, LDA, PCA among other possibilities.
  • Improve the parallelization quality and speed wise so that the library can scale linearly with the number of cores it has access to, both on a single machine and across multiple nodes in a cluster.
  • Look into porting library over to C or C++ to resolve shared memory management issues.
  • Looking into using GPU based lib-linear for training the classifier, offloading work to the GPU will be much faster.
  • Attempt to use different kinds of datasets, images, audio, text, instead of the niche hand-written datasets.
  • Explore the limits of the library by training it on datasets with an order of magnitude more classes and samples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment