micmn/features.md Secret

## features.md

      
    Raw
  

              features.md
            
          
    Features refactor

This work in progress document gathers requirements, design choices and implementation ideas
for the features refactor and in particular concerning these aspects:
immutable features, views, iterators, preprocessors and cross-validation.

Immutable features
Preprocessors
Iterators
On-the-fly evaluation
Cross-validation
Make it happen

Immutable features

Features should be immutable and offer two methods to construct new features:


view(Vector<index_t>) -> Features*:
creates a features object with a subset added on top;


preprocess(Preprocessor*) -> Features*:
creates a features object with a preprocessor added on top.


The view() method should create a new Features instance which shares
the same data but with a new Subset added on top of the SubsetStack:
Features* Features::view(SGVector<index_t> idx)
    auto feats = this->duplicate() // shallow copy
    feats->add_subset(idx)
    return feats
The preprocess() method is similar (see Preprocessors).

SGVector/Matrix as argument passed by value

Passing around that SGVector<index_t> idx object seems dangerous since
it could be modified somewhere else by the caller with the effect of changing the view?
In fact this is more general problem caused by the sgvector/matrix shallow copy 'ctor
(see get_feature_matrix).

DenseFeatures::get_feature_matrix()

Accessing the underlying matrix of dense features is common in current algorithms,
while certain usages can be replaced by iterators, others can't (i.e. algorithms
that need matrix decomposition, solution of linear systems etc.).
The current method get_feature_matrix() breaks features immutability since
the copy 'ctor of SGMatrix takes a shallow-copy of the matrix, changing the
method to return a const ref (get_feature_matrix() -> const SGMatrix<T>&)
doesn't help alone since one can always take a non-const copy of it and mutate the matrix.
So here the option are two: return a deep copy of the matrix
(but it wastes memory if the matrix is not changed) or changing
the copy 'ctor to make a deep copy and implement a move 'ctor.
I advocate the latter since it would also eliminate this
way of breaking immutability, i.e.:
SGVector idx = ...
feats->view(std::move(idx))
Which belongs to the more general category: put a mutable object into an
immutable one and keep a reference to it to modify it later.
But I recognize it's a pretty big change and I can't judge the impact on the codebase,
also I'm not aware of possible hard-requirements for a shallow copy 'ctor, in such
a case, and without cloning objects into immutables 'ctor and getters, we could
facilitate immutability but not enforce it.
Preprocessors

The new API should offer two alternatives:

process on-the-fly the features in the iterators

preproc=ZeroMeanPP(inplace=true)
feats_zero_mean=feats.preprocess(preproc) // compute the mean of the data
feats_zero_mean.dot_iterator( ... ) // removes mean on the fly in the iterator

create a new features instance

preproc=ZeroMeanPP(inplace=false)
feats_zero_mean=feats.preprocess(preproc) // creates new features object
feats_zero_mean.dot_iterator( ... )
My view is that we could let the on-the-fly behaviour be the default and if the
user wants to apply the preprocessors he does something like (apply to subsets too):
feats->preprocess(pre_1)->preprocess(pre_2).eval()
Preprocessors concatenation

An issue is with preprocessors concatenation:
Features* out = feats->preprocess(pre_1)->preprocess(pre_2)
Here feats->preprocess(pre_1) returns a pointer to a new features object that can't be freed
(in fact the same applies to view(), although it seems odd that views are concatenated...).
Since view()/preprocess() create a new resource they should simply return a smart pointer
(Some, problems with swig?) to manage it. A related thing is view()/preprocess() return type that in this case
is Some<Features>, a quite annoying fact is that you can't have covariant return types with
smart pointers so one has to do:
Some<DenseFeatures<float64_t>> feats = ...
auto preprocessed_feats = feats->preprocess(preprocessor) // auto = Some<Features>
// CASTING HELL:
auto dense_preprocessed_feats = static_cast<Some<DenseFeatures<float64_t>>>(preprocessed_feats)
// or?
auto dense_preprocessed_feats = wrap(static_cast<DenseFeatures<float64_t>*>(preprocessed_feats.get()))
// do something that requires DenseFeatures...
Using covariant types one could avoid this overriding Features::preprocess()/view/() in
every subclass and to the cast there, but as said this is not possible with smart pointers.
A possible workaround (drawbacks?) could be declaring the functions non-virtual while
dynamic dispatching is still achieved internally by duplicate():
@non-virtual
Some<Features> Features::view(SGVector<index_t> idx)
    auto feats = wrap(this->duplicate()) // shallow copy
    feats->add_subset(idx)
    return feats
    
@non-virtual
Some<DenseFeatures<T>> DenseFeatures::view(SGVector<index_t> idx)
    auto feats = wrap(
        static_cast<DenseFeatures<T>*>(
            Features::view(idx).get()))
    return feats
And then:
Some<DenseFeatures<float64_t>> feats = ...
auto preprocessed_feats = feats->preprocess(preprocessor) // auto = Some<DenseFeatures<float64_t>>
// do something that requires DenseFeatures...
Iterators

Algorithms can't random access features, instead they should use iterators,
i.e. (STL-way):
DotFeatures *features = ...;
SGVector w = ...;
for (const auto& v : DotIterator(features))
    v.add(w, 0.1) // w = w+w*feats[i]*0.1
Following the STL's approach (begin()/end()) the iterator's interface should be:
class DotIterator
    DotIterator(DotFeatures* features);
    iterator begin();
    iterator end();
    
    class iterator
        iterator(DotFeatures* features, index_t idx);
        iterator& operator++();
        bool operator!=();
        feature operator*();
    
    class feature
        feature(DotFeatures* features, index_t idx);
        void add(const SGVector& v, float64_t alpha=1.0);
To make it possible to use range-based for loops, iterators must have operator*()
so a possibility is using a feature mock class like above that could be implemented as:
class iterator
    feature m_feature;
    
    iterator(DotFeatures* features, index_t idx) :
        m_feature(features, idx)
    
    iterator& operator++()
        m_feature.idx++;
        return *this;
    
    feature operator*()
        return m_feature;
    
class feature {
    index_t idx;
    DotFeatures* features;
    
    feature(DotFeatures* f, index_t i) :
        m_features(f), idx(i)
    
    void add(const SGVector& v, float64_t alpha=1.0)
        m_features->do_something(m_idx, ...)
Also SGVector/Matrix should offer an iterator interface:
template <typename T>
class SGVector
    iterator begin();
    iterator end();
    
    class iterator
        T& operator*();
        ...

template <typename T>
class SGMatrix
    iterator begin();
    iterator end();
    
    class iterator
        SGVector<T> operator*(); // non-owning SGVector, get_column(idx)
        ...
A common pattern is iterating over many containers at once, i.e.
the perceptron algorithm iterates over three containers:
features, train labels and output labels.
for idx in range(num_vec):
    predicted_label = features->dot(idx, w) + bias
    true_label = train_labels[idx]
    output[idx] = predicted_label
Hence a zip-like function will be one of the first things needed (and probably
other common iter functions too, so it might be useful investigating
into an external library).
On-the-fly evaluation

On-the-fly evaluation could by achieved by (dense case):

the i-th feature vector is requested by i.e. an iterator
the requested index i gets transformed by the subsets stack into i'
the i'-th feature vector is fetched and passed through the preprocessors stack
the resulting vector is returned

Some methods trigger features evaluation (i.e. get_feature_matrix()).
Cross-validation

Immutable features (and labels) are inherently thread-safe and thus multithreaded xval
could be implemented easily by means of the view method:
#pragma omp parallel for
for (index_t i=0; i < num_subsets; ++i)
    auto machine = m_machine->clone()
    auto subset_idx = m_splitting_strategy->generate_subset_inverse(i)
    auto features = m_features->view(subset_idx)
    auto labels = m_labels->view(subset_idx)
    machine->set_labels(labels)
    machine->train(features)
    ...
Make it happen

Pick a features class (i.e. DenseFeatures):

Make the features/labels getters const (+ address this)
Code that use getters (i.e. get_feature_matrix()) should use const ref instead of copying
Remove the features/labels setters/mutators or make them private
for internal implementation (add_subset etc.)
Fix failing unit-tests
Implement iterators

Getters constness + drop or make private setters/mutators


Features:

make private all the preprocessor/subset methods for internal implementation
drop load (there is already a 'ctor)


DotFeatures: no changes needed


DenseFeatures:

free_*(features/feature_matrix/etc) ?
make const get_feature_vector/matrix, drop set_feature_vector/matrix, steal_feature_matrix
obtain_from_dot -> 'ctor?
drop set_num_features/vectors, reshape?


[more details coming...]
Failing unit-tests

 27 - unit-StochasticGBMachine (Fixed)
 32 - unit-LeastAngleRegression (Fixed)
 46 - unit-LMNNImpl (Fixed)
 47 - unit-LMNN (Fixed)
167 - unit-CARTree (Fixed)
 
 58 - unit-Block (Failed)
 74 - unit-StreamingDenseFeaturesTest (Failed)
 80 - unit-StreamingHashedDenseFeaturesTest (Failed)
122 - unit-LogPlusOne (Failed)
123 - unit-RescaleFeatures (Failed)
163 - unit-RandomCARTree (OTHER_FAULT)
165 - unit-C45ClassifierTree (Failed)
166 - unit-ID3ClassifierTree (Failed)
168 - unit-RandomForest (Failed)
170 - unit-CHAIDTree (SEGFAULT)
178 - unit-BaggingMachine (Failed)
185 - unit-GaussianProcessClassification (SEGFAULT)
227 - unit-KMeans (Failed)