Skip to content

Instantly share code, notes, and snippets.

@micmn
Last active August 23, 2017 22:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save micmn/52c82a6f47ebb8b49d35a5a3452becb8 to your computer and use it in GitHub Desktop.
Save micmn/52c82a6f47ebb8b49d35a5a3452becb8 to your computer and use it in GitHub Desktop.

Features refactor

This work in progress document gathers requirements, design choices and implementation ideas for the features refactor and in particular concerning these aspects: immutable features, views, iterators, preprocessors and cross-validation.

Immutable features

Features should be immutable and offer two methods to construct new features:

  • view(Vector<index_t>) -> Features*: creates a features object with a subset added on top;

  • preprocess(Preprocessor*) -> Features*: creates a features object with a preprocessor added on top.

The view() method should create a new Features instance which shares the same data but with a new Subset added on top of the SubsetStack:

Features* Features::view(SGVector<index_t> idx)
    auto feats = this->duplicate() // shallow copy
    feats->add_subset(idx)
    return feats

The preprocess() method is similar (see Preprocessors).

SGVector/Matrix as argument passed by value

Passing around that SGVector<index_t> idx object seems dangerous since it could be modified somewhere else by the caller with the effect of changing the view? In fact this is more general problem caused by the sgvector/matrix shallow copy 'ctor (see get_feature_matrix).

DenseFeatures::get_feature_matrix()

Accessing the underlying matrix of dense features is common in current algorithms, while certain usages can be replaced by iterators, others can't (i.e. algorithms that need matrix decomposition, solution of linear systems etc.).

The current method get_feature_matrix() breaks features immutability since the copy 'ctor of SGMatrix takes a shallow-copy of the matrix, changing the method to return a const ref (get_feature_matrix() -> const SGMatrix<T>&) doesn't help alone since one can always take a non-const copy of it and mutate the matrix.

So here the option are two: return a deep copy of the matrix (but it wastes memory if the matrix is not changed) or changing the copy 'ctor to make a deep copy and implement a move 'ctor.

I advocate the latter since it would also eliminate this way of breaking immutability, i.e.:

SGVector idx = ...
feats->view(std::move(idx))

Which belongs to the more general category: put a mutable object into an immutable one and keep a reference to it to modify it later.

But I recognize it's a pretty big change and I can't judge the impact on the codebase, also I'm not aware of possible hard-requirements for a shallow copy 'ctor, in such a case, and without cloning objects into immutables 'ctor and getters, we could facilitate immutability but not enforce it.

Preprocessors

The new API should offer two alternatives:

  • process on-the-fly the features in the iterators
preproc=ZeroMeanPP(inplace=true)
feats_zero_mean=feats.preprocess(preproc) // compute the mean of the data
feats_zero_mean.dot_iterator( ... ) // removes mean on the fly in the iterator
  • create a new features instance
preproc=ZeroMeanPP(inplace=false)
feats_zero_mean=feats.preprocess(preproc) // creates new features object
feats_zero_mean.dot_iterator( ... )

My view is that we could let the on-the-fly behaviour be the default and if the user wants to apply the preprocessors he does something like (apply to subsets too): feats->preprocess(pre_1)->preprocess(pre_2).eval()

Preprocessors concatenation

An issue is with preprocessors concatenation:

Features* out = feats->preprocess(pre_1)->preprocess(pre_2)

Here feats->preprocess(pre_1) returns a pointer to a new features object that can't be freed (in fact the same applies to view(), although it seems odd that views are concatenated...).

Since view()/preprocess() create a new resource they should simply return a smart pointer (Some, problems with swig?) to manage it. A related thing is view()/preprocess() return type that in this case is Some<Features>, a quite annoying fact is that you can't have covariant return types with smart pointers so one has to do:

Some<DenseFeatures<float64_t>> feats = ...
auto preprocessed_feats = feats->preprocess(preprocessor) // auto = Some<Features>
// CASTING HELL:
auto dense_preprocessed_feats = static_cast<Some<DenseFeatures<float64_t>>>(preprocessed_feats)
// or?
auto dense_preprocessed_feats = wrap(static_cast<DenseFeatures<float64_t>*>(preprocessed_feats.get()))
// do something that requires DenseFeatures...

Using covariant types one could avoid this overriding Features::preprocess()/view/() in every subclass and to the cast there, but as said this is not possible with smart pointers. A possible workaround (drawbacks?) could be declaring the functions non-virtual while dynamic dispatching is still achieved internally by duplicate():

@non-virtual
Some<Features> Features::view(SGVector<index_t> idx)
    auto feats = wrap(this->duplicate()) // shallow copy
    feats->add_subset(idx)
    return feats
    
@non-virtual
Some<DenseFeatures<T>> DenseFeatures::view(SGVector<index_t> idx)
    auto feats = wrap(
        static_cast<DenseFeatures<T>*>(
            Features::view(idx).get()))
    return feats

And then:

Some<DenseFeatures<float64_t>> feats = ...
auto preprocessed_feats = feats->preprocess(preprocessor) // auto = Some<DenseFeatures<float64_t>>
// do something that requires DenseFeatures...

Iterators

Algorithms can't random access features, instead they should use iterators, i.e. (STL-way):

DotFeatures *features = ...;
SGVector w = ...;
for (const auto& v : DotIterator(features))
    v.add(w, 0.1) // w = w+w*feats[i]*0.1

Following the STL's approach (begin()/end()) the iterator's interface should be:

class DotIterator
    DotIterator(DotFeatures* features);
    iterator begin();
    iterator end();
    
    class iterator
        iterator(DotFeatures* features, index_t idx);
        iterator& operator++();
        bool operator!=();
        feature operator*();
    
    class feature
        feature(DotFeatures* features, index_t idx);
        void add(const SGVector& v, float64_t alpha=1.0);

To make it possible to use range-based for loops, iterators must have operator*() so a possibility is using a feature mock class like above that could be implemented as:

class iterator
    feature m_feature;
    
    iterator(DotFeatures* features, index_t idx) :
        m_feature(features, idx)
    
    iterator& operator++()
        m_feature.idx++;
        return *this;
    
    feature operator*()
        return m_feature;
    
class feature {
    index_t idx;
    DotFeatures* features;
    
    feature(DotFeatures* f, index_t i) :
        m_features(f), idx(i)
    
    void add(const SGVector& v, float64_t alpha=1.0)
        m_features->do_something(m_idx, ...)

Also SGVector/Matrix should offer an iterator interface:

template <typename T>
class SGVector
    iterator begin();
    iterator end();
    
    class iterator
        T& operator*();
        ...

template <typename T>
class SGMatrix
    iterator begin();
    iterator end();
    
    class iterator
        SGVector<T> operator*(); // non-owning SGVector, get_column(idx)
        ...

A common pattern is iterating over many containers at once, i.e. the perceptron algorithm iterates over three containers: features, train labels and output labels.

for idx in range(num_vec):
    predicted_label = features->dot(idx, w) + bias
    true_label = train_labels[idx]
    output[idx] = predicted_label

Hence a zip-like function will be one of the first things needed (and probably other common iter functions too, so it might be useful investigating into an external library).

On-the-fly evaluation

On-the-fly evaluation could by achieved by (dense case):

  1. the i-th feature vector is requested by i.e. an iterator
  2. the requested index i gets transformed by the subsets stack into i'
  3. the i'-th feature vector is fetched and passed through the preprocessors stack
  4. the resulting vector is returned

Some methods trigger features evaluation (i.e. get_feature_matrix()).

Cross-validation

Immutable features (and labels) are inherently thread-safe and thus multithreaded xval could be implemented easily by means of the view method:

#pragma omp parallel for
for (index_t i=0; i < num_subsets; ++i)
    auto machine = m_machine->clone()
    auto subset_idx = m_splitting_strategy->generate_subset_inverse(i)
    auto features = m_features->view(subset_idx)
    auto labels = m_labels->view(subset_idx)
    machine->set_labels(labels)
    machine->train(features)
    ...

Make it happen

Pick a features class (i.e. DenseFeatures):

  1. Make the features/labels getters const (+ address this)
  2. Code that use getters (i.e. get_feature_matrix()) should use const ref instead of copying
  3. Remove the features/labels setters/mutators or make them private for internal implementation (add_subset etc.)
  4. Fix failing unit-tests
  5. Implement iterators
Getters constness + drop or make private setters/mutators
  • Features:

    • make private all the preprocessor/subset methods for internal implementation
    • drop load (there is already a 'ctor)
  • DotFeatures: no changes needed

  • DenseFeatures:

    • free_*(features/feature_matrix/etc) ?
    • make const get_feature_vector/matrix, drop set_feature_vector/matrix, steal_feature_matrix
    • obtain_from_dot -> 'ctor?
    • drop set_num_features/vectors, reshape?

[more details coming...]

Failing unit-tests
 27 - unit-StochasticGBMachine (Fixed)
 32 - unit-LeastAngleRegression (Fixed)
 46 - unit-LMNNImpl (Fixed)
 47 - unit-LMNN (Fixed)
167 - unit-CARTree (Fixed)
 
 58 - unit-Block (Failed)
 74 - unit-StreamingDenseFeaturesTest (Failed)
 80 - unit-StreamingHashedDenseFeaturesTest (Failed)
122 - unit-LogPlusOne (Failed)
123 - unit-RescaleFeatures (Failed)
163 - unit-RandomCARTree (OTHER_FAULT)
165 - unit-C45ClassifierTree (Failed)
166 - unit-ID3ClassifierTree (Failed)
168 - unit-RandomForest (Failed)
170 - unit-CHAIDTree (SEGFAULT)
178 - unit-BaggingMachine (Failed)
185 - unit-GaussianProcessClassification (SEGFAULT)
227 - unit-KMeans (Failed)
@karlnapf
Copy link

karlnapf commented Aug 9, 2017

  • "Note: passing around that SGVector<index_t> idx object seems dangerous given that it could be modified somewhere else by the caller changing the view?"How would you do that otherwise? I mean you could just copy the indices?
  • As said in email, preprocessors either should work on the fly or one has to accept the expensive operation they cause
  • Currently Preprocessor::apply(Features*) -> Features*, in the dense case, modifies the underlying matrix and also it creates a new instance of Features which shares the same memory of the input (why?). ........ God knows, well spotted!
  • Definitely offer range based iterators!
  • +1 for the SGMatrix iterator
  • +1 for zipping. But a wrapper might do that no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment