This work in progress document gathers requirements, design choices and implementation ideas for the features refactor and in particular concerning these aspects: immutable features, views, iterators, preprocessors and cross-validation.
Features should be immutable and offer two methods to construct new features:
-
view(Vector<index_t>) -> Features*
: creates a features object with a subset added on top; -
preprocess(Preprocessor*) -> Features*
: creates a features object with a preprocessor added on top.
The view()
method should create a new Features
instance which shares
the same data but with a new Subset
added on top of the SubsetStack
:
Features* Features::view(SGVector<index_t> idx)
auto feats = this->duplicate() // shallow copy
feats->add_subset(idx)
return feats
The preprocess()
method is similar (see Preprocessors).
Passing around that SGVector<index_t> idx
object seems dangerous since
it could be modified somewhere else by the caller with the effect of changing the view?
In fact this is more general problem caused by the sgvector/matrix shallow copy 'ctor
(see get_feature_matrix).
Accessing the underlying matrix of dense features is common in current algorithms, while certain usages can be replaced by iterators, others can't (i.e. algorithms that need matrix decomposition, solution of linear systems etc.).
The current method get_feature_matrix()
breaks features immutability since
the copy 'ctor of SGMatrix
takes a shallow-copy of the matrix, changing the
method to return a const ref (get_feature_matrix() -> const SGMatrix<T>&
)
doesn't help alone since one can always take a non-const copy of it and mutate the matrix.
So here the option are two: return a deep copy of the matrix (but it wastes memory if the matrix is not changed) or changing the copy 'ctor to make a deep copy and implement a move 'ctor.
I advocate the latter since it would also eliminate this way of breaking immutability, i.e.:
SGVector idx = ...
feats->view(std::move(idx))
Which belongs to the more general category: put a mutable object into an immutable one and keep a reference to it to modify it later.
But I recognize it's a pretty big change and I can't judge the impact on the codebase, also I'm not aware of possible hard-requirements for a shallow copy 'ctor, in such a case, and without cloning objects into immutables 'ctor and getters, we could facilitate immutability but not enforce it.
The new API should offer two alternatives:
- process on-the-fly the features in the iterators
preproc=ZeroMeanPP(inplace=true)
feats_zero_mean=feats.preprocess(preproc) // compute the mean of the data
feats_zero_mean.dot_iterator( ... ) // removes mean on the fly in the iterator
- create a new features instance
preproc=ZeroMeanPP(inplace=false)
feats_zero_mean=feats.preprocess(preproc) // creates new features object
feats_zero_mean.dot_iterator( ... )
My view is that we could let the on-the-fly behaviour be the default and if the
user wants to apply the preprocessors he does something like (apply to subsets too):
feats->preprocess(pre_1)->preprocess(pre_2).eval()
An issue is with preprocessors concatenation:
Features* out = feats->preprocess(pre_1)->preprocess(pre_2)
Here feats->preprocess(pre_1)
returns a pointer to a new features object that can't be freed
(in fact the same applies to view()
, although it seems odd that views are concatenated...).
Since view()/preprocess()
create a new resource they should simply return a smart pointer
(Some
, problems with swig?) to manage it. A related thing is view()/preprocess()
return type that in this case
is Some<Features>
, a quite annoying fact is that you can't have covariant return types with
smart pointers so one has to do:
Some<DenseFeatures<float64_t>> feats = ...
auto preprocessed_feats = feats->preprocess(preprocessor) // auto = Some<Features>
// CASTING HELL:
auto dense_preprocessed_feats = static_cast<Some<DenseFeatures<float64_t>>>(preprocessed_feats)
// or?
auto dense_preprocessed_feats = wrap(static_cast<DenseFeatures<float64_t>*>(preprocessed_feats.get()))
// do something that requires DenseFeatures...
Using covariant types one could avoid this overriding Features::preprocess()/view/()
in
every subclass and to the cast there, but as said this is not possible with smart pointers.
A possible workaround (drawbacks?) could be declaring the functions non-virtual while
dynamic dispatching is still achieved internally by duplicate()
:
@non-virtual
Some<Features> Features::view(SGVector<index_t> idx)
auto feats = wrap(this->duplicate()) // shallow copy
feats->add_subset(idx)
return feats
@non-virtual
Some<DenseFeatures<T>> DenseFeatures::view(SGVector<index_t> idx)
auto feats = wrap(
static_cast<DenseFeatures<T>*>(
Features::view(idx).get()))
return feats
And then:
Some<DenseFeatures<float64_t>> feats = ...
auto preprocessed_feats = feats->preprocess(preprocessor) // auto = Some<DenseFeatures<float64_t>>
// do something that requires DenseFeatures...
Algorithms can't random access features, instead they should use iterators, i.e. (STL-way):
DotFeatures *features = ...;
SGVector w = ...;
for (const auto& v : DotIterator(features))
v.add(w, 0.1) // w = w+w*feats[i]*0.1
Following the STL's approach (begin()/end()) the iterator's interface should be:
class DotIterator
DotIterator(DotFeatures* features);
iterator begin();
iterator end();
class iterator
iterator(DotFeatures* features, index_t idx);
iterator& operator++();
bool operator!=();
feature operator*();
class feature
feature(DotFeatures* features, index_t idx);
void add(const SGVector& v, float64_t alpha=1.0);
To make it possible to use range-based for loops, iterators must have operator*()
so a possibility is using a feature
mock class like above that could be implemented as:
class iterator
feature m_feature;
iterator(DotFeatures* features, index_t idx) :
m_feature(features, idx)
iterator& operator++()
m_feature.idx++;
return *this;
feature operator*()
return m_feature;
class feature {
index_t idx;
DotFeatures* features;
feature(DotFeatures* f, index_t i) :
m_features(f), idx(i)
void add(const SGVector& v, float64_t alpha=1.0)
m_features->do_something(m_idx, ...)
Also SGVector/Matrix
should offer an iterator interface:
template <typename T>
class SGVector
iterator begin();
iterator end();
class iterator
T& operator*();
...
template <typename T>
class SGMatrix
iterator begin();
iterator end();
class iterator
SGVector<T> operator*(); // non-owning SGVector, get_column(idx)
...
A common pattern is iterating over many containers at once, i.e. the perceptron algorithm iterates over three containers: features, train labels and output labels.
for idx in range(num_vec):
predicted_label = features->dot(idx, w) + bias
true_label = train_labels[idx]
output[idx] = predicted_label
Hence a zip-like function will be one of the first things needed (and probably other common iter functions too, so it might be useful investigating into an external library).
On-the-fly evaluation could by achieved by (dense case):
- the i-th feature vector is requested by i.e. an iterator
- the requested index i gets transformed by the subsets stack into i'
- the i'-th feature vector is fetched and passed through the preprocessors stack
- the resulting vector is returned
Some methods trigger features evaluation (i.e. get_feature_matrix()
).
Immutable features (and labels) are inherently thread-safe and thus multithreaded xval could be implemented easily by means of the view method:
#pragma omp parallel for
for (index_t i=0; i < num_subsets; ++i)
auto machine = m_machine->clone()
auto subset_idx = m_splitting_strategy->generate_subset_inverse(i)
auto features = m_features->view(subset_idx)
auto labels = m_labels->view(subset_idx)
machine->set_labels(labels)
machine->train(features)
...
Pick a features class (i.e. DenseFeatures):
- Make the features/labels getters const (+ address this)
- Code that use getters (i.e. get_feature_matrix()) should use const ref instead of copying
- Remove the features/labels setters/mutators or make them private for internal implementation (add_subset etc.)
- Fix failing unit-tests
- Implement iterators
-
Features:
- make private all the preprocessor/subset methods for internal implementation
- drop load (there is already a 'ctor)
-
DotFeatures: no changes needed
-
DenseFeatures:
- free_*(features/feature_matrix/etc) ?
- make const get_feature_vector/matrix, drop set_feature_vector/matrix, steal_feature_matrix
- obtain_from_dot -> 'ctor?
- drop set_num_features/vectors, reshape?
[more details coming...]
27 - unit-StochasticGBMachine (Fixed)
32 - unit-LeastAngleRegression (Fixed)
46 - unit-LMNNImpl (Fixed)
47 - unit-LMNN (Fixed)
167 - unit-CARTree (Fixed)
58 - unit-Block (Failed)
74 - unit-StreamingDenseFeaturesTest (Failed)
80 - unit-StreamingHashedDenseFeaturesTest (Failed)
122 - unit-LogPlusOne (Failed)
123 - unit-RescaleFeatures (Failed)
163 - unit-RandomCARTree (OTHER_FAULT)
165 - unit-C45ClassifierTree (Failed)
166 - unit-ID3ClassifierTree (Failed)
168 - unit-RandomForest (Failed)
170 - unit-CHAIDTree (SEGFAULT)
178 - unit-BaggingMachine (Failed)
185 - unit-GaussianProcessClassification (SEGFAULT)
227 - unit-KMeans (Failed)