vinx13/shogun-design.md

## shogun-design.md

      
    Raw
  

              shogun-design.md
            
          
    Enforce constness of SGMatrix / SGVector

Currently, one can always mutate const matrix and vector by creating shadow copy, which breaks the semantic of const reference. We can make copy ctors return deep copy and add move ctors. The move constructor should obtain the ownership of underlying memory, and leave the original one uninitialized.
For example
SGMatrix::SGMatrix(const SGMatrix &orig); // deep copy
SGMatrix& SGMatrix::operator=(const SGMatrix<T>&); // deep copy
SGMatrix(SGMatrix &&); // move ctor
Preprocessor

Preprocessors have a unified three stage API:
(constructor) // initialize and set up parameters
void fit(CFeatures *); // fit into training features, preprocessors that do not require training data have a empty implementation
Some<CFeatures *>apply(CFeatures *); // apply to features by creating a new instance, maybe 'transform' would be a better name
We will remove the whole preprocessor stuff from CFeatures (I am not sure whether we should do this). To use multiple preprocessors on a feature, one can use a pipeline.
Pipeline

Pipeline is a convenient wrapper a collection of preprocessors and a machine as the final stage.
class Pipeline {
    void add_preprocessor(CPreprocessor *);
    void add_machine(CMachine *);
    void fit(CFeatures *); // fit preprocessors on features one by one, and train the machine on transformed features
    void fit(CFeatures *, CLabels *); // fit preprocessors, and then fit the machine with features and training labels
    CLabels predict(CFeatures *);
};
Convenient method to create a pipeline, which need some tricks with template.
Some<Pipeline> make_pipeline(CPreprocessors *, CPreprocessors *, ..., CMachine *);
Or we can have a vector of preprocessors
Some<Pipeline> make_pipeline(std::vector<CPreprocessors *> preprocessors, CMachine *);
View

Some<CFeatures> CFeatures::view(SGVector index);
Some<CLabels> CLabels::view(SGVector index);
The view method call creates a new instance of feature or label, which shadow-copies underlying data. A subset is added to the subset stack of the new instance. After this, we will make all subset APIs in CFeatures and CLabels private.
Taken from @micmn 's design, we need to solve the issue of covariant type.
@non-virtual
Some<Features> Features::view(SGVector<index_t> idx); // create a new instance

@non-virtual
Some<DenseFeatures<T>> DenseFeatures::view(SGVector<index_t> idx) // do the type cast
    auto feats = wrap(
        static_cast<DenseFeatures<T>*>(
            Features::view(idx).get()))
    return feats
Feature Iterator

There are a set of old iteration APIs in CDotFeatures
void* get_feature_iterator(int32_t vector_index);
bool get_next_feature(int32_t& index, float64_t& value, void* iterator);
void free_feature_iterator(void* iterator);
These APIs expose data as raw pointers. We will adapt them to new iterator design in DotIterator. This also involves refactor in LibLinear where they are mostly used.
Refactor non-const methods of features

We could start from CDotFeatures, which is the super class of many other feature types. There are many non-const methods, for example,
virtual float64_t dot(int32_t vec_idx1, CDotFeatures* df, int32_t vec_idx2)=0;
virtual void add_to_dense_vec(float64_t alpha, int32_t vec_idx1, float64_t* vec2, int32_t vec2_len, bool abs_val=false)=0;
They are non-const because it will call other non-const methods to get the actual feature vector, i.e. get_feature_vector(int32_t num, int32_t& len, bool& dofree) in CDenseFeatures case, which may compute feature vector on the fly (using implementation of subclasses) and then cache it. We can make cache mutable so that these methods can be const.
We will add locks to CCache for thread safety to enable concurrent access of features.
Untemplated Matrix and Vector

We use expression template for lazy evaluation.
Exp is base class of all expressions.
template <class Derived>
class Exp {
public:
	Derived& self();
	const Derived& self() const;

	typename Derived::untemplated_result_type eval();

	template <PType>
	typename Derived::result_type<PType> eval_templated();
};
It has the following direct subtypes that define the corresponding result_type and untemplated_result_type.
// MatrixExp represents a expression that evaluates to a Matrix
template <class Derived>
class MatrixExp: public Exp<MatrixExp<Derived>> {
public:
	typedef Matrix untemplated_result_type;

	template <PType>
	typedef SGMatrix<PType> result_type;
};

// ScalarExp and VectorExp can be defined similarly
template <class Derived>
class VectorExp;

// untemplated_result_type and result_type are both primitive type in 
// ScalarExp
template <class Derived>
class ScalarExp;
We have several expression types, e.g.
template<class OP, class E> class UnaryMatrixExp;
template<class OP, class E> class UnaryVectorExp;
template<class OP, class E1, class E2> class BinaryMatrixExp;
template<class OP, class E1, class E2> class BinaryVectorExp;
Matrix can be implicit cast to UnaryMatrixExp that evaluates to itself, so as to Vector.
A possible implementation is,
template<class OP, class E1, class E2> 
class BinaryMatrixExp: public MatrixExp<BinaryMatrixExp<OP, E1, E2>>  {
public:
	BinaryMatrixExp(const Exp<E1>& left, const Exp<E2>& right);
	
	template<class PType>
	result_type<PType> eval_templated() {
		return OP::apply<PType>(
			e1.eval_templated<PType>(), 								e2.eval_templated<PType>()
		);

	untemplated_result_type eval() {
		switch(ptype()) {
		case FLOAT64: 
			return untemplated_result_type(
				eval_templated<float64_t>())
			);
		...
		}
	}

private:
	const Exp<E1>& left;
	const Exp<E2>& right;
};
We need to wrap linalg functions as static methods, e.g.
struct MatrixAdd {
	typedef Matrix result_type;
	template<class T>
	static SGMatrix<T> apply(
		const SGMatrix<T>&, 
		const SGMatrix<T>&
	);
};
And then, we need to provide factory methods to construct exp depending on the result type of operations and number of arguments. We will define several overloaded versions that return different types of expressions, e.g
template<OP, E1, E2, 
class = typename enable_if<is_same<Matrix, typename OP::result_type>::value >::type>
BinaryMatrixExp makeExp(const Exp<E1>&, const Exp<E2>&);
Finally, we can define linalg APIs over expressions. For example, we can overload operators like
template<E>
auto operator+(const MatrixExp<E>& left, const MatrixExp<E>& right) {
	return makeExp<MatrixAdd>(left, right);
}
The lazy expression supports implicit evaluation using the assignment operator.
template<class T, class E>
T& operator(T& target, const Exp<E>& exp) {
	target = exp.eval();
	return target;
}
A example usage:
Matrix X;
Vector w, b;
// initialize these variable in some ways ...

BinaryMatrixExp wx_exp = w * X;
BinaryMatrixExp Y_exp = wx_exp + Y;
Matrix Y = Y_exp; // implicit evaluation
// or trigger evaluation explicitly
Y = Y_exp.eval();