rymis/article.md Secret

## article.md

      
    Raw
  

              article.md
            
          
CatBoost C++ library

Kiwi.com open sourced a new CatBoost applying library written in C++ with reasonably good optimizations and
without dependencies. It means you can use this library in your projects to apply CatBoost models without
huge catboost official library dependency. And moreover, it is possible to integrate this library into your
build process.
History

In the year 2009 Yandex introduced MatrixNet algorithm and started using it for ranking web pages. Algorithm
itself is a gradient boosting on decision trees and the main advantage of this implementation is that it
is a great deal easier to tune parameters for MatrixNet then for XGBoost or other concurrent implementations.
Anyway, until the year 2017 nobody but Yandex and CERN could use MatrixNet because it is proprietary product.
But time had changed in 2017 and Yandex released the CatBoost. CatBoost is a very similar algorithm and
implementation but it has support for categorical features and is an open source library, so you can use it in
your products. That time I was working in Yandex and could say that it was a great success for the company.
CatBoost is supported by Apple CoreML and lots of companies are using this library for Machine Learning tasks.
It is not a surprise that when it came to solving prediction problems in Kiwi.com where I've been working as
a C++ developer, I've decided to use CatBoost model as a baseline. And finally it's finished as the best ML algorithm
in the long list of competitors. But here was a problem: it is not easy to integrate original CatBoost library
into your project if you are not using ya.make (and you don't). I'd like to have open source ya.make, it is
a great build system with nice command line interface, but we are living in the real world. So, I had two
options:

To take Yandex balancer project and extract CMake related code and then to implement build scripts for CatBoost.
Implement CatBoost library itself.

Yes, I know that it is not good solution to implement everything myself, but it is not hard to do, so I decided
to implement CatBoost.
Problem

Gradient boosting on decision trees is a very simple thing in terms of applying model. But lets start from
some simple things. Given we have some features space x ∈ Rⁿ and we have some train dataset: x_i, y_i: i ∈ [0..N]

We want to construct approximation function f(x) in the form:
f(x) = Σ_i=0^M g_i(x)
where g(x) is computed using following algorithm. For each $g$ we select some number of features (let's say k) and prepare
vector of border values: b ∈ R_k. Then we compare each of selected features with the border and return the value from
a table with 2^k prepared values.
Training of this model is a very tricky process but applying is simple. So lets get started.
Naive implementation

We should have some data structure for single tree and then we can write some simple function. For example:
struct Tree {
    std::vector<float> borders;
    std::vector<int> indexes;
    std::vector<double> values;

    double apply(const std::vector<float>& x) const {
        unsigned idx = 0;
        unsigned one = 1;
        for (size_t i = 0; i < borders.size(); ++i) {
            if (x[indexes[i]] > borders[i])
                idx |= one;
            one <<= 1;
        }
        return values[idx];
    }
};

The full models is a vector of such trees. It is very obvious solution but it has a lot of disadvantages.
And the main one is the most problematic. Memory access is not cache friendly if we use such structure.
Here I want to write several words about the memory model of the current CPUs. Typically CPU has several
caching layers and we can use values from L0 cache immediately, but for reading data from RAM we need to
stop execution and to wait until data is loaded to the cache. This happens because memory works on different
frequencies and those frequencies are much lower than CPU frequency. So we can't just load data from RAM,
we need to wait. And one of the consequences of this is that it makes no sense to load only one word of data,
so CPU loads some amount of bytes named page. So, if we have sequential data in the RAM we will have better
performance.
Memory optimization.

So, to do it better we need to write borders and indexes into the same array. That's easy to do, so lets
do it:
struct Tree {
    struct SimpleCompare {
        float border;
        int index;

        bool operator()(const std::vector<float>& x) const {
            return x[index] > border;
        }
    };

    std::vector<SimpleCompare> cmps;
    std::vector<double> values;

    double apply(const std::vector<float>& x) const {
        unsigned idx = 0;
        unsigned one = 1;
        for (size_t i = 0; i < cmps.size(); ++i) {
            if (cmps[i](x))
                idx |= one;
            one <<= 1;
        }
        return values[idx];
    }
};

Even this version works much better, but it is not all. It is better to write all the SimpleCompare structures
from all the trees into the same array and to use only one vector to access them. And it is possible to
write all values in the same array. In such case we will have only number of tree levels in the structure
and will operate sequentially in terms of memory. This behavior allows us to be faster than CatBoost model
compiled into C++ code. And I think that it is not too bad even now!
But here we come to the next problem. We are loading border and index values for each sample we want to
predict output. It is not effective and we can do it better.
From scalars to vectors

To be even faster we need to use SIMD (Single Instruction Multiple Data) instructions. SIMD was introduced
in early 1970s for supercomputers and then has been implemented in lots of CPUs. For Intel x86 series we
can say that all CPUs starting from the Pentium MMX have support for some SIMD instructions set.
What SIMD is exactly for? It is very simple technology: this set of instructions allows to load several
(typically 2, 4, 8, 16) values on registers and operate them simultaneously. For example:
float x[4], y[4], z[4];
for (int i = 0; i < 4; ++i)
    z[i] = x[i] + y[i];

This code could be implemented using only one instruction and it will be executed in a very short time
(depending on CPU). So we can compare 4 values, add 4 values, and process values using 4-groups.
You can look into details in the catboost-cxx repository, but here I want to
describe the thing in general.
We have two ways to parallelize computations on the decision trees. The first one is to load values by 4
throw layers (vertical parallelization) and the second one is to load 4 trees in the same time and to
apply them in parallel. The second way is very simple and all we need is to replace scalars with SSE
registers. The only tricky part is:
if (cmp) idx |= one;

should be replaced with
idx |= (one & cmp);

But when it comes to buckets we can see that Yandex implementation is a great deal faster. And to be comparable
we need to be a little more parallel. So I'm going another axis in parallel and this axis is a group of examples. We
can load 4 values from 4 examples at once and to compare them with vector borders. It allows our library to
be only 3 times slower on batches but 3 times faster on single values. Final results could be found at
catboost-cxx results page
Some results:


CatBoost code
CatBoost lib
catboost-cxx nosse
catboost-cxx
catboost-cxx / best (%)


msrank
0.409145
0.587546
0.438624
0.236725
57.8


creditgermany
0.000653028
0.0010128
0.000732899
0.000421047
64.5


cordna
0.355515
0.538532
0.360194
0.214148
60.2


msrank bucket
0.413497
0.050909
0.201874
0.12726
250


creditgermany bucket
0.000688076
0.000117064
0.0003829
0.000259876
222


cordna bucket
0.38238
0.0517659
0.198211
0.133639
258


Limitations and future

Now library does not support multitarget models. I'm going to implement them in the nearest future. Also we
don't support categorical features and maybe I'll implement them too. (But it is not the main priority).
The performance could be better and I'll try to be even faster.
	CatBoost code	CatBoost lib	catboost-cxx nosse	catboost-cxx	catboost-cxx / best (%)
msrank	0.409145	0.587546	0.438624	0.236725	57.8
creditgermany	0.000653028	0.0010128	0.000732899	0.000421047	64.5
cordna	0.355515	0.538532	0.360194	0.214148	60.2
msrank bucket	0.413497	0.050909	0.201874	0.12726	250
creditgermany bucket	0.000688076	0.000117064	0.0003829	0.000259876	222
cordna bucket	0.38238	0.0517659	0.198211	0.133639	258