Kiwi.com open sourced a new CatBoost applying library written in C++ with reasonably good optimizations and without dependencies. It means you can use this library in your projects to apply CatBoost models without huge catboost official library dependency. And moreover, it is possible to integrate this library into your build process.
In the year 2009 Yandex introduced MatrixNet algorithm and started using it for ranking web pages. Algorithm itself is a gradient boosting on decision trees and the main advantage of this implementation is that it is a great deal easier to tune parameters for MatrixNet then for XGBoost or other concurrent implementations.
Anyway, until the year 2017 nobody but Yandex and CERN could use MatrixNet because it is proprietary product. But time had changed in 2017 and Yandex released the CatBoost. CatBoost is a very similar algorithm and implementation but it has support for categorical features and is an open source library, so you can use it in your products. That time I was working in Yandex and could say that it was a great success for the company. CatBoost is supported by Apple CoreML and lots of companies are using this library for Machine Learning tasks.
It is not a surprise that when it came to solving prediction problems in Kiwi.com where I've been working as a C++ developer, I've decided to use CatBoost model as a baseline. And finally it's finished as the best ML algorithm in the long list of competitors. But here was a problem: it is not easy to integrate original CatBoost library into your project if you are not using ya.make (and you don't). I'd like to have open source ya.make, it is a great build system with nice command line interface, but we are living in the real world. So, I had two options:
- To take Yandex balancer project and extract CMake related code and then to implement build scripts for CatBoost.
- Implement CatBoost library itself.
Yes, I know that it is not good solution to implement everything myself, but it is not hard to do, so I decided to implement CatBoost.
Gradient boosting on decision trees is a very simple thing in terms of applying model. But lets start from some simple things. Given we have some features space x ∈ Rn and we have some train dataset: xi, yi: i ∈ [0..N]
We want to construct approximation function f(x) in the form: f(x) = Σi=0M gi(x)
where g(x) is computed using following algorithm. For each
Training of this model is a very tricky process but applying is simple. So lets get started.
We should have some data structure for single tree and then we can write some simple function. For example:
struct Tree {
std::vector<float> borders;
std::vector<int> indexes;
std::vector<double> values;
double apply(const std::vector<float>& x) const {
unsigned idx = 0;
unsigned one = 1;
for (size_t i = 0; i < borders.size(); ++i) {
if (x[indexes[i]] > borders[i])
idx |= one;
one <<= 1;
}
return values[idx];
}
};
The full models is a vector of such trees. It is very obvious solution but it has a lot of disadvantages. And the main one is the most problematic. Memory access is not cache friendly if we use such structure.
Here I want to write several words about the memory model of the current CPUs. Typically CPU has several caching layers and we can use values from L0 cache immediately, but for reading data from RAM we need to stop execution and to wait until data is loaded to the cache. This happens because memory works on different frequencies and those frequencies are much lower than CPU frequency. So we can't just load data from RAM, we need to wait. And one of the consequences of this is that it makes no sense to load only one word of data, so CPU loads some amount of bytes named page. So, if we have sequential data in the RAM we will have better performance.
So, to do it better we need to write borders and indexes into the same array. That's easy to do, so lets do it:
struct Tree {
struct SimpleCompare {
float border;
int index;
bool operator()(const std::vector<float>& x) const {
return x[index] > border;
}
};
std::vector<SimpleCompare> cmps;
std::vector<double> values;
double apply(const std::vector<float>& x) const {
unsigned idx = 0;
unsigned one = 1;
for (size_t i = 0; i < cmps.size(); ++i) {
if (cmps[i](x))
idx |= one;
one <<= 1;
}
return values[idx];
}
};
Even this version works much better, but it is not all. It is better to write all the SimpleCompare structures from all the trees into the same array and to use only one vector to access them. And it is possible to write all values in the same array. In such case we will have only number of tree levels in the structure and will operate sequentially in terms of memory. This behavior allows us to be faster than CatBoost model compiled into C++ code. And I think that it is not too bad even now!
But here we come to the next problem. We are loading border and index values for each sample we want to predict output. It is not effective and we can do it better.
To be even faster we need to use SIMD (Single Instruction Multiple Data) instructions. SIMD was introduced in early 1970s for supercomputers and then has been implemented in lots of CPUs. For Intel x86 series we can say that all CPUs starting from the Pentium MMX have support for some SIMD instructions set.
What SIMD is exactly for? It is very simple technology: this set of instructions allows to load several (typically 2, 4, 8, 16) values on registers and operate them simultaneously. For example:
float x[4], y[4], z[4];
for (int i = 0; i < 4; ++i)
z[i] = x[i] + y[i];
This code could be implemented using only one instruction and it will be executed in a very short time (depending on CPU). So we can compare 4 values, add 4 values, and process values using 4-groups.
You can look into details in the catboost-cxx repository, but here I want to describe the thing in general.
We have two ways to parallelize computations on the decision trees. The first one is to load values by 4 throw layers (vertical parallelization) and the second one is to load 4 trees in the same time and to apply them in parallel. The second way is very simple and all we need is to replace scalars with SSE registers. The only tricky part is:
if (cmp) idx |= one;
should be replaced with
idx |= (one & cmp);
But when it comes to buckets we can see that Yandex implementation is a great deal faster. And to be comparable we need to be a little more parallel. So I'm going another axis in parallel and this axis is a group of examples. We can load 4 values from 4 examples at once and to compare them with vector borders. It allows our library to be only 3 times slower on batches but 3 times faster on single values. Final results could be found at catboost-cxx results page
CatBoost code | CatBoost lib | catboost-cxx nosse | catboost-cxx | catboost-cxx / best (%) | |
---|---|---|---|---|---|
msrank | 0.409145 | 0.587546 | 0.438624 | 0.236725 | 57.8 |
creditgermany | 0.000653028 | 0.0010128 | 0.000732899 | 0.000421047 | 64.5 |
cordna | 0.355515 | 0.538532 | 0.360194 | 0.214148 | 60.2 |
msrank bucket | 0.413497 | 0.050909 | 0.201874 | 0.12726 | 250 |
creditgermany bucket | 0.000688076 | 0.000117064 | 0.0003829 | 0.000259876 | 222 |
cordna bucket | 0.38238 | 0.0517659 | 0.198211 | 0.133639 | 258 |
Now library does not support multitarget models. I'm going to implement them in the nearest future. Also we don't support categorical features and maybe I'll implement them too. (But it is not the main priority).
The performance could be better and I'll try to be even faster.