noureddin/performance-separate-interface-implementation.md

## performance-separate-interface-implementation.md

      
    Raw
  

              performance-separate-interface-implementation.md
            
          
    Performance of C (and C++) functions: separate implementation and interface, or implementation in header?

Summary: In C, implementing a function in a header file, rather than a separate .c file, provides about 200% performance gain. Maybe it's because the compiler is inlining them. But you may not gain that much in nontrivial projects.

In C++, you need also to apply link-time optimization, with the compiler switch -flto, to have that 200% performance gain.
Note 1: In many projects, especially large ones, the bottleneck is not those non-inlined functions, and separating the implementation from the interface makes compilation faster, thus easier to develop.
Note 2: Algorithmic optimization is most important (chose better algorithms). If you really need more performance, investigate other possibilities, like parallelizing your code, vectorize some parts of it, etc.
Overview

Let us write a small C class for matrices, with a matrix initializing, destroying, multiplying, and printing functions. And let us benchmark them to see how can we make it most performant. Obviously we need to choose a performant algorithm; e.g., using Strassen algorithm is far better than multiplying matrices the naïve way. But I will do it the naïve way just to focus on one aspect only, namely, the performance impact of separating the implementation from the interface to different translation units.
I will compile using GCC (gcc and g++) with the highest level of optimization (-O3). I am using GCC 5.4.0.
I do the benchmarking using the function-like macro time_test from this header file, called timetest.h:
#include <stdio.h>    // printf
#include <sys/time.h> // everything else

// https://stackoverflow.com/a/1861493
typedef unsigned long long timestamp_t;
static timestamp_t get_timestamp() {
    struct timeval now;  gettimeofday (&now, NULL);
    return now.tv_usec + (timestamp_t)now.tv_sec * 1000000;
}

#define time_test(opname, n, ...) ({                \
        long double time = 0;                       \
        for (unsigned i = n; i; --i) {              \
            timestamp_t t0 = get_timestamp();       \
            __VA_ARGS__                             \
            timestamp_t t1 = get_timestamp();       \
            time += (t1 - t0) / 1000000.0L;         \
        }                                           \
        time /= n;                                  \
        fprintf(stderr, "%s, repeated %u times, "   \
                        "takes %Lf on average\n",   \
                opname, n, time);                   \
    })
A C class with information hiding (encapsulation)

Let us see: we want to completely separate the interface from the implementation. We can do so by declaring an opaque data type and some functions in the .h file, then implement them all in the .c file.
matrix1.h:
#ifndef _MATRIX1_H_
#define _MATRIX1_H_ 1
#include <stddef.h>  // size_t

typedef struct mat mat;
mat *mat_new   (size_t nrow, size_t ncol, double fillvalue);
void mat_delete(mat *m);
mat *mat_mul   (mat const * const a, mat const * const b);
void mat_print (mat const * const m);

#endif  // ! _MATRIX1_H_
matrix1.c:
#include "matrix1.h"
#include <stdio.h>   // printf
#include <stdlib.h>  // malloc, free

struct mat {
    size_t nrow, ncol;
    double vals[];
};

#define M(m, i, j) m->vals[j + i * m->ncol]

mat *mat_new(size_t nrow, size_t ncol, double fillvalue)
{
    if (!nrow || !ncol) return NULL;
    mat *m = malloc(sizeof(mat) + sizeof(double) * nrow * ncol);
    if (!m) return NULL;
    m->nrow = nrow; m->ncol = ncol;
    for (size_t i = 0; i < nrow * ncol; ++i)
        m->vals[i] = fillvalue;
    return m;
}

void mat_delete(mat *m) { free(m); m = NULL; }

mat *mat_mul(mat const * const a, mat const * const b)
{
    if (!a || !b || a->ncol != b->nrow) return NULL;
    mat *r = mat_new(a->nrow, b->ncol, 0);
    if (!r) return NULL;
    for (size_t i = 0; i < r->nrow; ++i)
        for (size_t j = 0; j < r->ncol; ++j)
            for (size_t k = 0; k < a->ncol; ++k)
                M(r, i, j) += M(a, i, k) * M(b, k, j);
    return r;
}

void mat_print (mat const * const m)
{
    if (!m) return;
    for (size_t i = 0; i < m->nrow; ++i, puts(""))
        for (size_t j = 0; j < m->ncol; ++j)
            printf("%g\t", M(m,i,j));
}
main1.c:
#include "matrix1.h"
#include "timetest.h"

int main()
{
    time_test("separate interface and implementation", 5,
        mat *a = mat_new(10, 1000000, 2);
        mat *b = mat_new(1000000, 10, 3);
        mat *r = mat_mul(a, b);
        mat_print(r);  // prints 10x10 matrix of 6e6
        mat_delete(a); mat_delete(b); mat_delete(r);
    );
}
Now compile and run with the following command:
gcc -o 1 main1.c matrix1.c -O3 && ./1 > /dev/null
The result on my machine:
separate interface and implementation, repeated 5 times, takes 0.858798 on average

A C class without hiding: the implementation is in the header file

There is no matrix2.c this time; everything is in the header file.
Because in C, information hiding is done thru separating the implementation to a different translation unit, we no longer have information hiding.
We'll create matrix2.h and main2.c as the following.
matrix2.h:
#ifndef _MATRIX2_H_
#define _MATRIX2_H_ 1
#include <stddef.h>  // size_t
#include <stdio.h>   // printf
#include <stdlib.h>  // malloc, free

typedef struct mat mat;
struct mat {
    size_t nrow, ncol;
    double vals[];
};

mat *mat_new   (size_t nrow, size_t ncol, double fillvalue);
void mat_delete(mat *m);
mat *mat_mul   (mat const * const a, mat const * const b);
void mat_print (mat const * const m);

#define M(m, i, j) m->vals[j + i * m->ncol]

mat *mat_new(size_t nrow, size_t ncol, double fillvalue)
{
    if (!nrow || !ncol) return NULL;
    mat *m = malloc(sizeof(mat) + sizeof(double) * nrow * ncol);
    if (!m) return NULL;
    m->nrow = nrow; m->ncol = ncol;
    for (size_t i = 0; i < nrow * ncol; ++i)
        m->vals[i] = fillvalue;
    return m;
}

void mat_delete(mat *m) { free(m); m = NULL; }

mat *mat_mul(mat const * const a, mat const * const b)
{
    if (!a || !b || a->ncol != b->nrow) return NULL;
    mat *r = mat_new(a->nrow, b->ncol, 0);
    if (!r) return NULL;
    for (size_t i = 0; i < r->nrow; ++i)
        for (size_t j = 0; j < r->ncol; ++j)
            for (size_t k = 0; k < a->ncol; ++k)
                M(r, i, j) += M(a, i, k) * M(b, k, j);
    return r;
}

void mat_print (mat const * const m)
{
    if (!m) return;
    for (size_t i = 0; i < m->nrow; ++i, puts(""))
        for (size_t j = 0; j < m->ncol; ++j)
            printf("%g\t", M(m,i,j));
}

#endif  // ! _MATRIX2_H_
main2.c:
#include "matrix2.h"
#include "timetest.h"

int main()
{
    time_test("implementation is in the header file", 5,
        mat *a = mat_new(10, 1000000, 2);
        mat *b = mat_new(1000000, 10, 3);
        mat *r = mat_mul(a, b);
        mat_print(r);  // prints 10x10 matrix of 6e6
        mat_delete(a); mat_delete(b); mat_delete(r);
    );
}
Now compile and run with the following command:
gcc -o 2 main2.c -O3 && ./2 > /dev/null
The result on my machine:
implementation is in the header file, repeated 5 times, takes 0.450042 on average

About half the time. Wow!
Now, let us see what C++ will buy us.
A C++ class with the implementation in a separate .cpp file

matrix1.hpp:
#ifndef _MATRIX1_HPP_
#define _MATRIX1_HPP_ 1
#include <cstddef>  // size_t

class mat {
    public:
        mat(size_t nrow, size_t ncol, double fillvalue);
        mat();
        ~mat();
        mat& operator=(const mat& rval);
        void mul(const mat& a, const mat& b);
        void print() const;
        double& operator()(size_t i, size_t j);
        const double& operator()(size_t i, size_t j) const;
    private:
        size_t nrow, ncol;
        double* vals;
};

#endif  // ! _MATRIX1_HPP_
matrix1.cpp:
#include "matrix1.hpp"
#include <cstdio>  // std::printf

mat::mat(size_t nrow, size_t ncol, double fillvalue)
{
    this->nrow = nrow;
    this->ncol = ncol;
    this->vals = new double[nrow*ncol];
    for (size_t i = 0; i < nrow * ncol; ++i)
        this->vals[i] = fillvalue;
}

mat::mat()
{
    nrow = 0;
    ncol = 0;
    vals = 0;
}

mat::~mat()
{
    delete[] vals;
    vals = 0;
}

mat& mat::operator=(const mat& rval)
{
    delete[] vals;
    this->nrow = rval.nrow;
    this->ncol = rval.ncol;
    this->vals = new double[nrow*ncol];
    for (size_t i = 0; i < nrow * ncol; ++i)
        this->vals[i] = rval.vals[i];
    return *this;
}

void mat::mul(const mat& a, const mat& b)
{
    if (a.ncol != b.nrow) {
        *this = mat();
        return;
    }
    *this = mat(a.nrow, b.ncol, 0);
    for (size_t i = 0; i < this->nrow; ++i)
        for (size_t j = 0; j < this->ncol; ++j)
            for (size_t k = 0; k < a.ncol; ++k)
                (*this)(i, j) += a(i, k) * b(k, j);
    return;
}

void mat::print() const
{
    for (size_t i = 0; i < nrow; ++i, puts(""))
        for (size_t j = 0; j < ncol; ++j)
            std::printf("%g\t", (*this)(i,j));
}

double& mat::operator()(size_t i, size_t j)
{ return vals[j + i*ncol]; }

const double& mat::operator()(size_t i, size_t j) const
{ return vals[j + i*ncol]; }
main1.cpp:
#include "matrix1.hpp"
#include "timetest.h"

int main()
{
    time_test("separate interface and implementation", 5,
        mat a(10, 1000000, 2);
        mat b(1000000, 10, 3);
        mat r; r.mul(a, b);
        r.print();  // prints 10x10 matrix of 6e6
    );
}
Now compile and run with the following command:
g++ -o 1++ main1.cpp matrix1.cpp -O3 && ./1++ > /dev/null
The result on my machine:
separate interface and implementation, repeated 5 times, takes 0.869421 on average

A C++ class with the implementation in the header file

Again, there is no matrix2.cpp this time; everything is in the header file.
But unlike C, we still have information hiding, but we still have the problem that changing an implementation triggers recompiling of a lot of files.
We'll create matrix2.h and main2.c as the following.
matrix2.hpp:
#ifndef _MATRIX2_HPP_
#define _MATRIX2_HPP_ 1
#include <cstdio>   // std::printf
#include <cstddef>  // size_t

class mat {
    public:
        mat(size_t nrow, size_t ncol, double fillvalue);
        mat();
        ~mat();
        mat& operator=(const mat& rval);
        void mul(const mat& a, const mat& b);
        void print() const;
        double& operator()(size_t i, size_t j);
        const double& operator()(size_t i, size_t j) const;
    private:
        size_t nrow, ncol;
        double* vals;
};

mat::mat(size_t nrow, size_t ncol, double fillvalue)
{
    this->nrow = nrow;
    this->ncol = ncol;
    this->vals = new double[nrow*ncol];
    for (size_t i = 0; i < nrow * ncol; ++i)
        this->vals[i] = fillvalue;
}

mat::mat()
{
    nrow = 0;
    ncol = 0;
    vals = 0;
}

mat::~mat()
{
    delete[] vals;
    vals = 0;
}

mat& mat::operator=(const mat& rval)
{
    delete[] vals;
    this->nrow = rval.nrow;
    this->ncol = rval.ncol;
    this->vals = new double[nrow*ncol];
    for (size_t i = 0; i < nrow * ncol; ++i)
        this->vals[i] = rval.vals[i];
    return *this;
}

void mat::mul(const mat& a, const mat& b)
{
    if (a.ncol != b.nrow) {
        *this = mat();
        return;
    }
    *this = mat(a.nrow, b.ncol, 0);
    for (size_t i = 0; i < this->nrow; ++i)
        for (size_t j = 0; j < this->ncol; ++j)
            for (size_t k = 0; k < a.ncol; ++k)
                (*this)(i, j) += a(i, k) * b(k, j);
    return;
}

void mat::print() const
{
    for (size_t i = 0; i < nrow; ++i, puts(""))
        for (size_t j = 0; j < ncol; ++j)
            std::printf("%g\t", (*this)(i,j));
}

double& mat::operator()(size_t i, size_t j)
{ return vals[j + i*ncol]; }

const double& mat::operator()(size_t i, size_t j) const
{ return vals[j + i*ncol]; }

#endif  // ! _MATRIX2_HPP_
main2.cpp:
#include "matrix2.hpp"
#include "timetest.h"

int main()
{
    time_test("implementation is in the header file", 5,
        mat a(10, 1000000, 2);
        mat b(1000000, 10, 3);
        mat r; r.mul(a, b);
        r.print();  // prints 10x10 matrix of 6e6
    );
}
Now compile and run with the following command:
g++ -o 2++ main2.cpp -O3 && ./2++ > /dev/null
The result on my machine:
implementation is in the header file, repeated 5 times, takes 0.872273 on average

No noticeable difference! But an interesting fact is that increasing the number of tries (try 10 or 100) will affect only this result: it'll execute in half the time (as the C sister implementation). I think it's a kind of optimization.
Update:

Let's compile with like-time optimization; compile and run with the following command:
g++ -o 2++ main2.cpp -O3 -flto && ./2++ > /dev/null
The result on my machine:
implementation is in the header file, repeated 5 times, takes 0.472132 on average

Wow! Again we have the same performance as the C implementation, with the added benefit of pseudo-encapsulation of C++.
Note 1: Adding -flto doesn't have a noticeable effect on our second attempt in C; it may be already the best performance.
Note 2: The functions must be in the same header; they can't be in a separate translation units, unlike the first attempt (but see: Single Compilation Unit).
Conclusion

In C, if your application is performance critical, and you've already optimized your algorithms and chosen suitable data structures etc, you may sacrifice somethings for performance, by putting the implementation in the header, but only if that's the bottleneck.
If you want to do the same in C++, you can put them in the same header, and add the switch -flto to the compiler.
See also: Dynamically allocated arrays: 1D vs 2D.
See also: Single Compilation Unit.