Skip to content

Instantly share code, notes, and snippets.

@GuillaumeDua
Last active April 18, 2024 10:50
Show Gist options
  • Star 34 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save GuillaumeDua/0239fda353264b67ddcb39b5d9a01105 to your computer and use it in GitHub Desktop.
Save GuillaumeDua/0239fda353264b67ddcb39b5d9a01105 to your computer and use it in GitHub Desktop.
C++ legacy inheritance vs CRTP + std::variant

C++ : Polymorphic inheritance without vtable

In this article, we will see how to use CRTP, std::variant and std::visit to increase our code performances.

Table of content

  1. Introduction
  2. What's wrong with vTables ?
  3. A way out ?
    1. CRTP
    2. std::variant + std::visit
  4. Benchmark
  5. Pros & cons

Motivation :

Inheritance and vTable were used for many years to create interface in C++ polymorphic class
What if ... there were another way to do this ?
easier, cleaner, faster and more reliable

Introduction

Inheritance is a mechanism that allows developers to create a hierarchy between classes, using "is-a" relationships. The class being inherited from is called the parent class (or base class), and the class inheriting is called the child class (or derived class).

This is useful for many purposes, such like :

  • Code reusability.
    A child class will inherit datas and functions, so there's no need to duplicate code. Thus, the code base is faster to create, and easier to read & maintain.

  • Make the code more "human friendly"
    An "Is-a" relationship is meaningful for anyone who is familiar with OOP. It allows to create a classe hierarchy that best reflects the way we mentally organize informations. A cat is a feline, a feline is an animal, and animal is a life-form, etc.

class life_form{};
class animal : public life_form{};
class feline : public animal{};
class cat    : public feline{};

void func()
{
    cat my_cat;
}
  • Static polymorphism

Static polymorphism is a polymorphism resolved at compile time.


In the previous code snippet, my_cat is variable of cat type.
my_cat is also a feline, an animal, and a life_form.

void feed(animal & any_animal)
{
    // feed `any_animal`
}

void func()
{
    cat my_cat;

    feed(my_cat);
}
  • Runtime polymorhpsim & vTable
    Ah, here's the big thing.

Runtime polymorphism is a polymorphism resolved at runtime. How ? Using vTables. Virtual tables (vTable) is a lookup table of functions pointers used to resolve function calls in a dynamic (late) binding way. When compiling a class, the compiler (at compile time thus) creates a static array that contains one entry for each virtual function that the class can call.

struct animal
{
    virtual void move_forward() = 0;
};
struct fish : public animal
{
    void move_forward() override
    {
        // swim using fins
    }
};
struct snake : public animal
{
    void move_forward() override
    {
        // use serpentine method
    }
};
struct feline : public animal
{
    void move_forward() override
    {
        // walk using paws
    }
};

void func()
{
    using pointer_type = std::unique_ptr<animal>;

    pointer_type my_animal(new snake());

    my_animal->move_forward(); // use serpentine method
}

What's wrong with vTables ?

As mentionned in the previous section, a virtual function call typically means a call via a function pointer stored in a vTable.

  • Pros : Allows runtime polymorphism
  • Cons : Bad performance impact

Why ?

When called, a virtual function require first to read the adress of the function. Thus, while the function instruction are loaded into the memory, the CPU will idle, waiting.

A way out ?

Great ! We now can summarize our list of requierement, and dive into code experiments !

Requierements :

  • Interface
  • Polymorphism
  • Minimal amount of easy-to-read code
  • Better performances than vtables

Let's use the following use-case :

  • An actor that can receive, queue, and reacts to messages, and update itself as well.

Actor scenario :

  1. Receives and queues 3 messages, one after another
  2. Updates itself
  3. Handles queued messages

Additionaly, we want actors to be polymorphic, in order to handle a collection or them.
And we want to be able to add or remove them dynamically from the collection.

For example purpose, we will use interger (int) as message type, just like :

using message_type = int;

Using inheritance, our abstract class may looks like this :

struct actor
{
    virtual ~actor() = default;

    virtual void update() = 0;

    void handle_queued_messages() { /* handle queued message one after another */ }
    void receive_message(message_type && msg)
    {   // queue `msg` into `pending_messages`
        pending_messages.emplace(std::forward<message_type>(msg));
    }

private:
    std::queue<message_type> pending_messages;
    virtual void handle_one_message(message_type && msg) = 0;
};
struct A : actor
{
    void update() override { /* impl ... */ }
    void handle_one_message(message_type && msg) override { /* impl ... */ }
};
struct B : actor
{
    void update() override { /* impl ... */ }
    void handle_one_message(message_type && msg) override { /* impl ... */ }
};
using container_type = std::vector<std::unique_ptr<using_inheritance::actor>>;

container_type actors;
// fill `actors` with A-s and B-s

for (auto & active_actor : actors)
{   // broadcast messages ...
    active_actor->receive_message(41);
    active_actor->receive_message(42);
    active_actor->receive_message(43);
}

for (auto & active_actor : actors)
{
    active_actor->update();
    active_actor->handle_all_messages();
}

In the code snippet above, we can see two issues :

  • Pointers indirection when dereferencing active_actor for each member function call.
  • VTable usage with pure virtual functions calls : update() and handle_one_message(msg).

CRTP to the rescue ?

Curiously recurring template pattern (CRTP), is a C++ idiom in which a class derive from a template class instanciation that use the first one as template argument.
It allows safe, static downcasting, from the base class into the derived one.

If you want more informations about CRTP, please consider reading this blog serie, from fluentcpp.com.

template <typename T>
class base{};

class derived : public base<derived>
{};

The main advantage of doing such thing is that from the base class perspective, the derived object is itself but downcasted.
Thus, by design, because the base is always inherited from by its template parameter, we can use static_cast instead of dynamic_cast.


In summary, using static_cast, the base class can access the derived class by downcasting itself into the derived class.

template <typename T>
struct base
{
    void do_stuff()
    {
        T & as_derived = static_cast<T&>(*this);
        // do stuffs with `as_derived`
    }
};

Also, we need to use two CRTP best-practices :

  • Base class has private constructor
  • Base class is friend with its derived class

This way, we ensure that the template base class will always be instanciated by the class it is derived from.

This is legal (compiling) code. Check it on godbolt.

template <typename T>
struct base
{};

struct impl_1 : public base<impl_1>
{};

struct impl_2 : public base<impl_1> // oops ! Shoud be base<impl_2>
{};

But if we use the tricks mentioned above, impl_2 does not compile anymore.

template <typename T>
class base
{
    base() = default;
    friend T;
};

Looks good so far. Let's see our current implementation progress (collapsible, click to expand) :

Interface

template <typename T>
struct actor
{
    void update()
    {
        as_underlying().update();
    }

    void handle_all_messages()
    {
        while (!pending_messages.empty())
        {
            auto message = std::move(pending_messages.front());
            pending_messages.pop();
            handle_one_message(std::move(message));
        }
    }

    void receive_message(message_type && msg)
    {
        pending_messages.emplace(std::forward<message_type>(msg));
    }

private:
    friend T;
    actor() = default;

    std::queue<message_type> pending_messages;

    inline T & as_underlying()
    {
        return static_cast<T&>(*this);
    }

    void handle_one_message(message_type && msg)
    {
        as_underlying().handle_one_message(std::forward<message_type>(msg));
    }
};

Derived classes

struct A : actor<A>
{
    using actor::actor;

    void update(){ /* impl ...*/ }

private:
    friend struct actor<A>;

    void handle_one_message(message_type && msg){ /* impl ...*/ }
};

struct B : actor<B>
{
    using actor::actor;

    void update(){ /* impl ...*/ }

private:
    friend struct actor<B>;

    void handle_one_message(message_type && msg){ /* impl ...*/ }
};

Let's have a look to our checklist of requierements :

  • Interface
  • Polymorphism
  • Minimal amount of easy-to-read code
  • Better performances than vtables

What about polymorphism ?

CRTP looks great, but in opposition to inheritance, we lost polymorphism.
Remember, we need to get multiple implementation of actor into a container.

using container_type = std::vector<std::unique_ptr<actor</* ? */>>>;

Also, what if we could avoid the usage of pointers as container's value_type ?

std::variant to the rescue

Well, we know our implementation types at compile time.
So, an all designated solution might be to use std::variant. Let's try this :

template <typename ... Ts>
using poly_T = std::variant<Ts...>;

using container_type = std::vector
<
    poly_T<A, B>
>;

So we can get the following usage : (Test it on Godbolot)

container_type actors
{
    A{},
    B{},
    A{}
    /* etc ... */
};

actors.emplace_back(A{});
actors.emplace_back(B{});

Before we can check our "polymorphism" from our list of requierement,
we need to find a synthax to call member functions.

Once again, the STL provides an all designated solution, using std::variant's std::visit.

template <class R, class Visitor, class... Variants>
constexpr R visit(Visitor&& my_visitor, Variants&&... my_var);

std::visit applies the visitor my_visitor to the std::variant my_var
The Visitor is any callable that covers every possible alternatives of Variants

Thus, in order to interact with our std::variant, we can define visitors in many ways.
In our specific case, we just need generic lambdas that were introduce with C++14.

Implementing our use case, we can write the following code :

container_type actors; /* contains many A-s and B-s */

for (auto & active_actor : actors)
{   // broadcast messages ...
    std::visit([](auto & act)
    {
        act.receive_message(41);
        act.receive_message(42);
        act.receive_message(43);
    }, active_actor);
}

for (auto & active_actor : actors)
{   // update, then handle pending messages
    std::visit([](auto & act)
    {
        act.update();
        act.handle_all_messages();
    }, active_actor);
}

Another great advantage of this design is that we can handle std::variant's template parameters differently (here, A and B),
without using dynamic_cast.

What about specific cases ?

Sometimes, you may want to have additional code for a specific type.

In legacy code (using inheritance), we would do the following ugly thing :

struct base
{
    virtual ~base() = default;
};
struct A : public base{};
struct B : public base{};

void handle_base_value(base * value)
{
    if (A * value_as_A_ptr = dynamic_cast<A*>(value))
        // deal with value_as_A_ptr
        ;
    else if (B * value_as_B_ptr = dynamic_cast<B*>(value))
        // deal with value_as_B_ptr
        ;
    else
        // other types
        ;
}
void func()
{
    base * my_value = new A();

    handle_base_value(my_value);
}

Using std::variant and std::visit we don't need dynamic_cast and pointers anymore, because types are known at compile time.

for (auto & active_actor : actors)
{
    std::visit([](auto & act)
    {
        using T = std::decay_t<decltype(arg)>;

        if constexpr (std::is_same_v<T, A>)
            // `act` is an `A`
            ;
        else if constexpr (std::is_same_v<T,B>)
            // `act` is a `B`
        else
        {   // other types
            static_assert(always_false<T>::value, "non-exhaustive visitor!");
            // or handle deal with act in another way
        }

    }, active_actor);
}

Alternatively, we still can define visitor types the old way, defining all operator() overloads so it is exhaustive.

using visitor_type = struct
{
    void operator()(using_CRTP_and_variants::A &){}
    void operator()(using_CRTP_and_variants::B &){}
};

for (auto & active_actor : actors)
{
    std::visit(visitor_type{}, active_actor);
}

In order to avoid if-constexpr and reduce the amount of code, another alternative is to use the convinient overloaded lambas trick :

template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
for (auto & active_actor : actors)
{
    std::visit(overloaded
    {
        [](A & arg){ /* `arg` is an A */ },
        [](B & arg){ /* `arg` is a  B */ },
        [](auto & ){ /* other types   */ }
    }, active_actor);
}

The shorter the better, right ?

Let's have a look to our checklist :

  • Interface
  • Polymorphism
  • Minimal amount of easy-to-read code
  • Better performances than inheritance

Performances

Let's try the snippet (see below) on quick-bench.com, using C++20 standard and O3 optimization level.

Compiler STL CTRP + variants vtable how much faster? link
Clang 7.0 Libc++ (LLVM) 345 502 1.5 times test it !
GCC 8.2 libstdC++ (GNU) 281 667 2.4 times faster test it !

From 1.5 to 2.4 times faster. Now we can check our "Better performances" checkbox.

  • Interface
  • Polymorphism
  • Minimal amount of easy-to-read code
  • Better performances than inheritance
See the benchmark complete snippet

// gcc   : CRTP_and_variant is 2.5 times faster (libstdc++, GNU)
// clang : CRTP_and_variant is 1.4 times faster (libc++, LLVM)

#include <queue>
#include <variant>
#include <iostream>
#include <memory>

using message_type = int;

namespace using_CRTP_and_variants
{
    template <typename T>
    struct actor
    {
        void update()
        {
            as_underlying().update();
        }

        void handle_all_messages()
        {	// internal states only
            while (!pending_messages.empty())
            {
                auto message = std::move(pending_messages.front());
                pending_messages.pop();
                handle_one_message(std::move(message));
            }
        }

        void receive_message(message_type && msg)
        {
            pending_messages.emplace(std::forward<message_type>(msg));
        }

    private:
        friend T;
        actor() = default;

        std::queue<message_type> pending_messages;

        inline T & as_underlying()
        {
            return static_cast<T&>(*this);
        }
        inline T const & as_underlying() const
        {
            return static_cast<T const &>(*this);
        }

        void handle_one_message(message_type && msg)
        {
            as_underlying().handle_one_message(std::forward<message_type>(msg));
        }
    };

    struct A : actor<A>
    {
        using actor::actor;

        void update()
        {
            //std::cout << "A : update()\n";
        }

    private:
        friend struct actor<A>;

        void handle_one_message(message_type && msg)
        {
            //std::cout << "A : handle_one_message : " << msg << '\n';
        }
    };
    struct B : actor<B>
    {
        using actor::actor;

        void update()
        {
            //std::cout << "B : update()\n";
        }

    private:
        friend struct actor<B>;

        void handle_one_message(message_type && msg)
        {
            //std::cout << "B : handle_one_message : " << msg << '\n';
        }
    };
}

namespace using_inheritance
{
    struct actor
    {
        virtual ~actor() = default;
        virtual void update() = 0;

        void handle_all_messages()
        {	// internal states only
            while (!pending_messages.empty())
            {
                auto message = std::move(pending_messages.front());
                pending_messages.pop();
                handle_one_message(std::move(message));
            }
        }

        void receive_message(message_type && msg)
        {
            pending_messages.emplace(std::forward<message_type>(msg));
        }

    private:

        std::queue<message_type> pending_messages;

        virtual void handle_one_message(message_type && msg) = 0;
    };

    struct A : actor
    {
        void update() override
        {
            //std::cout << "A : update()\n";
        }
        void handle_one_message(message_type && msg) override
        {
            //std::cout << "A : handle_one_message : " << msg << '\n';
        }
    };
    struct B : actor
    {
        void update() override
        {
            //std::cout << "B : update()\n";
        }
        void handle_one_message(message_type && msg) override
        {
            //std::cout << "B : handle_one_message : " << msg << '\n';
        }
    };
}


template <typename ... Ts>
using poly_T = std::variant<Ts...>;

static void test_CRTP_and_variants(benchmark::State& state) {

    using container_type = std::vector<poly_T
    <
        using_CRTP_and_variants::A,
        using_CRTP_and_variants::B>
    >;

    container_type actors
    {
        using_CRTP_and_variants::A{},
        using_CRTP_and_variants::B{},
        using_CRTP_and_variants::A{},
        using_CRTP_and_variants::B{},
        using_CRTP_and_variants::A{},
        using_CRTP_and_variants::B{},
        using_CRTP_and_variants::A{},
        using_CRTP_and_variants::B{},
        using_CRTP_and_variants::A{},
        using_CRTP_and_variants::B{}
    };

    for (auto _ : state) {
        for (auto & active_actor : actors)
        {	// broadcast messages ...
            std::visit([](auto & act)
            {
                act.receive_message(41);
                act.receive_message(42);
                act.receive_message(43);
            }, active_actor);
        }

        for (auto & active_actor : actors)
        {
            std::visit([](auto & act)
            {
                act.update();
                act.handle_all_messages();
            }, active_actor);
        }
        benchmark::DoNotOptimize(actors);
    }
}
// Register the function as a benchmark
BENCHMARK(test_CRTP_and_variants);

static void test_inheritance(benchmark::State& state) {

    using container_type = std::vector<std::unique_ptr<using_inheritance::actor>>;

    container_type actors;
    {
        actors.emplace_back(std::make_unique<using_inheritance::A>());
        actors.emplace_back(std::make_unique<using_inheritance::B>());
        actors.emplace_back(std::make_unique<using_inheritance::A>());
        actors.emplace_back(std::make_unique<using_inheritance::B>());
        actors.emplace_back(std::make_unique<using_inheritance::A>());
        actors.emplace_back(std::make_unique<using_inheritance::B>());
        actors.emplace_back(std::make_unique<using_inheritance::A>());
        actors.emplace_back(std::make_unique<using_inheritance::B>());
        actors.emplace_back(std::make_unique<using_inheritance::A>());
        actors.emplace_back(std::make_unique<using_inheritance::B>());
    }

    for (auto _ : state) {
        for (auto & active_actor : actors)
        {	// broadcast messages ...
            active_actor->receive_message(41);
            active_actor->receive_message(42);
            active_actor->receive_message(43);
        }

        for (auto & active_actor : actors)
        {
            active_actor->update();
            active_actor->handle_all_messages();
        }
    }
}
BENCHMARK(test_inheritance);

Pros & cons

Let's do a quick recap.
According to our design, we can define an interface, create many polymorphic implementations with a minimal amount of readable code.
As a result, we end with better performances than the old fashion way "inheritance with vtable".

However, is it worthy ?

Pros :

  • no more vtable
  • std::visit flexibility
  • Better performances

Cons :

  • sizeof(std::variant<...>)

We can already see an issue : std::variants<...>'s size on the stack. Indeed, the size of a std::variant is slightly greater than the type with the largest alignement it contains, as it must also store the information of which type it currently contains.

  • CRTP can hide/shadow functions

CRTP can hide/shadow member functions, and thus lead to unexpected behaviors.
In order to avoid this, as saw previously, we used explicit base contructor call and friendship.

However, according to Herb Sutter's talks "Thoughts on a more powerful and simpler C++", we may find a way to solve this in the future using metaclasses. Maybe.

We still can do some static checks using SFINAE with std::void_t and std::experimental::is_detected

Thank you for reading

Wrote by Guillaume Dua.
Special thanks to O. Libre for reviewing this paper.

Want to read more ?
New articles incoming soon on gist.github.

@GuillaumeDua
Copy link
Author

GuillaumeDua commented Nov 19, 2019

@xNWDD, Thanks for you feedback. Could you be more explicit ??
If you notice any error, send me any proof using godbolt and I'd be glad to fix it.

[edit] as well as quick-bench

@ArtBlnd
Copy link

ArtBlnd commented Jan 21, 2020

First, vtable isn't slow. that means you are NOT understanding about micro-architectures. modern micro-architectures ain't stupid as you thought.

Second, you didn't check the cost of allocation(which is new on std::unique_ptr)

Third, its related to seconds issue. You can statically make a room for std::vector for benchmarking or testing. and only initialize it (not allocating it) by calling new (Type)();

Forth, the cost of function it-self is too small. (empty function) so benchmark will not emit correct result. (too high ERROR)

Fifth, there is better solution than making a Poly-T like llvm's dyn_cast(https://llvm.org/doxygen/Casting_8h_source.html) which is faster than dynmaic_cast on C++. so that's CLEARLY not a good way to make containers.

@xNWDD
Copy link

xNWDD commented Jan 21, 2020

Actually agree with most of what @ArtBlnd except that from my point of view (opinion) vtables are slow (although I think the idea is that it is not slower than a memory fetch, to which I agree).
I wanted to get to write and do a deeper analysis of this but actually forgot during holidays, and I think If I go any deeper into this I won't get to finish it anytime soon, things I wanted to mention:

  • Whether std::visit and virtual are implemented through vtables is implementation detail. Usually:
    • std::visit instantiations create a different vtable for each std::visit
    • virtual creates an vtable for each class implementation
  • Regarding semantics and implementation details of stuff, I've seen a lot of confusion:
    • std::visit provides runtime-polymorphism using early-binding and dynamic dispatch
    • virtual provides runtime-polymorphism using late-binding and dynamic dispatch
    • Due to how architectures have worked for long, both have the same conceptual cost since:
      • std::visit call(K+load(m)) has the vtable address known at compile time and fetches an offset to the vtable from memory
      • Virtual call(load(m)+K) fetches the address to the vtable from memory, then adds the offset which is known at compile-time
    • Because all of this, strictly at implementation level std::visit differs only by providing an early-binding approach to dynamic dispatching because it must know all available overloads at compile time.
  • About std::visit flexibility i think it all boils down to:
    • std::visit has more flexibility because it allows for POD dynamic dispatch
    • virtual has more flexibility because it brings a late-binding approach to dynamic dispatch (so it works nice when expecting the users to add custom stuff to closed-source code later on)
  • Better performance: It's not really thanks to std::visit it's mostly because
    • Handle_all_messages function actually performing a vcall or not (because of how you implemented it, the function will probably be inlined there since what function will be compiled is known at compile time due to crtp)
    • vcall batching (the code using std::visit combines 2 vcalls using std::visit that would otherwise require a caché hit)
    • The extra dereferences/misses due to how each test is allocated (check _inplace)
    • I have provided some alternatives showing down my general point http://quick-bench.com/cFuGS0L4d0MqwFki8Fe_zirGPGA

My general point is that you are actually comparing two vtable usages and the performance benefits you get are 1) because using CRTP allows for further optimizations (since the compiler knows more at compile time) and 2) because you are batching virtual calls at std::visit point (at some point you call multiple virtual functions when you could call a virtual function specific to that class that calls non-virtual functions).

@ArtBlnd
Copy link

ArtBlnd commented Jan 22, 2020

@xNWDD that's what exact what I want to say in detail. (except about allocation one). vtables are slow (yes its slow compare to static IAT jumps or inlined functions) but not that slow as he is saying.

@xNWDD
Copy link

xNWDD commented Jan 23, 2020

@ArtBlnd Didn't mention allocation cost itself because the library used to benchmark either runs for too many iterations or setup time isn't measured (this can be tested by adding a sleep of 2~3 seconds to the setup, just before/after allocations happen).
So in the end, neither allocation nor the adding additional indirection cost are going to reflect significantly in benchmark numbers at quick-bench where the object memory is likely to live in L1. This is likely very different in many real world scenarios where the allocation cost and extra indirection will probably have a waaay higher impact (depending on actual scenario details and how often the data is hot).

@GuillaumeDua
Copy link
Author

@ArtBlnd Thank you for your feed back, I appreciate.
It's always good to see peoples that are willing to shares ideas.

I'll answer/react points after points hereunder.

Vtables :
It's a common fact that vtables are slow, because of it adds pointers indirections (when fetching the address of the function to call) that cannot be optimized so far by compilers.
As a compiler engineer yourself, you may have fresh news about it that i'll be eager to read.

Cost of allocation :
As you may read in the quick-bench, initialization is outside the measured scope for (auto _ : state) {}.
Second, you didn't check the cost of allocation(which is new on std::unique_ptr)

Placement new :
Not sure I understood your point here about placement new.

Function cost :
Indeed, the goal of this benchmark is to measure an architecture pattern barebone.
So only the call to a function matter, not the cost of the function itself. As long as the functional behavior is the same.
Can you tell me more about :

(too high ERROR)
?

LLVM's dyn_cast :
llvm::dyn_cast is pretty interesting. Unfortunatly, it's not part of the standard, and I did not read any proposal related to so far.

Containers design :
Well, as you may read in the related paper, the goal here was to test different implementations for an architecture barebone.
So far, the result is used on few corporate, closed-source projects that actually match the expecting performances, architecture flexibility, and resilience.

@GuillaumeDua
Copy link
Author

@xNWDD Thanks for you reply, do not worry about the delay, better later than never.

Well, my point is basically to avoid using, as much as I can, dynamic memory, dynamic types and pointers,
so the compiler is more likely to optimize it.
However, It should not be done by adding over-complicated boilerplates.

**VirtualVariant : **
Not sure I got your point here. std::variant is defined as a type-safe union.
If you remove everything that makes the implementation safe and convenient, what I read is that it only remains a waste of memory.
Your only guarantee here is a sufficient storage. As you may see, it's not the same interface than std::variant.

@ArtBlnd
Copy link

ArtBlnd commented Jan 25, 2020

Vtables :
It's a common fact that vtables are slow, because of it adds pointers indirections (when fetching the address of the function to call) that cannot be optimized so far by compilers.
As a compiler engineer yourself, you may have fresh news about it that i'll be eager to read.

No, vtables are not slow as you thought since you are using modern micro-architectures.(OoO architectures) those these indirection load will be executed before executing call instruction.
and vtables can be optimized with devirtualization. see LLVM's devirtualization.
http://blog.llvm.org/2017/03/devirtualization-in-llvm-and-clang.html
or you can improve performance by using __declspec(novtable) on msvc. which statically link function pointer to object itself. but larger object size.

Cost of allocation :
As you may read in the quick-bench, initialization is outside the measured scope for (auto _ : state) {}.
Second, you didn't check the cost of allocation(which is new on std::unique_ptr)
Placement new :
Not sure I understood your point here about placement new.

Yes, I misunderstood the quick-bench's infrastructure. but still there is a cost using heap memory on it, which is cache-misses (TLB, L1). if you allocate object on difference location of memory. it will cause huge performance decrement. so allocating on the same place is fair. std::variant has small storage(or static? most of standard container has SSO storage) so it will not stored another heap storage, means object will stored in vector's memory it-self.

Function cost :
Indeed, the goal of this benchmark is to measure an architecture pattern barebone.
So only the call to a function matter, not the cost of the function itself. As long as the functional behavior is the same.

Those are NOT same on OoO architectures. the cost of indirection will be much reduces in the real-world story. its related to vtable issues. the cost of vtables in this benchmark is much increased than real-world. its some kind of comparing ARM64 and x64 with increments and its same behavior. saying ARM64 is faster.

do you think

jmp address 

and

push address
ret

is the same?

second one will rune all BTB cache and its branch predictions, call trace. I am saying its NOT same. but it looks same behavior. it will cause huge slow down after executing seconds one. but not much on itself. OoO is not that easy.

LLVM's dyn_cast :
llvm::dyn_cast is pretty interesting. Unfortunatly, it's not part of the standard, and I did not read any proposal related to so far.

It still generates vtables or RTTI like something on your case. but there is already case using custom RTTI on LLVM infrastructure. that is dyn_cast / isa model. it is lot more faster then traditional dynamic_cast. also provides flexibility. unlike this approach.

your approach is also not a standard yet. it doesn't matter its standard. the point is how useful is this on real-world.

@ArtBlnd
Copy link

ArtBlnd commented Jan 25, 2020

I am not saying your approach is all wrong. the usage and its beneficial point, benchmarks are not correct. this approach is very beneficial if you can assume its type, and execute it as its original type which has static calls.(still beneficial switching around types) there is beneficial facts that compilers can optimize(like inlining, unrolling. overall easy to calculate correct execution costs while static analysis.) when it has static calls.

@xNWDD
Copy link

xNWDD commented Jan 25, 2020

@GuillaumeDua yeah, I definitely understand the sentiment of wanting to avoid dynamic/runtime things and the CRTP approach is certainly good but the wording and benchmarks are wrong and misleading, which is important for me because students (such as my students) will and have been misled by this:

  • Things like "variant+crtp+std::visit vs vtable" and "no more vtable" are worded wrong, since both std::visit and virtual are "language concepts" and an vtable is an implementation detail. Also, both virtual and std::visit are in concept "equally dynamic operations" that do the same "dynamic-dispatching" work in run-time, the only difference being in binding-time (one binds at compile-time and the other binds at dynamic-dispatch but both binding strategies have the same exact cost as discussed earlier). Furthermore, in current tool-chains both virtual and std::visit use vtables (and I expect this will remain as-is for a long time since there probably too much work to be done with the graphics library, concepts, coroutines and stuff).
  • Expressions like "better performance than inheritance" are also misleading since this hints "look, don't use inheritance because it will be slower" when this is actually not the case at all. As of now you get better performance when you want to dynamic dispatch by using inheritance and a sane memory layout than you get by using variant with visit. This is understandable because it is a newer feature of the language most people don't even know about yet. Is it possible that in the future compiler engineers optimize std::variant/std::visit to the point where it is faster in most scenarios than virtual? Maybe, but I don't rely on compilers becoming "sufficiently smart" and I don't believe it is viable to get an edge because of later binding time with current architectures.
  • The general wording and phrasing sounds like: "vtables are slow and bad, so let's avoid them" which I like, but then you introduce "variant/visit" as a way to provide "polymorphism" without even mentioning the fact that this will still do dynamic dispatching using vtables at runtime visiting a relative offset in the very exact way this is done with virtual. This is very misleading.
  • Benchmarks are shown but don't provide the information needed for accurate conclusions and the bold statements, this is normal since benchmarking is very hard but the conclusions are taken after comparing apples to an orange juice with anise, without stopping to dissect how does each detail impact on performance and whether it is really "not using virtual at all" that gives the code a performance edge. Just for reference:
    • Apples: virtual as dynamic dispatch mechanism
    • Orange: std::visit as a different dynamic dispatch mechanism
    • Juice: different memory layout with different levels of indirection and non-prefetechable order with additional padding that will waste even more memory "std::variant" (vs.:random heap location + unique_ptr)
    • With anise: "with CRTP" (vs.:without crtp and making every single function virtual)
  • Because this is a very important and sensible topic in game development and has been a hot for a few years since Jonathan Blow started his crusade, I believe it is very important that students don't get this wrong. The real takeaway from your benchmarks I think it is: You pay for what you use (as many dynamic dispatchers as you use), regardless of the language construct you use (but virtual has been optimized for a longer time) and are always rewarded run-time performance by helping the compiler optimize (by using CRTP to, for example, avoid calling the same virtual neighbors), the quick-bench I provided expands on this by providing visual info showing that:
    • A huge amount (100%~300%) of performance lost is lost because of CRTP (which can be used in virtual, with the same complexity than variant/visit). CRTP helps a lot the compiler to optimize since most of the time (when using calls within the same class) it will know what path to take, can inline and won't need to devirtualize (which doesn't work well enough imho yet). However, using CRTP will also inflate compile time, which may be important in some scenarios where rapid iteration is required.
    • A significant but small amount of performance (15~25%) is lost because of storage, indirections and memory layout variant vs unique_ptr (which can be implemented for virtual, as long as you're ok with sacrificing multiple-inheritance and keeping the exact same memory safety than inheritance itself).
    • An also significant but small amount of performance (60%) is earned by using virtual instead of std::visit/std::variant because of implementation detail in current toolchains.

You can check that the difference in performance is mostly because of crtp (which can be use with inheritance too) by checking out the using_crtp_inheritance example which does not use VirtualVariant or alloca or anything else and has pretty much the same code complexity (if not less because of pre-knowledge most programmers are expected to have) and almost the same performance (~20% difference).

Furthermore in test_CRTP_inheritance_stack you can also see that around HALF of that 20% difference is because of the overhead unique_ptr brings over just dereferencing a plain pointer in a predictable, sequential memory address.

Then again, in test_CRTP_inheritance_inplace (the only one using VirtualVariant) I try to make the comparison "as fair as it gets" by allocating memory in "the same way, layout and level of indirection" that the original std::variant+std::visit+crtp example does.
This works by using in-place memory just like union/variant do to contain object data and overloading the -> operator to get a pointer to the data, the way I implemented this is unsafe and breaks alignment, but didn't want to spend much more time.
I understand this may sound crazy in the most formal programming circles, but in the end is what compilers do and it may surprise you how common this technique is in performance optimization for game development.
Also, keep in mind that using "VirtualVariant" was not a point itself, it is only used in the last example to showcase that it is possible to get dynamic dispatching playing nice with virtual, crtp and in-place allocation. And I am sure it is possible to implement this in a safer way (at least as safe as std::variant and std::visit are).

@xNWDD
Copy link

xNWDD commented Jan 25, 2020

Also about what @ArtBlnd says regarding "what's standard" and what isn't, I hate to say this because I have loved CRTP for years (I think is a fantastic way of hinting the compiler stuff, even more so when combined with virtual to improve compile-time and guaranteed "1-virtual call at most per call")... But this has been possible and useful for a very, very long time and most of the people I respect (peers, devs, students...) regard the technique as a "that stupid, annoying, weird, pointless, rarely-seen thing, I don't understand why anyone would use this useless messy piece of shit".

@ArtBlnd
Copy link

ArtBlnd commented Jan 25, 2020

@xNWDD I agree with you. that is the important point that CRTP is compiler friendly. vtables are not. it really does not matter using std::visit, variant or not.

@GuillaumeDua
Copy link
Author

@ArtBlnd I agree that our point of views differs,
because we both have different requirement for our code.
As mentioned before, my point was to test different shapes of codes for a similar functional need,
measuring performances for both.
After-what, picking one for my most recurring needs, from a performance, as well as ease-to-use perspective.

So far, the results I got on several plateforms/OS/compilers I daily work with are fine. It's about the same performance trends as shown on the snippet exposed on quick-bench.

About :

it doesn't matter its standard. the point is how useful is this on real-world.
Well, it does. In many companies, you cannot simply use your favorite tech environment, such as libraries, OS, compilers, archs, etc.
What's important here is to understand that the truth is not absolute, as you need to respect your companies/clients requirements.
Also, even standard is not safe here. Take embed development for instance : you may not be allowed to use exception, or bigger part of the STL (as well as C++ features) according to guidelines.

this approach is very beneficial if you can assume its type, and execute it as its original type which has static calls
Indeed ! That's the whole point. Create a statical, compiler-friendly design that is still convinient to use.
Take FSM implemented using std::visit and std::variant` for example. It's really convinient to use

@xNWDD Thanks for your long reply, I'll check it ASAP.
As we both teach C++, what I can suggest is that we collaborate about this paper.
May I suggest that you create a review and/or pull request ?

Same for you @ArtBlnd, what about "channeling" our discussion into a great Gist ? Feel free to help ;-)

@Hochheilige
Copy link

Hello @GuillaumeDua I was looking for information about CRTP + std::variant and luckily found this page. I read it and have a question: Why should we write friend T in Base class and friend struct Base<A> or friend struct Base<B> in heritors?

Also thank you for article, I found it really useful! :)

@GuillaumeDua
Copy link
Author

GuillaumeDua commented Jan 9, 2021

@Hochheilige : Thanks for reading.

Short answer

This is not mandatory.
It depends on which data you are willing to share between base and derived class, and which way.

Long answer

Let's take this example :

template <typename T>
struct crtp{};

The purpose of CRTP is to create a context where you ensure that :

  • The base class knows which is the derived one.

So in crtp, such instruction is legale :

static_cast<T&>(*this)

Why this is convinient is the lack of tight-coupling though, thus the ability to extends class using as CRTP as generic class features.

However, the CRTP, as a base class, cannot access private member of the derived one.

This is the same as :

class base { int i; };
class derived : public base {
    void usage() {
        const auto value = derived{}.i; // error
    }  
};

Which results in clang in the following error message:

error: 'i' is a private member of 'base'

But if you use the friend keywork :

class base
{
    friend class derived;
    int i;
};
class derived : public base
{
    void usage()
    {
        const auto value = derived{}.i; // OK
    }  
};

So you might be willing to use friendship the other way, in order to use CRTP as a class feature.
See live example on godbolt here


NB What's convinient is that CRTP might be extended with extra logic using if constexpr instructions or SFINAE using the detection idiom in general maners.
For instance :

template <typename T>
struct print_feature
{
   void print(std::ostream & os)
   {
       if constexpr (detect::has_shift_op_to_ostream_v<T>) // or maybe some `std::detect` impl, depending on standard release
          os << static_cast<T&>(*this);
       else
          os << "not implemented yet"; // or another default behavior : static_assert(false, ...), etc.
   }
};

@Hochheilige
Copy link

@GuillaumeDua Thank you! I was confused because in my case I have no private fields in base class. But in your case can we use protected: int i ? I ask this just to make sure :)

And I have one more question about static_cast<T&>(*this) is that good to cast to T* or the better way is to cast to T& or it is all depends only on my vision on concrete task?

@GuillaumeDua
Copy link
Author

GuillaumeDua commented Jan 10, 2021

@Hochheilige Glad to help. I also teach C++ as a school mentor and professional trainer.

I see very little real cases where you'd like to use protected encapsulation.
As you can read in the CppCoreGuideline, rule C.133 :

C.133: Avoid protected data
Reason
protected data is a source of complexity and errors. protected data complicates the statement of invariants. protected data inherently violates the guidance against putting data in base classes, which usually leads to having to deal with virtual inheritance as well.

About static_cast<T&>(*this) , you should - almost - always avoid manipulating raw pointers.
Programming is all about writing a "story", so you'd better keep it as clear as possible, thus avoid multiples & costly indirections.

In general, i'd say Teach Yourself C++ in ∞ Days, using Godbolt and other tools

@GuillaumeDua
Copy link
Author

@Hochheilige

Reading code you produced, you also can use concepts for static polymorphism, if your compiler supports C++20.
https://en.cppreference.com/w/cpp/language/constraints

@Hochheilige
Copy link

@GuillaumeDua Thank you for your explanation and for all links it is really useful and I feel like I've taken new look on C++ after each your comment!

I'm not sure that I'm ready to concepts and C++20 because I even don't know most of C++17 features, but found your advice about concepts interesting and going to try it.

Thank you again!

@GuillaumeDua
Copy link
Author

@Hochheilige Concepts are way easier that previous SFINAE + detection idiom boilerplates.
I'll write a paper about this very topic when I'll have enough time :)

Keep in mind that in a general manner, C++ tends to become easier, so I'd advise you to use latest standards as often as possible.
Also, I keep seeing students and professionals in my training that are currently learnin stuffs that already became deprecated - if not removed - in newest standards ! Like std::auto_ptr for instance.

@Hochheilige
Copy link

@GuillaumeDua Thanks again, really waiting for your next article :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment