Skip to content

Instantly share code, notes, and snippets.

@andreasWallner
Last active May 14, 2023 18:19
Show Gist options
  • Save andreasWallner/9c86a115a5b9fde91d6a to your computer and use it in GitHub Desktop.
Save andreasWallner/9c86a115a5b9fde91d6a to your computer and use it in GitHub Desktop.
A short benchmark of dynamic_cast

Topic

After a discussion about the performance implications of dynamic_cast, I had some time on my hands and wanted to check it myself (at least to some degree). That prompted the measurements below, where dynamic_cast is compared against an implementation that uses class member that cary an unique identifier to check. The functionality of the implementation is not the same as with dynamic_cast, but it satisfies the application that was discussed:

class Base {
    public:
        virtual ~Base() {}
        volatile int m_value;
        const types m_type;
    protected:
        Base(types type) : m_type(type), m_value(0) {}
};

class Derived : public Base {
    public:
        Derived() : Base(OurType) {}
        void increment() { m_value++; }
};

class OtherDerived : public Base {
    public:
        OtherDerived() : Base(OtherType) {}
        void doSomeThing() { m_value--; }
};

The testcase was to pass a Base* to a function, decide in the function if the object is an instance of Derived or OtherDerived and then call increment() or exit straight away:

void doDynamic(Base* base) {
    Derived* derived = dynamic_cast<Derived*>(base);
    if(derived)
        derived->increment();
}

void doStatic(Base* base) {
    if(base->m_type == OurType)
        static_cast<Derived*>(base)->increment();
}

The design with type IDs itself was chosen because it seemed to me much more optimized, would be sufficient for the discussed application and is simple to implement and test. As it stands I would not be too proud if I would write an application this way (but I still might...).

The complete benchmark code can be found at the bottom and is separated into two source files, the first with the benchmark code itself, the second with some helper functions for time measurement, output etc. Since I was lazy and did not want to have to write more than necessary, I included the one source file in the other (but somehow I have to think about http://xkcd.com/292/ all the time now...)

Compilation & Testing

All code was compiled with

g++ bench.cpp -o bench -O3

gcc was

gcc (Gentoo 4.7.3-r1 p1.4, pie-0.5.5) 4.7.3

host system for the tests was a

Linux xxx 3.12.13-gentoo #5 SMP Mon May 12 23:50:28 CEST 2014 i686 Genuine Intel(R) CPU T2400 @ 1.83GHz GenuineIntel GNU/Linux

Then the built program was simply executed. As can be seen in the code, no corrections where made for spurious outliers, the test was simply repeated until there were no (or at least very few) outliers.

Disassembly was generated using radare2:

> radare2 bench
e asm.syntax=pseudo
aa
pdf@sym._Z9doDynamicP4Base
pdf@sym._Z8doStaticP4Base

with additional manual comments after ';;'.

First trys

The first try was to run the inner loop within the tested functions:

void doDynamic(Base* base) {
    for(int cnt = 0; cnt < INNER_LOOP; cnt++) {
        Derived* derived = dynamic_cast<Derived*>(base);
        if(derived)
            derived->increment();
    }
}

void doStatic(Base* base) {
    for( int cnt = 0; cnt < INNER_LOOP; cnt++) {
        if(base->m_type == OurType)
            static_cast<Derived*>(base)->increment();
    }
}

In the dynamic_cast case this works out nicely. In the case with typeid it is not so easy, as the compiler is smart enough to pull the type check out of the loop...

The next test looked more promising, but the enum used led to problems:

    enum types {
        OurType,
        OtherType
    };

In this case 'OurType' will be assigned the integer 0, which makes the test in doStatic simpler than it would be for the general case (a TEST instruction can be used instead of a CMP with a constant).

Real Comparison

The core code that is being compared looks like this:

void __attribute__((noinline)) doDynamic(Base* base) {
    Derived* derived = dynamic_cast<Derived*>(base);
    if(derived)
        derived->increment();
}

void __attribute__((noinline)) doStatic(Base* base) {
    if(base->m_type == OurType)
        static_cast<Derived*>(base)->increment();
}

void __attribute__((noinline)) doDerived(Derived* derived) {
    derived->increment();
}

For the dynamic case we do a dynamic cast, check if the result is 0, and if not we invoke the method of the casted object. In the static case we make a lookup into the objects type and use that to decide if we have the correct object or not.

The noinline attribute is neccesary because gcc can not be trusted not to inline the function call (and then do the same optimization as above). I guess thats a good thing though...

With this we can do a simple measurement, where the output look something like (minus the raw values, for those see the end of the file):

Measurement resolution: 1ns

Results for dynamic:

min:     39980011ns
max:     45938170ns
avg:     4.02109e+07ns
stddev:  7.77441e+06ns
range:   5958159ns
range %: 0.148173%

Results for static:

min:     4375487ns
max:     4460343ns
avg:     4.38304e+06ns
stddev:  147143ns
range:   84856ns
range %: 0.0193601%

Results for derived:

min:     3281912ns
max:     3338762ns
avg:     3.28731e+06ns
stddev:  117995ns
range:   56850ns
range %: 0.0172938%


Adjusted for loop, operation & function call overhead:
    dynamic avg: 3.69236e+07ns
    static avg : 1.09573e+06ns
    derived avg: 0ns

Taking the average, there is a factor of ~9.1 between the two codes. If we look at just the times that decision & casting take, we look at a factor of >30.

The internals

Let us have a look at the assembly code:

/ (fcn) sym._Z9doDynamicP4Base 60
|          0x080491b0    83ec1c       sub esp, 0x1c
|          0x080491b3    8b442420     mov eax, [esp+0x20]
|          0x080491b7    85c0         test eax, eax
|      ,=< 0x080491b9    742d         je 0x80491e8
|      |   0x080491bb    c744240c000. mov dword [esp+0xc], 0x0
|      |   0x080491c3    c7442408689. mov dword [esp+0x8], sym._ZTI7Derived ;  0x08049d68
|      |   0x080491cb    c7442404609. mov dword [esp+0x4], sym._ZTI4Base ;  0x08049d60
|      |   0x080491d3    890424       mov [esp], eax
|      |   ; CODE (CALL) XREF from 0x08048ba0 (fcn.08048b96)
|      |   0x080491d6    e8c5f9ffff   call sym.imp.__dynamic_cast
|      |      sym.imp.__dynamic_cast()
|      |   0x080491db    85c0         test eax, eax
|     ,==< 0x080491dd    7409         je 0x80491e8
|     ||   0x080491df    8b5004       mov edx, [eax+0x4]
|     ||   0x080491e2    83c201       add edx, 0x1
|     ||   0x080491e5    895004       mov [eax+0x4], edx
|     ``-> 0x080491e8    83c41c       add esp, 0x1c
\          0x080491eb    c3           ret
/ (fcn) sym._Z8doStaticP4Base 26
|          0x080491f0    8b442404     mov eax, [esp+0x4]
|          0x080491f4    83780801     cmp dword [eax+0x8], 0x1
|      ,=< 0x080491f8    7406         je 0x8049200
|      |   0x080491fa    f3c3         repe ret
|      |   0x080491fc    8d742600     lea esi, [esi]
|      `-> 0x08049200    8b5004       mov edx, [eax+0x4]
|          0x08049203    83c201       add edx, 0x1
|          0x08049206    895004       mov [eax+0x4], edx
\          0x08049209    c3           ret

We see that the case with type id is a simple test, jump and then the inlined increment() call. In the case with the dynamic cast we have to set up the registers with the values to call the implementation function of dynamic_cast, and the call the function itself. The call can not be inlined, since dynamic_cast is in a shared library and will be linked at runtime.

Even if dynamic_cast could be inlined, it would still be much slower, reason being that we know much more about our class structure than dynamic_cast could ever assume. It basically has to deal with (from C++ Standard Draft N2857, Chapter 5.2.7 [expr.dynamic.cast]):

  • multiple inheritance
  • virtual inheritance
  • errors like multiple non-virtual appearance of the same base class (therefore it has to walk the complete object ever dynamic_cast)
  • access control

For simple cases gcc has a few dynamic_cast implementations that do not cover the full spectrum (I fear one of those is used here...).

Comparing the non-sucessful case

Changes to test this case:

...
void __attribute__((noinline)) doDerived(Derived* derived) {
    //derived->increment();
    asm volatile("");
}

int main(int argc, char* argv[]) {
    //Base* derived = new Derived();
    Base* derived = new OtherDerived();
...

An interesting thing is comparing the case where we call the function with a class that can not be casted to the expected class. Lets have a look at the numbers here:

Results for dynamic:

min:     61891106ns
max:     69293025ns
avg:     6.19973e+07ns
stddev:  8.63447e+06ns
range:   7401919ns
range %: 0.119391%


Results for static:

min:     3281842ns
max:     3343511ns
avg:     3.28736e+06ns
stddev:  128454ns
range:   61669ns
range %: 0.0187594%

Results for derived:

min:     3281772ns
max:     3323956ns
avg:     3.28706e+06ns
stddev:  108004ns
range:   42184ns
range %: 0.0128333%

We see an even bigger difference in the two implementations, the averages have about a factor 19 between them. In this case it is not useful to compare the corrected values since there is so little overhead from the noop function to the one check the numbers are not realistic.

Open questions

  • Is the implementation with type IDs viable or just plain ugly and evil?
  • how does performance change with different class hierarchies (wide, deep, both)?
  • how does performance look on other platformat/compilers/systems?
  • does branch prediction influence our result?

The takeaway message

If you can avoid using dynamic_cast, then do so. On cast not compared here is the qobject_cast of QT, which is implemented without dynamic_cast, but needs the QT metaobject system to get type information and therefore only works on classes derived from QObject (and needs QT...). But since it does not have to cover all the special cases, it should be fast than dynamic_cast, though not as fast as no type-guessing at all. There are also a host of other alternatives, some listed below. Whether those are viable will highly depend on your use-case, you might be ok with dynamic_cast event though it is slower (but hey, it checks more stuff...).

Interesting stuff on the internet to look at:

Complete Measurement results

Measurement resolution: 1ns

Results for dynamic:

min:     39980011ns
max:     45938170ns
avg:     4.02109e+07ns
stddev:  7.77441e+06ns
range:   5958159ns
range %: 0.148173%

raw values:
        45938170 40699028 40878939 40973364 41518266 40757345 40648113 40617104 40778926 40834519
        40728779 40608165 40683941 40717954 40645668 40742329 40742120 40580018 39994259 40096926
        40104329 40078139 40128564 40207834 44474089 40023663 40119973 39991745 40076742 40109846
        40110475 40043846 40032253 40119764 40055371 40096856 40075554 40152660 40092805 40112081
        40542793 40181434 40004386 40072621 40075066 40051739 40127517 40054253 40137504 40080094
        40094551 40072970 40103840 40089592 40119555 40080793 40073041 40150284 40039447 40118088
        40086380 40092945 40028831 40052297 40060888 40218240 40445713 40112431 40106424 40065498
        40039656 40107961 40102164 40054463 40117459 40078278 40097275 40119345 40130868 40142462
        40103491 40126818 40130171 40093154 40094481 40141066 40045663 40094342 40074437 40029110
        40560044 40099859 40007180 40065637 40117878 40141554 40051598 40127516 40025129 40113618
        40025478 40097206 40041821 40110685 40095040 40080792 40142741 40084214 40081421 40085682
        40109707 40107263 40118298 40100488 40106354 40151960 40068849 40037910 40156152 40110754
        40086241 40094621 40128983 40074158 40086869 40081770 40086938 40075205 40070107 40159015
        40089941 40081631 40092177 40107472 40053834 40118087 40099300 40033091 40100208 40071434
        40548520 40142811 40069897 40112221 40068640 40052577 40125561 40081771 40135478 40157897
        40069757 40007319 39985739 40140717 40098951 40078069 40093992 40090152 40037211 40069338
        40067453 40076462 40117040 40093993 40064030 40189466 40145885 40059980 40031834 40105098
        40026526 40103771 40137015 40104469 40084564 40119065 40128564 40076811 40118088 40111313
        40069967 40117739 40022824 40074228 40147631 40096926 40091339 40048735 40074298 40103351
        40530361 40117738 39980011 40144348 40099300 40141973 40090222 40125840 40053485 40061167

Results for static:

min:     4375487ns
max:     4460343ns
avg:     4.38304e+06ns
stddev:  147143ns
range:   84856ns
range %: 0.0193601%

raw values:
        4393086 4375766 4389245 4375766 4375906 4389734 4375766 4389245 4375766 4390921
        4375696 4375766 4399301 4375765 4397975 4375766 4391619 4375765 4395251 4375766
        4375766 4397277 4375696 4389594 4375766 4391829 4375766 4375765 4389175 4375766
        4389175 4375766 4389245 4375766 4419695 4375766 4375766 4392946 4375766 4391619
        4375696 4389733 4375766 4389245 4375766 4375696 4390781 4375836 4389664 4375766
        4389454 4375627 4375766 4389384 4375766 4389594 4375766 4389175 4375835 4389175
        4375766 4375766 4389384 4375766 4389455 4375766 4389454 4375696 4375766 4390781
        4375696 4389385 4375766 4389245 4375836 4389105 4375835 4375696 4389384 4375836
        4389315 4375696 4389454 4375766 4375766 4390432 4375766 4389943 4375836 4389105
        4375766 4419416 4375766 4375766 4389943 4375766 4389733 4375766 4389315 4375766
        4389803 4375766 4375696 4389873 4375766 4389524 4375836 4389384 4375765 4375696
        4389315 4375836 4389035 4375766 4390432 4375696 4389664 4375766 4375696 4389035
        4375766 4389175 4375836 4389455 4375766 4375766 4389244 4375766 4389035 4375766
        4390502 4375766 4389943 4375835 4375836 4390572 4375766 4389175 4375976 4390292
        4375696 4375766 4460343 4375766 4392247 4375766 4389594 4375766 4429124 4375766
        4375487 4389873 4375766 4389454 4375835 4393574 4375766 4375766 4393505 4375766
        4398464 4375766 4389803 4375696 4389174 4375696 4375696 4392247 4375697 4389524
        4375765 4389524 4375696 4389245 4375766 4375766 4390781 4375766 4389803 4375696
        4389663 4375766 4375766 4391200 4375697 4389803 4375765 4388686 4375766 4389384
        4375696 4375696 4390711 4375766 4389105 4375766 4389664 4375696 4375766 4389385

Results for derived:

min:     3281912ns
max:     3338762ns
avg:     3.28731e+06ns
stddev:  117995ns
range:   56850ns
range %: 0.0172938%

raw values:
        3282121 3295810 3282191 3282191 3295880 3282121 3282191 3338762 3282261 3282122
        3297206 3282261 3282191 3295879 3282121 3282191 3295810 3282121 3282192 3282261
        3295810 3282261 3282122 3295670 3282121 3282121 3295880 3282121 3282191 3297067
        3282191 3282191 3296089 3282122 3282121 3295530 3282261 3282192 3296857 3282191
        3282191 3295530 3282122 3282261 3295880 3282260 3281912 3295670 3282261 3282191
        3300419 3282191 3282121 3295879 3282192 3282261 3295740 3282471 3282191 3295530
        3282121 3282261 3295879 3282191 3282191 3296159 3282191 3282261 3296857 3282192
        3282121 3295879 3282191 3282330 3295530 3282121 3282121 3295739 3282331 3282191
        3299232 3282401 3282191 3324304 3282121 3282122 3297206 3282191 3282261 3297765
        3282191 3282051 3282191 3296228 3282261 3282191 3295810 3282122 3282331 3297975
        3282470 3282121 3296088 3282121 3282261 3295600 3282191 3282261 3295669 3282121
        3282191 3295739 3282121 3282122 3295740 3282121 3282260 3295879 3282052 3281982
        3295600 3282191 3282331 3295810 3282680 3282330 3296438 3282261 3282122 3295949
        3282261 3282191 3295810 3282121 3282121 3295600 3282121 3282191 3295809 3282122
        3282191 3295879 3282122 3282121 3295460 3282191 3282191 3295810 3282261 3282121
        3296996 3282192 3282051 3296228 3282261 3282192 3297695 3282191 3282122 3324374
        3282331 3282121 3296369 3282191 3282122 3282191 3296228 3282052 3282401 3295810
        3282261 3282261 3295879 3282121 3282192 3296019 3282122 3282331 3295740 3282191
        3282121 3295530 3282192 3282261 3295880 3282191 3282192 3296019 3282121 3282191
        3296857 3282122 3282191 3296299 3282261 3282820 3295879 3282121 3282121 3295879

Adjusted for loop, operation & function call overhead:
        dynamic avg: 3.69236e+07ns
        static avg : 1.09573e+06ns
        derived avg: 0ns

Factors:
        normal: 9.17421
        corr  : 33.6977

Measurement source code

bench.cpp

#include <iostream>
#include <string>

const int INNER_LOOP = 1000000;
const int OUTER_LOOP = 200;

enum types {
    OurType = 1,
    OtherType
};

class Base {
    public:
        virtual ~Base() {}
        volatile int m_value;
        const types m_type;
    protected:
        Base(types type) : m_type(type), m_value(0) {}
};

class Derived : public Base {
    public:
        Derived() : Base(OurType) {}
        void increment() { m_value++; }
};

class OtherDerived : public Base {
    public:
        OtherDerived() : Base(OtherType) {}
        void doSomeThing() { m_value--; }
};

void __attribute__((noinline)) doDynamic(Base* base) {
    Derived* derived = dynamic_cast<Derived*>(base);
    if(derived)
        derived->increment();
}

void __attribute__((noinline)) doStatic(Base* base) {
    if(base->m_type == OurType)
        static_cast<Derived*>(base)->increment();
}

void __attribute__((noinline)) doDerived(Derived* derived) {
    derived->increment();
    //asm volatile("");
}

#include "stuff.cpp" // I know... Raptors will come and eat me for this

int main(int argc, char* argv[]) {
    Base* derived = new Derived();
    //Base* derived = new OtherDerived();

    Clock::TimePoint startDyn[OUTER_LOOP], startStat[OUTER_LOOP], startDer[OUTER_LOOP];
    Clock::TimePoint endDyn[OUTER_LOOP], endStat[OUTER_LOOP], endDer[OUTER_LOOP];

    std::cout << "Measurement resolution: " << Clock::Resolution() << "ns" << std::endl << std::endl;

    for( int i = 0; i < 10; ++i)
        doDynamic(derived);
    for( int i = 0; i < OUTER_LOOP; ++i) {
        startDyn[i] = Clock::Now();
        for( int cnt = 0; cnt < INNER_LOOP; cnt++)
            doDynamic(derived);
        endDyn[i] = Clock::Now();
    }

    for( int i = 0; i < 10; ++i)
        doStatic(derived);
    for( int i = 0; i < OUTER_LOOP; ++i) {
        startStat[i] = Clock::Now();
        for( int cnt = 0; cnt < INNER_LOOP; cnt++)
            doStatic(derived);
        endStat[i] = Clock::Now();
    }

    for( int i = 0; i < 10; ++i)
        doDerived((Derived*)derived);
    for( int i = 0; i < OUTER_LOOP; ++i) {
        startDer[i] = Clock::Now();
        for( int cnt = 0; cnt < INNER_LOOP; cnt++)
            doDerived((Derived*)derived);
        endDer[i] = Clock::Now();
    }

    double dy = printResult( "dynamic", startDyn, endDyn);
    double st = printResult( "static", startStat, endStat);
    double de = printResult( "derived", startDer, endDer);

    std::cout << "Adjusted for loop, operation & function call overhead:" << std::endl;
    std::cout << "\tdynamic avg: " << dy-de << "ns" << std::endl;
    std::cout << "\tstatic avg : " << st-de << "ns" << std::endl;
    std::cout << "\tderived avg: 0ns" << std::endl;
    std::cout << std::endl;
    std::cout << "Factors:" << std::endl;
    std::cout << "\tnormal: " << dy / st << std::endl;
    std::cout << "\tcorr  : " << (dy-de)/(st-de) << std::endl;
}

stuff.cpp

#include <time.h>
#include <stdint.h>
#include <limits>
#include <cmath>

std::ostream& operator<<( std::ostream& stream, const Base& base) {
    stream << "Type: " << base.m_type << "  Value: " << base.m_value << std::endl;
    return stream;
}

/////////////////////////////////////////////////////////////////////////////////////
// taken from the hayai library (https://github.com/nickbruun/hayai) by Nick Bruun //
/////////////////////////////////////////////////////////////////////////////////////
class Clock
{
public:
    /// Time point.

    /// Opaque representation of a point in time.
    typedef struct timespec TimePoint;


    /// Get the current time as a time point.

    /// @returns the current time point.
    static TimePoint Now()
    {
        TimePoint result;
        # if defined(CLOCK_MONOTONIC_RAW)
            clock_gettime(CLOCK_MONOTONIC_RAW, &result);
        # elif defined(CLOCK_MONOTONIC)
            clock_gettime(CLOCK_MONOTONIC, &result);
        # elif defined(CLOCK_REALTIME)
            clock_gettime(CLOCK_REALTIME, &result);
        # else
            clock_gettime((clockid_t)-1, &result);
        # endif
        return result;
    }


    /// Get the duration between two time points.

    /// @param startTime Start time point.
    /// @param endTime End time point.
    /// @returns the number of nanoseconds elapsed between the two time
    /// points.
    static int64_t Duration(const TimePoint& startTime,
                            const TimePoint& endTime)
    {
        TimePoint timeDiff;

        timeDiff.tv_sec = endTime.tv_sec - startTime.tv_sec;
        if (endTime.tv_nsec < startTime.tv_nsec)
        {
            timeDiff.tv_nsec = endTime.tv_nsec + 1000000000L -
                startTime.tv_nsec;
            timeDiff.tv_sec--;
        }
        else
            timeDiff.tv_nsec = endTime.tv_nsec - startTime.tv_nsec;

        return timeDiff.tv_sec * 1000000000L + timeDiff.tv_nsec;
    }

    static int64_t Resolution()
    {
        TimePoint result;

        # if defined(CLOCK_MONOTONIC_RAW)
            clock_getres(CLOCK_MONOTONIC_RAW, &result);
        # elif defined(CLOCK_MONOTONIC)
            clock_getres(CLOCK_MONOTONIC, &result);
        # elif defined(CLOCK_REALTIME)
            clock_getres(CLOCK_REALTIME, &result);
        # else
            clock_getres((clockid_t)-1, &result);
        # endif

        return result.tv_sec * 1000000000L + result.tv_nsec;
    }
};

double printResult( std::string test, Clock::TimePoint start[OUTER_LOOP], Clock::TimePoint end[OUTER_LOOP]) {
    double avg = 0;
    double variance = 0;
    double stddev = 0;
    int64_t duration[OUTER_LOOP];
    int64_t min = std::numeric_limits<int64_t>::max();
    int64_t max = std::numeric_limits<int64_t>::min();
    int64_t range;

    for( int i = 0; i < OUTER_LOOP; ++i) {
        duration[i] = Clock::Duration(start[i], end[i]);
        avg += duration[i];

        if( min > duration[i])
            min = duration[i];
        if( max < duration[i])
            max = duration[i];
    }
    avg /= OUTER_LOOP;

    for( int i = 0; i < OUTER_LOOP; ++i)
        variance += std::pow(duration[i] - avg, 2);
    stddev = std::sqrt(variance);

    std::cout << "Results for " << test << ":" << std::endl;
    std::cout << std::endl;
    std::cout << "min:     " << min << "ns" << std::endl;
    std::cout << "max:     " << max << "ns" << std::endl;
    std::cout << "avg:     " << avg << "ns" << std::endl;
    std::cout << "stddev:  " << stddev << "ns" << std::endl;
    std::cout << "range:   " << max - min << "ns" << std::endl;
    std::cout << "range %: " << (max - min)/avg << "%" << std::endl;

    std::cout << std::endl;
    std::cout << "raw values: ";

    for( int i = 0; i < OUTER_LOOP; ++i) {
        if( (i % 10) == 0)
            std::cout << std::endl << "\t";
        std::cout << duration[i] << " ";
    }

    std::cout << std::endl;
    std::cout << std::endl;

    return avg;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment