Skip to content

Instantly share code, notes, and snippets.

@topin89
Last active March 20, 2023 19:21
Show Gist options
  • Save topin89/84d1f78511dee9d32782b96638d94b6b to your computer and use it in GitHub Desktop.
Save topin89/84d1f78511dee9d32782b96638d94b6b to your computer and use it in GitHub Desktop.
Shape area count benchmark

Edit:

  1. Please disregard MSVC Results. I missed that CMake generates an app in $builddir/Release, and not in $builddir. MSVC's results were so close to clang's because they were, in fact, clang's. But don't worry it doesn't matter much, because
  2. This benchmark was designed on assuption that since we access shapes in predicatable way, cache won't give us much trouble. Well, cmuratori and AcheretoTM pointed that it's probably wrong. I checked, and I was wrong. Now virtual calls are even faster, so please disregard everything else as well. Now, the code don't generate all 20 MiB objects for each case, but iterate over 8192 (about 128 KB, half of my L2 cache for a single core) 2560 times. That's the same 20 MiB iterations. Also, vectors are made before every loop and destroyed after every measurement. So, it's always cold start for every loop and low memory consumption for the bench.

Recently, Case Muratori made an article and a video named "Clean" Code, Horrible Performance

I agree that sometimes, a particular software architecture that is regarded as better (that is, Clean Code) in general can reduce performance quite a bit. I also agree that the Clean Code example can be misleading and make people think it's fine even in cases like shape area size. There is a reason why shaders are not written in object-oriented style after all.

Sometimes after the video was published, Casey and Robert (author of Clean Code) engaged in discussion on the merits of different architecture and on this day (16 Mar 2023) Bob said about "a hypothetical compiler that produces identical binary code irrespective of whether the input is operand or operation primal." Casey responded with essentially "it's hypothetical, modern compilers can't do that" and it is as far from reality as proposing "A hypothetical language where the dependency cost is always the same". I think this is a bit unfair since the previous points were more about architecture in general, and Bob agreed that there are cases where it's harmful for performance to use OO style with virtual functions.

About four days ago (13 Mar 2023), I made my own benchmark to be absolutely sure things are as they are, as well as to test different approaches to the problem.

Along with Casey's code, I tested:

  • std::variant containing four different classes with no base class:
class square_no_vt
{
public:
    square_no_vt(f32 SideInit) : Side(SideInit) {}
    f32 Area() {return Side*Side;}

private:
    f32 Side;
};

class rectangle_no_vt
{
public:
    rectangle_no_vt(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
    f32 Area() {return Width*Height;}

private:
    f32 Width, Height;
};

class triangle_no_vt
{
public:
    triangle_no_vt(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
    f32 Area() {return 0.5f*Base*Height;}

private:
    f32 Base, Height;
};

class circle_no_vt
{
public:
    circle_no_vt(f32 RadiusInit) : Radius(RadiusInit) {}
    f32 Area() {return Pi32*Radius*Radius;}

private:
    f32 Radius;
};

using shape_variant = std::variant<square_no_vt, rectangle_no_vt, triangle_no_vt, circle_no_vt>;


template<typename ShapeVariant>
f32 TotalAreaVariant(u32 ShapeCount, ShapeVariant *Shapes) noexcept
{
    f32 Accum = 0.0f;
    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
        Accum += std::visit([](auto shape){return shape.Area();}, Shapes[ShapeIndex]);
    }

    return Accum;
}
  • std::variant with one base class with optimized Area() function
class shape_base_no_vt_opt{
public:
    shape_base_no_vt_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
        shape_index(shape_index_), side1(side1_), side2(side2_)
    {}
    f32 Area() { return c_table[shape_index] * side1 * side2; }
private:

    static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};

    shape_type shape_index;
    f32 side1;
    f32 side2;
};

class square_no_vt_opt : public shape_base_no_vt_opt
{
public:
    square_no_vt_opt(f32 SideInit) : shape_base_no_vt_opt(Shape_Square, SideInit, SideInit) {}

};

class rectangle_no_vt_opt: public shape_base_no_vt_opt
{
public:
    rectangle_no_vt_opt(f32 WidthInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Rectangle, WidthInit, HeightInit) {}
};

class triangle_no_vt_opt: public shape_base_no_vt_opt
{
public:
    triangle_no_vt_opt(f32 BaseInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Triangle, BaseInit, HeightInit) {}
};

class circle_no_vt_opt: public shape_base_no_vt_opt
{
public:
    circle_no_vt_opt(f32 RadiusInit) : shape_base_no_vt_opt(Shape_Circle, RadiusInit, RadiusInit) {}
};

using shape_variant_opt = std::variant<square_no_vt_opt, rectangle_no_vt_opt, triangle_no_vt_opt, circle_no_vt_opt>;

// same TotalAreaVariant as above
  • Finally, another shape_base overloaded classes, with an intermediate base class with optimized area function
class shape_base_opt: public shape_base{
public:
    shape_base_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
        shape_index(shape_index_), side1(side1_), side2(side2_)
    {}
    virtual f32 Area() { return c_table[shape_index] * side1 * side2; }
private:

    static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};

    shape_type shape_index;
    f32 side1;
    f32 side2;
};

class square_opt : public shape_base_opt
{
public:
    square_opt(f32 SideInit) : shape_base_opt(Shape_Square, SideInit, SideInit) {}

};

class rectangle_opt: public shape_base_opt
{
public:
    rectangle_opt(f32 WidthInit, f32 HeightInit) : shape_base_opt(Shape_Rectangle, WidthInit, HeightInit) {}
};

class triangle_opt: public shape_base_opt
{
public:
    triangle_opt(f32 BaseInit, f32 HeightInit) : shape_base_opt(Shape_Triangle, BaseInit, HeightInit) {}
};

class circle_opt: public shape_base_opt
{
public:
    circle_opt(f32 RadiusInit) : shape_base_opt(Shape_Circle, RadiusInit, RadiusInit) {}
};

// used for original shapes as well
f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes)
{
    f32 Accum = 0.0f;
    for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
    {
        Accum += Shapes[ShapeIndex]->Area();
    }

    return Accum;
}

Edit: there's new measurements from Windows machine:

Windows 10 (different machine, so I can't compare it with the above benchmarks):

  • clang 15.0.6 (MSVC runtime)

Release:

Bench: virtual functions
        Regular  loop, time: 154 ms area: 7.50355e+14
        Unrolled loop, time: 148 ms area: 7.39093e+14

Bench: switch case
        Regular  loop, time: 131 ms area: 7.50355e+14
        Unrolled loop, time: 119 ms area: 7.39093e+14

Bench: lookup table
        Regular  loop, time: 18 ms area: 7.45486e+14
        Unrolled loop, time: 12 ms area: 7.60836e+14

Bench: std::variant
        Regular  loop, time: 133 ms area: 7.50355e+14
        Unrolled loop, time: 121 ms area: 7.39093e+14

Bench: std::variant optimized
        Regular  loop, time: 19 ms area: 7.50355e+14
        Unrolled loop, time: 18 ms area: 7.39093e+14

Bench: virtual functions optimized
        Regular  loop, time: 41 ms area: 7.50355e+14
        Unrolled loop, time: 39 ms area: 7.39093e+14

Debug:

Bench: virtual functions
        Regular  loop, time: 251 ms area: 7.50355e+14
        Unrolled loop, time: 209 ms area: 7.39093e+14

Bench: switch case
        Regular  loop, time: 278 ms area: 7.50355e+14
        Unrolled loop, time: 237 ms area: 7.39093e+14

Bench: lookup table
        Regular  loop, time: 120 ms area: 7.45486e+14
        Unrolled loop, time: 114 ms area: 7.60836e+14

Bench: std::variant
        Regular  loop, time: 684 ms area: 7.50355e+14
        Unrolled loop, time: 687 ms area: 7.39093e+14

Bench: std::variant optimized
        Regular  loop, time: 687 ms area: 7.50355e+14
        Unrolled loop, time: 701 ms area: 7.39093e+14

Bench: virtual functions optimized
        Regular  loop, time: 99 ms area: 7.50355e+14
        Unrolled loop, time: 74 ms area: 7.39093e+14
  • MSVC 2022 17.4.3

Release:

Bench: virtual functions
        Regular  loop, time: 153 ms area: 7.50355e+14
        Unrolled loop, time: 142 ms area: 7.39093e+14

Bench: switch case
        Regular  loop, time: 125 ms area: 7.50355e+14
        Unrolled loop, time: 91 ms area: 7.39093e+14

Bench: lookup table
        Regular  loop, time: 21 ms area: 7.45486e+14
        Unrolled loop, time: 18 ms area: 7.60836e+14

Bench: std::variant
        Regular  loop, time: 124 ms area: 7.50355e+14
        Unrolled loop, time: 79 ms area: 7.39093e+14

Bench: std::variant optimized
        Regular  loop, time: 116 ms area: 7.50355e+14
        Unrolled loop, time: 79 ms area: 7.39093e+14

Bench: virtual functions optimized
        Regular  loop, time: 41 ms area: 7.50355e+14
        Unrolled loop, time: 37 ms area: 7.39093e+14

Debug:

Bench: virtual functions
        Regular  loop, time: 265 ms area: 7.50355e+14
        Unrolled loop, time: 233 ms area: 7.39093e+14

Bench: switch case
        Regular  loop, time: 398 ms area: 7.50355e+14
        Unrolled loop, time: 384 ms area: 7.39093e+14

Bench: lookup table
        Regular  loop, time: 243 ms area: 7.45486e+14
        Unrolled loop, time: 232 ms area: 7.60836e+14

Bench: std::variant
        Regular  loop, time: 894 ms area: 7.50355e+14
        Unrolled loop, time: 887 ms area: 7.39093e+14

Bench: std::variant optimized
        Regular  loop, time: 993 ms area: 7.50355e+14
        Unrolled loop, time: 1027 ms area: 7.39093e+14

Bench: virtual functions optimized
        Regular  loop, time: 115 ms area: 7.50355e+14
        Unrolled loop, time: 91 ms area: 7.39093e+14

Ubuntu 20.04, chrooted in Ubuntu 22.04

  • gcc 12.1.0

Release:

Bench: virtual functions
        Regular  loop, time: 141 ms area: 7.45026e+14
        Unrolled loop, time: 128 ms area: 7.36963e+14

Bench: switch case
        Regular  loop, time: 83 ms area: 7.45026e+14
        Unrolled loop, time: 56 ms area: 7.36963e+14

Bench: lookup table
        Regular  loop, time: 21 ms area: 7.5042e+14
        Unrolled loop, time: 12 ms area: 7.65487e+14

Bench: std::variant
        Regular  loop, time: 78 ms area: 7.45026e+14
        Unrolled loop, time: 54 ms area: 7.36963e+14

Bench: std::variant optimized
        Regular  loop, time: 68 ms area: 7.45026e+14
        Unrolled loop, time: 29 ms area: 7.36963e+14

Bench: virtual functions optimized
        Regular  loop, time: 45 ms area: 7.45026e+14
        Unrolled loop, time: 31 ms area: 7.36963e+14

Debug:

Bench: virtual functions
        Regular  loop, time: 224 ms area: 7.45026e+14
        Unrolled loop, time: 186 ms area: 7.36963e+14

Bench: switch case
        Regular  loop, time: 236 ms area: 7.45026e+14
        Unrolled loop, time: 197 ms area: 7.36963e+14

Bench: lookup table
        Regular  loop, time: 66 ms area: 7.5042e+14
        Unrolled loop, time: 47 ms area: 7.65487e+14

Bench: std::variant
        Regular  loop, time: 1092 ms area: 7.45026e+14
        Unrolled loop, time: 1076 ms area: 7.36963e+14

Bench: std::variant optimized
        Regular  loop, time: 1098 ms area: 7.45026e+14
        Unrolled loop, time: 1087 ms area: 7.36963e+14

Bench: virtual functions optimized
        Regular  loop, time: 83 ms area: 7.45026e+14
        Unrolled loop, time: 57 ms area: 7.36963e+14

Ubuntu 20.04

  • gcc 11.1.0

Release:

Bench: virtual functions
        Regular  loop, time: 128 ms area: 7.45026e+14
        Unrolled loop, time: 128 ms area: 7.36963e+14

Bench: switch case
        Regular  loop, time: 88 ms area: 7.45026e+14
        Unrolled loop, time: 55 ms area: 7.36963e+14

Bench: lookup table
        Regular  loop, time: 19 ms area: 7.5042e+14
        Unrolled loop, time: 10 ms area: 7.65487e+14

Bench: std::variant
        Regular  loop, time: 114 ms area: 7.45026e+14
        Unrolled loop, time: 108 ms area: 7.36963e+14

Bench: std::variant optimized
        Regular  loop, time: 115 ms area: 7.45026e+14
        Unrolled loop, time: 111 ms area: 7.36963e+14

Bench: virtual functions optimized
        Regular  loop, time: 43 ms area: 7.45026e+14
        Unrolled loop, time: 28 ms area: 7.36963e+14

Debug:

Bench: virtual functions
        Regular  loop, time: 227 ms area: 7.45026e+14
        Unrolled loop, time: 188 ms area: 7.36963e+14

Bench: switch case
        Regular  loop, time: 231 ms area: 7.45026e+14
        Unrolled loop, time: 198 ms area: 7.36963e+14

Bench: lookup table
        Regular  loop, time: 65 ms area: 7.5042e+14
        Unrolled loop, time: 48 ms area: 7.65487e+14

Bench: std::variant
        Regular  loop, time: 1140 ms area: 7.45026e+14
        Unrolled loop, time: 1165 ms area: 7.36963e+14

Bench: std::variant optimized
        Regular  loop, time: 1150 ms area: 7.45026e+14
        Unrolled loop, time: 1151 ms area: 7.36963e+14

Bench: virtual functions optimized
        Regular  loop, time: 79 ms area: 7.45026e+14
        Unrolled loop, time: 59 ms area: 7.36963e+14
  • clang 15.0.7

Release:

Bench: virtual functions
        Regular  loop, time: 136 ms area: 7.45026e+14
        Unrolled loop, time: 127 ms area: 7.36963e+14

Bench: switch case
        Regular  loop, time: 119 ms area: 7.45026e+14
        Unrolled loop, time: 112 ms area: 7.36963e+14

Bench: lookup table
        Regular  loop, time: 19 ms area: 7.5042e+14
        Unrolled loop, time: 15 ms area: 7.65487e+14

Bench: std::variant
        Regular  loop, time: 116 ms area: 7.45026e+14
        Unrolled loop, time: 110 ms area: 7.36963e+14

Bench: std::variant optimized
        Regular  loop, time: 122 ms area: 7.45026e+14
        Unrolled loop, time: 117 ms area: 7.36963e+14

Bench: virtual functions optimized
        Regular  loop, time: 67 ms area: 7.45026e+14
        Unrolled loop, time: 31 ms area: 7.36963e+14

Debug:

Bench: virtual functions
        Regular  loop, time: 208 ms area: 7.45026e+14
        Unrolled loop, time: 178 ms area: 7.36963e+14

Bench: switch case
        Regular  loop, time: 290 ms area: 7.45026e+14
        Unrolled loop, time: 264 ms area: 7.36963e+14

Bench: lookup table
        Regular  loop, time: 105 ms area: 7.5042e+14
        Unrolled loop, time: 91 ms area: 7.65487e+14

Bench: std::variant
        Regular  loop, time: 586 ms area: 7.45026e+14
        Unrolled loop, time: 603 ms area: 7.36963e+14

Bench: std::variant optimized
        Regular  loop, time: 619 ms area: 7.45026e+14
        Unrolled loop, time: 631 ms area: 7.36963e+14

Bench: virtual functions optimized
        Regular  loop, time: 60 ms area: 7.45026e+14
        Unrolled loop, time: 48 ms area: 7.36963e+14

So, lets compare it by categories, for regular loops:

gcc 12 gcc 11 clang 15 Lin clang 15 Win MSVC
virtual functions 141 128 136 154 152
switch case 83 88 119 131 130
lookup table 21 19 19 18 21
std::variant 78 114 116 133 124
std::variant optimized 68 115 122 19 118
virtual functions optimized 45 43 67 41 45

and for unrolled loops:

gcc 12 gcc 11 clang 15 Lin clang 15 Win MSVC
virtual functions 128 128 127 148 139
switch case 56 55 112 119 90
lookup table 12 10 15 12 18
std::variant 54 108 110 121 79
std::variant optimized 29 111 117 18 78
virtual functions optimized 31 28 31 39 40

Note, Win machine is not the same as Lin, so there will be no direct comparison. I don't think it's worth comparing Debug versions with each other, so I'll leave it as an exercise for a reader.

So, conclusions:

  • Yes, counting area of 20 megashapes all placed in memory is not the same as counting area of 8 kiloshapes 2.5k times. All data fitting in at least L3 cache can significantly change results of a benchmark
  • Nevertheless, virtual shapes with that have one single base class counts faster than naive shapes with 4 different area functions
  • Lookup table version gained from about 2x to just a bit speed, depending on a compiler. 2x is for gcc at least on Linux, MSVC gives no significant speedup
  • Effect of unrolling switches and variants is compiler and platform dependent, but never to a degree the LuT code has.
  • On most platforms and compilers in the table, variant and visit is about the same speed as switch, except for GCC 11.
  • On most platform, the "optimized variant" code has no speedup compared to naive implementation. GCC 12 somehow improved on unrolled loops, and only Clang on windows showed exactly the same performance as the LuT code. I'm not sure why. My guess is, Windows linker can eliminate same code (Partial-redundancy elimination) and clang can take advantage of it and throw away all branches, leaving single function aside. But, again, I may be completely wrong and clang 15.0.6 can do it everywhere, but clang 15.0.7 has some regression and so it can't optimize it like that. Or some clang defaults on Linux and Windows are different enough so it works on Win10 but not on Ubuntu 20.04.
  • Finally, optimized version of shapes with single middle-class (let's call it that) is still faster than four naive classes each with a different area counting function. About 3 to 4 times faster now. Using the smae function for both cases. My guess is, it's branch prediction at work, but I can't neither proof nor disproof it. All I want to say, it's not really fair to compare optimized procedural code with unoptimized OO, especially since it's still 2 to 4 times faster.

I think I should clarify lots of stuff:

  • I totally agree OO is way less optimizable then Operation First, procedural style. At least, that's what my personal experience is
  • I agree that to OO is less understandable than procedural. I mean, just try to understand how FizzBuzzEnterpriseEdition even works. I tried, and I failed to find where main is. Maybe one or two level of indirection is fine, but five or more are pain
  • Therefore, I agree that it's almost impossible to find optimization opportunities in OO code, with IDE always pointing to base class and not the one class I need, and code being dispersed across way too many files
  • I agree there is no place for OO in tight loops that should work fast with lots of data
  • I do not agree you can't have same performance with Operand First approach. variant, visit and clang on visit shows that it's at least possible with modern C++ compilers. Yes, no base class, no vtables, but the structure of a code is about the same. Should we use it or not is beside the point (but probably we shouldn't except for some strange cases)
  • I do not agree we should not use clean code approach. Yet, do think that we should limit levels of indirection only for user API, and using rarely and carefully in a code that we fully control.

Well, that's all for now.

cmake_minimum_required(VERSION 3.0.0)
project(unclean VERSION 0.1.0)
set (CMAKE_CXX_STANDARD 17)
add_executable(shape_bench main.cpp)
#define _USE_MATH_DEFINES
#include <math.h>
#include <stdint.h>
#include <chrono>
#include <iostream>
#include <random>
#include <stdexcept>
#include <type_traits>
#include <vector>
#include <variant>
using f32 = float;
using u32 = uint32_t;
static constexpr f32 Pi32 = M_PI;
static constexpr size_t num_of_shapes = 8192;
static constexpr size_t num_of_outer_loops = 1024*1024*20 / num_of_shapes;
// static constexpr size_t num_of_shapes = 1024*1024*20;
// utilities
class TimeMeasure {
public:
TimeMeasure() = default;
void begin(){ start_point = std::chrono::steady_clock::now(); }
void end(){ end_point = std::chrono::steady_clock::now(); }
int64_t ms(){
return std::chrono::duration_cast<std::chrono::milliseconds>(end_point - start_point).count();
}
double ms_per_tick(int64_t ticks) {
return static_cast<double>(ms()) / ticks;
}
private:
std::chrono::steady_clock::time_point start_point;
std::chrono::steady_clock::time_point end_point;
};
// from https://stackoverflow.com/a/34111095
template <typename Kind, typename... Kinds>
constexpr bool any_of_types(){
/* The following expands to :
* std::is_same_v<Kind, Kind0> || std::is_same_v<Kind, Kind1> || ... */
if constexpr ((std::is_same_v<Kind, Kinds> || ...)) {
return true;
}
return false;
};
// This should really be in the standard
template< typename T>
class RandomNumberGen{
public:
RandomNumberGen(T min = 14, T max = 9001, size_t seed = 42):
rng(seed), uni(min, max) {}
T operator()(){ return uni(rng);}
private:
using uniform_dist = std::conditional_t<
any_of_types<T, float, double, long double>(),
std::uniform_real_distribution<T>, std::uniform_int_distribution<T>
>;
std::mt19937 rng;
uniform_dist uni;
};
// main dish
enum shape_type : u32
{
Shape_Square,
Shape_Rectangle,
Shape_Triangle,
Shape_Circle,
Shape_Count,
};
struct shape_union
{
shape_type Type;
f32 Width;
f32 Height;
};
f32 const CTable[Shape_Count] = {1.0f, 1.0f, 0.5f, Pi32};
class shape_base
{
public:
shape_base() {}
virtual f32 Area() = 0;
virtual ~shape_base() = default;
};
class square : public shape_base
{
public:
square(f32 SideInit) : Side(SideInit) {}
virtual f32 Area() {return Side*Side;}
private:
f32 Side;
};
class rectangle : public shape_base
{
public:
rectangle(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
virtual f32 Area() {return Width*Height;}
private:
f32 Width, Height;
};
class triangle : public shape_base
{
public:
triangle(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
virtual f32 Area() {return 0.5f*Base*Height;}
private:
f32 Base, Height;
};
class circle : public shape_base
{
public:
circle(f32 RadiusInit) : Radius(RadiusInit) {}
virtual f32 Area() {return Pi32*Radius*Radius;}
private:
f32 Radius;
};
f32 TotalAreaVTBL(u32 ShapeCount, shape_base **Shapes)
{
f32 Accum = 0.0f;
for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
{
Accum += Shapes[ShapeIndex]->Area();
}
return Accum;
}
f32 TotalAreaVTBL4(u32 ShapeCount, shape_base **Shapes)
{
f32 Accum0 = 0.0f;
f32 Accum1 = 0.0f;
f32 Accum2 = 0.0f;
f32 Accum3 = 0.0f;
u32 Count = ShapeCount/4;
while(Count--)
{
Accum0 += Shapes[0]->Area();
Accum1 += Shapes[1]->Area();
Accum2 += Shapes[2]->Area();
Accum3 += Shapes[3]->Area();
Shapes += 4;
}
f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
return Result;
}
f32 GetAreaSwitch(shape_union Shape)
{
f32 Result = 0.0f;
switch(Shape.Type)
{
case Shape_Square: {Result = Shape.Width*Shape.Width;} break;
case Shape_Rectangle: {Result = Shape.Width*Shape.Height;} break;
case Shape_Triangle: {Result = 0.5f*Shape.Width*Shape.Height;} break;
case Shape_Circle: {Result = Pi32*Shape.Width*Shape.Width;} break;
case Shape_Count: {} break;
}
return Result;
}
f32 TotalAreaSwitch(u32 ShapeCount, shape_union *Shapes)
{
f32 Accum = 0.0f;
for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
{
Accum += GetAreaSwitch(Shapes[ShapeIndex]);
}
return Accum;
}
f32 TotalAreaSwitch4(u32 ShapeCount, shape_union *Shapes)
{
f32 Accum0 = 0.0f;
f32 Accum1 = 0.0f;
f32 Accum2 = 0.0f;
f32 Accum3 = 0.0f;
ShapeCount /= 4;
while(ShapeCount--)
{
Accum0 += GetAreaSwitch(Shapes[0]);
Accum1 += GetAreaSwitch(Shapes[1]);
Accum2 += GetAreaSwitch(Shapes[2]);
Accum3 += GetAreaSwitch(Shapes[3]);
Shapes += 4;
}
f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
return Result;
}
f32 GetAreaUnion(shape_union Shape)
{
f32 Result = CTable[Shape.Type]*Shape.Width*Shape.Height;
return Result;
}
f32 TotalAreaUnion(u32 ShapeCount, shape_union *Shapes)
{
f32 Accum = 0.0f;
for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
{
Accum += GetAreaUnion(Shapes[ShapeIndex]);
}
return Accum;
}
f32 TotalAreaUnion4(u32 ShapeCount, shape_union *Shapes)
{
f32 Accum0 = 0.0f;
f32 Accum1 = 0.0f;
f32 Accum2 = 0.0f;
f32 Accum3 = 0.0f;
ShapeCount /= 4;
while(ShapeCount--)
{
Accum0 += GetAreaUnion(Shapes[0]);
Accum1 += GetAreaUnion(Shapes[1]);
Accum2 += GetAreaUnion(Shapes[2]);
Accum3 += GetAreaUnion(Shapes[3]);
Shapes += 4;
}
f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
return Result;
}
class square_no_vt
{
public:
square_no_vt(f32 SideInit) : Side(SideInit) {}
f32 Area() {return Side*Side;}
private:
f32 Side;
};
class rectangle_no_vt
{
public:
rectangle_no_vt(f32 WidthInit, f32 HeightInit) : Width(WidthInit), Height(HeightInit) {}
f32 Area() {return Width*Height;}
private:
f32 Width, Height;
};
class triangle_no_vt
{
public:
triangle_no_vt(f32 BaseInit, f32 HeightInit) : Base(BaseInit), Height(HeightInit) {}
f32 Area() {return 0.5f*Base*Height;}
private:
f32 Base, Height;
};
class circle_no_vt
{
public:
circle_no_vt(f32 RadiusInit) : Radius(RadiusInit) {}
f32 Area() {return Pi32*Radius*Radius;}
private:
f32 Radius;
};
using shape_variant = std::variant<square_no_vt, rectangle_no_vt, triangle_no_vt, circle_no_vt>;
class shape_base_no_vt_opt{
public:
shape_base_no_vt_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
shape_index(shape_index_), side1(side1_), side2(side2_)
{}
f32 Area() { return c_table[shape_index] * side1 * side2; }
private:
static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};
shape_type shape_index;
f32 side1;
f32 side2;
};
class square_no_vt_opt : public shape_base_no_vt_opt
{
public:
square_no_vt_opt(f32 SideInit) : shape_base_no_vt_opt(Shape_Square, SideInit, SideInit) {}
};
class rectangle_no_vt_opt: public shape_base_no_vt_opt
{
public:
rectangle_no_vt_opt(f32 WidthInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Rectangle, WidthInit, HeightInit) {}
};
class triangle_no_vt_opt: public shape_base_no_vt_opt
{
public:
triangle_no_vt_opt(f32 BaseInit, f32 HeightInit) : shape_base_no_vt_opt(Shape_Triangle, BaseInit, HeightInit) {}
};
class circle_no_vt_opt: public shape_base_no_vt_opt
{
public:
circle_no_vt_opt(f32 RadiusInit) : shape_base_no_vt_opt(Shape_Circle, RadiusInit, RadiusInit) {}
};
using shape_variant_opt = std::variant<square_no_vt_opt, rectangle_no_vt_opt, triangle_no_vt_opt, circle_no_vt_opt>;
template<typename ShapeVariant>
f32 TotalAreaVariant(u32 ShapeCount, ShapeVariant *Shapes) noexcept
{
f32 Accum = 0.0f;
for(u32 ShapeIndex = 0; ShapeIndex < ShapeCount; ++ShapeIndex)
{
Accum += std::visit([](auto shape){return shape.Area();}, Shapes[ShapeIndex]);
}
return Accum;
}
template<typename ShapeVariant>
f32 TotalAreaVariant4(u32 ShapeCount, ShapeVariant *Shapes) noexcept
{
f32 Accum0 = 0.0f;
f32 Accum1 = 0.0f;
f32 Accum2 = 0.0f;
f32 Accum3 = 0.0f;
u32 Count = ShapeCount/4;
while(Count--)
{
Accum0 += std::visit([](auto shape){return shape.Area();}, Shapes[0]);
Accum1 += std::visit([](auto shape){return shape.Area();}, Shapes[1]);
Accum2 += std::visit([](auto shape){return shape.Area();}, Shapes[2]);
Accum3 += std::visit([](auto shape){return shape.Area();}, Shapes[3]);
Shapes += 4;
}
f32 Result = (Accum0 + Accum1 + Accum2 + Accum3);
return Result;
}
class shape_base_opt: public shape_base{
public:
shape_base_opt(shape_type shape_index_, f32 side1_, f32 side2_) :
shape_index(shape_index_), side1(side1_), side2(side2_)
{}
virtual f32 Area() { return c_table[shape_index] * side1 * side2; }
private:
static constexpr f32 c_table[4] = {1.0f, 1.0f, 0.5f, Pi32};
shape_type shape_index;
f32 side1;
f32 side2;
};
class square_opt : public shape_base_opt
{
public:
square_opt(f32 SideInit) : shape_base_opt(Shape_Square, SideInit, SideInit) {}
};
class rectangle_opt: public shape_base_opt
{
public:
rectangle_opt(f32 WidthInit, f32 HeightInit) : shape_base_opt(Shape_Rectangle, WidthInit, HeightInit) {}
};
class triangle_opt: public shape_base_opt
{
public:
triangle_opt(f32 BaseInit, f32 HeightInit) : shape_base_opt(Shape_Triangle, BaseInit, HeightInit) {}
};
class circle_opt: public shape_base_opt
{
public:
circle_opt(f32 RadiusInit) : shape_base_opt(Shape_Circle, RadiusInit, RadiusInit) {}
};
shape_base* generate_shape_base(){
static RandomNumberGen<size_t> next_shape{0, 3};
static RandomNumberGen<f32> next_argument{};
switch(next_shape()){
case 0: return new square{next_argument()};
case 1: return new rectangle{next_argument(), next_argument()};
case 2: return new triangle{next_argument(), next_argument()};
case 3: return new circle{next_argument()};
}
abort(); // silencing no return warning
}
shape_union generate_shape_union(){
static RandomNumberGen<size_t> next_shape{0, 3};
static RandomNumberGen<f32> next_argument{};
switch(next_shape()){
case 0: {
f32 edge = next_argument();
return shape_union{Shape_Square, edge, edge};
};
case 1: return shape_union{Shape_Rectangle, next_argument(), next_argument()};
case 2: return shape_union{Shape_Triangle, next_argument(), next_argument()};
case 3: {
f32 edge = next_argument();
return shape_union{Shape_Circle, edge, edge};
};
}
abort(); // silencing no return warning
}
shape_variant generate_shape_variant(){
static RandomNumberGen<size_t> next_shape{0, 3};
static RandomNumberGen<f32> next_argument{};
switch(next_shape()){
case 0: return square_no_vt{next_argument()};
case 1: return rectangle_no_vt{next_argument(), next_argument()};
case 2: return triangle_no_vt{next_argument(), next_argument()};
case 3: return circle_no_vt{next_argument()};
}
abort(); // silencing no return warning
}
shape_variant_opt generate_shape_variant_opt(){
static RandomNumberGen<size_t> next_shape{0, 3};
static RandomNumberGen<f32> next_argument{};
switch(next_shape()){
case 0: return square_no_vt_opt{next_argument()};
case 1: return rectangle_no_vt_opt{next_argument(), next_argument()};
case 2: return triangle_no_vt_opt{next_argument(), next_argument()};
case 3: return circle_no_vt_opt{next_argument()};
}
abort(); // silencing no return warning
}
shape_base* generate_shape_base_opt(){
static RandomNumberGen<size_t> next_shape{0, 3};
static RandomNumberGen<f32> next_argument{};
switch(next_shape()){
case 0: return new square_opt{next_argument()};
case 1: return new rectangle_opt{next_argument(), next_argument()};
case 2: return new triangle_opt{next_argument(), next_argument()};
case 3: return new circle_opt{next_argument()};
}
abort(); // silencing no return warning
}
template<typename ShapeGenerator>
auto generate_shapes(ShapeGenerator generate_shape) -> std::vector<decltype(generate_shape())>{
std::vector<decltype(generate_shape())> result;
result.reserve(num_of_shapes);
for(size_t i=0; i < num_of_shapes; ++i){
result.push_back(generate_shape());
}
return result;
}
template<typename ShapeGenerator, typename AreaCount, typename AreaCountLoopUnrolled>
void make_bench(ShapeGenerator gen_shape, AreaCount TotalArea, AreaCountLoopUnrolled TotalArea4, char const * const benchname){
TimeMeasure tm_common;
TimeMeasure tm_loop_unrolled;
f32 area_common = 0;
f32 area_unrolled = 0;
{
auto shapes = generate_shapes(gen_shape);
tm_common.begin();
for(size_t i = 0; i < num_of_outer_loops; ++i){
area_common += TotalArea(shapes.size(), shapes.data());
}
tm_common.end();
if constexpr (std::is_pointer_v<std::decay_t<decltype(shapes[0])>>){
for(const auto shape: shapes ){
delete shape;
}
}
}
{
auto shapes = generate_shapes(gen_shape);
tm_loop_unrolled.begin();
for(size_t i = 0; i < num_of_outer_loops; ++i){
area_unrolled += TotalArea4(shapes.size(), shapes.data());
}
tm_loop_unrolled.end();
if constexpr (std::is_pointer_v<std::decay_t<decltype(shapes[0])>>){
for(const auto shape: shapes ){
delete shape;
}
}
}
std::cout << "Bench: " << benchname << '\n';
std::cout << "\tRegular loop, time: " << tm_common.ms() << " ms area: " << area_common << '\n';
std::cout << "\tUnrolled loop, time: " << tm_loop_unrolled.ms() << " ms area: " << area_unrolled << "\n\n";
}
int main(int, char**) {
make_bench(generate_shape_base, TotalAreaVTBL, TotalAreaVTBL4, "virtual functions");
make_bench(generate_shape_union, TotalAreaSwitch, TotalAreaSwitch4, "switch case");
make_bench(generate_shape_union, TotalAreaUnion, TotalAreaUnion4, "lookup table");
make_bench(generate_shape_variant, TotalAreaVariant<shape_variant>, TotalAreaVariant4<shape_variant>, "std::variant");
make_bench(generate_shape_variant_opt, TotalAreaVariant<shape_variant_opt>, TotalAreaVariant4<shape_variant_opt>, "std::variant optimized");
make_bench(generate_shape_base_opt, TotalAreaVTBL, TotalAreaVTBL4, "virtual functions optimized");
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment