Skip to content

Instantly share code, notes, and snippets.

@rianhunter
Last active December 21, 2023 23:25
  • Star 20 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save rianhunter/0be8dc116b120ad5fdd4 to your computer and use it in GitHub Desktop.
Indirect vs Direct Function Call Overhead in C/C++
/*
This benchmark shows how indirect function calls have nearly
the same overhead costs as direct function calls.
This is comparing apples to apples, factoring out the savings
due to inlining optimizations that direct calls usually afford.
From this, it seems that inlining and other generic interprocedual
optimizations are the main drivers of direct function call optimization,
not the direct call itself.
In other words, unless you implement your functions in header files
or compile with LTO (Link Time Optimization), there is virtually no benefit
to using direct function calls, aka static dispatch.
*/
#include <limits.h>
int foo(int a) {
return a;
}
int direct_version() {
int i, b = 0;
for (i = 0; i < INT_MAX; ++i) {
b = foo(b);
}
return b;
}
int indirect_version(int (*fn)(int)) {
int i, b = 0;
for (i = 0; i < INT_MAX; ++i) {
b = fn(b);
}
return b;
}
int main(int argc, char *argv[]) {
if (argc == 2 && argv[1][0] == 'd') {
return direct_version();
}
else {
return indirect_version(&foo);
}
}
/*
$ cc -O3 -fno-inline -fomit-frame-pointer call_overhead.c
$ time ./a.out d
real 0m5.199s
user 0m5.155s
sys 0m0.021s
$ time ./a.out s
real 0m5.251s
user 0m5.215s
sys 0m0.018s
*/
@abainbridge
Copy link

I like your code - clear and to-the-point. However, I think it is a little misleading in the general case. To understand more I looked at the assembly generated for x86. After I cleaned it up to make it more readable, it looked like this:

_foo:
    movl    4(%esp), %eax       ; Copy argument from stack into eax,
                                ; which is normally used to store
                                ; the return value from a function
                                ; in x86.
    ret


_direct_version:
    subl    $4, %esp            ; Allocate 4 bytes of stack space.
                                ; This space will be used to hold
                                ; the argument when we call foo().

    movl    $2147483647, %edx   ; edx is the 'i' variable of the 
                                ; for loop. Initialized to MAX_INT

    xorl    %eax, %eax          ; eax is the 'b' variable. That xor
                                ; will set eax to 0.
L3:
    movl    %eax, (%esp)        ; Copy 'b' onto the stack space
                                ; reserved to hold the argument
                                ; for foo().
    call    _foo                ; Call the function
    subl    $1, %edx            ; i--
    jne L3                      ; if (result of subtract above != 0) goto L3;
    addl    $4, %esp
    ret


_indirect_version:
    pushl   %esi
    pushl   %ebx
    xorl    %eax, %eax
    movl    $2147483647, %ebx
    subl    $20, %esp
    movl    32(%esp), %esi
L8:
    movl    %eax, (%esp)
    call    *%esi
    subl    $1, %ebx
    jne L8
    addl    $20, %esp
    popl    %ebx
    popl    %esi
    ret

Differences between the direct and indirect versions are:

  1. The direct version uses 3 instructions to setup before it gets to the for-loop.The indirect version uses 6.
  2. The loop itself is 4 instructions in both cases, but the direct version uses 3 registers (eax, esp and edx) while the indirect version uses 4 (eax, esp, esi, and ebx). If there were no more registers free, the indirect version would have to add extra code to move variables on and off the stack.

The extra setup overhead doesn't matter much, unless the loop count is tiny. But the extra register use does matter. In real code, register contention is often a problem - it is more of a problem on x86 than instruction sets with more registers, but I don't think we should ignore this cost in any case.

To investigate the cost, I changed your code to use additional copies of foo(). My code is below. Timing the resulting executable, I found that the indirect version was 3.4x slower.

#include <limits.h>

int foo(int a) {
  return a;
}

int bar(int a) {
  return a;
}

int baz(int a) {
  return a;
}

int direct_version() {
  int i, b = 0;
  for (i = 0; i < INT_MAX; ++i) {
      b = foo(b) + bar(b) + baz(b);
  }
  return b;
}

int indirect_version(int (*fn)(int), int (*fn2)(int), int (*fn3)(int)) {
  int i, b = 0;

  for (i = 0; i < INT_MAX; ++i) {
    b = fn(b) + fn2(b) + fn3(b);
  }

  return b;
}

int main(int argc, char *argv[]) {
  if (argc == 2 && argv[1][0] == 'd') {
    return direct_version();
  }
  else {
    return indirect_version(&foo, &bar, &baz);
  }
}

@abainbridge
Copy link

A even simpler change makes the problem even worse. I modified the for-loop to ignore the return value of foo(). Then, in the direct case, the compiler understood that there is no point in calling the function, since it can see that the foo() has no side-effects. So the for-loop became a no-op. But it didn't make the same decision for the indirect version because it didn't know if the function pointed to has any side effects.

Obviously the performance difference then is massive.

This illustrates the general problem that indirect functions make it harder for the compiler to apply its optimizations.

Here are the functions I modified:

int direct_version() {
  int i, b = 0;
  for (i = 0; i < INT_MAX; ++i) {
      foo(b);
  }
  return b;
}

int indirect_version(int (*fn)(int)) {
  int i, b = 0;

  for (i = 0; i < INT_MAX; ++i) {
    fn(b);
  }

  return b;
}

@cruxrebels
Copy link

Thanks for this @rianhunter and the response @abainbridge 👍

@jzulauf-lunarg
Copy link

Additionally, if the function call can be inlined, the indirect version has function call overhead, while the direct version does not.

Where this gets interesting is where the reason for the indirection is the presence or absence of a conditional (or conditional tree) to select a function. I.e. "how many 'if's does one need to eliminate to justify the performance penalties of the function call indirection?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment