Skip to content

Instantly share code, notes, and snippets.

@zeux
Last active January 27, 2024 11:45
Show Gist options
  • Save zeux/3ce4fcc3a43072b4315abde95319ecb6 to your computer and use it in GitHub Desktop.
Save zeux/3ce4fcc3a43072b4315abde95319ecb6 to your computer and use it in GitHub Desktop.
How does clang 2.7 hold up in 2021?

A friend recently learned about Proebsting's law and mentioned it to me off hand. I knew about the law's existence but I never really asked myself - do I believe in it?

For people who aren't aware, Proebsting's law states:

Compiler Advances Double Computing Power Every 18 Years

Which is to say, if you upgrade your compiler every 18 years, you would expect on average your code to double in performance on the same hardware.

Let's C about this

It occurred to me that I could try to do an experiment. I could take a modern compiler and compare performance of generated code - along with perhaps a few other metrics - vs a 20-year-old one.

At least this was my initial intention; however I've long wanted to do another experiment which is to figure out how LLVM has changed over the years. To combine these two I wanted to get an old version of LLVM and test it against a modern version.

To make this experiment a bit more interesting, I was going to test LLVM 1.0 - unfortunately, it only comes with 32-bit Linux binaries that I wasn't able to get to work fully due to lack of 32-bit system headers, and it segfaulted when compiling one of the source files. So we're going to test two versions of LLVM:

  • LLVM 2.7. This is the first release of LLVM that contains a version of Clang that can compile C++ code.
  • LLVM 11. This is the latest stable release of LLVM that I happen to have available.

LLVM 2.7 was released in April 2010, which was 11 years ago. So we wouldn't quite expect a 2x speedup according to Proebsting's law - only a 1.5x one.

We're going to compare these compilers on compile time and run time axis as follows:

  • Using an amalgamated version of https://github.com/zeux/meshoptimizer library, we're going to build libmeshoptimizer.o several times for each compiler, with and without optimizations (-O0 through -O3), and note the build time.
  • Using the resulting optimized .o file we're going to compile the meshoptimized demo program using modern clang, run it on a Stanford dragon mesh and compare timings for various algorithms.

The reason why we're going to compile the demo program separately is that demo program uses STL and I don't want to find versions of STL that are compatible with these older compilers.

Note: I'm aware that this is not a rigorous or a scientific way to analyze the law; the law itself is also a bit tongue in cheek so who cares? Don't read too much into the results.

Let's go!

Building library code

I've downloaded a binary release of LLVM 2.7 from https://releases.llvm.org/; LLVM 11 comes with Ubuntu 20. I'm running everything using WSL2 on a Linux partition to make sure the performance numbers are representative of real hardware.

Each compiler is used to build all meshoptimizer source (8.5 KLOC) as a single translation unit to simplify the build process, in four configurations: -O0, -Os -DNDEBUG, -O2 -DNDEBUG and -O3 -DNDEBUG.

Build time comparison:

Build LLVM 2.7 LLVM 11
O0 0.236s 0.267s (+13%)
Os 0.540s 0.992s (+84%)
O2 0.618s 1.350s (+118%)
O3 0.658s 1.443s (+119%)

Object size comparison:

Build LLVM 2.7 LLVM 11
O0 229.5 KB 215.3 KB (-6%)
Os 80.9 KB 74.4 KB (-8%)
O2 86.2 KB 106.9 KB (+24%)
O3 85.5 KB 111.9 KB (+30%)

Based on this analysis we can observe that the debug compilation throughput was not impacted very significantly - over 10 years of development time clang+llvm got 15% slower in debug builds, which is not surprising and not particularly alarming. Release mode, however, is noticeably slower - 2.2x slower in O2/O3.

In terms of output size, the numbers look healthy - O2/O3 builds got ~25-30% larger but that by itself isn't a problem as long as we see matching performance increases - in Os, where size is important, the binary got 8% smaller.

Runtime: basics in O0/Os/O2/O3

The problem when comparing runtime is that it's not clear what specific build we need to compare, and what code we need to benchmark. meshoptimizer comes with lots of algorithms that have various performance characteristics. It would be interesting to analyze all of them, but since this article doesn't promise to be scientific, we're going to pick a few algorithms and measure them in all build configurations, and then select one configuration to dig into Proebsting's law further.

To get a basic understanding, let's pick just three algorithms - vertex cache optimization, simplification and index decompression. We're going to look closer into performance of other algorithms later, but it would be good to get a sense of the differences between the versions on a small set.

Vertex cache optimization:

Build LLVM 2.7 LLVM 11
O0 506ms 482ms (-5%)
Os 176ms 167ms (-5%)
O2 175ms 181ms (+3%)
O3 174ms 183ms (+5%)

Simplification:

Build LLVM 2.7 LLVM 11
O0 761ms 741ms (-3%)
Os 376ms 335ms (-11%)
O2 379ms 325ms (-14%)
O3 366ms 318ms (-13%)

Index decompression:

Build LLVM 2.7 LLVM 11
O0 21.3ms 18.9ms (-11%)
Os 7.0ms 4.6ms (-34%)
O2 5.1ms 4.6ms (-9%)
O3 5.2ms 4.6ms (-12%)

The picture that is beginning to emerge here seems rather grim. We see speedups in the 10-15% range in optimized builds, with an exception of index decompress in Os that seems more like an outlier, where likely -Os inlining heuristics in LLVM 11 result in the same code across different optimization levels; we also see speedups in the 5% range in unoptimized builds.

Now, it's important that in addition to the disclaimer about the comparison not being particularly scientific the reader also understands one extra detail - all algorithms in meshoptimizer are carefully optimized. This isn't a run-of-the-mill C++ code - this is the code that was studied under various profilers and tweaked until, while it remained reasonably concise, the performance was deemed worthy.

It is possible in theory that code that's less carefully optimized exhibits different behavior, or that the benchmarks chosen here are simply not as amenable to compiler optimization as they could be - the lack of prominent difference between different optimization levels is also noteworthy (although O3 in particular has been stufied before in academic research and the value of that mode was inconclusive).

To try to get a more complete picture, let's now look at more algorithms and compare them in O2 build only.

Runtime: algorithms in O2

We're going to first take a look at a more complete set of algorithms from meshoptimizer library; this isn't every single algorithm in existence as some of the algorithms have performance characteristics that aren't very distinct compared to other algorithms already presented here. This also excludes vertex decompression which is going to be mentioned separately.

Algorithm LLVM 2.7 LLVM 11
Reindex 92ms 86ms (-7%)
Cache 175ms 183ms (+4%)
CacheFifo 49ms 48ms (-2%)
Overdraw 57ms 52ms (-8%)
Stripify 46ms 36ms (-20%)
Meshlets 519ms 545ms (+5%)
Adjacency 250ms 188ms (-25%)
Simplify 380ms 323ms (-15%)
SimplifySloppy 61ms 45ms (-26%)
SpatialSort 22ms 19ms (-14%)
IndexEncode 29ms 26ms (-11%)
IndexDecode 5.2ms 4.6ms (-12%)

Overall the picture here is not very different from what we've already established - LLVM 11 seems to produce code that's 10-15% faster on most benchmarks. There are a couple outliers where the performance gain is more substantial, up to 25%, and a couple benchmarks where LLVM 11 actually generates consistently slower code, up to 5% - this is not a measurement error.

I've reran the outliers using -O3 with the following results, that made the gap a bit less wide but still substantial:

Algorithm LLVM 2.7 LLVM 11
Stripify 44ms 35ms (-20%)
Adjacency 212ms 174ms (-18%)
SimplifySloppy 52ms 44ms (-15%)

These gains are certainly welcome, although it is unfortunate that they seem to come at the cost of 2x slower compilation. This takes me back to "The death of optimizing compilers" by Daniel J. Bernstein - I wonder if there's a happier middle ground that can be found, one where the compiler gives more control over optimization decisions to the developer and allows tuning the code to reach gains that can be seen here at a lower complexity and compilation performance cost.

Runtime: SIMD

All of the algorithms presented before were scalar, implemented using portable C++. While portions of some of these can be vectorized in theory, in practice clang 11 even at -O3 struggles with generating efficient SIMD code for most/all of them.

meshoptimizer does have several algorithms that have first-class SIMD versions, implemented using SSE/NEON/Wasm intrinsics. Their performance was compared using codecbench, a utility that comes with meshoptimizer and outputs performance in GB/sec - so the numbers in the following tables are reversed, larger is better.

Algorithm LLVM 2.7 LLVM 11
vertex decode 2.3 GB/s 3.0 GB/s (+30%)
filter-oct8 2.6 GB/s 2.8 GB/s (+8%)
filter-oct12 4.1 GB/s 4.2 GB/s (+2%)
filter-quat12 2.4 GB/s 2.6 GB/s (+8%)
filter-exp 13.2 GB/s 13.6 GB/s (+3%)

All of the filters are typical SIMD streaming kernels - there's no branches or complex data dependencies. Perhaps unsurprisingly, the delta in performance of the compiled code is thus not very significant. The vertex decode is substantially more complicated - it contains function calls, branches, mix of scalar and vector instructions and in general can be more challenging for the optimizer.

It's worth noting that on this particular example, using -O3 with LLVM 2.7 brings the performance up from 2.3 GB/s to 2.7 GB/s, while having no effect on LLVM 11 - bringing the delta between LLVM 11 and LLVM 2.7 back to ~10% range.

It's undoubtedly possible to find examples of loops that LLVM 2.7 couldn't vectorize (by virtue of not having an autovectorizer) and LLVM 11 can - unfortunately, my experience even on streamlined kernels like the aforementioned filters force me to maintain a deep distrust towards the auto-vectorizer (out of the 4 filter kernels above, clang 11 can not vectorize even a single one, and gcc 10 can only vectorize 'exp' - one out of 4). I would claim that any gains due to auto-vectorization can't be counted as significant until programmers are given better tools to make these optimizations more predictable and reliable.

Conclusion?

The overall picture seems to be as follows.

LLVM 11 tends to take 2x longer to compile code with optimizations, and as a result produces code that runs 10-20% faster (with occasional outliers in either direction), compared to LLVM 2.7 which is more than 10 years old. This may be a general rule, something specific to highly tuned code, or something specific to meshoptimizer algorithms.

Without spending more than an evening it's hard to disambiguate the reasons. And this post definitely doesn't pretend to be a thorough research - it's just a fun little study of how competitive clang 2.7 looks like in 2021. Without a doubt, the amazing community behind LLVM didn't spend the last decade for naught - but if you still believe in the sufficiently smart optimizing compiler, it may be time to reconsider the extent to which you can rely on the compiler to make your code faster year after year, as if anything Proebsting's law should probably be reformulated as:

Compiler Advances Double Computing Power Every 50 Years, And The Interval Keeps Growing

It's important to recognize that there are many forces that together define the rate at which software performance changes - between hardware getting faster (yes, even in the last 10 years, despite what Herb Sutter's "Free Lunch Is Over" would make you believe), compilers getting better, software development practices frequently getting out of hand and a large discrepancy between the expertise of the software developers wrt optimization, compiler advances are just one, rather small, piece of the puzzle. Perhaps Daniel Bernstein was right after all.

@eyalroz
Copy link

eyalroz commented Jan 31, 2021

But were you using -march=native -mtune=native?

@hbayindir
Copy link

But were you using -march=native -mtune=native?

It's not practical to use -march and -mtune if you're distributing the closed source binary at the end of the day, unless you have specific builds for every processor generation or ability to dictate the hardware that the binary will run.

@zeux
Copy link
Author

zeux commented Jan 31, 2021

Not to mention that clang doesn't support mtune until clang 12.

@alistra
Copy link

alistra commented Jan 31, 2021

Was the binary release for 2.7 bootstrapped from 2.7? Chances are they may have recompiled it with a newer version.

Maybe it is worth to retry on bootstrapped compilers (build a stage 3 yourself)?

@zeux
Copy link
Author

zeux commented Jan 31, 2021

clang 2.7 binaries are dated April 18 2010. I don't know if they were compiled by gcc or LLVM backend, but either way that's only going to impact compilation speeds and only by so much. I'm not adventurous enough to attempt to rebootstrap a 10-year old compiler...

@AntonErtl
Copy link

Proebsting was not serious about the numbers, and his point was, in his own words: "Perhaps this means Programming Language Research should be concentrating on something other than optimizations. Perhaps programmer productivity is a more fruitful arena."

In any case, thank you for showing again that optimization advanced even more slowly than Proebsting claimed.

@Reflexe
Copy link

Reflexe commented Jan 31, 2021

Thanks for the post. The only note i have is that testing a very optimized sw could be misleading when benchmarking compilers.
Also, from my experience, most code compiled in the real world is also unoptimized (e.g. optimized for being readable rather being as fast as possible).

@AntonErtl
Copy link

Sounds plausible. In my own testing (Fig. 1 of What every compiler writer should know about programmers), the least programmer-optimized variants of the tsp program (tsp1-tsp3) do indeed see quite a bit of speedup from gcc-2.7.2.3 -O3 (1997) to gcc-5.2.0 -O3 (2015), while the differences are pretty small from tsp4 onwards. clang-3.5 -O3 (2014) was about a factor of 2 slower then gcc-2.7.2.3 -O3 for tsp1-tsp2, and as fast as gcc-5.2 for tsp3, whatever you want to make out of that.

While most code may not be programmer-optimized, the performance of most code is irrelevant; and typically, if performance is relevant, programmers tend to optimize for performance.

@eyalroz
Copy link

eyalroz commented Jan 31, 2021

While most code may not be programmer-optimized, the performance of most code is irrelevant; and typically, if performance is relevant, programmers tend to optimize for performance.

Programmers, especially in C++, often rely on the compiler to optimize certain things away which, if kept as-is, would tank performance. So, compilers must at the very least "optimize the obvious" - and they still don't get all of those cases. Sometimes the language or the ABI prevent them from doing so, sometimes - it just hasn't happened yet. I've filed a few perf bugs about such issues myself, against GCC mostly - and I'm no compilation expert.

@AntonErtl
Copy link

Sure, if the compilers offer reliable optimizations, it's ok for programmers to rely on them. But when it comes to code where performance is relevant, you find out that compiler optimizations are hit-and-miss. Even if a compiler can perform the optimization in principle, there may have some reason (maybe in your code) or other that prevents it from performing it (cf. zeux' comments on auto-vectorization). You can now try to flatten these bumps until the compiler under consideration performs the desired optimization, but another compiler version might stumble over something else. Or you write the code in a (typically more low-level) way that does not demand as much of the compiler. The latter is easier and more reliable in my experience.

@AntonErtl
Copy link

AntonErtl commented Feb 1, 2021

Actually, low-level is not the issue, explicitness is. In particular, wrt vectorization, I argue for an explicit, but more high-level approach than auto-vectorization: See section 2 of Software Vector Chaining.

@toddlipcon
Copy link

It would be interesting to try this analysis on a more "typical" program - eg Chrome or something. I'd wager that your meshoptimizer library has already been significantly hand-tuned to be "easy to optimize", and wouldn't take really benefit from significant improvements in code layout, PGO/LTO, devirtualization, etc that help bigger bloated programs quite a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment