This article was first written during the Unity Fee-asco of 2023. As such, many Unity ECS community members were seeking alternative high-performance solutions, and were asking me about potential places to migrate some technology I have developed.
Unfortunately, most of the ECS solutions were inadequate for facilitating the technology, yet it hasn’t been easy to explain why. That’s why this series exists. I’ll explain several ECS concepts ideas that are often overlooked in many of the ECS projects I have seen. Many of these concepts will help your ECS not just thrive in tech demos, but be able to achieve next-level performance in real large-scale games with high complexity.
- Part 1 – Memory Matters
- Part 2 – High-Ratio Optimizations
- Part 3 – Relationships
First off, you might have heard that you should read adjacent values in memory for better cache efficiency, which is faster. But if you don’t know what a cache line is, or what a prefetcher is, or how many there are, it is best that you study up on these things. There’s lots of resources out there on this topic. This 2D Arrays section of this article is one I wrote that briefly covers the topic.
There’s a few takeaways from this. First, make sure that when iterating through entities, you are iterating through actual component values, and not indices into another array. The latter is no different than an OOP pointer lookup. Second, your ECS should allow iterating through multiple components in tandem at the same time, because this allows expressing game complexity operations between various engine features in a very fast way. That last part is important for achieving scale in complexity for real games.
This does eliminate many types of ECS implementations. An archetype ECS with multi-component iteration is likely the way to go here. That’s not to say there aren’t other ways to achieve this, but I’ll be assuming this type of ECS for the remainder of the series.
If you already have this type of ECS, great! But you haven’t escaped my criticism yet. This next one is brutal. Not even your “recommended practices” guide you wrote for your users is safe.
Whereas spatial locality of cache refers to adjacent addresses in memory being good for cache, temporal locality is all about how long that data stays in cache. A cache line stays in cache until a new requested cache line kicks it out (referred to as eviction).
In an ECS, because you are iterating over a lot of entities and loading in lots of different component values, you are going to be making lots of cache line requests that evict other cache lines. This causes memory to cycle in and out of cache potentially very fast, and can easily introduce memory bandwidth bottlenecks. The solution to this, is to load your components less.
For example, let’s say you have the components Acceleration, Velocity, and Position. You have one system that reads Acceleration and writes to Velocity. And you have another system that reads Velocity and writes to Position. Unless you have some system in-between these that further modify Velocity, this is wasteful. You are loading Velocity into cache twice, as it is probably getting evicted by other systems, or even by the sheer number of entities being processed. If your system instead writes to Position for each entity immediately after calculating the new Velocity, you only have load Velocity once. Thus, bandwidth is decreased and becomes less of a bottleneck slowing your game down.
The takeaway here is that too granular of systems can hurt performance. You have to consider the operations your game actually needs, and be willing to refactor to combine or split systems to achieve the right balance of performance and flexibility.
This also means that some systems will want to read more and more components, so you should consider combining components as well, especially if said components are only used by a single system.
Lastly, smaller data sizes can help a lot, even if they cost more CPU instructions to convert them. Benchmark it within the full context of your game to see.
So yeah. Check your examples to see if you are encouraging good practices or not.
Now we get into the big offender. Let’s talk about temporary allocations. For this, I’ll propose an example.
Suppose for each entity you need to perform a raycast query against some sort of spatial structure, and each hit provides back a RaycastResult. You need a temporary list to store each of these, so you allocate one. But then, as you iterate each entity, you keep reallocating this list. Depending on the allocation strategy, you may be hitting a very slow allocator that has to switch to OS privileged mode of execution and both allocate when you start using and free when you stop. That ain’t good, so to fix that, you might instead use a bump allocator that just keeps allocating and then frees the memory all at once later (or recycles it). But now you’ve introduced a new problem, for each allocation, you are using new memory addresses, and that means new cache lines that have to evict old cache lines. That chews up bandwidth.
What you really want to do is reuse that memory for every entity, and just clear the list each time. If your ECS can’t facilitate a means to do that, you better fix that!
But now, let’s suppose that you took this and the previous advice to heart and after processing all the raycasting for a single entity and no longer needing the list, in the same system you then decide to perform distance queries against the spatial structure, which provides back DistanceResult values. Now you need a new list for this, but it would be a lot nicer if you could use the same list memory as the raycasts, so that you had less cache eviction.
Depending on your language of choice, good luck!
The most common game development language is C++. In that language, a developer might cast the pointer to the list memory into a new type. And while most of the time the code will work correctly, it will be completely by luck because such behavior is undefined.
C++ has a rule called “strict aliasing” which basically states that once an allocated memory address has some instance of some data type placed at it, the compiler can assume that the type at that address never changes until the address is freed again. If it were an int, it will always be an int.
Take this code, and compile it with GCC 6.4 with -O3 -std=c++14
and then again
with GCC 7.5.
#include <iostream>
#include <memory>
struct PreventInlines
{
virtual float* convertVoidToFloatPtr(int* address)
{
return reinterpret_cast<float*>(address);
}
};
struct DerivedPreventer : public PreventInlines
{
float* convertVoidToFloatPtr(int* address) override
{
return reinterpret_cast<float*>(address);
}
};
std::unique_ptr<PreventInlines> makePreventer(const int* ptr)
{
return std::make_unique<DerivedPreventer>();
}
int main()
{
int arrayPtr[2];
arrayPtr[0] = 5;
auto preventInlines = makePreventer(arrayPtr);
auto floatPtr = preventInlines->convertVoidToFloatPtr(arrayPtr);
floatPtr++;
arrayPtr[1] = 10;
*floatPtr = 0.0f;
std::cout << arrayPtr[0] << " , " << arrayPtr[1] << std::endl;
return 0;
}
GCC 6.4 prints:
5, 10
Whereas GCC 7.5 prints:
5, 0
This is not a bug in GCC 6.4, it just happens that GCC 7.5 is better at optimizing and sees the progression based on what actually happens. But the reason they are different is strict aliasing. The whole behavior is undefined, and if we were to add a bunch of other code, we’d get a 10 in the output even with newer compilers.
That’s a problem when we want to recycle memory. And while I highlighted the issue with temporary allocations, this has the same problem if you are trying to reuse memory allocations for different entity archetypes or other similar potential optimizations.
Oddly enough, unmanaged C# doesn’t have a strict aliasing rule, and consequently doesn’t run into this problem. Unity’s ECS code leverages this in quite a few circumstances, and Burst allows for explicitly specifying whether or not pointers are the same, different, or potentially either, giving control of optimization to the developer.
The last bit about memory in an ECS is that you should strive to expose it directly to the user as a low-level API option. This gives the user the freedom to reason about multiple Entities at once using SIMD. It also lets the user avoid loading some component instances into the cache if they are conditionally not needed.
So, how good is your ECS? Is it friendly to the memory ninjas? Or are you getting owned by your language of choice and encapsulation?
In the next article, we’ll cover high-ratio optimizations and uncover how your filtering features should be more than just user convenience.
I'm out of touch with what the best introductory material is, but for a technical read, this is one of the best resources out there. Despite its age, most of the information there is still relevant. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf