Skip to content

Instantly share code, notes, and snippets.

@Dreaming381
Last active June 5, 2024 03:08
Show Gist options
  • Save Dreaming381/89d65f81b9b430ffead443a2d430defc to your computer and use it in GitHub Desktop.
Save Dreaming381/89d65f81b9b430ffead443a2d430defc to your computer and use it in GitHub Desktop.
Your ECS Probably Still Sucks - Practical Tips for a Next Level ECS

Your ECS Probably Still Sucks: Part 1 – Memory Matters

This article was first written during the Unity Fee-asco of 2023. As such, many Unity ECS community members were seeking alternative high-performance solutions, and were asking me about potential places to migrate some technology I have developed.

Unfortunately, most of the ECS solutions were inadequate for facilitating the technology, yet it hasn’t been easy to explain why. That’s why this series exists. I’ll explain several ECS concepts ideas that are often overlooked in many of the ECS projects I have seen. Many of these concepts will help your ECS not just thrive in tech demos, but be able to achieve next-level performance in real large-scale games with high complexity.

Outline

  • Part 1 – Memory Matters
  • Part 2 – High-Ratio Optimizations
  • Part 3 – Relationships

Spatial Locality

First off, you might have heard that you should read adjacent values in memory for better cache efficiency, which is faster. But if you don’t know what a cache line is, or what a prefetcher is, or how many there are, it is best that you study up on these things. There’s lots of resources out there on this topic. This 2D Arrays section of this article is one I wrote that briefly covers the topic.

There’s a few takeaways from this. First, make sure that when iterating through entities, you are iterating through actual component values, and not indices into another array. The latter is no different than an OOP pointer lookup. Second, your ECS should allow iterating through multiple components in tandem at the same time, because this allows expressing game complexity operations between various engine features in a very fast way. That last part is important for achieving scale in complexity for real games.

This does eliminate many types of ECS implementations. An archetype ECS with multi-component iteration is likely the way to go here. That’s not to say there aren’t other ways to achieve this, but I’ll be assuming this type of ECS for the remainder of the series.

If you already have this type of ECS, great! But you haven’t escaped my criticism yet. This next one is brutal. Not even your “recommended practices” guide you wrote for your users is safe.

Temporal Locality

Whereas spatial locality of cache refers to adjacent addresses in memory being good for cache, temporal locality is all about how long that data stays in cache. A cache line stays in cache until a new requested cache line kicks it out (referred to as eviction).

In an ECS, because you are iterating over a lot of entities and loading in lots of different component values, you are going to be making lots of cache line requests that evict other cache lines. This causes memory to cycle in and out of cache potentially very fast, and can easily introduce memory bandwidth bottlenecks. The solution to this, is to load your components less.

For example, let’s say you have the components Acceleration, Velocity, and Position. You have one system that reads Acceleration and writes to Velocity. And you have another system that reads Velocity and writes to Position. Unless you have some system in-between these that further modify Velocity, this is wasteful. You are loading Velocity into cache twice, as it is probably getting evicted by other systems, or even by the sheer number of entities being processed. If your system instead writes to Position for each entity immediately after calculating the new Velocity, you only have load Velocity once. Thus, bandwidth is decreased and becomes less of a bottleneck slowing your game down.

The takeaway here is that too granular of systems can hurt performance. You have to consider the operations your game actually needs, and be willing to refactor to combine or split systems to achieve the right balance of performance and flexibility.

This also means that some systems will want to read more and more components, so you should consider combining components as well, especially if said components are only used by a single system.

Lastly, smaller data sizes can help a lot, even if they cost more CPU instructions to convert them. Benchmark it within the full context of your game to see.

So yeah. Check your examples to see if you are encouraging good practices or not.

Temporary Memory

Now we get into the big offender. Let’s talk about temporary allocations. For this, I’ll propose an example.

Suppose for each entity you need to perform a raycast query against some sort of spatial structure, and each hit provides back a RaycastResult. You need a temporary list to store each of these, so you allocate one. But then, as you iterate each entity, you keep reallocating this list. Depending on the allocation strategy, you may be hitting a very slow allocator that has to switch to OS privileged mode of execution and both allocate when you start using and free when you stop. That ain’t good, so to fix that, you might instead use a bump allocator that just keeps allocating and then frees the memory all at once later (or recycles it). But now you’ve introduced a new problem, for each allocation, you are using new memory addresses, and that means new cache lines that have to evict old cache lines. That chews up bandwidth.

What you really want to do is reuse that memory for every entity, and just clear the list each time. If your ECS can’t facilitate a means to do that, you better fix that!

But now, let’s suppose that you took this and the previous advice to heart and after processing all the raycasting for a single entity and no longer needing the list, in the same system you then decide to perform distance queries against the spatial structure, which provides back DistanceResult values. Now you need a new list for this, but it would be a lot nicer if you could use the same list memory as the raycasts, so that you had less cache eviction.

Depending on your language of choice, good luck!

The most common game development language is C++. In that language, a developer might cast the pointer to the list memory into a new type. And while most of the time the code will work correctly, it will be completely by luck because such behavior is undefined.

C++ has a rule called “strict aliasing” which basically states that once an allocated memory address has some instance of some data type placed at it, the compiler can assume that the type at that address never changes until the address is freed again. If it were an int, it will always be an int.

Take this code, and compile it with GCC 6.4 with -O3 -std=c++14 and then again with GCC 7.5.

#include <iostream>
#include <memory>

struct PreventInlines
{
    virtual float* convertVoidToFloatPtr(int* address)
    {
        return reinterpret_cast<float*>(address);
    }
};

struct DerivedPreventer : public PreventInlines
{
    float* convertVoidToFloatPtr(int* address) override
    {
        return reinterpret_cast<float*>(address);
    }
};

std::unique_ptr<PreventInlines> makePreventer(const int* ptr)
{
    return std::make_unique<DerivedPreventer>();
}

int main()
{
    int arrayPtr[2];
    arrayPtr[0] = 5;
    

    auto preventInlines = makePreventer(arrayPtr);

    auto floatPtr = preventInlines->convertVoidToFloatPtr(arrayPtr);
    floatPtr++;
    arrayPtr[1] = 10;
    *floatPtr = 0.0f;

    std::cout << arrayPtr[0] << " ,  " << arrayPtr[1] << std::endl;
    return 0;
}

GCC 6.4 prints:

5, 10

Whereas GCC 7.5 prints:

5, 0

This is not a bug in GCC 6.4, it just happens that GCC 7.5 is better at optimizing and sees the progression based on what actually happens. But the reason they are different is strict aliasing. The whole behavior is undefined, and if we were to add a bunch of other code, we’d get a 10 in the output even with newer compilers.

That’s a problem when we want to recycle memory. And while I highlighted the issue with temporary allocations, this has the same problem if you are trying to reuse memory allocations for different entity archetypes or other similar potential optimizations.

Oddly enough, unmanaged C# doesn’t have a strict aliasing rule, and consequently doesn’t run into this problem. Unity’s ECS code leverages this in quite a few circumstances, and Burst allows for explicitly specifying whether or not pointers are the same, different, or potentially either, giving control of optimization to the developer.

Memory Access

The last bit about memory in an ECS is that you should strive to expose it directly to the user as a low-level API option. This gives the user the freedom to reason about multiple Entities at once using SIMD. It also lets the user avoid loading some component instances into the cache if they are conditionally not needed.

Conclusion

So, how good is your ECS? Is it friendly to the memory ninjas? Or are you getting owned by your language of choice and encapsulation?

In the next article, we’ll cover high-ratio optimizations and uncover how your filtering features should be more than just user convenience.

Your ECS Probably Still Sucks: Part 2 – High-Ratio Optimizations

You got your archetype ECS with direct memory access and fast iteration through your data. You have a few systems crunching through a large number of entities. You’ve run your tech demo through the benchmarks, and it looks like you are getting great performance and full CPU utilization. You’ve hit the threshold, right?

Not even close!

There’s still a potential order of magnitude of performance still out there, and you’re going to need it.

Outline

  • Part 1 – Memory Matters
  • Part 2 – High-Ratio Optimizations
  • Part 3 – Relationships

The Filter Is the Problem

Let’s suppose you have a component called MagicNumber whose value ranges between 0 and 9. You have a million entities with this component. This component is a critical concept to your game, as the number completely changes the behavior of the entity. Each behavior is handled by several systems. Consequently, the first thing each of these systems do is check if the magic number is the one they want. Roughly 90% of the time, it isn’t, but the magic number still had to be loaded into cache, and CPU cycles had to be spent checking. Even though we have amazing cache efficiency, we’re still being wasteful.

We could have a list of entities for each magic number, but then all our accesses would require randomly looking up entities, which is usually pretty expensive.

This type of problem doesn’t just apply to this scenario. Any “sparse” events that could happen in the frame that affects a small subset of entities can run into this problem. This is pretty common, especially as games grow in complexity. Sometimes brute-forcing it is good enough, and can be faster than the list-based approach. But what if we could do even better?

Queries Got It Right

In an ECS, when you perform a query, you don’t have to check every entity to see if it has the components you are after. Instead, the query is able to find the selected entities quickly based on grouping patterns (in our case, archetypes). We are able to skip over entire groups of entities just by recognizing their archetype doesn’t match.

But as has been shown, archetypes aren’t the only thing we need to filter by with our logic. If your ECS doesn’t have tools for efficiently skipping over groups of entities based on other considerations, you will want to fix that. The rest of this article will show you some of the mechanisms you can add that will help accomplish this.

Lastly, remember your goal isn’t to perform all this skipping for the user, but rather provide the user with the tools for them to do these optimizations.

Value Groups

In addition to grouping entities by archetype, another thing you can do is group entities by specific component values. These components need to be special, as changing their values would have to trigger regrouping of entities. If in our MagicNumber example, the magic numbers rarely or never change, then this technique could be applied to that problem.

Unity’s ECS calls these “shared components”.

Slice Aggregate Values

What happens if the sparse set of entities we are searching for can’t easily be grouped without costing major organizational overhead? We need a technique to skip over entities without reordering them.

Pick a power of 2. I like 64. Now let’s say that for every 64 entities, there’s a bool that specifies if any of the entities meet some search criteria. Our criteria is very sparse, so at least half of these bool values are false. Instead of checking every entity, we can first check this bool, and if it is false, skip 64 entities. If none of the entities meet the criteria, then we are doing 64 times less work!

It may still technically be an O(n) algorithm in terms of complexity, but this two-level hierarchy concept where we reason about a fixed length slice of adjacent entities at once can offer massive speedups. Of course, this only works if the length of the slice is small enough to likely skip entities, but still large enough to skip a lot of work.

Ideally, the bool values we are using as a fast filter are all adjacent in memory so that we can blast through them quickly. However, they don’t have to just be bools either. They could be min or max values, bounding boxes, sums, or even bitmasks.

Unity provides this type of potential optimization via “Chunk Components”. In Unity, entities and their components are stored in “chunks” where each can hold up to 128 entities. However, chunk components are actually stored in separate “meta chunks”. There are several other ways to implement something like this, and perhaps I’ll propose what I would do in a future article.

Change Detection

When it comes to detecting changes, I see a lot of ECS implementations keep some change tracker per entity. Some of them use bool values, which only works if only one system is allowed to write to the component. Most implementations with this feature use versioning integers and rollover-aware comparisons.

However, doing this per entity has a per-entity cost. Sometimes, it is appropriate. But for many cases, using the previous technique with a change version per slice can have significant benefits, especially for lots of simple ECS systems.

What makes version numbers slightly different is that there’s more of an incentive to make them automatic within the ECS implementation. I’d also encourage a special version number for detecting if any of the entities in the chunk changed archetype. Unity’s ECS has chunk per-component version numbers as well as an archetype version number.

Granular Bit Filtering

So far, we mostly covered the case where we can skip over a whole slice of entities. But what happens if every single slice has on average 2 entities we are looking for? If our slice was 64 entities wide, there’s still potentially a 32X performance gap to close. Let’s close it!

Suppose for each slice, we have a 64 bit integer. However, we will really use this as a bitfield, where each entity in the slice corresponds to one of these bits. How many loop iterations would it take to count the number of entities that have their bit set in a single slice? And if there are 3 bits set, how many iterations would it take to find and process those 3 entities?

The answers: 1 and 3 respectively.

Modern processors have special hardware for dealing with problems formatted in this way, that allows the processor to instantly know the answer to questions like:

“How many bits are set?”

“Is this bit set?”

“What is the index of the first bit set?”

In addition, there’s hardware for flipping either all the bits at once or individual bits.

This already plays nicely with the slice value optimizations discussed earlier. However, an additional specialization that can be added to an ECS is to allow these bits to move with their associated entities when regrouping occurs.

Unity uses the IEnableableComponent interface for this, though at the time of writing, their implementation is quite unpolished.

Parallelism

This concept of splitting everything into slices has an additional benefit, and that’s that it offers natural boundaries for parallelizing algorithms. It is quite common where an operation requires one particular step to be single-threaded, and if you have to iterate through every entity, this can be a bottleneck. By using the above techniques, you can drastically reduce the number of iterations the single-threaded part has to run through while all the other parts of the algorithm can use all the CPU cores to crunch through all the entities.

Conclusion

So how good is your ECS? Does it have great skip opportunities for the speedrunners? Or has your iteration devolved into an autoscroller?

In the next article, we’ll cover relationship caching and explore why static archetypes might not be as good as you think.

Your ECS Probably Still Sucks: Part 3 – Relationships

Relationships are that thing that ECS architectures are “bad at”, unless it is built in, in which the implementation provides a false sense of performance when in actuality it is no better than OOP relationships.

Most people don’t actually understand how to handle relationships in DOD. If you are one of those people, we’re about to fix that.

Outline

  • Part 1 – Memory Matters
  • Part 2 – High-Ratio Optimizations
  • Part 3 – Relationships

Relationships ARE Important

A common thing I see people working with an ECS do is try to get rid of relationships as much as possible. But relationships are everywhere. Parent-child relationships allow for complex structures. Skinned meshes need to deform with bones, which may be part of skeleton that has to know about animation. Physics objects might have joints that drive their interactions. AI needs to know about the various objects that are around them or that they are manipulating. Relationships are often an essential part to fun gameplay. Don’t compromise on fun!

The Secret

It is true, the relationships can be one of the most expensive things in ECS, but most people try to treat the symptom and not the cause. Most people try to remove relationships. But the existence of relationships isn’t the problem. They are just components sitting in memory.

The issue, is that a system may need data from both ends of the relationship at the same time. And this leads to one end of the relationship being accessed in a cache-unfriendly way. This is what you want to reduce.

When optimizing relationships, focus less on optimizing data, and more on optimizing data flow.

Archetypes Matter

Sometimes relationships are persistent, and sometimes they are temporary and dynamic. Depending on the game, any entity could have hundreds of different possible relationships, but few will actually have them because such relationships are driven by player actions. Additionally, when relationships do exist, they tend to exist for multiple frames at a time.

Because of this, it is highly recommended that your archetype ECS support archetype changes on entities for forming and breaking relationships. Yes, such changes can be expensive. But not allowing changes means that the entity has to be prepared for every possible relationship, which drastically increases memory usage and polling.

Relationships tend to require special logic when first formed and when broken. Perhaps, this may even be the only logic associated with the relationship. There’s no reason to do this logic every frame. Instead, use addition and removal of components to track these changes. You also want to have special types of components that linger around when an entity is destroyed so that a system can react to the changes.

Additionally, if your ECS allows cloning an existing entity, you want to reform relationships with the new entity, but if you clone all the components exactly, you won’t be able to detect this. Therefore, you should define components that never get cloned. Having these be the same as the ones that linger around is perfectly suitable. Unity defines cleanup components for these purposes.

Sticky Value Caching

Let’s suppose we have EntityA which has a component called StickyValue. Now we have EntityB which has a relationship with EntityA. There’s a system that needs several of EntityB’s components along with EntityA’s StickyValue. How do we optimize this?

The real question is, how do we make it so that we need StickyValue less often?

The easiest solution would be to put the fetching of StickyValue inside a branch. Sure, branches slow things down a little, but they are still way better than a cache miss.

But the system might need StickyValue every frame, because it is part of some formula involving other components on EntityB that change every frame. If StickyValue changes every frame and this system is the only system that reads it, fetching it may be the best option.

Most of the time when this situation comes up, StickyValue doesn’t change every frame. Sure, it might even be written every frame, but we can still drastically improve performance.

The key here is that EntityB doesn’t have any information about the state of StickyValue until it needs it, but the system that processes EntityA to write StickyValue can know everything. Therefore, instead of making the responsibility for EntityB to retrieve StickyValue, it should be the responsibility of EntityA to push StickyValue only when StickyValue is different than a previous frame, thus reducing the number of frames the random access occurs.

This means that when EntityB forms a relationship with EntityA, that EntityA learns about EntityB being dependent on StickyValue, and a component on EntityA stores this info. When a system writes to StickyValue, it first compares if the new StickyValue is different to the old one. Only then, it will look up EntityB and copy StickyValue to EntityB. This means EntityB needs to have a component to cache StickyValue. Call it StickyValueCache.

Now the system that reads EntityB and StickyValue never needs to transfer data across the dependency, because the data is already in a component that is being loaded in a cache-efficient way.

As for the system that writes to StickyValue on EntityA, while it does get a little more expensive, most of the time, it detects no changes to StickyValue and never needs to load EntityB, which means most of the time this system is also exclusively using a cache-efficient code path.

Use Other Tools Too

It is important to consider the tools that have already been discussed when working with relationships. Good filtering can reduce relationship polling. Using 1 bit components can improve the chances of accidental cache hits just by having so many different component instances represented in a single cache line (512 for 64 byte cache lines).

Also, relationships are often better expressed implicitly inside other data structures, such as bounding-volume hierarchies or grids. Make sure your ECS has a way to store these data structures.

Reliability

When it comes to relationships, it is very important that the execution order is predictable and reliable, otherwise debugging issues can be an absolute nightmare. Thread-safety isn’t enough. You want some level of determinism to ensure that assumptions made on either side of the relationship hold true so that you can skip passing data across all the time.

Conclusion

So how good is your ECS? Is your communication clean? Or are your entities shooting blind?

This concludes discussion regarding ECS architectures. Future articles will discuss various facets of a game engine powered by an ECS and the considerations required for designing such an engine.

@Dreaming381
Copy link
Author

Can I ask you some other resources to understand better how cache work?

I'm out of touch with what the best introductory material is, but for a technical read, this is one of the best resources out there. Despite its age, most of the information there is still relevant. https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment