Dreaming381/Your ECS Probably Still Sucks Part 1.md

## Your ECS Probably Still Sucks Part 1.md

      
    Raw
  

              Your ECS Probably Still Sucks Part 1.md
            
          
    Your ECS Probably Still Sucks: Part 1 – Memory Matters

This article was first written during the Unity Fee-asco of 2023. As such, many
Unity ECS community members were seeking alternative high-performance solutions,
and were asking me about potential places to migrate some technology I have
developed.
Unfortunately, most of the ECS solutions were inadequate for facilitating the
technology, yet it hasn’t been easy to explain why. That’s why this series
exists. I’ll explain several ECS concepts ideas that are often overlooked in
many of the ECS projects I have seen. Many of these concepts will help your ECS
not just thrive in tech demos, but be able to achieve next-level performance in
real large-scale games with high complexity.
Outline


Part 1 – Memory Matters
Part 2 – High-Ratio Optimizations
Part 3 – Relationships

Spatial Locality

First off, you might have heard that you should read adjacent values in memory
for better cache efficiency, which is faster. But if you don’t know what a
cache line is, or what a prefetcher is, or how many there are, it is best
that you study up on these things. There’s lots of resources out there on this
topic. This 2D Arrays section of this
article
is one I wrote that briefly covers the topic.
There’s a few takeaways from this. First, make sure that when iterating through
entities, you are iterating through actual component values, and not indices
into another array. The latter is no different than an OOP pointer lookup.
Second, your ECS should allow iterating through multiple components in tandem at
the same time, because this allows expressing game complexity operations between
various engine features in a very fast way. That last part is important for
achieving scale in complexity for real games.
This does eliminate many types of ECS implementations. An archetype ECS with
multi-component iteration is likely the way to go here. That’s not to say there
aren’t other ways to achieve this, but I’ll be assuming this type of ECS for the
remainder of the series.
If you already have this type of ECS, great! But you haven’t escaped my
criticism yet. This next one is brutal. Not even your “recommended practices”
guide you wrote for your users is safe.
Temporal Locality

Whereas spatial locality of cache refers to adjacent addresses in memory being
good for cache, temporal locality is all about how long that data stays in
cache. A cache line stays in cache until a new requested cache line kicks it out
(referred to as eviction).
In an ECS, because you are iterating over a lot of entities and loading in lots
of different component values, you are going to be making lots of cache line
requests that evict other cache lines. This causes memory to cycle in and out of
cache potentially very fast, and can easily introduce memory bandwidth
bottlenecks. The solution to this, is to load your components less.
For example, let’s say you have the components Acceleration, Velocity, and
Position. You have one system that reads Acceleration and writes to
Velocity. And you have another system that reads Velocity and writes to
Position. Unless you have some system in-between these that further modify
Velocity, this is wasteful. You are loading Velocity into cache twice, as it
is probably getting evicted by other systems, or even by the sheer number of
entities being processed. If your system instead writes to Position for each
entity immediately after calculating the new Velocity, you only have load
Velocity once. Thus, bandwidth is decreased and becomes less of a bottleneck
slowing your game down.
The takeaway here is that too granular of systems can hurt performance. You have
to consider the operations your game actually needs, and be willing to refactor
to combine or split systems to achieve the right balance of performance and
flexibility.
This also means that some systems will want to read more and more components, so
you should consider combining components as well, especially if said components
are only used by a single system.
Lastly, smaller data sizes can help a lot, even if they cost more CPU
instructions to convert them. Benchmark it within the full context of your game
to see.
So yeah. Check your examples to see if you are encouraging good practices or
not.
Temporary Memory

Now we get into the big offender. Let’s talk about temporary allocations. For
this, I’ll propose an example.
Suppose for each entity you need to perform a raycast query against some sort of
spatial structure, and each hit provides back a RaycastResult. You need a
temporary list to store each of these, so you allocate one. But then, as you
iterate each entity, you keep reallocating this list. Depending on the
allocation strategy, you may be hitting a very slow allocator that has to switch
to OS privileged mode of execution and both allocate when you start using and
free when you stop. That ain’t good, so to fix that, you might instead use a
bump allocator that just keeps allocating and then frees the memory all at once
later (or recycles it). But now you’ve introduced a new problem, for each
allocation, you are using new memory addresses, and that means new cache lines
that have to evict old cache lines. That chews up bandwidth.
What you really want to do is reuse that memory for every entity, and just clear
the list each time. If your ECS can’t facilitate a means to do that, you better
fix that!
But now, let’s suppose that you took this and the previous advice to heart and
after processing all the raycasting for a single entity and no longer needing
the list, in the same system you then decide to perform distance queries against
the spatial structure, which provides back DistanceResult values. Now you need
a new list for this, but it would be a lot nicer if you could use the same list
memory as the raycasts, so that you had less cache eviction.
Depending on your language of choice, good luck!
The most common game development language is C++. In that language, a developer
might cast the pointer to the list memory into a new type. And while most of the
time the code will work correctly, it will be completely by luck because such
behavior is undefined.
C++ has a rule called “strict aliasing” which basically states that once an
allocated memory address has some instance of some data type placed at it, the
compiler can assume that the type at that address never changes until the
address is freed again. If it were an int, it will always be an int.
Take this code, and compile it with GCC 6.4 with -O3 -std=c++14 and then again
with GCC 7.5.
#include <iostream>
#include <memory>

struct PreventInlines
{
    virtual float* convertVoidToFloatPtr(int* address)
    {
        return reinterpret_cast<float*>(address);
    }
};

struct DerivedPreventer : public PreventInlines
{
    float* convertVoidToFloatPtr(int* address) override
    {
        return reinterpret_cast<float*>(address);
    }
};

std::unique_ptr<PreventInlines> makePreventer(const int* ptr)
{
    return std::make_unique<DerivedPreventer>();
}

int main()
{
    int arrayPtr[2];
    arrayPtr[0] = 5;
    

    auto preventInlines = makePreventer(arrayPtr);

    auto floatPtr = preventInlines->convertVoidToFloatPtr(arrayPtr);
    floatPtr++;
    arrayPtr[1] = 10;
    *floatPtr = 0.0f;

    std::cout << arrayPtr[0] << " ,  " << arrayPtr[1] << std::endl;
    return 0;
}
GCC 6.4 prints:
5, 10
Whereas GCC 7.5 prints:
5, 0
This is not a bug in GCC 6.4, it just happens that GCC 7.5 is better at
optimizing and sees the progression based on what actually happens. But the
reason they are different is strict aliasing. The whole behavior is undefined,
and if we were to add a bunch of other code, we’d get a 10 in the output even
with newer compilers.
That’s a problem when we want to recycle memory. And while I highlighted the
issue with temporary allocations, this has the same problem if you are trying to
reuse memory allocations for different entity archetypes or other similar
potential optimizations.
Oddly enough, unmanaged C# doesn’t have a strict aliasing rule, and
consequently doesn’t run into this problem. Unity’s ECS code leverages this in
quite a few circumstances, and Burst allows for explicitly specifying whether or
not pointers are the same, different, or potentially either, giving control of
optimization to the developer.
Memory Access

The last bit about memory in an ECS is that you should strive to expose it
directly to the user as a low-level API option. This gives the user the freedom
to reason about multiple Entities at once using SIMD. It also lets the user
avoid loading some component instances into the cache if they are conditionally
not needed.
Conclusion

So, how good is your ECS? Is it friendly to the memory ninjas? Or are you
getting owned by your language of choice and encapsulation?
In the next article, we’ll cover high-ratio optimizations and uncover how your
filtering features should be more than just user convenience.

  
## Your ECS Probably Still Sucks Part 2.md

      
    Raw
  

              Your ECS Probably Still Sucks Part 2.md
            
          
    Your ECS Probably Still Sucks: Part 2 – High-Ratio Optimizations

You got your archetype ECS with direct memory access and fast iteration through
your data. You have a few systems crunching through a large number of entities.
You’ve run your tech demo through the benchmarks, and it looks like you are
getting great performance and full CPU utilization. You’ve hit the threshold,
right?
Not even close!
There’s still a potential order of magnitude of performance still out there, and
you’re going to need it.
Outline


Part 1 – Memory Matters
Part 2 – High-Ratio Optimizations
Part 3 – Relationships

The Filter Is the Problem

Let’s suppose you have a component called MagicNumber whose value ranges
between 0 and 9. You have a million entities with this component. This component
is a critical concept to your game, as the number completely changes the
behavior of the entity. Each behavior is handled by several systems.
Consequently, the first thing each of these systems do is check if the magic
number is the one they want. Roughly 90% of the time, it isn’t, but the magic
number still had to be loaded into cache, and CPU cycles had to be spent
checking. Even though we have amazing cache efficiency, we’re still being
wasteful.
We could have a list of entities for each magic number, but then all our
accesses would require randomly looking up entities, which is usually pretty
expensive.
This type of problem doesn’t just apply to this scenario. Any “sparse” events
that could happen in the frame that affects a small subset of entities can run
into this problem. This is pretty common, especially as games grow in
complexity. Sometimes brute-forcing it is good enough, and can be faster than
the list-based approach. But what if we could do even better?
Queries Got It Right

In an ECS, when you perform a query, you don’t have to check every entity to see
if it has the components you are after. Instead, the query is able to find the
selected entities quickly based on grouping patterns (in our case, archetypes).
We are able to skip over entire groups of entities just by recognizing their
archetype doesn’t match.
But as has been shown, archetypes aren’t the only thing we need to filter by
with our logic. If your ECS doesn’t have tools for efficiently skipping over
groups of entities based on other considerations, you will want to fix that. The
rest of this article will show you some of the mechanisms you can add that will
help accomplish this.
Lastly, remember your goal isn’t to perform all this skipping for the user, but
rather provide the user with the tools for them to do these optimizations.
Value Groups

In addition to grouping entities by archetype, another thing you can do is group
entities by specific component values. These components need to be special, as
changing their values would have to trigger regrouping of entities. If in our
MagicNumber example, the magic numbers rarely or never change, then this
technique could be applied to that problem.
Unity’s ECS calls these “shared components”.
Slice Aggregate Values

What happens if the sparse set of entities we are searching for can’t easily be
grouped without costing major organizational overhead? We need a technique to
skip over entities without reordering them.
Pick a power of 2. I like 64. Now let’s say that for every 64 entities, there’s
a bool that specifies if any of the entities meet some search criteria. Our
criteria is very sparse, so at least half of these bool values are false.
Instead of checking every entity, we can first check this bool, and if it is
false, skip 64 entities. If none of the entities meet the criteria, then we are
doing 64 times less work!
It may still technically be an O(n) algorithm in terms of complexity, but this
two-level hierarchy concept where we reason about a fixed length slice of
adjacent entities at once can offer massive speedups. Of course, this only works
if the length of the slice is small enough to likely skip entities, but still
large enough to skip a lot of work.
Ideally, the bool values we are using as a fast filter are all adjacent in
memory so that we can blast through them quickly. However, they don’t have to
just be bools either. They could be min or max values, bounding boxes, sums, or
even bitmasks.
Unity provides this type of potential optimization via “Chunk Components”. In
Unity, entities and their components are stored in “chunks” where each can hold
up to 128 entities. However, chunk components are actually stored in separate
“meta chunks”. There are several other ways to implement something like this,
and perhaps I’ll propose what I would do in a future article.
Change Detection

When it comes to detecting changes, I see a lot of ECS implementations keep some
change tracker per entity. Some of them use bool values, which only works if
only one system is allowed to write to the component. Most implementations with
this feature use versioning integers and rollover-aware comparisons.
However, doing this per entity has a per-entity cost. Sometimes, it is
appropriate. But for many cases, using the previous technique with a change
version per slice can have significant benefits, especially for lots of simple
ECS systems.
What makes version numbers slightly different is that there’s more of an
incentive to make them automatic within the ECS implementation. I’d also
encourage a special version number for detecting if any of the entities in the
chunk changed archetype. Unity’s ECS has chunk per-component version numbers as
well as an archetype version number.
Granular Bit Filtering

So far, we mostly covered the case where we can skip over a whole slice of
entities. But what happens if every single slice has on average 2 entities we
are looking for? If our slice was 64 entities wide, there’s still potentially a
32X performance gap to close. Let’s close it!
Suppose for each slice, we have a 64 bit integer. However, we will really use
this as a bitfield, where each entity in the slice corresponds to one of these
bits. How many loop iterations would it take to count the number of entities
that have their bit set in a single slice? And if there are 3 bits set, how many
iterations would it take to find and process those 3 entities?
The answers: 1 and 3 respectively.
Modern processors have special hardware for dealing with problems formatted in
this way, that allows the processor to instantly know the answer to questions
like:
“How many bits are set?”
“Is this bit set?”
“What is the index of the first bit set?”
In addition, there’s hardware for flipping either all the bits at once or
individual bits.
This already plays nicely with the slice value optimizations discussed earlier.
However, an additional specialization that can be added to an ECS is to allow
these bits to move with their associated entities when regrouping occurs.
Unity uses the IEnableableComponent interface for this, though at the time of
writing, their implementation is quite unpolished.
Parallelism

This concept of splitting everything into slices has an additional benefit, and
that’s that it offers natural boundaries for parallelizing algorithms. It is
quite common where an operation requires one particular step to be
single-threaded, and if you have to iterate through every entity, this can be a
bottleneck. By using the above techniques, you can drastically reduce the number
of iterations the single-threaded part has to run through while all the other
parts of the algorithm can use all the CPU cores to crunch through all the
entities.
Conclusion

So how good is your ECS? Does it have great skip opportunities for the
speedrunners? Or has your iteration devolved into an autoscroller?
In the next article, we’ll cover relationship caching and explore why static
archetypes might not be as good as you think.

  
## Your ECS Probably Still Sucks Part 3.md

      
    Raw
  

              Your ECS Probably Still Sucks Part 3.md
            
          
    Your ECS Probably Still Sucks: Part 3 – Relationships

Relationships are that thing that ECS architectures are “bad at”, unless it is
built in, in which the implementation provides a false sense of performance when
in actuality it is no better than OOP relationships.
Most people don’t actually understand how to handle relationships in DOD. If you
are one of those people, we’re about to fix that.
Outline


Part 1 – Memory Matters
Part 2 – High-Ratio Optimizations
Part 3 – Relationships

Relationships ARE Important

A common thing I see people working with an ECS do is try to get rid of
relationships as much as possible. But relationships are everywhere.
Parent-child relationships allow for complex structures. Skinned meshes need to
deform with bones, which may be part of skeleton that has to know about
animation. Physics objects might have joints that drive their interactions. AI
needs to know about the various objects that are around them or that they are
manipulating. Relationships are often an essential part to fun gameplay. Don’t
compromise on fun!
The Secret

It is true, the relationships can be one of the most expensive things in ECS,
but most people try to treat the symptom and not the cause. Most people try to
remove relationships. But the existence of relationships isn’t the problem. They
are just components sitting in memory.
The issue, is that a system may need data from both ends of the relationship at
the same time. And this leads to one end of the relationship being accessed in a
cache-unfriendly way. This is what you want to reduce.
When optimizing relationships, focus less on optimizing data, and more on
optimizing data flow.
Archetypes Matter

Sometimes relationships are persistent, and sometimes they are temporary and
dynamic. Depending on the game, any entity could have hundreds of different
possible relationships, but few will actually have them because such
relationships are driven by player actions. Additionally, when relationships do
exist, they tend to exist for multiple frames at a time.
Because of this, it is highly recommended that your archetype ECS support
archetype changes on entities for forming and breaking relationships. Yes, such
changes can be expensive. But not allowing changes means that the entity has to
be prepared for every possible relationship, which drastically increases memory
usage and polling.
Relationships tend to require special logic when first formed and when broken.
Perhaps, this may even be the only logic associated with the relationship.
There’s no reason to do this logic every frame. Instead, use addition and
removal of components to track these changes. You also want to have special
types of components that linger around when an entity is destroyed so that a
system can react to the changes.
Additionally, if your ECS allows cloning an existing entity, you want to reform
relationships with the new entity, but if you clone all the components exactly,
you won’t be able to detect this. Therefore, you should define components that
never get cloned. Having these be the same as the ones that linger around is
perfectly suitable. Unity defines cleanup components for these purposes.
Sticky Value Caching

Let’s suppose we have EntityA which has a component called StickyValue. Now we
have EntityB which has a relationship with EntityA. There’s a system that needs
several of EntityB’s components along with EntityA’s StickyValue. How do we
optimize this?
The real question is, how do we make it so that we need StickyValue less often?
The easiest solution would be to put the fetching of StickyValue inside a
branch. Sure, branches slow things down a little, but they are still way better
than a cache miss.
But the system might need StickyValue every frame, because it is part of some
formula involving other components on EntityB that change every frame. If
StickyValue changes every frame and this system is the only system that reads
it, fetching it may be the best option.
Most of the time when this situation comes up, StickyValue doesn’t change every
frame. Sure, it might even be written every frame, but we can still drastically
improve performance.
The key here is that EntityB doesn’t have any information about the state of
StickyValue until it needs it, but the system that processes EntityA to write
StickyValue can know everything. Therefore, instead of making the responsibility
for EntityB to retrieve StickyValue, it should be the responsibility of EntityA
to push StickyValue only when StickyValue is different than a previous frame,
thus reducing the number of frames the random access occurs.
This means that when EntityB forms a relationship with EntityA, that EntityA
learns about EntityB being dependent on StickyValue, and a component on EntityA
stores this info. When a system writes to StickyValue, it first compares if the
new StickyValue is different to the old one. Only then, it will look up EntityB
and copy StickyValue to EntityB. This means EntityB needs to have a component to
cache StickyValue. Call it StickyValueCache.
Now the system that reads EntityB and StickyValue never needs to transfer data
across the dependency, because the data is already in a component that is being
loaded in a cache-efficient way.
As for the system that writes to StickyValue on EntityA, while it does get a
little more expensive, most of the time, it detects no changes to StickyValue
and never needs to load EntityB, which means most of the time this system is
also exclusively using a cache-efficient code path.
Use Other Tools Too

It is important to consider the tools that have already been discussed when
working with relationships. Good filtering can reduce relationship polling.
Using 1 bit components can improve the chances of accidental cache hits just by
having so many different component instances represented in a single cache line
(512 for 64 byte cache lines).
Also, relationships are often better expressed implicitly inside other data
structures, such as bounding-volume hierarchies or grids. Make sure your ECS has
a way to store these data structures.
Reliability

When it comes to relationships, it is very important that the execution order is
predictable and reliable, otherwise debugging issues can be an absolute
nightmare. Thread-safety isn’t enough. You want some level of determinism to
ensure that assumptions made on either side of the relationship hold true so
that you can skip passing data across all the time.
Conclusion

So how good is your ECS? Is your communication clean? Or are your entities
shooting blind?
This concludes discussion regarding ECS architectures. Future articles will
discuss various facets of a game engine powered by an ECS and the considerations
required for designing such an engine.