John Calsbeek Nexuapex

## GPUOptimizationForGameDev.md

      
              1 file
            
          
              96 forks
            
          
              11 comments
            
          
              1046 stars
            
          
                silvesthu
                / GPUOptimizationForGameDev.md
            
            
              Last active
              May 12, 2024 22:42
            
              
                GPU Optimization for GameDev
              
          
    GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview


2011 - A trip through the Graphics Pipeline 2011
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon
2020 - GPU ARCHITECTURE RESOURCES


## latency.txt
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD

## effective_modern_cmake.md

      
              1 file
            
          
              272 forks
            
          
              59 comments
            
          
              2549 stars
            
          
                mbinna
                / effective_modern_cmake.md
            
            
              Last active
              May 11, 2024 09:23
            
              
                Effective Modern CMake
              
          
    Effective Modern CMake

Getting Started

For a brief user-level introduction to CMake, watch C++ Weekly, Episode 78, Intro to CMake by Jason Turner. LLVM’s CMake Primer provides a good high-level introduction to the CMake syntax. Go read it now.
After that, watch Mathieu Ropert’s CppCon 2017 talk Using Modern CMake Patterns to Enforce a Good Modular Design (slides). It provides a thorough explanation of what modern CMake is and why it is so much better than “old school” CMake. The modular design ideas in this talk are based on the book [Large-Scale C++ Software Design](https://www.amazon.de/Large-Scale-Soft

  
## FastUniformLoadWithWaveOps.txt
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]

## iggy_focus.cpp
enum FocusDir
{
   DIR_W,
   DIR_E,
   DIR_N,
   DIR_S
};

static FocusDir get_quadrant(float dx, float dy)
{

## custom_game_engines_small_study.md

      
              1 file
            
          
              59 forks
            
          
              142 comments
            
          
              1305 stars
            
          
                raysan5
                / custom_game_engines_small_study.md
            
            
              Last active
              May 7, 2024 19:39
            
              
                A small state-of-the-art study on custom engines
              
          
    CUSTOM GAME ENGINES: A Small Study


A couple of weeks ago I played (and finished) A Plague Tale, a game by Asobo Studio. I was really captivated by the game, not only by the beautiful graphics but also by the story and the locations in the game. I decided to investigate a bit about the game tech and I was surprised to see it was developed with a custom engine by a relatively small studio. I know there are some companies using custom engines but it's very difficult to find a detailed market study with that kind of information curated and updated. So this article.
Nowadays lots of companies choose engines like Unreal or Unity for their games (or that's what lot of people think) because d

  
## Tex2DCatmullRom.hlsl
// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae

// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
{
    // We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
    // down the sample location to get the exact center of our "starting" texel. The starting texel will be at
    // location [1, 1] in the grid, where [0, 0] is the top left corner.
    float2 samplePos = uv * texSize;

## cool-game-programming-blogs.opml
<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
    <head>
        <title>Graphics, Games, Programming, and Physics Blogs</title>
    </head>
    <body>
        <outline text="Tech News" title="Tech News">
            <outline type="rss" text="Ars Technica" title="Ars Technica" xmlUrl="http://feeds.arstechnica.com/arstechnica/index/" htmlUrl="https://arstechnica.com"/>
            <outline type="rss" text="Polygon - Full" title="Polygon - Full" xmlUrl="http://www.polygon.com/rss/index.xml" htmlUrl="https://www.polygon.com/"/>
            <outline type="rss" text="Road to VR" title="Road to VR" xmlUrl="http://www.roadtovr.com/feed" htmlUrl="https://www.roadtovr.com"/>

## rh_grow.c
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).

## perfect-quantization-dxt-endpoints.txt
Perfect Quantization of DXT endpoints
-------------------------------------

One of the issues that affect the quality of most DXT compressors is the way floating point colors are rounded.

For example, stb_dxt does:

    max16 =  (unsigned short)(stb__sclamp((At1_r*yy - At2_r*xy)*frb+0.5f,0,31) << 11);
    max16 |= (unsigned short)(stb__sclamp((At1_g*yy - At2_g*xy)*fg +0.5f,0,63) << 5);
    max16 |= (unsigned short)(stb__sclamp((At1_b*yy - At2_b*xy)*frb+0.5f,0,31) << 0);
	Latency Comparison Numbers (~2012)
	----------------------------------
	L1 cache reference 0.5 ns
	Branch mispredict 5 ns
	L2 cache reference 7 ns 14x L1 cache
	Mutex lock/unlock 25 ns
	Main memory reference 100 ns 20x L2 cache, 200x L1 cache
	Compress 1K bytes with Zippy 3,000 ns 3 us
	Send 1K bytes over 1 Gbps network 10,000 ns 10 us
	Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]
	enum FocusDir
	{
	DIR_W,
	DIR_E,
	DIR_N,
	DIR_S
	};

	static FocusDir get_quadrant(float dx, float dy)
	{
	// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae

	// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
	// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
	float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
	{
	// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
	// down the sample location to get the exact center of our "starting" texel. The starting texel will be at
	// location [1, 1] in the grid, where [0, 0] is the top left corner.
	float2 samplePos = uv * texSize;
	<?xml version="1.0" encoding="UTF-8"?>
	<opml version="1.0">
	<head>
	<title>Graphics, Games, Programming, and Physics Blogs</title>
	</head>
	<body>
	<outline text="Tech News" title="Tech News">
	<outline type="rss" text="Ars Technica" title="Ars Technica" xmlUrl="http://feeds.arstechnica.com/arstechnica/index/" htmlUrl="https://arstechnica.com"/>
	<outline type="rss" text="Polygon - Full" title="Polygon - Full" xmlUrl="http://www.polygon.com/rss/index.xml" htmlUrl="https://www.polygon.com/"/>
	<outline type="rss" text="Road to VR" title="Road to VR" xmlUrl="http://www.roadtovr.com/feed" htmlUrl="https://www.roadtovr.com"/>
	// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
	// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
	// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
	// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
	// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
	// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
	// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
	void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
	Perfect Quantization of DXT endpoints
	-------------------------------------

	One of the issues that affect the quality of most DXT compressors is the way floating point colors are rounded.

	For example, stb_dxt does:

	max16 = (unsigned short)(stb__sclamp((At1_ryy - At2_rxy)*frb+0.5f,0,31) << 11);
	max16 \|= (unsigned short)(stb__sclamp((At1_gyy - At2_gxy)*fg +0.5f,0,63) << 5);
	max16 \|= (unsigned short)(stb__sclamp((At1_byy - At2_bxy)*frb+0.5f,0,31) << 0);