John Calsbeek Nexuapex

## rh_grow.c
// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).

## custom_game_engines_small_study.md

      
              1 file
            
          
              59 forks
            
          
              142 comments
            
          
              1302 stars
            
          
                raysan5
                / custom_game_engines_small_study.md
            
            
              Last active
              April 23, 2024 13:41
            
              
                A small state-of-the-art study on custom engines
              
          
    CUSTOM GAME ENGINES: A Small Study


A couple of weeks ago I played (and finished) A Plague Tale, a game by Asobo Studio. I was really captivated by the game, not only by the beautiful graphics but also by the story and the locations in the game. I decided to investigate a bit about the game tech and I was surprised to see it was developed with a custom engine by a relatively small studio. I know there are some companies using custom engines but it's very difficult to find a detailed market study with that kind of information curated and updated. So this article.
Nowadays lots of companies choose engines like Unreal or Unity for their games (or that's what lot of people think) because d

  
## perfect-quantization-dxt-endpoints.txt
Perfect Quantization of DXT endpoints
-------------------------------------

One of the issues that affect the quality of most DXT compressors is the way floating point colors are rounded.

For example, stb_dxt does:

    max16 =  (unsigned short)(stb__sclamp((At1_r*yy - At2_r*xy)*frb+0.5f,0,31) << 11);
    max16 |= (unsigned short)(stb__sclamp((At1_g*yy - At2_g*xy)*fg +0.5f,0,63) << 5);
    max16 |= (unsigned short)(stb__sclamp((At1_b*yy - At2_b*xy)*frb+0.5f,0,31) << 0);

## GPUOptimizationForGameDev.md

      
              1 file
            
          
              94 forks
            
          
              11 comments
            
          
              1043 stars
            
          
                silvesthu
                / GPUOptimizationForGameDev.md
            
            
              Last active
              April 19, 2024 04:21
            
              
                GPU Optimization for GameDev
              
          
    GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview


2011 - A trip through the Graphics Pipeline 2011
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon
2020 - GPU ARCHITECTURE RESOURCES


## asan_clang_cl.md

      
              1 file
            
          
              1 fork
            
          
              5 comments
            
          
              23 stars
            
          
                pervognsen
                / asan_clang_cl.md
            
            
              Last active
              June 21, 2023 16:57
            
          
    I was told by @mmozeiko that Address Sanitizer (ASAN) works on Windows now. I'd tried it a few years ago with no luck, so this was exciting news to hear.
It was a pretty smooth experience, but with a few gotchas I wanted to document.
First, download and run the LLVM installer for Windows: https://llvm.org/builds/
Then download and install the VS extension if you're a Visual Studio 2017 user like I am.
It's now very easy to use Clang to build your existing MSVC projects since there's a cl compatible frontend:

  
## FramentShaderWaveCoherency.txt
#version 450
#extension GL_ARB_separate_shader_objects : enable
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_ballot : enable
#extension GL_KHR_shader_subgroup_vote : enable
#extension GL_KHR_shader_subgroup_arithmetic : enable

layout(location = 0) out vec4 outColor;

//#define VISUALIZE_WAVES

## FastUniformLoadWithWaveOps.txt
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

Simplified HLSL code looks like this:

Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;

[numthreads(8, 8, 1)]

## microsoft_craziness.h
//
// Author:   Jonathan Blow
// Version:  1
// Date:     31 August, 2018
//
// This code is released under the MIT license, which you can find at
//
//          https://opensource.org/licenses/MIT
//
//

## avx_sigh.md

      
              1 file
            
          
              3 forks
            
          
              0 comments
            
          
              66 stars
            
          
                rygorous
                / avx_sigh.md
            
            
              Last active
              September 21, 2023 07:33
            
          
why doesn't radfft support AVX on PC?

So there's two separate issues here: using instructions added in AVX and using 256-bit wide vectors. The former turns out to be much easier than the latter for our use case.
Problem number 1 was that you positively need to put AVX code in a separate file with different compiler settings (/arch:AVX for VC++, -mavx for GCC/Clang) that make all SSE code emitted also use VEX encoding, and at the time radfft was written there was no way in CDep to set compiler flags for just one file, just for the overall build.
[There's the GCC "target" annotations on individual funcs, which in principle fix this, but I ran into nasty problems with this for several compiler versions, and VC++ has no equivalent, so we're not currently using that and just sticking with different compilation units.]
The other issue is to do with CPU power management.

  
## TinyCRT.h

//
// TinyCRT, revamp and TinyWin support by Don Williamson, 2011
// Based on http://www.codeproject.com/KB/library/tlibc.aspx and LIBCTINY by Matt Pietrek
//

#pragma once


#ifdef USE_DEFAULT_CRT
	// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
	// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
	// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
	// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
	// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
	// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
	// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
	void grow(Table *table) {
	u64 exp = 64 - table->shift;
	// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
	Perfect Quantization of DXT endpoints
	-------------------------------------

	One of the issues that affect the quality of most DXT compressors is the way floating point colors are rounded.

	For example, stb_dxt does:

	max16 = (unsigned short)(stb__sclamp((At1_ryy - At2_rxy)*frb+0.5f,0,31) << 11);
	max16 \|= (unsigned short)(stb__sclamp((At1_gyy - At2_gxy)*fg +0.5f,0,63) << 5);
	max16 \|= (unsigned short)(stb__sclamp((At1_byy - At2_bxy)*frb+0.5f,0,31) << 0);
	#version 450
	#extension GL_ARB_separate_shader_objects : enable
	#extension GL_KHR_shader_subgroup_basic : enable
	#extension GL_KHR_shader_subgroup_ballot : enable
	#extension GL_KHR_shader_subgroup_vote : enable
	#extension GL_KHR_shader_subgroup_arithmetic : enable

	layout(location = 0) out vec4 outColor;

	//#define VISUALIZE_WAVES
	In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
	group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.

	Simplified HLSL code looks like this:

	Buffer<float4> lightDatas;
	Texture2D<uint2> lightStartCounts;
	RWTexture2D<float4> output;

	[numthreads(8, 8, 1)]
	//
	// Author: Jonathan Blow
	// Version: 1
	// Date: 31 August, 2018
	//
	// This code is released under the MIT license, which you can find at
	//
	// https://opensource.org/licenses/MIT
	//
	//

	//
	// TinyCRT, revamp and TinyWin support by Don Williamson, 2011
	// Based on http://www.codeproject.com/KB/library/tlibc.aspx and LIBCTINY by Matt Pietrek
	//

	#pragma once


	#ifdef USE_DEFAULT_CRT