Skip to content

Instantly share code, notes, and snippets.

// This can grow a Robin Hood linear probing hash table near word-at-a-time memcpy speeds. If you're confused why I use 'keys'
// to describe the hash values, it's because my favorite perspective on Robin Hood (which I learned from Paul Khuong)
// is that it's just a sorted gap array which is MSB bucketed and insertion sorted per chain:
// https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/
// The more widely known "max displacement" picture of Robin Hood hashing also has strengths since the max displacement
// can be stored very compactly. You can see a micro-optimized example of that here for small tables where the max displacement
// can fit in 4 bits: Sub-nanosecond Searches Using Vector Instructions, https://www.youtube.com/watch?v=paxIkKBzqBU
void grow(Table *table) {
u64 exp = 64 - table->shift;
// We grow the table downward in place by a factor of 2 (not counting the overflow area at table->end).
@raysan5
raysan5 / custom_game_engines_small_study.md
Last active April 23, 2024 13:41
A small state-of-the-art study on custom engines

CUSTOM GAME ENGINES: A Small Study

a_plague_tale

A couple of weeks ago I played (and finished) A Plague Tale, a game by Asobo Studio. I was really captivated by the game, not only by the beautiful graphics but also by the story and the locations in the game. I decided to investigate a bit about the game tech and I was surprised to see it was developed with a custom engine by a relatively small studio. I know there are some companies using custom engines but it's very difficult to find a detailed market study with that kind of information curated and updated. So this article.

Nowadays lots of companies choose engines like Unreal or Unity for their games (or that's what lot of people think) because d

Perfect Quantization of DXT endpoints
-------------------------------------
One of the issues that affect the quality of most DXT compressors is the way floating point colors are rounded.
For example, stb_dxt does:
max16 = (unsigned short)(stb__sclamp((At1_r*yy - At2_r*xy)*frb+0.5f,0,31) << 11);
max16 |= (unsigned short)(stb__sclamp((At1_g*yy - At2_g*xy)*fg +0.5f,0,63) << 5);
max16 |= (unsigned short)(stb__sclamp((At1_b*yy - At2_b*xy)*frb+0.5f,0,31) << 0);

I was told by @mmozeiko that Address Sanitizer (ASAN) works on Windows now. I'd tried it a few years ago with no luck, so this was exciting news to hear.

It was a pretty smooth experience, but with a few gotchas I wanted to document.

First, download and run the LLVM installer for Windows: https://llvm.org/builds/

Then download and install the VS extension if you're a Visual Studio 2017 user like I am.

It's now very easy to use Clang to build your existing MSVC projects since there's a cl compatible frontend:

@sebbbi
sebbbi / FramentShaderWaveCoherency.txt
Last active November 28, 2018 15:51
FramentShaderWaveCoherency test shader (Vulkan 1.1)
#version 450
#extension GL_ARB_separate_shader_objects : enable
#extension GL_KHR_shader_subgroup_basic : enable
#extension GL_KHR_shader_subgroup_ballot : enable
#extension GL_KHR_shader_subgroup_vote : enable
#extension GL_KHR_shader_subgroup_arithmetic : enable
layout(location = 0) out vec4 outColor;
//#define VISUALIZE_WAVES
@sebbbi
sebbbi / FastUniformLoadWithWaveOps.txt
Last active February 15, 2024 08:41
Fast uniform load with wave ops (up to 64x speedup)
In shader programming, you often run into a problem where you want to iterate an array in memory over all pixels in a compute shader
group (tile). Tiled deferred lighting is the most common case. 8x8 tile loops over a light list culled for that tile.
Simplified HLSL code looks like this:
Buffer<float4> lightDatas;
Texture2D<uint2> lightStartCounts;
RWTexture2D<float4> output;
[numthreads(8, 8, 1)]
//
// Author: Jonathan Blow
// Version: 1
// Date: 31 August, 2018
//
// This code is released under the MIT license, which you can find at
//
// https://opensource.org/licenses/MIT
//
//

why doesn't radfft support AVX on PC?

So there's two separate issues here: using instructions added in AVX and using 256-bit wide vectors. The former turns out to be much easier than the latter for our use case.

Problem number 1 was that you positively need to put AVX code in a separate file with different compiler settings (/arch:AVX for VC++, -mavx for GCC/Clang) that make all SSE code emitted also use VEX encoding, and at the time radfft was written there was no way in CDep to set compiler flags for just one file, just for the overall build.

[There's the GCC "target" annotations on individual funcs, which in principle fix this, but I ran into nasty problems with this for several compiler versions, and VC++ has no equivalent, so we're not currently using that and just sticking with different compilation units.]

The other issue is to do with CPU power management.

//
// TinyCRT, revamp and TinyWin support by Don Williamson, 2011
// Based on http://www.codeproject.com/KB/library/tlibc.aspx and LIBCTINY by Matt Pietrek
//
#pragma once
#ifdef USE_DEFAULT_CRT