Matt MattPD

## transpose.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                pervognsen
                / transpose.md
            
            
              Last active
              May 17, 2020 04:35
            
          
    The trick to designing transpose algorithms for both small and large problems is to recognize their simple recursive structure.
For a matrix A, let's denote its transpose by T(A) as a shorthand. First, suppose A is a 2x2 matrix:
    [A00 A01]
A = [A10 A11]

Then we have:

  
## signal-on-instruction-count.c
#define RUN_ME /*
exec cc -g -ggdb -O2 -W -Wall -std=c99 $0 -o "$(basename $0 .c)"
*/

/*
 * Copyright 2020 Paul Khuong
 * SPDX-License-Identifier:  BSD-2-Clause
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions

## libbts.c
#define RUN_ME /*
exec cc -O2 -W -Wall -std=c99 -shared $0 -o "$(basename $0 .c).so" -fPIC
*/

/*
 * Copyright 2019 Paul Khuong
 * SPDX-License-Identifier:  BSD-2-Clause
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions

## cache-counters-rant.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              19 stars
            
          
                travisdowns
                / cache-counters-rant.md
            
            
              Created
              October 13, 2019 16:46
            
              
                Discussion of x86 L1D related cache counters
              
          
    The counters that are the easiest to understand and the best for making ratios that are internally consistent (i.e., always fall in the range 0.0 to 1.0) are the mem_load_retired events, e.g., mem_load_retired.l1_hit and mem_load_retired.l1_miss.
These count at the instruction level, i.e., the universe of retired instructions. For example, could make a reasonable hit ratio from mem_load_retired.l1_hit / mem_inst_retired.all_loads and it will be sane (never indicate a hit rate more than 100%, for example).
That one isn't perfect though, in that it may not reflect the true costs of cache misses and the behavior of the program for at least the following reasons:

It appplies only to loads and can't catch misses imposed by stores (AFAICT there is no event that counts store misses).
It only counts loads that retire - a lot of the load activity in your process may be due to loads on a speculative path that never retire. Loads on a speculative path may bring in data that is never used, causing misses and d


## GPUOptimizationForGameDev.md

      
              1 file
            
          
              95 forks
            
          
              11 comments
            
          
              1043 stars
            
          
                silvesthu
                / GPUOptimizationForGameDev.md
            
            
              Last active
              May 7, 2024 20:43
            
              
                GPU Optimization for GameDev
              
          
    GPU Optimization for GameDev

Graphics Pipeline / GPU Architecture Overview


2011 - A trip through the Graphics Pipeline 2011
2015 - Life of a triangle - NVIDIA's logical pipeline
2015 - Render Hell 2.0
2016 - How bad are small triangles on GPU and why?
2017 - GPU Performance for Game Artists
2019 - Understanding the anatomy of GPUs using Pokémon
2020 - GPU ARCHITECTURE RESOURCES


## multidimensional_array_views.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              39 stars
            
          
                pervognsen
                / multidimensional_array_views.md
            
            
              Last active
              March 24, 2024 02:09
            
          
    Multi-dimensional array views for systems programmers

As C programmers, most of us think of pointer arithmetic for multi-dimensional arrays in a nested way:
The address for a 1-dimensional array is base + x.
The address for a 2-dimensional array is base + x + y*x_size for row-major layout and base + y + x*y_size for column-major layout.
The address for a 3-dimensional array is base + x + (y + z*y_size)*x_size for row-column-major layout.
And so on.

  
## Quirks of C.md

      
              1 file
            
          
              33 forks
            
          
              14 comments
            
          
              447 stars
            
          
                fay59
                / Quirks of C.md
            
            
              Last active
              January 23, 2024 04:24
            
              
                Quirks of C
              
          
    Here's a list of mildly interesting things about the C language that I learned mostly by consuming Clang's ASTs. Although surprises are getting sparser, I might continue to update this document over time.
There are many more mildly interesting features of C++, but the language is literally known for being weird, whereas C is usually considered smaller and simpler, so this is (almost) only about C.
1. Combined type and variable/field declaration, inside a struct scope [https://godbolt.org/g/Rh94Go]

struct foo {
   struct bar {
 int x;

  
## Matrix.md

      
              7 files
            
          
              74 forks
            
          
              17 comments
            
          
              860 stars
            
          
                nadavrot
                / Matrix.md
            
            
              Last active
              May 5, 2024 08:37
            
              
                Efficient matrix multiplication
              
          
    High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix
multiplication program on modern processors. In this tutorial I will use a
single core of the Skylake-client CPU with AVX2, but the principles in this post
also apply to other processors with different instruction sets (such as AVX512).
Intro

Matrix multiplication is a mathematical operation that defines the product of

  
## avx_sigh.md

      
              1 file
            
          
              3 forks
            
          
              0 comments
            
          
              66 stars
            
          
                rygorous
                / avx_sigh.md
            
            
              Last active
              September 21, 2023 07:33
            
          
why doesn't radfft support AVX on PC?

So there's two separate issues here: using instructions added in AVX and using 256-bit wide vectors. The former turns out to be much easier than the latter for our use case.
Problem number 1 was that you positively need to put AVX code in a separate file with different compiler settings (/arch:AVX for VC++, -mavx for GCC/Clang) that make all SSE code emitted also use VEX encoding, and at the time radfft was written there was no way in CDep to set compiler flags for just one file, just for the overall build.
[There's the GCC "target" annotations on individual funcs, which in principle fix this, but I ran into nasty problems with this for several compiler versions, and VC++ has no equivalent, so we're not currently using that and just sticking with different compilation units.]
The other issue is to do with CPU power management.

  
## ll.bnf
// ### [ Lexical part ] ########################################################

_ascii_letter_upper
	: 'A' - 'Z'
;

_ascii_letter_lower
	: 'a' - 'z'
;
	#define RUN_ME /*
	exec cc -g -ggdb -O2 -W -Wall -std=c99 $0 -o "$(basename $0 .c)"
	*/

	/*
	* Copyright 2020 Paul Khuong
	* SPDX-License-Identifier: BSD-2-Clause
	*
	* Redistribution and use in source and binary forms, with or without
	* modification, are permitted provided that the following conditions
	#define RUN_ME /*
	exec cc -O2 -W -Wall -std=c99 -shared $0 -o "$(basename $0 .c).so" -fPIC
	*/

	/*
	* Copyright 2019 Paul Khuong
	* SPDX-License-Identifier: BSD-2-Clause
	*
	* Redistribution and use in source and binary forms, with or without
	* modification, are permitted provided that the following conditions
	// ### [ Lexical part ] ########################################################

	_ascii_letter_upper
	: 'A' - 'Z'
	;

	_ascii_letter_lower
	: 'a' - 'z'
	;