Skip to content

Instantly share code, notes, and snippets.

The trick to designing transpose algorithms for both small and large problems is to recognize their simple recursive structure.

For a matrix A, let's denote its transpose by T(A) as a shorthand. First, suppose A is a 2x2 matrix:

    [A00 A01]
A = [A10 A11]

Then we have:

@pkhuong
pkhuong / signal-on-instruction-count.c
Created February 18, 2020 02:21
Trigger a unix signal based on the HW instruction counter PMC
#define RUN_ME /*
exec cc -g -ggdb -O2 -W -Wall -std=c99 $0 -o "$(basename $0 .c)"
*/
/*
* Copyright 2020 Paul Khuong
* SPDX-License-Identifier: BSD-2-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
@pkhuong
pkhuong / libbts.c
Last active October 18, 2022 09:01
minimal BTS tracing wrapper for linux perf
#define RUN_ME /*
exec cc -O2 -W -Wall -std=c99 -shared $0 -o "$(basename $0 .c).so" -fPIC
*/
/*
* Copyright 2019 Paul Khuong
* SPDX-License-Identifier: BSD-2-Clause
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
@travisdowns
travisdowns / cache-counters-rant.md
Created October 13, 2019 16:46
Discussion of x86 L1D related cache counters

The counters that are the easiest to understand and the best for making ratios that are internally consistent (i.e., always fall in the range 0.0 to 1.0) are the mem_load_retired events, e.g., mem_load_retired.l1_hit and mem_load_retired.l1_miss.

These count at the instruction level, i.e., the universe of retired instructions. For example, could make a reasonable hit ratio from mem_load_retired.l1_hit / mem_inst_retired.all_loads and it will be sane (never indicate a hit rate more than 100%, for example).

That one isn't perfect though, in that it may not reflect the true costs of cache misses and the behavior of the program for at least the following reasons:

  • It appplies only to loads and can't catch misses imposed by stores (AFAICT there is no event that counts store misses).
  • It only counts loads that retire - a lot of the load activity in your process may be due to loads on a speculative path that never retire. Loads on a speculative path may bring in data that is never used, causing misses and d

Multi-dimensional array views for systems programmers

As C programmers, most of us think of pointer arithmetic for multi-dimensional arrays in a nested way:

The address for a 1-dimensional array is base + x. The address for a 2-dimensional array is base + x + y*x_size for row-major layout and base + y + x*y_size for column-major layout. The address for a 3-dimensional array is base + x + (y + z*y_size)*x_size for row-column-major layout. And so on.

@fay59
fay59 / Quirks of C.md
Last active January 23, 2024 04:24
Quirks of C

Here's a list of mildly interesting things about the C language that I learned mostly by consuming Clang's ASTs. Although surprises are getting sparser, I might continue to update this document over time.

There are many more mildly interesting features of C++, but the language is literally known for being weird, whereas C is usually considered smaller and simpler, so this is (almost) only about C.

1. Combined type and variable/field declaration, inside a struct scope [https://godbolt.org/g/Rh94Go]

struct foo {
   struct bar {
 int x;
@nadavrot
nadavrot / Matrix.md
Last active May 5, 2024 08:37
Efficient matrix multiplication

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of

why doesn't radfft support AVX on PC?

So there's two separate issues here: using instructions added in AVX and using 256-bit wide vectors. The former turns out to be much easier than the latter for our use case.

Problem number 1 was that you positively need to put AVX code in a separate file with different compiler settings (/arch:AVX for VC++, -mavx for GCC/Clang) that make all SSE code emitted also use VEX encoding, and at the time radfft was written there was no way in CDep to set compiler flags for just one file, just for the overall build.

[There's the GCC "target" annotations on individual funcs, which in principle fix this, but I ran into nasty problems with this for several compiler versions, and VC++ has no equivalent, so we're not currently using that and just sticking with different compilation units.]

The other issue is to do with CPU power management.

@mewmew
mewmew / ll.bnf
Last active January 16, 2024 15:38
A BNF grammar for LLVM IR assembly
// ### [ Lexical part ] ########################################################
_ascii_letter_upper
: 'A' - 'Z'
;
_ascii_letter_lower
: 'a' - 'z'
;