Jake Taylor yupferris

## prefetch_and_hoist.c
// Example: Opcode dispatch in a bytecode VM. Assume the opcode case dispatching is mispredict heavy,
// and that pc, ins, next_ins, next_opcase are always in registers.

#define a ((ins >> 8) & 0xFF)
#define b ((ins >> 16) & 0xFF)
#define c ((ins >> 24) & 0xFF)

// Version 1: Synchronous instruction fetch and opcode dispatch. The big bottleneck is that given how light
// the essential work is for each opcode case (e.g. something like ADD is typical), you're dominated
// by the cost of the opcode dispatch branch mispredicts. When there's a mispredict, the pipeline restarts

## shift_dfa.md

      
              1 file
            
          
              4 forks
            
          
              6 comments
            
          
              94 stars
            
          
                pervognsen
                / shift_dfa.md
            
            
              Last active
              July 7, 2024 06:26
            
              
                Shift-based DFAs
              
          
    A traditional table-based DFA implementation looks like this:
uint8_t table[NUM_STATES][256]

uint8_t run(const uint8_t *start, const uint8_t *end, uint8_t state) {
    for (const uint8_t *s = start; s != end; s++)
        state = table[state][*s];
    return state;
}


## vhs.hlsl
/**
 * (c) 2021 FMS_Cat, MIT License
 * Original shader: https://www.shadertoy.com/view/MdffD7
 * I dumbass don't know what it says despite it's my own shader
 */

Texture2D shaderTexture;
SamplerState samplerState;

cbuffer PixelShaderSettings {

## lut-reading-materials.md

      
              1 file
            
          
              1 fork
            
          
              0 comments
            
          
              9 stars
            
          
                Ravenslofty
                / lut-reading-materials.md
            
            
              Last active
              April 19, 2024 15:40
            
          
    FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs


the classic paper
compute LUT mappings through maximum-flow
produces optimal depth designs
but needs major area recovery, e.g. flow-pack
implemented in the Yosys flowmap pass.

On Area/Depth Trade-off in LUT-based FPGA Technology Mapping


relaxing requirement of one logic gate to one LUT allows area recovery
duplicates logic to produce more LUT mapping opportunities


## rast.c
// ---- triangle rasterizer

#define SUBPIXEL_SHIFT  8
#define SUBPIXEL_SCALE  (1 << SUBPIXEL_SHIFT)

static RADINLINE S64 det2x2(S32 a, S32 b, S32 c, S32 d)
{
   S64 r = (S64) a*d - (S64) b*c;
   return r >> SUBPIXEL_SHIFT;
}

## RISC-V.md

      
              1 file
            
          
              11 forks
            
          
              21 comments
            
          
              220 stars
            
          
                erincandescent
                / RISC-V.md
            
            
              Created
              July 25, 2019 23:32
            
          
    Foreward

This document was originally written several years ago. At the time I was working as an execution core verification engineer at Arm. The following points are coloured heavily by working in and around the execution cores of various processors. Apply a pinch of salt; points contain varying degrees of opinion.
It is still my opinion that RISC-V could be much better designed; though I will also say that if I was building a 32 or 64-bit CPU today I'd likely implement the architecture to benefit from the existing tooling.
Mostly based upon the RISC-V ISA spec v2.0. Some updates have been made for v2.2
Original Foreword: Some Opinion

The RISC-V ISA has pursued minimalism to a fault. There is a large emphasis on minimizing instruction count, normalizing encoding, etc. This pursuit of minimalism has resulted in false orthogonalities (such as reusing the same instruction for branches, calls and returns) and a requirement for superfluous instructions which impacts code density both in terms of size and

  
## 68000_instruction_timings.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              3 stars
            
          
                emoon
                / 68000_instruction_timings.md
            
            
              Created
              April 24, 2019 11:31
            
          
    68000 instructions timings

When I started to write some pure 68000 I didn't find a nice doc that would cover all the op codes I used which a nice diagram for the cycle counts so I made one.
This info has been assembled / hacked up using this tool https://github.com/emoon/68k_documentation_gen with a bunch of manual work and some automation for generating the cycle tables.
If you find any errors in this (I'm sure there are plenty but it has been useful for me) please contact me or even better do a PR :)
ABCD

Operation:      Source10 + Destination10 + X → Destination

  
## asan_clang_cl.md

      
              1 file
            
          
              1 fork
            
          
              5 comments
            
          
              24 stars
            
          
                pervognsen
                / asan_clang_cl.md
            
            
              Last active
              June 7, 2024 10:42
            
          
    I was told by @mmozeiko that Address Sanitizer (ASAN) works on Windows now. I'd tried it a few years ago with no luck, so this was exciting news to hear.
It was a pretty smooth experience, but with a few gotchas I wanted to document.
First, download and run the LLVM installer for Windows: https://llvm.org/builds/
Then download and install the VS extension if you're a Visual Studio 2017 user like I am.
It's now very easy to use Clang to build your existing MSVC projects since there's a cl compatible frontend:

  
## ukkonen.rs
use std::str;

// Store positions in packed (u32) form; this limits us to under 4GB of
// payload but makes the data structures a bit more compact.
struct PackedPos(u32);

impl PackedPos {
    fn from(pos: usize) -> PackedPos {
        assert!(pos <= std::u32::MAX as usize);
        PackedPos(pos as u32)

## multidimensional_array_views.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              39 stars
            
          
                pervognsen
                / multidimensional_array_views.md
            
            
              Last active
              March 24, 2024 02:09
            
          
    Multi-dimensional array views for systems programmers

As C programmers, most of us think of pointer arithmetic for multi-dimensional arrays in a nested way:
The address for a 1-dimensional array is base + x.
The address for a 2-dimensional array is base + x + y*x_size for row-major layout and base + y + x*y_size for column-major layout.
The address for a 3-dimensional array is base + x + (y + z*y_size)*x_size for row-column-major layout.
And so on.
	// Example: Opcode dispatch in a bytecode VM. Assume the opcode case dispatching is mispredict heavy,
	// and that pc, ins, next_ins, next_opcase are always in registers.

	#define a ((ins >> 8) & 0xFF)
	#define b ((ins >> 16) & 0xFF)
	#define c ((ins >> 24) & 0xFF)

	// Version 1: Synchronous instruction fetch and opcode dispatch. The big bottleneck is that given how light
	// the essential work is for each opcode case (e.g. something like ADD is typical), you're dominated
	// by the cost of the opcode dispatch branch mispredicts. When there's a mispredict, the pipeline restarts
	/**
	* (c) 2021 FMS_Cat, MIT License
	* Original shader: https://www.shadertoy.com/view/MdffD7
	* I dumbass don't know what it says despite it's my own shader
	*/

	Texture2D shaderTexture;
	SamplerState samplerState;

	cbuffer PixelShaderSettings {
	// ---- triangle rasterizer

	#define SUBPIXEL_SHIFT 8
	#define SUBPIXEL_SCALE (1 << SUBPIXEL_SHIFT)

	static RADINLINE S64 det2x2(S32 a, S32 b, S32 c, S32 d)
	{
	S64 r = (S64) ad - (S64) bc;
	return r >> SUBPIXEL_SHIFT;
	}
	use std::str;

	// Store positions in packed (u32) form; this limits us to under 4GB of
	// payload but makes the data structures a bit more compact.
	struct PackedPos(u32);

	impl PackedPos {
	fn from(pos: usize) -> PackedPos {
	assert!(pos <= std::u32::MAX as usize);
	PackedPos(pos as u32)