Dougall Johnson dougallj

## 0-readme.txt
Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
On AVX-512 you can pass the result directly to VPTERNLOGD. On other platforms,
look up the value in the following tables to find a short, equivalent sequence of
operations.

For A64/SVE/Neon see https://gist.github.com/dougallj/10c3ffdbd07229db2cc8b0430d7ccd39

The tables here are:
* agx: "not" and all binary operations (as used in Apple GPUs, but possibly useful elsewhere):

## 0-readme.txt
Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
On Intel you can pass the result directly to VPTERNLOGD. On A64, look up the value
in the following tables to find a short, equivalent sequence of operations.

Entries selected for throughput, not latency (though generally they seem to be
optimal for both).

I've only used a couple of entries and found them to be correct. Sorry if there are
errors. Note that SVE changed the operand order to bsl (why???), so that's svbsl.
Generally names are a mix between the opcodes and what I found readable (mostly

## a-readme.txt
Raw data. These were dumped from iPhones/iPads using wall-timers, not
perf-counters. They contain some likely issues and inconsistencies that
haven't been fully investigated. Mostly correct, but it's worth
double-checking anything odd. (For example, "TBL (two register table)"
can have better throughput than is listed sometimes, as can some other
three-operand SIMD things iirc.)

The goal is to find the fastest rate at which an instruction can run. If
there are multiple rows with the same label, the "correct" value is the
minimum. For example:

## aarch64_amx.py
# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
#
# WIP research. (This was edited to add more info after someone posted it to
# Hacker News. Click "Revisions" to see full changes.)
#
# Copyright (c) 2020 dougallj


# Based on Python port of VMX intrinsics plugin:
# Copyright (c) 2019 w4kfu - Synacktiv

## gist:9211fd24c3759f7f340dede28929c659
N_BITS = 8
MASK = (1 << N_BITS) - 1

class Ternary:
    def __init__(self, ones, unknowns):
        self.ones = ones & MASK
        self.unknowns = unknowns & MASK
        assert (self.ones & self.unknowns) == 0, (bin(self.ones), bin(self.unknowns))

    def __add__(self, other):

## asm.s
global _time_load
global _cache_flush
global _run_attempt

extern _bools
extern _values
extern _pointers

section .text

## draw-patterns.c
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

#define WIDTH_IN_BLOCKS 29
#define HEIGHT_IN_BLOCKS 28

#define PADDING 4

#define BLOCK_WIDTH (4 * 4)
#define BLOCK_HEIGHT (4 * 4)
	Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
	On AVX-512 you can pass the result directly to VPTERNLOGD. On other platforms,
	look up the value in the following tables to find a short, equivalent sequence of
	operations.

	For A64/SVE/Neon see https://gist.github.com/dougallj/10c3ffdbd07229db2cc8b0430d7ccd39

	The tables here are:
	* agx: "not" and all binary operations (as used in Apple GPUs, but possibly useful elsewhere):
	Usage: evaluate a ternary bitwise function with the values a=0xf0, b=0xcc, c=0xaa.
	On Intel you can pass the result directly to VPTERNLOGD. On A64, look up the value
	in the following tables to find a short, equivalent sequence of operations.

	Entries selected for throughput, not latency (though generally they seem to be
	optimal for both).

	I've only used a couple of entries and found them to be correct. Sorry if there are
	errors. Note that SVE changed the operand order to bsl (why???), so that's svbsl.
	Generally names are a mix between the opcodes and what I found readable (mostly
	Raw data. These were dumped from iPhones/iPads using wall-timers, not
	perf-counters. They contain some likely issues and inconsistencies that
	haven't been fully investigated. Mostly correct, but it's worth
	double-checking anything odd. (For example, "TBL (two register table)"
	can have better throughput than is listed sometimes, as can some other
	three-operand SIMD things iirc.)

	The goal is to find the fastest rate at which an instruction can run. If
	there are multiple rows with the same label, the "correct" value is the
	minimum. For example:
	# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
	#
	# WIP research. (This was edited to add more info after someone posted it to
	# Hacker News. Click "Revisions" to see full changes.)
	#
	# Copyright (c) 2020 dougallj


	# Based on Python port of VMX intrinsics plugin:
	# Copyright (c) 2019 w4kfu - Synacktiv
	N_BITS = 8
	MASK = (1 << N_BITS) - 1

	class Ternary:
	def __init__(self, ones, unknowns):
	self.ones = ones & MASK
	self.unknowns = unknowns & MASK
	assert (self.ones & self.unknowns) == 0, (bin(self.ones), bin(self.unknowns))

	def __add__(self, other):
	global _time_load
	global _cache_flush
	global _run_attempt

	extern _bools
	extern _values
	extern _pointers

	section .text
	#define STB_IMAGE_WRITE_IMPLEMENTATION
	#include "stb_image_write.h"

	#define WIDTH_IN_BLOCKS 29
	#define HEIGHT_IN_BLOCKS 28

	#define PADDING 4

	#define BLOCK_WIDTH (4 * 4)
	#define BLOCK_HEIGHT (4 * 4)