Skip to content

Instantly share code, notes, and snippets.

@bjacob
bjacob / README.md
Last active February 16, 2024 20:30
Attempt at ukernel fallback to codegen

Attempt at ukernel fallback to codegen

This is to document a short-lived attempt at solving #15784 by implementing the idea laid out in the original issue description. This changes the mmt4 ukernel to return a second return value which is a status code, and changes the mmt4d-to-ukernel lowering to create a scf.if based on that status code:

%62:2 = iree_codegen.ukernel.generic "iree_uk_mmt4d" ins(%59, %60 : tensor<1x?x16x1xf32>, tensor<1x?x16x1xf32>) outs(%61 : tensor<1x1x16x16xf32>) (%c1, %c1, %dim, %c16_i32, %c16_i32, %c1_i32, %c1281_i32 : index, index, index, i32, i32, i32, i32) fn_def_attrs {hal.import.bitcode = true, hal.import.cconv = 1 : i32, hal.import.fields = ["processor_data"]} strided_outer_dims(1) -> tensor<1x1x16x16xf32>, i32
%63 = arith.cmpi eq, %62#1, %c0_i32 : i32
%64 = scf.if %63 -> (tensor<1x1x16x16xf32>) {
  scf.yield %62#0 : tensor<1x1x16x16xf32>
} else {
@bjacob
bjacob / README.md
Created January 29, 2024 22:04
Putting the "LLVM loop unrolling for ukernel bitcode" idea to rest

Putting the "LLVM loop unrolling for ukernel bitcode" idea to rest

Problem statement

Microkernels have some variants for various M0 tile sizes, such as M0={1,2,4,8,16}, and sometimes a few other similar parameters.

We need to generate microkernel code for each such variant, with some fully-unrollable for loops properly unrolled each time.

Currently this is done in microkernel source code at the price of some boilerplate in the source, and inflated bitcode to embed into iree-compile. For instance, here is how we generate 5 tile-functions differing only in the M0-value: https://github.com/openxla/iree/blob/1c83020136b9d3d56da692036e5bbcb2b4586ebf/runtime/src/iree/builtins/ukernel/arch/x86_64/mmt4d_x86_64_avx512_base.c#L12-L61

@bjacob
bjacob / README.md
Last active April 19, 2024 02:31
IREE / MLIR / Linalg tutorial

IREE/MLIR/Linalg tutorial

Introduction

This tutorial is simultaneously about IREE, MLIR, and specifically the MLIR Linalg dialect.

What is MLIR?

MLIR is a programming language, but MLIR in itself is almost just an empty shell. What it really provides is a framework allowing to define MLIR dialects which are where the features come from.

@bjacob
bjacob / README.md
Last active January 23, 2024 15:55
%%{ init: {"theme": "neutral" } }%%
graph TD;
matmulontensors-- CPUMaterializeEncoding -->mmt4dontensors;
mmt4dontensors-- CPULowerToUKernels -->ukernelontensors;
ukernelontensors-- IREEComprehensiveBufferize -->ukernelonmemref;
ukernelonmemref-- LowerUKernelOpsToCalls -->ukernelcall;
ukernelcall-- ConvertToLLVM -->codegenll;
codegenll-->bitcodelinking;
genericsource-- clang -emit-llvm --> genericbitcode -- llvm-link --> ukernelbitcode;
@bjacob
bjacob / README.md
Last active January 22, 2024 20:11
%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#BB2528',
      'primaryTextColor': '#fff',
      'primaryBorderColor': '#7C0000',
      'lineColor': '#F8B229',
      'secondaryColor': '#006100',
@bjacob
bjacob / README.md
Last active February 29, 2024 17:18
Exploring IREE CPU microkernels on a simple matmul example

Exploring IREE CPU microkernels on a simple matmul example

Basic setup, command lines

Source file: matmul.mlir:

func.func @matmul_dynamic(%lhs: tensor<?x?xf32>, %rhs: tensor<?x?xf32>, %acc: tensor<?x?xf32>) -> tensor<?x?xf32> {
  %result = linalg.matmul ins(%lhs, %rhs: tensor<?x?xf32>, tensor<?x?xf32>) outs(%acc: tensor<?x?xf32>) -> tensor<?x?xf32>
  return %result: tensor<?x?xf32>
4-01-12T08:35:37.345767-05:00 hocher kernel: [344097.857664] igc 0000:0c:00.0 eno1: NIC Link is Down
2024-01-12T08:39:20.513466-05:00 hocher kernel: [344321.027521] igc 0000:0c:00.0 eno1: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
2024-01-12T14:26:30.629809-05:00 hocher kernel: [365151.293589] igc 0000:0c:00.0 eno1: PCIe link lost, device now detached
2024-01-12T14:26:30.629824-05:00 hocher kernel: [365151.293600] ------------[ cut here ]------------
2024-01-12T14:26:30.629826-05:00 hocher kernel: [365151.293601] igc: Failed to read reg 0xc030!
2024-01-12T14:26:30.629827-05:00 hocher kernel: [365151.293641] WARNING: CPU: 12 PID: 221077 at drivers/net/ethernet/intel/igc/igc_main.c:6583 igc_rd32+0xa4/0xc0 [igc]
2024-01-12T14:26:30.629827-05:00 hocher kernel: [365151.293654] Modules linked in: tls xt_MASQUERADE xt_tcpudp xt_mark nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfcomm ccm snd_seq_dummy snd_hrtimer nf_tables libcrc32c nfnetlink cmac algif_hash algif_skciphe
@bjacob
bjacob / README.md
Last active January 9, 2024 16:49
Trying out IREE CPU code generation on some fusions

Trying out IREE CPU code generation on some fusions

This is a tutorial to start using IREE to compile some small workload akin to some fusion of ops in a neural network inference workload, to x86 CPU code, and run/benchmark it.

Before we start: MLIR, dialects, IREE front-ends.

IREE is a MLIR compiler. MLIR is a general framework, where actual operations are done by specific dialects. We are going to focus on one specific dialect in this tutorial, Linalg, because it is in practice the common denominator that all front-ends (PyTorch, TensorFlow, TensorFlow Lite, ONNX...) ultimately go through.

So this tutorial is going to be about how to express some common workloads in Linalg, and experiment with IREE compiling these Linalg programs to x86 code and running that.

// -----// IR Dump Before LLVMCPUVectorLowering (iree-llvmcpu-vector-lowering) //----- //
func.func @repro_dispatch_0_generic_11008x32_i32xf32xf32xf32xf32xf32() {
%cst = arith.constant 0.000000e+00 : f32
%c0_i32 = arith.constant 0 : i32
%c1 = arith.constant 1 : index
%c8 = arith.constant 8 : index
%c32 = arith.constant 32 : index
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : memref<32x11008xi32, #hal.descriptor_type<storage_buffer>>
memref.assume_alignment %0, 64 : memref<32x11008xi32, #hal.descriptor_type<storage_buffer>>
define internal i32 @run_forward_dispatch_38_generic_11008x32_i32xf32xf32xf32xf32xf32(ptr noalias nonnull align 16 %0, ptr noalias nonnull align 16 %1, ptr noalias nonnull align 16 %2) #0 !dbg !1578 {
%4 = load %iree_hal_executable_dispatch_state_v0_t, ptr %1, align 8, !dbg !1579
%5 = extractvalue %iree_hal_executable_dispatch_state_v0_t %4, 9, !dbg !1579
%6 = load i32, ptr %5, align 4, !dbg !1579
%7 = getelementptr i32, ptr %5, i32 1, !dbg !1580
%8 = load i32, ptr %7, align 4, !dbg !1580
%9 = zext i32 %6 to i64, !dbg !1581
%10 = zext i32 %8 to i64, !dbg !1582
%11 = extractvalue %iree_hal_executable_dispatch_state_v0_t %4, 10, !dbg !1583
%12 = load ptr, ptr %11, align 8, !dbg !1583