Oscar Barenys oscarbg

## TinyGrad-notes.md

      
              1 file
            
          
              0 forks
            
          
              14 comments
            
          
              9 stars
            
          
                fxkamd
                / TinyGrad-notes.md
            
            
              Last active
              April 26, 2024 15:34
            
              
                Observations about HSA and KFD backends in TinyGrad
              
          
    This is Felix Kuehling, long time KFD driver architect. I started looking into the TinyGrad source code yesterday, focusing on ops_kfd.py, ops_hsa.py and driver/hsa.py, to understand how TinyGrad talks to our HW and help with the ongoing debugging effort from the top down. This analysis is based on this commit: https://github.com/tinygrad/tinygrad/tree/3de855ea50d72238deac14fc05cda2a611497778
I'm intrigued by the use of Python for low-level programming. I think I can learn something from your use of ctypes and clang2py for fast prototyping and test development. I want to share some observations based on my initial review.
ops_kfd looks pretty new, and I see many problems with it based on my long experience working on KFD. I think it's interesting, but probably not relevant for the most pressing problems at hand, so I'll cover that last.
ops_hsa uses ROCr APIs to manage GPU memory, create a user mode AQL queue for GPU kernel dispatch, async SDMA copies, and signal-based synchronization with barrier packets

  
## benchmark_7900XTX_10142023.txt
(tf) root@rocm:~/tmp# python benchmark.py
2023-10-14 15:02:22.116047: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
2023-10-14 15:02:22.348480: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-14 15:02:23.756833: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:787] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-14 15:02:23.982269: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:787] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-10-14 15:02:23.9823

## simd_partition.metal
//  Created by Timothy Davison on 2023-06-21.
//
// This is a Metal implementation of subgroupPartitionNV. You use it to find a mask of
// the other threads in a simd-group with the same value (a partition of the simd-group about
// a set of values).
//
// Feel free to use this in your code. Please share any fixes or ideas to make it faster.
//
// Khronos docs on subgroup partitioning:
// - https://github.com/KhronosGroup/GLSL/blob/master/extensions/nv/GL_NV_shader_subgroup_partitioned.txt

## sve2.md

      
              1 file
            
          
              9 forks
            
          
              40 comments
            
          
              68 stars
            
          
                zingaburga
                / sve2.md
            
            
              Last active
              June 4, 2024 08:54
            
              
                ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads
              
          
    ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads

Scalable Vector Extensions (SVE) is ARM’s latest SIMD extension to their instruction set, which was announced back in 2016.  A follow-up SVE2 extension was announced in 2019, designed to incorporate all functionality from ARM’s current primary SIMD extension, NEON (aka ASIMD).
Despite being announced 5 years ago, there is currently no generally available CPU which supports any form of SVE (which excludes the [Fugaku supercomputer](https://www.fujitsu.com/global/about/innovation/

  
## clpeak.txt
% ./clpeak
[mvk-info] MoltenVK version 1.1.5, supporting Vulkan version 1.1.189.
	The following 72 Vulkan extensions are supported:
		VK_KHR_16bit_storage v1
		VK_KHR_8bit_storage v1
		VK_KHR_bind_memory2 v1
		VK_KHR_create_renderpass2 v1
		VK_KHR_dedicated_allocation v3
		VK_KHR_depth_stencil_resolve v1
		VK_KHR_descriptor_update_template v1

## aarch64_amx.py
# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
#
# WIP research. (This was edited to add more info after someone posted it to
# Hacker News. Click "Revisions" to see full changes.)
#
# Copyright (c) 2020 dougallj


# Based on Python port of VMX intrinsics plugin:
# Copyright (c) 2019 w4kfu - Synacktiv

## QEMU_ON_M1.md

      
              2 files
            
          
              34 forks
            
          
              66 comments
            
          
              196 stars
            
          
                citruz
                / QEMU_ON_M1.md
            
            
              Last active
              June 6, 2024 08:29
            
              
                Create Ubuntu and Windows VMs with QEMU on Apple Silicon
              
          
    Running Linux and Windows on M1 with QEMU


30.11.2020: Updated with the new patchseries and instructions for Windows


02.12.2020: Added tweaks


08.12.2020: Updated with patchseries v4


31.01.2020: Updated with patchseries v6


## README.en.md

      
              2 files
            
          
              29 forks
            
          
              116 comments
            
          
              247 stars
            
          
                niw
                / README.en.md
            
            
              Last active
              July 5, 2024 14:28
            
              
                How to run Windows 10 on ARM or Ubuntu for ARM64 in QEMU on Apple Silicon Mac
              
          
    How to run Windows 10 on ARM or Ubuntu for ARM64 in QEMU on Apple Silicon Mac

Here is easy steps to try Windows 10 on ARM or Ubuntu for ARM64
on your Apple Silicon Mac. Enjoy!

NOTE: that this is current, 10/1/2021 state.

Running Windows 10 on ARM


Install Xcode from App Store or install Command Line Tools on your Mac


## isa.txt
platform: 7.5
ext: 7p5
name: HSW
1 add add 0x40 Addition
    0xfc0 u8 i8 u16 i16 u32 i32 , 0xfc0 u8 i8 u16 i16 u32 i32
    0x20000 f32 , 0xfc0 u8 i8 u16 i16 u32 i32
    0x20000 f32 , 0x20000 f32
    0x40000 f64 , 0x40000 f64
3 addc addc 0x4e Addition with Carry
    0x400 u32 , 0x400 u32

## b.bat
@echo off
setlocal
cd %~dp0
call vcvars amd64
..\..\bin\win32\nasm -f win64 -g -o histo_asm.obj histo_asm.nas || exit /b 1
cl /Zi /O2 /nologo histotest.cpp histo_asm.obj || exit /b 1
	(tf) root@rocm:~/tmp# python benchmark.py
	2023-10-14 15:02:22.116047: E external/local_xla/xla/stream_executor/plugin_registry.cc:93] Invalid plugin kind specified: DNN
	2023-10-14 15:02:22.348480: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
	To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
	2023-10-14 15:02:23.756833: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:787] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
	2023-10-14 15:02:23.982269: I external/local_xla/xla/stream_executor/rocm/rocm_gpu_executor.cc:787] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
	2023-10-14 15:02:23.9823
	// Created by Timothy Davison on 2023-06-21.
	//
	// This is a Metal implementation of subgroupPartitionNV. You use it to find a mask of
	// the other threads in a simd-group with the same value (a partition of the simd-group about
	// a set of values).
	//
	// Feel free to use this in your code. Please share any fixes or ideas to make it faster.
	//
	// Khronos docs on subgroup partitioning:
	// - https://github.com/KhronosGroup/GLSL/blob/master/extensions/nv/GL_NV_shader_subgroup_partitioned.txt
	% ./clpeak
	[mvk-info] MoltenVK version 1.1.5, supporting Vulkan version 1.1.189.
	The following 72 Vulkan extensions are supported:
	VK_KHR_16bit_storage v1
	VK_KHR_8bit_storage v1
	VK_KHR_bind_memory2 v1
	VK_KHR_create_renderpass2 v1
	VK_KHR_dedicated_allocation v3
	VK_KHR_depth_stencil_resolve v1
	VK_KHR_descriptor_update_template v1
	# IDA (disassembler) and Hex-Rays (decompiler) plugin for Apple AMX
	#
	# WIP research. (This was edited to add more info after someone posted it to
	# Hacker News. Click "Revisions" to see full changes.)
	#
	# Copyright (c) 2020 dougallj


	# Based on Python port of VMX intrinsics plugin:
	# Copyright (c) 2019 w4kfu - Synacktiv
	platform: 7.5
	ext: 7p5
	name: HSW
	1 add add 0x40 Addition
	0xfc0 u8 i8 u16 i16 u32 i32 , 0xfc0 u8 i8 u16 i16 u32 i32
	0x20000 f32 , 0xfc0 u8 i8 u16 i16 u32 i32
	0x20000 f32 , 0x20000 f32
	0x40000 f64 , 0x40000 f64
	3 addc addc 0x4e Addition with Carry
	0x400 u32 , 0x400 u32
	@echo off
	setlocal
	cd %~dp0
	call vcvars amd64
	..\..\bin\win32\nasm -f win64 -g -o histo_asm.obj histo_asm.nas \|\| exit /b 1
	cl /Zi /O2 /nologo histotest.cpp histo_asm.obj \|\| exit /b 1