Chiro Liang chiro2001

## README.md

      
              2 files
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                chiro2001
                / README.md
            
            
              Created
              May 7, 2026 10:54
            
              
                x265 SVE2: USDOT_H vs SDOT_H instruction count comparison (real hardware)
              
          
    x265 SVE2 8-tap Vertical Luma Filter: USDOT_H vs SDOT_H

Instruction Count Comparison (Real Hardware)

Assumes both USDOT_H and SDOT_H exist as single-cycle SVE2 hardware instructions.
Counting only computation instructions (no loads/stores).
Per Dot-product Call (row pair × coefficient slot)

| Operation | USDOT_H (r1) | SDOT_H (r3) |

  
## 关于 LoongArch 汇编如何加载标号地址.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                chiro2001
                / 关于 LoongArch 汇编如何加载标号地址.md
            
            
              Created
              May 7, 2026 10:44
            
          
    LoongArch 汇编要如何把一个标号的地址加载到寄存器中呢？
比如，我想将 finish 这个标号的地址加载到 a4 寄存器，在 RISCV 中的写法是这样的：
test:
    lla a4, finish
finish:
    nop

  
## easyconnect-clash-redirect.js
module.exports.parse = async (
  { content, name, url },
  { axios, yaml, notify }
) => {
  const extra = {
    proxies: [
      {"name": "easyconn", "type": "socks5", "server": "127.0.0.1", "port": 1080},
      {"name": "easyconn_http", "type": "http", "server": "127.0.0.1", "port": 8888}
    ],
    rules: [

## core_final.txt
=== x265 SVE2 8-tap Vertical Luma Filter (final) ===
File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines)

================================================================
① USDOT_H EMULATION (accumulate variant)
================================================================
13 SVE2 instructions. Real USDOT_H: 1 instruction.

static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b)
// → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot)

## core_final.txt
=== x265 SVE2 8-tap Vertical Luma Filter (final) ===
File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines)

================================================================
① USDOT_H EMULATION (accumulate variant)
================================================================
13 SVE2 instructions. Real USDOT_H: 1 instruction.

static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b)
// → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot)

## k.cpp
#else // !HIGH_BIT_DEPTH

#include "constants.h"
#include <arm_sve.h>

namespace {

// ===================================================================
// USDOT_H EMULATION: unsigned s8×s8 → s16, 2-way dot, accumulate
// Simulation: 13 SVE2 insns.  Real USDOT_H: 1 insn.

## neon_vs_sve2_analysis.txt
=== x265 SVE2 vs NEON i8mm Compute Instruction Comparison ===
Target: interp_8tap_vert_pp_16x16, 8-bit pixels

================================================================
① NEON i8mm per 4 output rows (16-wide block, disassembled from -O2 binary)
================================================================
Source: filter-neon-i8mm.cpp, compiled with g++ 15.2.0 -O2 -march=armv8.2-a+dotprod+i8mm
Counted from actual disassembly of `interp8_vert_pp_i8mm<16,16>` hot inner loop.

| Operation | Instruction | Count | Notes |

## test_8_12.cpp
#include <cstring>
#include <cstdio>
#include <cstdint>
#include <arm_sve.h>
#define IF 6
namespace X{const int16_t gl[4][8]={{0,0,0,64,0,0,0,0},{-1,4,-10,58,17,-5,1,0},{-1,4,-11,40,40,-11,4,-1},{0,1,-5,17,58,-10,4,-1}};}

static inline svint16_t uh(svint16_t ac,svuint8_t a,svint8_t b){
    svuint8_t z=svdup_u8(0);svint8_t zs=svdup_s8(0);
    svuint16_t ae=svunpklo_u16(svuzp1_u8(a,z)),ao=svunpklo_u16(svuzp2_u8(a,z));

## avx2_vs_sve2_report.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                chiro2001
                / avx2_vs_sve2_report.md
            
            
              Last active
              May 7, 2026 11:13
            
              
                AVX2 vs SVE2 Dynamic Instruction Count: interp_8tap_vert_pp_16x16
              
          
    AVX2 vs SVE2 vs SVE2P3: Dynamic Instruction Count Comparison

Kernel: interp_8tap_vert_pp_16x16 (8-tap vertical luma filter, pixel→pixel, 16×16 block)
计数方法: 动态指令 — 单次 kernel 调用中实际执行的每条指令
"SIMD 计算指令"定义: 所有 SIMD 指令，含数据搬运 (zip/uzp/unpk/permute/extract) 和零初始化 (svdup)。排除: 访存 (load/store/svwhilelt)、标量 (lea/shl/mov)、RET、循环分支。


## IntelPIN.cmake
# Website: https://software.intel.com/content/www/us/en/develop/articles/pin-a-binary-instrumentation-tool-downloads.html
# License: https://software.intel.com/sites/landingpage/pintool/pinlicense.txt
# This snippet: https://gist.github.com/mrexodia/f61fead0108603d04b2ca0ab045e0952
# TODO: lunix support

# Thanks to Francesco for showing me this method
CPMAddPackage(
    NAME IntelPIN
    VERSION 3.18
    URL https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.18-98332-gaebd7b1e6-msvc-windows.zip
	module.exports.parse = async (
	{ content, name, url },
	{ axios, yaml, notify }
	) => {
	const extra = {
	proxies: [
	{"name": "easyconn", "type": "socks5", "server": "127.0.0.1", "port": 1080},
	{"name": "easyconn_http", "type": "http", "server": "127.0.0.1", "port": 8888}
	],
	rules: [
	=== x265 SVE2 8-tap Vertical Luma Filter (final) ===
	File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines)

	================================================================
	① USDOT_H EMULATION (accumulate variant)
	================================================================
	13 SVE2 instructions. Real USDOT_H: 1 instruction.

	static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b)
	// → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot)
	#else // !HIGH_BIT_DEPTH

	#include "constants.h"
	#include <arm_sve.h>

	namespace {

	// ===================================================================
	// USDOT_H EMULATION: unsigned s8×s8 → s16, 2-way dot, accumulate
	// Simulation: 13 SVE2 insns. Real USDOT_H: 1 insn.
	=== x265 SVE2 vs NEON i8mm Compute Instruction Comparison ===
	Target: interp_8tap_vert_pp_16x16, 8-bit pixels

	================================================================
	① NEON i8mm per 4 output rows (16-wide block, disassembled from -O2 binary)
	================================================================
	Source: filter-neon-i8mm.cpp, compiled with g++ 15.2.0 -O2 -march=armv8.2-a+dotprod+i8mm
	Counted from actual disassembly of `interp8_vert_pp_i8mm<16,16>` hot inner loop.

	\| Operation \| Instruction \| Count \| Notes \|
	#include <cstring>
	#include <cstdio>
	#include <cstdint>
	#include <arm_sve.h>
	#define IF 6
	namespace X{const int16_t gl[4][8]={{0,0,0,64,0,0,0,0},{-1,4,-10,58,17,-5,1,0},{-1,4,-11,40,40,-11,4,-1},{0,1,-5,17,58,-10,4,-1}};}

	static inline svint16_t uh(svint16_t ac,svuint8_t a,svint8_t b){
	svuint8_t z=svdup_u8(0);svint8_t zs=svdup_s8(0);
	svuint16_t ae=svunpklo_u16(svuzp1_u8(a,z)),ao=svunpklo_u16(svuzp2_u8(a,z));
	# Website: https://software.intel.com/content/www/us/en/develop/articles/pin-a-binary-instrumentation-tool-downloads.html
	# License: https://software.intel.com/sites/landingpage/pintool/pinlicense.txt
	# This snippet: https://gist.github.com/mrexodia/f61fead0108603d04b2ca0ab045e0952
	# TODO: lunix support

	# Thanks to Francesco for showing me this method
	CPMAddPackage(
	NAME IntelPIN
	VERSION 3.18
	URL https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.18-98332-gaebd7b1e6-msvc-windows.zip