Skip to content

Instantly share code, notes, and snippets.

View chiro2001's full-sized avatar
💭
I may be slow to respond.

Chiro Liang chiro2001

💭
I may be slow to respond.
View GitHub Profile
@chiro2001
chiro2001 / README.md
Created May 7, 2026 10:54
x265 SVE2: USDOT_H vs SDOT_H instruction count comparison (real hardware)

x265 SVE2 8-tap Vertical Luma Filter: USDOT_H vs SDOT_H

Instruction Count Comparison (Real Hardware)

Assumes both USDOT_H and SDOT_H exist as single-cycle SVE2 hardware instructions. Counting only computation instructions (no loads/stores).

Per Dot-product Call (row pair × coefficient slot)

| Operation | USDOT_H (r1) | SDOT_H (r3) |

LoongArch 汇编要如何把一个标号的地址加载到寄存器中呢?

比如,我想将 finish 这个标号的地址加载到 a4 寄存器,在 RISCV 中的写法是这样的:

test:
    lla a4, finish
finish:
    nop
@chiro2001
chiro2001 / easyconnect-clash-redirect.js
Created May 7, 2026 10:44
Clash Mixin scripts for easyconnect
module.exports.parse = async (
{ content, name, url },
{ axios, yaml, notify }
) => {
const extra = {
proxies: [
{"name": "easyconn", "type": "socks5", "server": "127.0.0.1", "port": 1080},
{"name": "easyconn_http", "type": "http", "server": "127.0.0.1", "port": 8888}
],
rules: [
@chiro2001
chiro2001 / core_final.txt
Created May 7, 2026 10:44
x265 SVE2 luma filter v12: narrow_mz2_sim + deinterleave, pure SVE2
=== x265 SVE2 8-tap Vertical Luma Filter (final) ===
File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines)
================================================================
① USDOT_H EMULATION (accumulate variant)
================================================================
13 SVE2 instructions. Real USDOT_H: 1 instruction.
static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b)
// → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot)
@chiro2001
chiro2001 / core_final.txt
Created May 7, 2026 10:44
x265 SVE2 usdot+sdot dual kernel (pure SVE2, VL=256, verified 8/8)
=== x265 SVE2 8-tap Vertical Luma Filter (final) ===
File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines)
================================================================
① USDOT_H EMULATION (accumulate variant)
================================================================
13 SVE2 instructions. Real USDOT_H: 1 instruction.
static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b)
// → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot)
@chiro2001
chiro2001 / k.cpp
Created May 7, 2026 10:44
x265 SVE2 usdot+sdot pure sim (v3: bias external, 8/8 verified)
#else // !HIGH_BIT_DEPTH
#include "constants.h"
#include <arm_sve.h>
namespace {
// ===================================================================
// USDOT_H EMULATION: unsigned s8×s8 → s16, 2-way dot, accumulate
// Simulation: 13 SVE2 insns. Real USDOT_H: 1 insn.
@chiro2001
chiro2001 / neon_vs_sve2_analysis.txt
Created May 7, 2026 10:43
x265 SVE2 vs NEON i8mm compute instruction comparison (interp_8tap_vert_pp_16x16)
=== x265 SVE2 vs NEON i8mm Compute Instruction Comparison ===
Target: interp_8tap_vert_pp_16x16, 8-bit pixels
================================================================
① NEON i8mm per 4 output rows (16-wide block, disassembled from -O2 binary)
================================================================
Source: filter-neon-i8mm.cpp, compiled with g++ 15.2.0 -O2 -march=armv8.2-a+dotprod+i8mm
Counted from actual disassembly of `interp8_vert_pp_i8mm<16,16>` hot inner loop.
| Operation | Instruction | Count | Notes |
@chiro2001
chiro2001 / test_8_12.cpp
Created May 7, 2026 10:43
SVE2 unroll comparison: 4-row, 8-row, 12-row, 16-row variants (all 4/4 PASS)
#include <cstring>
#include <cstdio>
#include <cstdint>
#include <arm_sve.h>
#define IF 6
namespace X{const int16_t gl[4][8]={{0,0,0,64,0,0,0,0},{-1,4,-10,58,17,-5,1,0},{-1,4,-11,40,40,-11,4,-1},{0,1,-5,17,58,-10,4,-1}};}
static inline svint16_t uh(svint16_t ac,svuint8_t a,svint8_t b){
svuint8_t z=svdup_u8(0);svint8_t zs=svdup_s8(0);
svuint16_t ae=svunpklo_u16(svuzp1_u8(a,z)),ao=svunpklo_u16(svuzp2_u8(a,z));
@chiro2001
chiro2001 / avx2_vs_sve2_report.md
Last active May 7, 2026 11:13
AVX2 vs SVE2 Dynamic Instruction Count: interp_8tap_vert_pp_16x16

AVX2 vs SVE2 vs SVE2P3: Dynamic Instruction Count Comparison

Kernel: interp_8tap_vert_pp_16x16 (8-tap vertical luma filter, pixel→pixel, 16×16 block)

计数方法: 动态指令 — 单次 kernel 调用中实际执行的每条指令

"SIMD 计算指令"定义: 所有 SIMD 指令,含数据搬运 (zip/uzp/unpk/permute/extract) 和零初始化 (svdup)。排除: 访存 (load/store/svwhilelt)、标量 (lea/shl/mov)、RET、循环分支。


@chiro2001
chiro2001 / IntelPIN.cmake
Created November 17, 2022 12:14 — forked from mrexodia/IntelPIN.cmake
IntelPIN.cmake
# Website: https://software.intel.com/content/www/us/en/develop/articles/pin-a-binary-instrumentation-tool-downloads.html
# License: https://software.intel.com/sites/landingpage/pintool/pinlicense.txt
# This snippet: https://gist.github.com/mrexodia/f61fead0108603d04b2ca0ab045e0952
# TODO: lunix support
# Thanks to Francesco for showing me this method
CPMAddPackage(
NAME IntelPIN
VERSION 3.18
URL https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.18-98332-gaebd7b1e6-msvc-windows.zip