Assumes both USDOT_H and SDOT_H exist as single-cycle SVE2 hardware instructions. Counting only computation instructions (no loads/stores).
| Operation | USDOT_H (r1) | SDOT_H (r3) |
Assumes both USDOT_H and SDOT_H exist as single-cycle SVE2 hardware instructions. Counting only computation instructions (no loads/stores).
| Operation | USDOT_H (r1) | SDOT_H (r3) |
LoongArch 汇编要如何把一个标号的地址加载到寄存器中呢?
比如,我想将 finish 这个标号的地址加载到 a4 寄存器,在 RISCV 中的写法是这样的:
test:
lla a4, finish
finish:
nop| module.exports.parse = async ( | |
| { content, name, url }, | |
| { axios, yaml, notify } | |
| ) => { | |
| const extra = { | |
| proxies: [ | |
| {"name": "easyconn", "type": "socks5", "server": "127.0.0.1", "port": 1080}, | |
| {"name": "easyconn_http", "type": "http", "server": "127.0.0.1", "port": 8888} | |
| ], | |
| rules: [ |
| === x265 SVE2 8-tap Vertical Luma Filter (final) === | |
| File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines) | |
| ================================================================ | |
| ① USDOT_H EMULATION (accumulate variant) | |
| ================================================================ | |
| 13 SVE2 instructions. Real USDOT_H: 1 instruction. | |
| static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b) | |
| // → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot) |
| === x265 SVE2 8-tap Vertical Luma Filter (final) === | |
| File: source/common/aarch64/filter-prim-sve.cpp (#else // !HIGH_BIT_DEPTH, 223 lines) | |
| ================================================================ | |
| ① USDOT_H EMULATION (accumulate variant) | |
| ================================================================ | |
| 13 SVE2 instructions. Real USDOT_H: 1 instruction. | |
| static inline svint16_t usdot_h_sve(svint16_t acc, svuint8_t a, svint8_t b) | |
| // → svuzp1/2 + svunpklo + svmul+svadd → svadd_s16(acc, dot) |
| #else // !HIGH_BIT_DEPTH | |
| #include "constants.h" | |
| #include <arm_sve.h> | |
| namespace { | |
| // =================================================================== | |
| // USDOT_H EMULATION: unsigned s8×s8 → s16, 2-way dot, accumulate | |
| // Simulation: 13 SVE2 insns. Real USDOT_H: 1 insn. |
| === x265 SVE2 vs NEON i8mm Compute Instruction Comparison === | |
| Target: interp_8tap_vert_pp_16x16, 8-bit pixels | |
| ================================================================ | |
| ① NEON i8mm per 4 output rows (16-wide block, disassembled from -O2 binary) | |
| ================================================================ | |
| Source: filter-neon-i8mm.cpp, compiled with g++ 15.2.0 -O2 -march=armv8.2-a+dotprod+i8mm | |
| Counted from actual disassembly of `interp8_vert_pp_i8mm<16,16>` hot inner loop. | |
| | Operation | Instruction | Count | Notes | |
| #include <cstring> | |
| #include <cstdio> | |
| #include <cstdint> | |
| #include <arm_sve.h> | |
| #define IF 6 | |
| namespace X{const int16_t gl[4][8]={{0,0,0,64,0,0,0,0},{-1,4,-10,58,17,-5,1,0},{-1,4,-11,40,40,-11,4,-1},{0,1,-5,17,58,-10,4,-1}};} | |
| static inline svint16_t uh(svint16_t ac,svuint8_t a,svint8_t b){ | |
| svuint8_t z=svdup_u8(0);svint8_t zs=svdup_s8(0); | |
| svuint16_t ae=svunpklo_u16(svuzp1_u8(a,z)),ao=svunpklo_u16(svuzp2_u8(a,z)); |
Kernel: interp_8tap_vert_pp_16x16 (8-tap vertical luma filter, pixel→pixel, 16×16 block)
计数方法: 动态指令 — 单次 kernel 调用中实际执行的每条指令
"SIMD 计算指令"定义: 所有 SIMD 指令,含数据搬运 (zip/uzp/unpk/permute/extract) 和零初始化 (svdup)。排除: 访存 (load/store/svwhilelt)、标量 (lea/shl/mov)、RET、循环分支。
| # Website: https://software.intel.com/content/www/us/en/develop/articles/pin-a-binary-instrumentation-tool-downloads.html | |
| # License: https://software.intel.com/sites/landingpage/pintool/pinlicense.txt | |
| # This snippet: https://gist.github.com/mrexodia/f61fead0108603d04b2ca0ab045e0952 | |
| # TODO: lunix support | |
| # Thanks to Francesco for showing me this method | |
| CPMAddPackage( | |
| NAME IntelPIN | |
| VERSION 3.18 | |
| URL https://software.intel.com/sites/landingpage/pintool/downloads/pin-3.18-98332-gaebd7b1e6-msvc-windows.zip |