Skip to content

Instantly share code, notes, and snippets.

View nmoinvaz's full-sized avatar

Nathan Moin Vaziri nmoinvaz

  • Los Angeles, California
View GitHub Profile
@nmoinvaz
nmoinvaz / zlib-ng-pr2291-strstart-lookahead-struct-experiment.md
Created May 12, 2026 06:44
zlib-ng PR #2291 struct-local experiment for strstart/lookahead

zlib-ng PR #2291 — struct-local experiment for strstart/lookahead

Follow-up to PR #2291 (nv/develop/deflate-strategy-locals-hoist). After the scalar-local hoist landed, tried two variants in deflate_quick.c to see whether packing the two uint32_t fields into a struct could coax the compiler into 64-bit load/store transfers.

Machine

  • Apple M5, macOS 26.4.1, Apple clang 21.0.0
  • AArch64 native, x86_64 cross-compile via CMAKE_OSX_ARCHITECTURES=x86_64
  • CMake Release, -D BUILD_SHARED_LIBS=OFF
@nmoinvaz
nmoinvaz / zlib-ng-deflate-struct-hoist-benefits.md
Last active May 12, 2026 04:33
zlib-ng deflate-struct-hoist: codegen and benchmark analysis

zlib-ng: deflate-struct-hoist — codegen and benchmark analysis

Branch: nv/develop/deflate-struct-hoist Commit: c435d01e — "Hoist deflate bit-emit state into deflate_emit_hot local" Base: upstream/develop (48087450)

Change

Introduces a deflate_emit_hot struct (bi_buf, bi_valid, pending_buf, pending) and DEFLATE_EMIT_HOT_LOAD/STORE macros that cache bit-emit state in a local at the top of hot loops and write it back once on exit. Converts put_byte/put_short/put_short_msb/put_uint32/put_uint32_msb/put_uint64 from macros to static inline functions taking a deflate_emit_hot *. Cold-path callers in deflate.c (zlib/gzip header and trailer writers, name/comment loops) and deflate_stored.c bracket their put_* clusters with LOAD/STORE.

@nmoinvaz
nmoinvaz / zlib-ng-benchmark-chunkmemset.cc
Created May 11, 2026 22:22
zlib-ng: Microbenchmark for chunkmemset_safe variants (compares byte-by-byte tail vs widened bit-decomposed stores)
/* benchmark_chunkmemset.cc -- benchmark chunkmemset_safe variants
* Copyright (C) 2026 Nathan Moinvaziri
* For conditions of distribution and use, see copyright notice in zlib.h
*/
#include <benchmark/benchmark.h>
extern "C" {
# include "zbuild.h"
# include "zutil_p.h"
@nmoinvaz
nmoinvaz / benchmark_dist1.cc
Last active May 10, 2026 06:51
zlib-ng PR #2286: microbenchmark of memset replacements for the dist=1 path in CHUNKMEMSET
/* benchmark_dist1.cc -- compare strategies for the dist=1 path in CHUNKMEMSET */
#include <benchmark/benchmark.h>
extern "C" {
# include "zbuild.h"
# include "zutil.h"
}
#if defined(__ARM_NEON) || defined(__ARM_NEON__)
@nmoinvaz
nmoinvaz / deflate_quick.patch
Last active May 8, 2026 07:55
zlib-ng: hoist strstart and lookahead in deflate_quick (~12% level 1)
diff --git a/deflate_quick.c b/deflate_quick.c
index 6b84388e..41086525 100644
--- a/deflate_quick.c
+++ b/deflate_quick.c
@@ -49,13 +49,19 @@ Z_INTERNAL block_state deflate_quick(deflate_state *s, int flush) {
unsigned char *window;
unsigned last = (flush == Z_FINISH) ? 1 : 0;
+ /* Hold strstart and lookahead in registers across the inner loop.
+ * Sync to s before fill_window (which mutates them) and quick_*_block
@nmoinvaz
nmoinvaz / zlib-ng-insert-string-roll-spill.md
Created May 8, 2026 06:21
zlib-ng: eliminate s->ins_h spill in insert_string_roll inner loop

zlib-ng: eliminate s->ins_h spill in insert_string_roll inner loop

Background

Inspecting AArch64 disassembly of zlib-ng's level-9 deflate path showed that the rolling-hash insert loop emitted a redundant store of the running hash to memory on every byte processed. The cause was visible in the C source and the fix is small and self-contained.

Root cause

@nmoinvaz
nmoinvaz / zlib-ng-pr2281-cache-v2-results.md
Last active May 7, 2026 23:58
zlib-ng PR #2281 — apt-cache v2 results: composite actions + matrix.packages gate

zlib-ng PR #2281 — apt-cache v2 results

PR: zlib-ng/zlib-ng#2281[CI] Cache Ubuntu .deb packages to speed up installing dependencies.

Follow-up to https://gist.github.com/nmoinvaz/978f248ea7d528c954e3d52fb8dc99c0 — measuring the v2 design after the author refactored into composite actions and adopted the matrix.packages gate.

TL;DR

v2 is a clean win across the board. Total apt-install time across Ubuntu jobs is now −71% vs. develop, up from −47% in v1. Default-only jobs no longer pay cache overhead, and warm cache hits skip the write-back step entirely.

@nmoinvaz
nmoinvaz / zlib-ng-pr2281-cache-default-packages.md
Created May 7, 2026 18:47
zlib-ng PR #2281 — apt-cache analysis: skip cache on default-package Ubuntu jobs

zlib-ng PR #2281 — apt-cache analysis and recommendation

PR: zlib-ng/zlib-ng#2281[CI] Cache Ubuntu .deb packages to speed up installing dependencies.

TL;DR

The cache works: it cuts total apt-install time across Ubuntu jobs roughly in half. But cmake.yml enables the cache on every Ubuntu job, including the 23 jobs that only install the default libgtest-dev libbenchmark-dev package set. For those, cache-action overhead exceeds the savings. configure.yml already gates on matrix.packages; mirroring that gate in cmake.yml is a one-line fix.

Methodology

@nmoinvaz
nmoinvaz / zlib-ng-check-lens-bench-results.md
Created April 23, 2026 00:58
zlib-ng zng_check_lens: SIMD vs SWAR vs scalar benchmark (spun out of PR #2267)

zng_check_lens: SIMD vs SWAR vs scalar

Validity-check benchmark for the zng_check_lens(lens, codes) function proposed in the PR #2267 discussion. All three variants scan lens[0..codes-1] and return -1 if any entry exceeds MAX_BITS (15). Input is all-valid (random values in [0, 15]) so the worst case — a full scan with no early exit — is measured.

Variants:

@nmoinvaz
nmoinvaz / zlib-ng-count-lengths-swar-results.md
Last active April 23, 2026 01:18
zlib-ng count_lengths: SWAR vs SIMD benchmark investigation (spun out of PR #2267)

count_lengths: SWAR vs SIMD

Investigation spun out of the PR #2267 discussion on zlib-ng: can the SIMD paths in count_lengths (inftrees.c) be replaced with a SWAR implementation using zng_memread_8?

SWAR design

Mirror the pair-interleaved 8-bit-lane structure of the active SIMD path. Two pairs of uint64_t accumulators (s1_lo/s1_hi,