Skip to content

Instantly share code, notes, and snippets.

View ccgibson's full-sized avatar

Chad Gibson ccgibson

View GitHub Profile
@ccgibson
ccgibson / reasoning_preservation_compare_v2.ts
Last active May 24, 2026 19:27
Reasoning preservation test — Neuralwatt vs Moonshot AI (via OpenRouter). 3 variants × N trials with chat_template_kwargs.preserve_thinking opt-in and content-vs-reasoning breakdown.
// Reasoning preservation test — Neuralwatt vs Moonshot AI (via OpenRouter).
// Now covers BOTH Kimi (preserve_thinking) and GLM (clear_thinking) families.
//
// Minimal modifications from the customer's original script:
// 1. Tests both K2.6 and GLM-5.1 (same probe, family-correct kwarg name).
// 2. Sends chat_template_kwargs.<kwarg> so we actually exercise the opt-in.
// 3. Runs N trials per variant (default 5) so a single sample landing on the
// model's "refuse to repeat secret" mode doesn't cause a false negative.
// 4. Scores content AND reasoning separately so we can distinguish
// "template didn't preserve it" (neither has 'arnold') from
@ccgibson
ccgibson / reasoning_preservation_compare.py
Created May 23, 2026 23:17
Reasoning preservation comparison: Neuralwatt vs Moonshot (via OpenRouter). K2.6 + preserve_thinking=true + Variant B.
#!/usr/bin/env python3
"""Compare reasoning preservation: Neuralwatt vs Moonshot AI (via OpenRouter).
Sends the same chat-completion request to two endpoints back-to-back and
reports how each handles a prior-turn assistant ``reasoning`` field on the
"no tool calls" + trailing-user shape (Variant B in our docs / your tests).
The two endpoints are configured to be as apples-to-apples as possible:
Neuralwatt: POST https://api.neuralwatt.com/v1/chat/completions