This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| // Reasoning preservation test — Neuralwatt vs Moonshot AI (via OpenRouter). | |
| // Now covers BOTH Kimi (preserve_thinking) and GLM (clear_thinking) families. | |
| // | |
| // Minimal modifications from the customer's original script: | |
| // 1. Tests both K2.6 and GLM-5.1 (same probe, family-correct kwarg name). | |
| // 2. Sends chat_template_kwargs.<kwarg> so we actually exercise the opt-in. | |
| // 3. Runs N trials per variant (default 5) so a single sample landing on the | |
| // model's "refuse to repeat secret" mode doesn't cause a false negative. | |
| // 4. Scores content AND reasoning separately so we can distinguish | |
| // "template didn't preserve it" (neither has 'arnold') from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| #!/usr/bin/env python3 | |
| """Compare reasoning preservation: Neuralwatt vs Moonshot AI (via OpenRouter). | |
| Sends the same chat-completion request to two endpoints back-to-back and | |
| reports how each handles a prior-turn assistant ``reasoning`` field on the | |
| "no tool calls" + trailing-user shape (Variant B in our docs / your tests). | |
| The two endpoints are configured to be as apples-to-apples as possible: | |
| Neuralwatt: POST https://api.neuralwatt.com/v1/chat/completions |