Skip to content

Instantly share code, notes, and snippets.

@davidsvaughn
davidsvaughn / threshold_tuning.md
Created August 19, 2025 13:38
Threshold Tuning in vLLM : 2 Hacks

Yep — you can keep a plain CausalLM behind vLLM and still get a tunable yes/no cutoff with zero architecture changes. You’ve got two clean patterns that work with the OpenAI-compatible server:


Option A (recommended): Score → threshold in your client

  1. Ask vLLM for the next-token logprobs (no real decoding—max_tokens=1, temperature=0).
  2. Read the logprob of the “yes” token and the “no” token at that step.
  3. Use a threshold on the logprob difference $\Delta = \log p(\text{yes}) - \log p(\text{no})$.
@davidsvaughn
davidsvaughn / rwa.py
Created May 5, 2017 10:22 — forked from shamatar/rwa.py
Keras (keras.is) implementation of Recurrent Weighted Average, as described in https://arxiv.org/abs/1703.01253. Follows original implementation in Tensorflow from https://github.com/jostmey/rwa. Works with fixed batch sizes, requires "batch_shape" parameter in input layer. Outputs proper config, should save and restore properly. You are welcome…
from keras.layers import Recurrent
import keras.backend as K
from keras import activations
from keras import initializers
from keras import regularizers
from keras import constraints
from keras.engine import Layer
from keras.engine import InputSpec