Yep — you can keep a plain CausalLM behind vLLM and still get a tunable yes/no cutoff with zero architecture changes. You’ve got two clean patterns that work with the OpenAI-compatible server:
-
Ask vLLM for the next-token logprobs (no real decoding—
max_tokens=1,temperature=0). - Read the logprob of the “yes” token and the “no” token at that step.
- Use a threshold on the logprob difference
$\Delta = \log p(\text{yes}) - \log p(\text{no})$ .