Skip to content

Instantly share code, notes, and snippets.

@yoavg
Created September 9, 2024 20:23
Show Gist options
  • Save yoavg/4e4b48afda8693bc274869c2c23cbfb2 to your computer and use it in GitHub Desktop.
Save yoavg/4e4b48afda8693bc274869c2c23cbfb2 to your computer and use it in GitHub Desktop.
Is telling a model to "not hallucinate" absurd?

Is telling a model to "not hallucinate" absurd?

Can you tell an LLM "don't hallucinate" and expect it to work? my gut reaction was "oh this is so silly" but upon some reflection, it really isn't. There is actually no reason why it shouldn't work, especially if it was preference-fine-tuned on instructions with "don't hallucinate" in them, and if it a recent commercial model, it likely was.

What does an LLM need in order to follow an instruction? It needs two things:

  1. an ability to perform then task. Something in its parameters/mechanism should be indicative of the task objective, in a way that can be influenced. (In our case, it should "know" when it hallucinates, and/or should be able to change or adapt its behavior to reduce the chance of hallucinations.)
  2. an ability to ground the instruction: the model should be able to associate the requested behavior with its parameters/mechanisms. (In our case, the model should associate "don't hallucinate" with the behavior related to 1).

Number (2) is easy to achieve with fine-tuning, assuming (1) exists. Does (1) exist? There is evidence that yes, it does. Presumably, "retrieving from memory" and "improvising an answer" are two different model behaviors, which use different internal mechanisms. Indeed, we can probe model inner layers and infer if it is "lying"1 or if "the question is unanswerable"2. These are very much related to "hallucinations". And if we can do it, then why can't a model use this internally when fine-tuned on contrastive examples, as in what happens in preference fine-tuning? Another possible trainable behavior to reduce hallucinations is to make the output distribution sharper, in order to reduce the chance of wrong random sampling (only if the answer is in the parameters, of course).

So, given these pieces of information, yes, LLM can be trained to reduce hallucinations upon request. And given the prominence and popularity of the term, strong new models likely were trained for exactly that. This is not an absurd or silly instruction. Maybe it's absurd in the sense that you have to explicitly request it, and that the models weren't trained to always reduce hallucinations. On the other hand, maybe always trying to avoid hallucinations has some other undesired consequences, which model trainers and product managers would like to avoid. (Actually, if you are in a position to know of such undesired consequences, and are free to tell the world about them, I will be really curious to learn more!)

Footnotes

  1. https://arxiv.org/abs/2304.13734

  2. https://arxiv.org/abs/2310.11877

@shauray8
Copy link

I agree with @impredicative's statement. Larger models handle uncertainty better, and allowing them the freedom to reject flawed inputs significantly reduces hallucinations. Also, using the “h-word” explicitly might force the model into hallucinating more, as evident from https://arxiv.org/abs/2402.07896. So, it's always better to instruct the model in different ways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment