yoavg/instruct-to-not-hallucinate.md

## instruct-to-not-hallucinate.md

      
    Raw
  

              instruct-to-not-hallucinate.md
            
          
    Is telling a model to "not hallucinate" absurd?

Can you tell an LLM "don't hallucinate" and expect it to work? my gut reaction was "oh this is so silly" but upon some reflection, it really isn't. There is actually no reason why it shouldn't work, especially if it was preference-fine-tuned on instructions with "don't hallucinate" in them, and if it a recent commercial model, it likely was.
What does an LLM need in order to follow an instruction? It needs two things:

an ability to perform then task. Something in its parameters/mechanism should be indicative of the task objective, in a way that can be influenced. (In our case, it should "know" when it hallucinates, and/or should be able to change or adapt its behavior to reduce the chance of hallucinations.)
an ability to ground the instruction: the model should be able to associate the requested behavior with its parameters/mechanisms. (In our case, the model should associate "don't hallucinate" with the behavior related to 1).

Number (2) is easy to achieve with fine-tuning, assuming (1) exists. Does (1) exist? There is evidence that yes, it does. Presumably, "retrieving from memory" and "improvising an answer" are two different model behaviors, which use different internal mechanisms. Indeed, we can probe model inner layers and infer if it is "lying"¹ or if "the question is unanswerable"². These are very much related to "hallucinations". And if we can do it, then why can't a model use this internally when fine-tuned on contrastive examples, as in what happens in preference fine-tuning? Another possible trainable behavior to reduce hallucinations is to make the output distribution sharper, in order to reduce the chance of wrong random sampling (only if the answer is in the parameters, of course).
So, given these pieces of information, yes, LLM can be trained to reduce hallucinations upon request. And given the prominence and popularity of the term, strong new models likely were trained for exactly that. This is not an absurd or silly instruction. Maybe it's absurd in the sense that you have to explicitly request it, and that the models weren't trained to always reduce hallucinations. On the other hand, maybe always trying to avoid hallucinations has some other undesired consequences, which model trainers and product managers would like to avoid. (Actually, if you are in a position to know of such undesired consequences, and are free to tell the world about them, I will be really curious to learn more!)
Footnotes


https://arxiv.org/abs/2304.13734 ↩


https://arxiv.org/abs/2310.11877 ↩