Skip to content

Instantly share code, notes, and snippets.

@shahbazsyed
Last active May 8, 2023 12:18
Show Gist options
  • Save shahbazsyed/482a919c67bd0ba7a4d1f6745de8465f to your computer and use it in GitHub Desktop.
Save shahbazsyed/482a919c67bd0ba7a4d1f6745de8465f to your computer and use it in GitHub Desktop.
Running notes on reinforcement learning from human feedback

Notes on RLHF

Three stages of training a LLM

  1. Pretraining: a LLM is pretrained on indiscriminate web data
  2. Supervised finetuning (SFT): the pretrained language model (PLM) is then finetuned on higher quality data
  3. RLHF: finetuned model is further polished using RLHF to make it appropriate for the broad audience

Pretraining is the most resource-intensive phase; SFT and RLHF can be seen as unlocking the existing capabilities of the pretrained models that are hard for users to do via prompting alone.

There are two types of data required besides the scraped web data used for pretraining:

Demonstration data: used for SFT and contains positive examples of type (prompt, response); the goal is to optimize the PLM to generate prompt responses that we are looking for. This is also known as behavior cloning.

Comparison data: used for RLHD and contains both positive and negative examples of type (prompt, winning_response, losing_response)

Both demonstration data and comparison data is generally collected via manual annotation from highly qualified workers who understand the tasks. This is crucial as labels from people who do not understand the task cannot help the model.

The problem with SFT demonstration data is that while it tells what should be the response to a prompt, it does not tell how good or bad the response is. This is where RLHF comes into the picture.

RLHF consists of two parts:

  1. Reward Model (RM) that acts as a scoring function to score the prompt responses of the SFT model
  2. Optimizing the LLM to generate responses that achieve high score from the RM

Reward Model

It is recommended to start with the SFT model as the initialized RM; the scoring model should be as good as the SFT model. Since scoring is a classification or a regression task, this RM is finetuned using comparison data described above.

The combination of the SFT model and the RM is trained via reinforcement learning using Proximal Policy Optimization (PPO) to produce an optimized variant of the SFT model. For a new prompt, the SFT model generates a response using its current policy. This response is scored by the RM and the policy is updated to optimize the resulting RLHF-based LLM.

However, an important constraint is that the model resulting from this phase should not stray too far from the model resulting from the SFT phase. The intuition is that there are many possible responses for any given prompt, the vast majority of them the RM has never seen before. For many of those unknown (prompt, response) pairs, the RM might give an extremely high or low score by mistake. Without this constraint, we might bias toward those responses with extremely high scores, even though they might not be good responses.

Why does RLHF work?

Three possible hypotheses:

  1. The diversity hypothesis: during SFT, the model's output is expected to somewhat match the demonstrated responses. This may work negatively in situations where the same answer might be given in different forms. Thus RL allows us to inform the model of equally valid answers to a given prompt.

  2. The negative feedback hypothesis: demonstration (SFT) only gives the model positive signals and not negative signals. RL allows us to show models negative signals.

  3. The hallucination hypothesis: there are primarily 3 modes of interaction with an LLM: (a) text-grounded where we provide the model with a text and an instruction, and expect the answer to be fully grounded in the provided text; (b) knowledge-seeking where we provide the model with a question or instruction, and expect a truthful answer based on the model's internal knowledge; (c) creative where we provide the model with a question or instruction, and expect some creative output.

The argument for using RL is mostly based on the knowledge-seeking mode. The core issue is that in such situations, we want to encourage the model to answer based on its internal knowledge, but we don't know what this internal knowledge contains. Therefore, it is possible that hallucinations occur in case when the model does not know the answer but we still force it to respond (which usually happens in SFT).

RLHF and Hallucination

There are two hypotheses for why LLMs hallucinate:

  1. because they lack the understanding of the cause and effect of their actions; this may be addressed by treating response generation as causal intervention
  2. because there is a mismatch between the LLM's internal knowledge and the labeler's internal knowledge. In theory, the human labeler can include all the context they know with each prompt to teach the model to use only the existing knowledge. However, this is impossible in practice.

InstructGPT paper showed that RLHF made hallucinations worse. Further investigation is necessary to conclude this.

Tip Based on the assumption that LLMs know what they know, some people try to reduce hallucination with prompts, e.g. adding Answer as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know". Making LLMs respond concisely also seems to help with hallucination – the fewer tokens LLMs have to generate, the less chance they have to make things up.

Sources

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment