Skip to content

Instantly share code, notes, and snippets.

@chunhualiao
Last active February 23, 2024 05:16
Show Gist options
  • Save chunhualiao/88a95df6c9eacaa6ecf3ae4fe76059c1 to your computer and use it in GitHub Desktop.
Save chunhualiao/88a95df6c9eacaa6ecf3ae4fe76059c1 to your computer and use it in GitHub Desktop.
Self-Rewarding Language Models

Quickly scanned https://arxiv.org/abs/2401.10020 . Quite interesting work. The paper's idea is to have a single language model doing both question answering (responding to prompts) and self-evaluating its own answers. Iterative Direct Preference Optimization training is used to improve the model's dual capabilities.

The authors tried different LLM-as-a-judge promptings to generate a reward score for each answer. A very particular additive 5-point rewarding prompting is found to be the most effective one. The two-step inferencing pipeline (answering questions+evaluating answers) also generates an extra dataset in the form of <question, winning-answering, losing-answering>.

This AI-generated dataset (called preference pairs) is fed back to the model in a training pipeline (using Direct Preference Optimization).

The inferencing and training pipelines are connected to have a closed-loop, iterative process.

  • Each iteration generates better AI feedback training data and subsequently better model.
  • The evaluation shows very promising results, outperforming Claude 2, Genimini Pro, and GPT-4 in selected benchmarks.

The paper has some zoom for improvements.

  1. Figure 1 is not very accurate to reflect the entire workflow. A fixed model is used to generate prompts for example. But it is not shown in the figure. The preference pairs should be a matrix instead of a vector in the diagram. Also, the bootstrapping workflow (using seed instruction following and evaluation datasets) should be reflected.

  2. The authors did not explain why a fixed model is used to generate prompts, instead of using the self-rewarding model directly.

  3. The authors tried to use another form of AI feedback data (question, best-answer), coupled with supervised fine-tuning. However, it did not result in any performance improvement for the model. It is better to explore why or at least propose it as future work.

  4. Fundamentally, the paper does not directly compare (or comment on) self-rewarding vs. independent rewarding. The iterative process can still apply to an independent rewarding model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment