Skip to content

Instantly share code, notes, and snippets.

Reinforcement Learning for Language Models

Yoav Goldberg, April 2023.

Why RL?

With the release of the ChatGPT model and followup large language models (LLMs), there was a lot of discussion of the importance of "RLHF training", that is, "reinforcement learning from human feedback". I was puzzled for a while as to why RL (Reinforcement Learning) is better than learning from demonstrations (a.k.a supervised learning) for training language models. Shouldn't learning from demonstrations (or, in language model terminology "instruction fine tuning", learning to immitate human written answers) be sufficient? I came up with a theoretical argument that was somewhat convincing. But I came to realize there is an additional argumment which not only supports the case of RL training, but also requires it, in particular for models like ChatGPT. This additional argument is spelled out in (the first half of) a talk by John Schulman from OpenAI. This post pretty much

On virtual spaces (for scientific conferences)

The title is a bit broad. What I am going to write about is gather.town. I complained quite a bit re how recent conferences used the gather platform. Here I try to be more constructive, and explain why I think things were bad, and also how I think they can be improved (substantially).

I think gather.town is a fantastic interface, and I think it was mis-used or mal-used in some recent xACL conferences (EACL 2021, EMNLP 2020). It is really disappointing, as there is so much potential, which was not only left unfulfilled, but even in some cases was worse than not having gather at all. This post will try to explain what I think was bad, and how I think things can be improved.

@yoavg
yoavg / instruct-to-not-hallucinate.md
Created September 9, 2024 20:23
Is telling a model to "not hallucinate" absurd?

Is telling a model to "not hallucinate" absurd?

Can you tell an LLM "don't hallucinate" and expect it to work? my gut reaction was "oh this is so silly" but upon some reflection, it really isn't. There is actually no reason why it shouldn't work, especially if it was preference-fine-tuned on instructions with "don't hallucinate" in them, and if it a recent commercial model, it likely was.

What does an LLM need in order to follow an instruction? It needs two things:

  1. an ability to perform then task. Something in its parameters/mechanism should be indicative of the task objective, in a way that can be influenced. (In our case, it should "know" when it hallucinates, and/or should be able to change or adapt its behavior to reduce the chance of hallucinations.)
  2. an ability to ground the instruction: the model should be able to associate the requested behavior with its parameters/mechanisms. (In our case, the model should associate "don't hallucinate" with the behavior related to 1).
@yoavg
yoavg / multi-llm-agents.md
Last active January 14, 2025 12:14
What makes multi-agent LLM systems multi-agent?

Are multi-LLM-agent systems a thing? Yes they are. But.

Yoav Goldberg, Nov 24, 2024

This piece started with a pair of twitter and bluesky posts:

let's talk about "agents" (in the LLM sense). there's a lot of buzz around "multi-agent" systems where agents collaborate but... i don't really get how it differs from a thinking of a single agent with multiple modes of operation. what are the benefits of modeling as multi-agent?

— (((ل()(ل() 'yoav))))👾 (@yoavgo) November 23, 2024
175504 אז
182482 יום
198859 טוב
202605 את
204134 כמה
220754 בוקר
230920 לילה
244401 רק
245847 עוד
261671 מי
@yoavg
yoavg / lm_example
Created May 22, 2015 23:43
Unreasonable Effectiveness of LMs
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The unreasonable effectiveness of Character-level Language Models\n",
"## (and why RNNs are still cool)\n",
"\n",
"###[Yoav Goldberg](http://www.cs.biu.ac.il/~yogo)\n",
@yoavg
yoavg / ACL.js
Created September 9, 2015 08:07
Updated the zotero ACL translator to include titles as well as author names in selection list.
{
"translatorID": "f4a5876a-3e53-40e2-9032-d99a30d7a6fc",
"label": "ACL",
"creator": "Nathan Schneider, Yoav Goldberg",
"target": "^https?://(www[.])?aclweb\\.org/anthology/[^#]+",
"minVersion": "1.0.8",
"maxVersion": "",
"priority": 100,
"inRepository": true,
"translatorType": 4,
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4gram language models share secrets too...\n",
"_Yoav Goldberg, 28 Feb, 2018._\n",
"\n",
"In [a recent research paper](https://arxiv.org/pdf/1802.08232.pdf) titled \"The Secret Sharer:\n",
@yoavg
yoavg / ngram-lm-leak.ipynb
Created February 27, 2018 22:51
Simple 4gram-lm also "leak secrets"
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@yoavg
yoavg / stochastic-critique.md
Last active January 5, 2025 10:43
A criticism of Stochastic Parrots

A criticism of "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big"

Yoav Goldberg, Jan 23, 2021.

The FAccT paper "On the Dangers of Stochastic Parrots: Can Languae Models be Too Big" by Bender, Gebru, McMillan-Major and Shmitchell has been the center of a controversary recently. The final version is now out, and, owing a lot to this controversary, would undoubtly become very widely read. I read an earlier draft of the paper, and I think that the new and updated final version is much improved in many ways: kudos for the authors for this upgrade. I also agree with and endorse most of the content. This is important stuff, you should read it.

However, I do find some aspects of the paper (and the resulting discourse around it and around technology) to be problematic. These weren't clear to me when initially reading the first draft several months ago, but they became very clear to me now. These points are for the most part