Xom/language_models_in_games.md

## language_models_in_games.md

      
    Raw
  

              language_models_in_games.md
            
          
    Language Models in Games: A Survey (as of 2022-11-30)

From Wikipedia:

AI Dungeon is a text adventure game which uses artificial intelligence to generate content. The game's first version was made available on Google Colaboratory in May 2019, and its second version was released in December 2019. The AI model was then upgraded in July 2020.

I started reading about AI Dungeon two months ago, and I thought, it would be interesting to use a LLM (large language model, such as GPT) to derive conventional gamestate (such as health points) from arbitrary text. (There are several ways to get a boolean or numeric output from a LLM, the simplest of which is to prompt it to answer yes or no.) I found a number of existing examples, which can be classified by what kind of gamestate and interface they have.
I classify data into three kinds: conventional state, textual state, and display data. I define textual state as textual data that interacts with game mechanics via LLM. Text that doesn't interact with game mechanics, such as names in Crusader Kings, is display data. A conventional computer game has only conventional state. AI Dungeon and its clones, such as NovelAI and KoboldAI, have only textual state.
A simple example of a game with both conventional state and textual state is Medieval Problems, by the same company as AI Dungeon. The main difference from AI Dungeon is that gamestate includes the numeric stats "Prosperity", "Military", "Stability", and "Happiness", and the text generated by the game in response to the player is also fed to a LLM to determine numeric effects on the stats, such as +15% or -10%.
The current most popular such game on Steam is AI Roguelite. AI Roguelite has more complicated state than Medieval Problems, featuring many objects with both numeric stats and a textual description, both of which influence the text generation. The stats influence the LLM by causing words like "expertly" to be inserted into prompts.
Another difference between Medieval Problems and AI Roguelite is that Medieval Problems takes arbitrary text input from the player, whereas AI Roguelite mostly presents the player with discrete options (except in a few cases that the player can choose to ignore). This makes Medieval Problems mostly a toy, whereas AI Roguelite has the potential to be a game of skill (given some difficulty tuning that the current stage of development is too early for). How arbitrary text input precludes a game of skill is illustrated in the video "How I broke Medieval Problems", in which the player simply gives commands like "increase prosperity" instead of anything more concrete. More generally, in most cases, it's probably too easy to figure out the best possible text input.
One way to substitute discrete choices for arbitrary text input is to generate a few textual options. It's more interesting to ask the player which of two text inputs is better, than to ask for the best possible text input. I made a simple game, Flavor Battle, that presents each player with a roster of character descriptions, and asks the player to pick a lineup for battles in which each side's character description is fed to a LLM to output a battle description (which is then fed to the LLM to determine the winner):


The prompts used in Flavor Battle are here. For an invite to the game, message me at Xom#7240 on Discord.
Cicero is an innovative AI for the boardgame Diplomacy that negotiates via textual dialogue. Since Cicero can't be convinced to act against its own interests, but rather can only be convinced to coordinate mutually beneficial actions, figuring out the best way to talk to Cicero has diminishing returns and doesn't break the game.
On the other hand, trying to predict Cicero's future actions based on the dialogue is a novel and interesting challenge to face in a game. That is, figuring out the effect of text when it becomes a prompt is not the novel part, and is already in the games mentioned earlier; the novel part is that Cicero's messages are influenced by hidden state, creating the challenge of inferring the hidden state from the messages.
(In this case, there's an extra wrinkle in that the hidden state is Cicero's future action as computed based on the dialogue.

The inputs for computing Cicero's action are: dialogue + past and current board state.

The inputs for computing Cicero's messages are: dialogue + past and current board state + Cicero's action as computed from the dialogue and board state so far.)
A less complicated example would be if, in a game like AI Roguelite, an object had hidden numeric stats influencing a LLM, and the player could try to infer the value of the object based on generated text. The hidden state could also be textual, rather than conventional; a trivial example would be a guess-the-prompt game in which the player sees a generated image and tries to guess the prompt text.
(See also Radio Commander, a game pre-dating LLMs.)
Text-to-text functions can also be interesting, especially ones with multiple text inputs, such as in Things, which has "You use [input a] to combine [input b] with [input c] to form [output]!". AI Roguelite and Flavor Battle have similar item crafting functions.
Appendix: miscellaneous notes on LLM capabilities

The following table is a summary of https://beta.openai.com/docs/models/gpt-3:


Parameter count*
VRAM
good at


2.7B
8 GB
Parsing text, simple classification, address correction, keywords


6.7B
12–16 GB
Moderate classification, semantic search classification


13B
24–32 GB
Language translation, complex classification, text sentiment, summarization, chatbot


175B
800? GB
Complex intent, cause and effect, summarization for audience, creative writing


*B = Billion
AI Dungeon has a paid version that uses a 175B model, which reportedly became much less intelligent after some SFW tuning.

NovelAI has 13B and 20B models.

KoboldAI has models up to 20B.

AI Roguelite has models up to 13B.

Flavor Battle uses KoboldAI models, and is noticeably better with 20B.

Cicero uses a 2.7B model to generate a large number of candidate messages, then uses an ensemble of filtering models to select good ones.
LLMs can only write what they know. They're much better at typical fantasy scenarios than at unusual science-fiction settings. They have trouble keeping track of possibility-restricting conditions, such as physical invulnerability. They get confused—when told that a sea captain has a rifle and a siren has vampire teeth, the sea captain might lash out with her tail and the siren might fire bullets from her teeth. And, of course, it's a mixed blessing that LLMs can confabulate details not given to them.
I can think of a few strategies for handling confabulation. One is that you can try to detect and filter out confabulations you don't want to deal with, and retry generation. You might try to detect special cases, like AI Roguelite which tries to detect when an enemy changes their attitude and stops fighting. You probably want to hold on to generated text as textual state, even if you derive conventional state, too. Or else you might make your game so simple that there's no narrative continuation, like Flavor Battle, where there's no continuity between battles except the counting of wins and losses, so that it doesn't matter if a wizard blows up the arena along with the opponent.

  
## screenshot_of_flavorbattle.jpg

      
    Raw
  

              screenshot_of_flavorbattle.jpg
Parameter count*	VRAM	good at
2.7B	8 GB	Parsing text, simple classification, address correction, keywords
6.7B	12–16 GB	Moderate classification, semantic search classification
13B	24–32 GB	Language translation, complex classification, text sentiment, summarization, chatbot
175B	800? GB	Complex intent, cause and effect, summarization for audience, creative writing