Skip to content

Instantly share code, notes, and snippets.

@yuchenlin
Last active July 19, 2024 03:33
Show Gist options
  • Save yuchenlin/2b1f444fc3e9aed973b4b69f9b3926c0 to your computer and use it in GitHub Desktop.
Save yuchenlin/2b1f444fc3e9aed973b4b69f9b3926c0 to your computer and use it in GitHub Desktop.

ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models

zebra_banner
image

LLMs excel at information-seeking and creative writing tasks. They have significantly improved in math and coding too. But how do they perform in logical reasoning?

To evaluate the logical reasoning abilities of LLMs, we have created a benchmark named ZebraLogic. Each example is a Logic Grid Puzzle, also known as a Zebra Puzzle. In each puzzle, we are given N houses (numbered 1 to N from left to right) and M features for each house. There are N distinct values for each feature, and each house must have a unique value for each feature. Given a list of clues, one should be able to deduce a unique correct assignment of values. The logic grid puzzle is a typical Constraint Satisfaction Problem (CSP) and is often used to test humans' logical reasoning abilities in exams such as the Law School Admission Test (LSAT).


Links


An Example of ZebraLogic Data πŸ¦“

Here is an example of a 2x3 puzzle (2 houses x 3 features):

ZebraLogic Bench Example; id=[lgp-test-2x3-1]: ⬇️

 There are 2 houses, numbered 1 to 2 from left to right. 
 Each house is occupied by a different person. 
 Each house has a unique attribute for each of the following characteristics:
 - Each person has a unique name: **Arnold, Eric**
 - People own unique car models: **ford f150, tesla model 3**
 - The people keep unique animals: **cat, horse**
 
 **Clues**:
 1. Eric is directly left of the person who owns a Tesla Model 3.
 2. The person who keeps horses is in the first house.

Reasoning steps:

  • From Clue 1, we know that Eric is to the left of someone, so he must be the owner of House 1, because House 2 is the rightmost house.
  • Additionally, we know that the person in House 1 must be Arnold, and he owns a Tesla Model 3. Thus, Eric owns a Ford F150.
  • From Clue 2, we know that Eric keeps horses in House 1, which means the other house must keep cats. Finally, we arrive at the unique solution to this puzzle.

The solution presented in table format:

Houses Name CarModel Animal
1 Eric ford f150 horse
2 Arnold tesla model 3 cat

Evaluation Method πŸ“

We programmatically created 1,000 such puzzles, with sizes ranging from 2x2 to 6x6, and there are 40 puzzles for each size. We test large language models (LLMs) by providing one-shot example with reasoning steps and the JSON-formatted solution. We instruct the LLMs to first output their reasoning and then present their answers in the same format as shown in the in-context example.

Metrics

We have two major metrics: puzzle-level accuracy and cell-wise accuracy. For each puzzle of size NxM, there are NxM cells to fill in, and we compute the cell-wise accuracy as the proportion of correctly filled cells. A puzzle is counted as a puzzle-level success only when all cells are filled with correct values. Additionally, we divide the 1000 puzzles into two subsets: easy and hard puzzles, based on their sizes.

Easy vs Hard Puzzles

For an NxM-size Zebra puzzle (N houses and M features), the probability of correctly guessing the assignment for each feature by random chance is $\frac{1}{N!}$. Consequently, the probability of getting all cells correct by randomly guessing is $\left(\frac{1}{N!}\right)^{M}$. The logarithmic values are shown in the following table:

N ⬇️ M=2 M=3 M=4 M=5 M=6
2 -0.602060 -0.903090 -1.204120 -1.505150 -1.806180
3 -1.556303 -2.334454 -3.112605 -3.890756 -4.668908
4 -2.760422 -4.146634 -5.520845 -6.901056 -8.281267
5 -4.158362 -6.237544 -8.316725 -10.395906 -12.475087
6 -5.714665 -8.571997 -11.429330 -14.286662 -17.143995

We set a threshold for the logarithmic value at -3. Therefore, all puzzles smaller than 3x3 are considered easy, while those of size 3x3 and larger are considered hard.


Results πŸ“ˆ

Humans can solve the puzzles by strategically reasoning with the constraints presented in the clues, using deliberate thinking methods such as reductio ad absurdum and the process of elimination. However, LLMs are still weak in such logical reasoning tasks. The best LLM, Claude 3.5 Sonnet, can only solve 33.4% of all puzzles and just 12.4% of the hard puzzles. Smaller language models with 7 to 10 billion parameters struggle to solve hard puzzles and also exhibit low accuracy on easy puzzles.

Our results indicate that LLMs still lack several abilities commonly required for complex logical reasoning: counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization, etc.

image

View all results on our leaderboard https://huggingface.co/spaces/allenai/ZebraLogic

Greedy decoding vs. Sampling?

Recent research shows that greedy decoding usually leads to better performance in hard reasoning tasks. However, in our case, some models can degenerate when they generate reasoning steps (e.g., start to repeatedly decode the same sentences). Thus, we also use sampling with a 0.5 temperature for some models. A few models get higher acc when using sampling but most models have better perforamnce in greedy decoding.


Puzzle Creation 🏭

Zebra puzzles can be synthetically generated by programs:

  • We start by defining a set of features and their possible values (e.g., the feature CarModel might have values like Tesla Model 3, Ford F150, etc.).
  • Next, we establish the clue types and their language templates, which include placeholders for values to be filled in. Each clue type is logically structured to describe a type of constraint that can involve multiple variables.
  • To create a ZebraLogic example, we randomly assign values on a sampled grid as the solution. Then, we enumerate all possible clues that can describe the relation among variables.
  • By iteratively removing clues through weighted sampling, we continuously check if the remaining set of clues can uniquely lead to the above solution.
  • Finally, we represent the puzzle with prompting templates to form the inputs for the LLMs.

The used clue types are as follows:

  • Found_At: the tea drinker lives in House 3
  • Not_At: the musician does not drink tea
  • Same_House: the musician drinks tea
  • Direct_Left/Right: the green house is directly to the left/right of the white house
  • Side_By_Side: the coffee drinker and the tea drinker are next to each other.
  • Left/Right_Of: A is somewhere to the left/right of B
  • One/Two_between: There is one/two house between A and B.

Future Directions πŸ”œ

  • More reasoning methods: We are interested in evaluting LLM agents (e.g., ReAct, Reflexion, SwiftSage). Also, we'll explore advanced prompting and fine-tuning methods like Tree of Thoughts, Flow of Reasoning, etc.
  • More evaluation methods: We are considering trying multiple-choice format for faster evaluation.
  • Fine-tuning with Logic Puzzles: Can fine-tuning with synthetic logical reasoning tasks improve the general abilities of LLMs?
  • Analyze the internal reasoning mechanism of LLMs: how do LLMs reason correctly and incorrectly.
  • More tasks: We'll add more types of logic puzzles that require a more diverse set of reasoning abilities in the evaluation.

Citations

@misc{zebralogic2024,
    title={ZebraLogic: Benchmarking the Logical Reasoning Ability of Language Models},
    author={Bill Yuchen Lin, Ronan Le Bras, Yejin Choi},
    url={https://huggingface.co/spaces/allenai/ZebraLogic},
    year={2024}
}

@article{dziri2024faith,
  title={Faith and fate: Limits of transformers on compositionality},
  author={Nouha Dziri and Ximing Lu and Melanie Sclar and Xiang Lorraine Li and Liwei Jian and Bill Yuchen Lin and Peter West and Chandra Bhagavatula and Ronan Le Bras and Jena D. Hwang and Soumya Sanyal and Sean Welleck and Xiang Ren and Allyson Ettinger and Za{\"i}d Harchaoui and Yejin Choi},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

@yuchenlin
Copy link
Author

test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment