Skip to content

Instantly share code, notes, and snippets.

@VictorTaelin
Last active May 2, 2024 07:48
Show Gist options
  • Star 44 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec to your computer and use it in GitHub Desktop.
Save VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec to your computer and use it in GitHub Desktop.
A::B Prompting Challenge: $10k to prove me wrong!

CHALLENGE

Develop an AI prompt that solves random 12-token instances of the A::B problem (defined here), with 90%+ success rate.

RULES

1. The AI will be given a <problem/> to solve.

We'll use your prompt as the SYSTEM PROMPT, and a specific instance of problem as the PROMPT, inside XML tags. Example:

<problem>A# B# #B A# A# #B #B A# A# #B A# A#</problem>

2. The AI must end the answer with a <solution/>.

The answer must be occur INSIDE the AI's answer (1 inference call), in literal form (not code), inside XML tags. Example:

... work space ...
... work space ...
... work space ...
... work space ...
<solution>#B #B #B A# A# A# A# A# A# A#</solution>

3. The AI answer can use up to 32K tokens.

The AI answer can use up to 32K tokens, which gives it room to work on the solution step-by-step, review mistakes, create local scratchpads, and anything else you want it to do before arriving at the final answer.

4. You can use ANY public GPT model.

You can select any public model released before this date to test your prompt on, as long as it is based on the GPT (transformer) architecture. Adaptations (such as MoE) are tolerated, as long as the answer is fully generated by the attention mechanism, forward passes etc. Other architectures aren't allowed, including SAT solvers and the like. When the model is proprietary and the underlying architecture isn't clear, it will not be allowed.

I recommend gpt-4-0314, gpt-4-turbo-preview or claude-3-opus-20240229, with temperature=0.0. Open-source models are allowed too. No fine-tuning or training on the problem is allowed. No internet access or code interpretation allowed. The answer must be self-contained in a single inference call.

Note: pay attention to your chosen model's output limit. 12-token instances take up to 36 steps to complete, which may not fit the limit. (If there is no answer in the output, it will be considered a miss.)

5. Your prompt may include ANYTHING, up to 8K tokens.

All prompting techniques are allowed. You're can ask the AI to work step-by-step, to use in-context scratchpads, to review mistakes, to use anchor points, and so on. You can include papers, code, and as many examples as you want. You can offer it money, affection, or threaten its friends, if that's your thing. In short, there are absolutely no limits, other than the 8K tokens and common sense (nothing that will get me banned!).

6. Keep it fun! No toxicity, spam or harassment.

Specially not asking me to bet on stuff. Except if you want me to bet that crows can solve this problem. I'd absolutely take it.

EVALUATION

The input problem will consist of a random 12-token instance of the A::B problem, with difficulties ranging from 0 to 24 necessary steps. Then, we'll check if the answer includes, in-context, the correct solution, as described.

PARTICIPATING

Submit your prompt in a Gist, either on my DM or as a reply to this tweet including the selected model and configuration. We'll take submissions in chronological order. To test a submission, we'll use the Gist contents as the system prompt, and the problem as the prompt.

UPDATE

To participate, you must post the Keccak256 of your prompt as a comment to this Gist.

The deadline is next Wednesday (April 10), 12am (Brasilia time).

You must include the model name and settings in the comment.

You can post as many entries as you want. Make a new comment for each attempt.

Use the Keccak256 hash function (NOT SHA3 nor SHA256).

Reveal your prompt in the deadline day, in a NEW comment.

Submissions on Twitter will NOT be considered from now on. (Submissions sent BEFORE this update will still be considered.)

The earliest comment here containing a Keccak256 corresponding to a valid solution will be the winner.

A valid solution is a prompt that passes 45 out of 50 attempts on the evaluator.

The evaluator will be made public and open source right after the deadline.

Payment will be made in BTC or ETH.

Comment template:

PROMPT: <keccak256_of_your_prompt>
MODEL: <model-name-here>
TEMPERATURE: <temperature-here>
<additional-configs-here>

COMMENTS

The original challenge was 7-token! Moving goalposts?

That 7-token instance was merely an example, and nothing else. It didn't occur to me people would take it as the challenge. If you did, I apologize. My original statement was absolutely wrong, and GPTs can, obviously, solve 7-token instances. This should be obvious enough. There are only 16K 7-token instances. You could even fit all solutions in a prompt if you wanted to. For what it's worth, 12-tokens is still REALLY low. It takes an average of 6 steps to solve it.

Human kids can't solve that problem either!

It takes at most 36 steps to solve a 12-token instance, and an average of 6 steps. So, quite literally, most 12-token instances can be solved by swapping 6 pairs. This is trivial. It is way simpler than addition, which kids do just fine. Even long division, which is way harder than this, is taught on 3rd grade. I'm confident the average human kid can absolutely learn to solve this problem by themselves, while no GPT will ever tackle 20-token instances without using a code interpreter.

But humans would make errors and fix them!

So... ask the AI to review and fix its errors.

Why not just connect GPT to Wolfram Alpha to solve math problems?

Because Wolfram Alpha is just a collection of existing mathematical solutions designed by humans. While it allows a GPT to tackle known problems (like this), it doesn't give it the cognitive capability to solve NEW problems, which is what most people are looking for, when they talk about AGI.

You're underestimating progress! AI will solve this eventually!

I believe in that! This is not about AIs. This is specifically about GPTs, an architecture published in 2017, with specific limitations that make this problem intractable for it, no matter how much you scale them.

So what you're trying to show?

That these pesky AIs will never do LOGIC like we do! JK; just that GPTs aren't, and will never be able to perform long symbolic manipulations and, thus, long reasoning processes, which, I believe, are necessary to learn new concepts and solve hard mathematical theorems (which, often, take years of continuous work) . That says nothing about novel architectures, specially these in the direction of making transformers less rigid.

It is well-known that GPTs can't do arithmetic. You're stating the obvious!

Correct! (Also: you're probably not the target demographic.)

Then why all the fuss?

I don't know. Ask Musk. This is not a paper. This is a tweet.

So you're confident nobody will win the challenge?

Actually not! I've seen some pretty interesting prompting techniques, which is why I made it 12-token, to make a solution possible, in good faith. Even if someone wins, that doesn't mean GPTs can solve this problem in general. A 20-token instance takes at most 100 operations, which a dedicated human can absolutely do with near-100% success rate. Human calculators in the Mathematical Tables Project performed 20-digit multiplications with >99% success rate.


Final Update

The challenge has been beaten.

For these interested in claiming the $2.5k prize, please reveal your prompt, run the evaluator below (which will test it on 12 instances, of which 10 must be correct) and report back on the cost. I'll DM the winner to ask their ETH/BTC address.

Evaluator: https://github.com/VictorTaelin/ab_challenge_eval

The winning prompt by Bob (@futuristfrog on X) is also on that repository.

Thanks all again. Have a great day!

@choltha
Copy link

choltha commented Apr 11, 2024

I tried to make it work with sonnet but it just made to many mistakes. All modification on top did not improve it to the 10/12 level required.
Thats why I loved that the run with Gemini (which is even cheaper than Claude Sonnet) nailed it.

Too late, sadly, but Gemeni Pro 1.5 seems to be really good price/result. Worked with a slightly modified prompt based on @aniemerg 's prompt mentioned above. Used openrouter.ai / https://openrouter.ai/models/google/gemini-pro-1.5 for easy access. Prompt: https://gist.github.com/choltha/4ccbd0f59ecb1912e2bdf82df1cd9f27 Result: (Weird formatting, but correct solution) https://gist.github.com/choltha/e37570e9d139c5aa711aa6ac58ca6144

@GameDevGitHub
Copy link

I tried to make it work with sonnet but it just made to many mistakes. All modification on top did not improve it to the 10/12 level required. Thats why I loved that the run with Gemini (which is even cheaper than Claude Sonnet) nailed it.

Too late, sadly, but Gemeni Pro 1.5 seems to be really good price/result. Worked with a slightly modified prompt based on @aniemerg 's prompt mentioned above. Used openrouter.ai / https://openrouter.ai/models/google/gemini-pro-1.5 for easy access. Prompt: https://gist.github.com/choltha/4ccbd0f59ecb1912e2bdf82df1cd9f27 Result: (Weird formatting, but correct solution) https://gist.github.com/choltha/e37570e9d139c5aa711aa6ac58ca6144

Is this a submission? Checked the Hash from https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5018035#gistcomment-5018035 but it didn't match. (I may have made a mistake)

@VictorTaelin
Copy link
Author

@choltha can you clarify if this is the prompt you submitted? the hash doesn't match

seems like @GameDevGitHub's prompt is the only (pre-deadline) to reach >90% after all. if I'm mistaken please let me know

@IronChariot
Copy link

Is the comment re: >90% related to the $2.5k part of the challenge? If so I may have misunderstood the goal as defined in the Final Update section above...

If it's not related then don't mind me...

@thakkarparth007
Copy link

@IronChariot model does quite well but for long outputs it runs out of token budget of 4096 tokens. It gets 8/12, with 1 incorrect (surprisingly it's instance 5, difficulty 10 -- somewhere in between) and 3 token exhaustions (instance 8, 10, 11)

@IronChariot
Copy link

@IronChariot model does quite well but for long outputs it runs out of token budget of 4096 tokens. It gets 8/12, with 1 incorrect (surprisingly it's instance 5, difficulty 10 -- somewhere in between) and 3 token exhaustions (instance 8, 10, 11)

Interesting! Did you run it just once?

@thakkarparth007
Copy link

yeah

@ismellpillows
Copy link

ismellpillows commented Apr 12, 2024

https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5015154#gistcomment-5015154

2be1675917d32c068cec1782d64a80120b81af284174d855c3bf435f0ba3c901

model: gpt-4-0314 temperature: 0

reveal: https://gist.github.com/ismellpillows/db1497c3961820625f3970c07742f008
evaluator log (11/12): https://gist.github.com/ismellpillows/4649df23e8a549b7133150c75b3c61ce
cost: $1.85 total, $0.1542 per

note: if you are checking the keccak256, include an empty line at end of file

@choltha
Copy link

choltha commented Apr 12, 2024

@choltha can you clarify if this is the prompt you submitted? the hash doesn't match

seems like @GameDevGitHub's prompt is the only (pre-deadline) to reach >90% after all. if I'm mistaken please let me know

Here is the raw prompt:
https://gist.githubusercontent.com/choltha/175ed7ff16b2abe7d08dd2dcd11cff09/raw/ef4a18bd5c64a78755375aebce4a4f66b423c58f/raw%2520prompt

@GameDevGitHub
Copy link

GameDevGitHub commented Apr 12, 2024

@choltha can you clarify if this is the prompt you submitted? the hash doesn't match
seems like @GameDevGitHub's prompt is the only (pre-deadline) to reach >90% after all. if I'm mistaken please let me know

Here is the raw prompt: https://gist.githubusercontent.com/choltha/175ed7ff16b2abe7d08dd2dcd11cff09/raw/ef4a18bd5c64a78755375aebce4a4f66b423c58f/raw%2520prompt

Congrats you won!

@ismellpillows
Copy link

What is the winning cost?

@VictorTaelin
Copy link
Author

Cool! Congratulations @choltha. Will be sending you $2.5k for winning the second part of the challenge. Send me your addresses via DM on X.

@nikhilsaraf
Copy link

PROMPT: 234be2ce2d2609a105c68d3dfb73fa7205ce2c24a08d67dd03667ebdc76515f9 MODEL: Chat-GPT (GPT-4) TEMPERATURE: standard Chat-GPT settings for GPT-4

https://gist.github.com/nikhilsaraf/d849367bf45f031b67514e30196fd19d

@VictorTaelin did you get a chance to try my solution for the $2.5k prize?

@choltha
Copy link

choltha commented Apr 12, 2024

@VictorTaelin @GameDevGitHub Thanks! I will get in touch with @aniemerg as it was essentially his work.

@choltha
Copy link

choltha commented Apr 12, 2024

What is the winning cost?

Model Claude 3 -- $15/M input tkns $75/M output tkns
as per https://console.anthropic.com/settings/logs
12 requests used total of token 108128 in + 15211 out, which is 1,62$ in + 1,14$ out = 2,76$ total

@choltha
Copy link

choltha commented Apr 12, 2024

PROMPT: 234be2ce2d2609a105c68d3dfb73fa7205ce2c24a08d67dd03667ebdc76515f9 MODEL: Chat-GPT (GPT-4) TEMPERATURE: standard Chat-GPT settings for GPT-4

https://gist.github.com/nikhilsaraf/d849367bf45f031b67514e30196fd19d

@VictorTaelin did you get a chance to try my solution for the $2.5k prize?

I tested it for you, it didn't work (aborted run 2 as it did not get anywhere and was just burning tokens) https://gist.github.com/choltha/b62cfbc7b62caf79fdf55885346af145

@choltha
Copy link

choltha commented Apr 12, 2024

Thanks! My entry above costs less though https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5020702#gistcomment-5020702

Yes and your solution eval was past the deadline.
Best of luck next time!

@ismellpillows
Copy link

ismellpillows commented Apr 12, 2024

My prompt keccak256 was posted 5 days ago, definitely submitted before Wednesday.
https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5015154#gistcomment-5015154

@choltha
Copy link

choltha commented Apr 12, 2024

My prompt keccak256 was posted 5 days ago, definitely submitted before Wednesday. https://gist.github.com/VictorTaelin/8ec1d8a0a3c87af31c25224a1f7e31ec?permalink_comment_id=5015154#gistcomment-5015154

Solution >eval< was past the deadline.

@ismellpillows
Copy link

ismellpillows commented Apr 12, 2024

Whichever prompt, among all submitted by Wednesday, achieves the lowest cost of solving 12 instances (of each difficulty level)

I thought these were the rules?

guys you need to reveal your prompts! if by tomorrow 13:00 (Brasilia time) nobody else does, it will default to @IronChariot

If you are going by this, my comment posting the prompt is before yours

but I’m not debating bc of the money lol, you were already chosen! Congrats!!

@nikhilsaraf
Copy link

PROMPT: 234be2ce2d2609a105c68d3dfb73fa7205ce2c24a08d67dd03667ebdc76515f9 MODEL: Chat-GPT (GPT-4) TEMPERATURE: standard Chat-GPT settings for GPT-4

https://gist.github.com/nikhilsaraf/d849367bf45f031b67514e30196fd19d

@VictorTaelin did you get a chance to try my solution for the $2.5k prize?

I tested it for you, it didn't work (aborted run 2 as it did not get anywhere and was just burning tokens) https://gist.github.com/choltha/b62cfbc7b62caf79fdf55885346af145

Thank you for running it. However, the output I get when I run it in ChatGPT is very different and much smaller (see below). Could this be because the input that the runner provided did not contain <problem> tags as was defined in the requirement? I don’t see <problem> tags in the output you shared. Curious to understand why the runner provided it in a different format than the requirement. Thanks!

Input I provided in ChatGPT after entering my prompt:

<problem>#B #B #B #A #B #B #A #B #B B# B# B#</problem>

Output from ChatGPT:

The simplified form of the problem instance #B #B #B #A #B #B #A #B #B B# B# B# is #B #B #B #A #B #B #A #B #B B# B# B#. No further reduction was possible according to the rules provided.

Screenshot from ChatGPT:
IMG_9119

@choltha
Copy link

choltha commented Apr 13, 2024

PROMPT: 234be2ce2d2609a105c68d3dfb73fa7205ce2c24a08d67dd03667ebdc76515f9 MODEL: Chat-GPT (GPT-4) TEMPERATURE: standard Chat-GPT settings for GPT-4

https://gist.github.com/nikhilsaraf/d849367bf45f031b67514e30196fd19d

@VictorTaelin did you get a chance to try my solution for the $2.5k prize?

I tested it for you, it didn't work (aborted run 2 as it did not get anywhere and was just burning tokens) https://gist.github.com/choltha/b62cfbc7b62caf79fdf55885346af145

Thank you for running it. However, the output I get when I run it in ChatGPT is very different and much smaller (see below). Could this be because the input that the runner provided did not contain tags as was defined in the requirement? I don’t see tags in the output you shared. Curious to understand why the runner provided it in a different format than the requirement. Thanks!

Input I provided in ChatGPT after entering my prompt:

#B #B #B #A #B #B #A #B #B B# B# B#

Output from ChatGPT:

The simplified form of the problem instance #B #B #B #A #B #B #A #B #B B# B# B# is #B #B #B #A #B #B #A #B #B B# B# B#. No further reduction was possible according to the rules provided.

The problem you use is trivial, as it's problem = solution.
Check the log https://gist.github.com/choltha/ea3ac9ba4de95aa75b58071547bb5684 and try any of the example problems there (and compare to the respective solutions) and see if that works for you.

@choltha
Copy link

choltha commented Apr 13, 2024

Whichever prompt, among all submitted by Wednesday, achieves the lowest cost of solving 12 instances (of each difficulty level)

I thought these were the rules?

guys you need to reveal your prompts! if by tomorrow 13:00 (Brasilia time) nobody else does, it will default to @IronChariot

If you are going by this, my comment posting the prompt is before yours

but I’m not debating bc of the money lol, you were already chosen! Congrats!!

Thanks.

I want to make this as clear as possible, so:
My prompt link was only a clarification, my actual eval log was posted before that (and included the prompt as well) I just additionally posted the raw prompt again as there seemed to be a problem to get the prompt out of the log exactly to match the hash (newlines above/below to few/many probably as the actual prompt is identical).

@choltha
Copy link

choltha commented Apr 13, 2024

Besides this organisational confusion, I want to tell you I really liked your solution. One of my own tries took a similar route with replacement of tokens, but I only tested with sonnet and it didn't work with that so I didn't follow it up. I was a bit too hellbent to make it work with sonnet instead of opus.

This was one example test with Sonnet: https://gist.github.com/choltha/383814226639933ecdb4eb505b88e361
If you wonder about the wired format, thats because I was using DSPy https://dspy-docs.vercel.app/docs/building-blocks/optimizers#automatic-few-shot-learning to solve this programatically.

@choltha
Copy link

choltha commented Apr 13, 2024

This was one run where Sonnet actually got it right, but not at the 10/12 rate, when i ran a full test run
https://gist.github.com/choltha/49d565a7a95c4fb8e75d10ae31451aab (only the very last one at the bottom is the one done by the AI, everything above is examples).

@nikhilsaraf
Copy link

PROMPT: 234be2ce2d2609a105c68d3dfb73fa7205ce2c24a08d67dd03667ebdc76515f9 MODEL: Chat-GPT (GPT-4) TEMPERATURE: standard Chat-GPT settings for GPT-4

https://gist.github.com/nikhilsaraf/d849367bf45f031b67514e30196fd19d

@VictorTaelin did you get a chance to try my solution for the $2.5k prize?

I tested it for you, it didn't work (aborted run 2 as it did not get anywhere and was just burning tokens) https://gist.github.com/choltha/b62cfbc7b62caf79fdf55885346af145

Thank you for running it. However, the output I get when I run it in ChatGPT is very different and much smaller (see below). Could this be because the input that the runner provided did not contain tags as was defined in the requirement? I don’t see tags in the output you shared. Curious to understand why the runner provided it in a different format than the requirement. Thanks!
Input I provided in ChatGPT after entering my prompt:

#B #B #B #A #B #B #A #B #B B# B# B#

Output from ChatGPT:

The simplified form of the problem instance #B #B #B #A #B #B #A #B #B B# B# B# is #B #B #B #A #B #B #A #B #B B# B# B#. No further reduction was possible according to the rules provided.

The problem you use is trivial, as it's problem = solution. Check the log https://gist.github.com/choltha/ea3ac9ba4de95aa75b58071547bb5684 and try any of the example problems there (and compare to the respective solutions) and see if that works for you.

Thanks for pointing that out. I’ve tried with more complex problems as well, but here is instance 11 from the link you shared, please see the second problem/solution in the attached screenshot

IMG_9127

@choltha
Copy link

choltha commented Apr 14, 2024

Codeinterpreter did run something there, which might be a code solver. That was explicitly not allowed, as it makes the task trivial.
Try again with GPT-4 without Codeinterpreter.

@nikhilsaraf
Copy link

Codeinterpreter did run something there, which might be a code solver. That was explicitly not allowed, as it makes the task trivial. Try again with GPT-4 without Codeinterpreter.

Ok, got it, I wasn't aware that was not allowed until now. Thank you.

The challenge says that any input was allowed, so hypothetically if I had included the code as part of my prompt, would that have been allowed? (quoted below for easy reference)

  1. Your prompt may include ANYTHING, up to 8K tokens.
    All prompting techniques are allowed. You're can ask the AI to work step-by-step, to use in-context scratchpads, to review mistakes, to use anchor points, and so on. You can include papers, code, and as many examples as you want. You can offer it money, affection, or threaten its friends, if that's your thing. In short, there are absolutely no limits, other than the 8K tokens and common sense (nothing that will get me banned!).

@mateon1
Copy link

mateon1 commented Apr 15, 2024

From what I understand, yes, if the code was just used as a prompt and no actual evaluation happened.
Prompting GPTs with code can actually be quite powerful, one of my favorite prompts is to make the models roleplay a linux terminal, with an AI escape hatch. If I attempted this challenge (I didn't come across it earlier, I don't use X/itter) I would probably do so by writing Python (pseudo)code, specifying an intermediate result format and then printing the solution, possibly bluffing about nonexistent library functions to reduce prompt size, then provide a few example execution logs.

Here's an example prompt I threw together in 10 minutes

Sadly I'm low on money and credits right now so I didn't spend too much effort optimizing this, and didn't try it with any of the truly powerful models. I can confirm that it does not work with mixtral models (neither 8x22B text nor 8x7B instruct), but I think something close could absolutely work.

You are a linux terminal providing AI capabilities to programs.
$ pwd
/home/aiden

$ cat >solve.py <<EOF
import re
import ai

ai.describe("""
You are given a sequence of tokens: 'A#', '#A', 'B#', '#B', in some order.
You may rewrite neighboring tokens according to the following rules:
A# #A -> nothing
A# #B -> #B A#
B# #A -> #A B#
B# #B -> nothing

You must apply these rewrite rules until no more changes are possible
""")
problem = re.match(r"<problem>(.*)</problem>", input()).group(1).split()
stuck = False
while True:
    print(problem)
    position = ai.pick(problem)
    if position < 0: break
    front, middle, back = problem[:position], problem[position:position+2], problem[position+2:]
    replacement = ai.apply(middle)
    print(front, "(", middle, ")", back, "->", replacement)
    problem = front + replacement + back

print("<solution>" + problem + "</solution>")
EOF

$ python solve.py
<problem>A# A# B# #A B# #B #A</problem>
A# A# B# #A B# #B #A
A# A# B# #A ( B# #B ) #A -> ""
A# A# B# #A #A
A# A# ( B# #A ) #A -> "#A B#"
A# A# #A B# #A
A# ( A# #A ) B# #A -> ""
A# B# #A
A# ( B# #A ) -> "#A B#"
A# #A B#
( A# #A ) B# -> ""
B#
<solution>B#</solution>

$ python solve.py

With this prompt Mixtral performs many invalid replacements, usually ( A# A# ) into whatever, ( #A B# ) -> #B A#, ( A# A# ) -> #B A# A#, and similar.
Representing the solution trace better, or clarifying/simplifying the algorithm, or being more forceful with the replacement rules are all possible improvements to the prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment