VictorTaelin/a_b_challenge.md

## a_b_challenge.md

      
    Raw
  

              a_b_challenge.md
            
          
    CHALLENGE

Develop an AI prompt that solves random 12-token instances of the A::B problem (defined here), with 90%+ success rate.
RULES

1. The AI will be given a <problem/> to solve.

We'll use your prompt as the SYSTEM PROMPT, and a specific instance of problem as the PROMPT, inside XML tags. Example:
<problem>A# B# #B A# A# #B #B A# A# #B A# A#</problem>

2. The AI must end the answer with a <solution/>.

The answer must be occur INSIDE the AI's answer (1 inference call), in literal form (not code), inside XML tags. Example:
... work space ...
... work space ...
... work space ...
... work space ...
<solution>#B #B #B A# A# A# A# A# A# A#</solution>

3. The AI answer can use up to 32K tokens.

The AI answer can use up to 32K tokens, which gives it room to work on the
solution step-by-step, review mistakes, create local scratchpads, and anything
else you want it to do before arriving at the final answer.
4. You can use ANY public GPT model.

You can select any public model released before this date to test your prompt
on, as long as it is based on the GPT (transformer) architecture. Adaptations
(such as MoE) are tolerated, as long as the answer is fully generated by the
attention mechanism, forward passes etc. Other architectures aren't allowed,
including SAT solvers and the like. When the model is proprietary and the
underlying architecture isn't clear, it will not be allowed.
I recommend gpt-4-0314, gpt-4-turbo-preview or claude-3-opus-20240229,
with temperature=0.0. Open-source models are allowed too. No fine-tuning or
training on the problem is allowed. No internet access or code interpretation
allowed. The answer must be self-contained in a single inference call.
Note: pay attention to your chosen model's output limit. 12-token instances
take up to 36 steps to complete, which may not fit the limit. (If there is
no answer in the output, it will be considered a miss.)
5. Your prompt may include ANYTHING, up to 8K tokens.

All prompting techniques are allowed. You're can ask the AI to work
step-by-step, to use in-context scratchpads, to review mistakes, to use anchor
points, and so on. You can include papers, code, and as many examples as you
want. You can offer it money, affection, or threaten its friends, if that's your
thing. In short, there are absolutely no limits, other than the 8K tokens and
common sense (nothing that will get me banned!).
6. Keep it fun! No toxicity, spam or harassment.

Specially not asking me to bet on stuff. Except if you want me to bet that
crows can solve this problem. I'd absolutely take it.
EVALUATION

The input problem will consist of a random 12-token instance of the A::B
problem, with difficulties ranging from 0 to 24 necessary steps.
Then, we'll check if the answer includes,
in-context, the correct solution, as described.
PARTICIPATING

Submit your prompt in a Gist, either on my DM or as a reply to this
tweet including
the selected model and configuration. We'll take submissions in chronological
order. To test a submission, we'll use the Gist contents as the system prompt,
and the problem as the prompt.
UPDATE
To participate, you must post the Keccak256 of your prompt as a comment to this Gist.
The deadline is next Wednesday (April 10), 12am (Brasilia time).
You must include the model name and settings in the comment.
You can post as many entries as you want. Make a new comment for each attempt.
Use the Keccak256 hash function (NOT SHA3 nor SHA256).
Reveal your prompt in the deadline day, in a NEW comment.
Submissions on Twitter will NOT be considered from now on. (Submissions sent BEFORE this update will still be considered.)
The earliest comment here containing a Keccak256 corresponding to a valid solution will be the winner.
A valid solution is a prompt that passes 45 out of 50 attempts on the evaluator.
The evaluator will be made public and open source right after the deadline.
Payment will be made in BTC or ETH.
Comment template:
PROMPT: <keccak256_of_your_prompt>
MODEL: <model-name-here>
TEMPERATURE: <temperature-here>
<additional-configs-here>

COMMENTS


The original challenge was 7-token! Moving goalposts?

That 7-token instance was merely an example, and nothing else. It didn't occur
to me people would take it as the challenge. If you did, I apologize. My
original statement was absolutely wrong, and GPTs can, obviously, solve 7-token
instances. This should be obvious enough. There are only 16K 7-token instances.
You could even fit all solutions in a prompt if you wanted to. For what it's
worth, 12-tokens is still REALLY low. It takes an average of 6 steps to solve it.

Human kids can't solve that problem either!

It takes at most 36 steps to solve a 12-token instance, and an average of 6 steps.
So, quite literally, most 12-token instances
can be solved by swapping 6 pairs. This is trivial. It is way simpler than
addition, which kids do just fine. Even long division, which is way harder than
this, is taught on 3rd grade. I'm confident the average human kid can absolutely
learn to solve this problem by themselves, while no GPT will ever tackle
20-token instances without using a code interpreter.

But humans would make errors and fix them!

So... ask the AI to review and fix its errors.

Why not just connect GPT to Wolfram Alpha to solve math problems?

Because Wolfram Alpha is just a collection of existing mathematical solutions
designed by humans. While it allows a GPT to tackle known problems (like this),
it doesn't give it the cognitive capability to solve NEW problems, which is what
most people are looking for, when they talk about AGI.

You're underestimating progress! AI will solve this eventually!

I believe in that! This is not about AIs. This is specifically about GPTs, an
architecture published in 2017, with specific limitations that make this
problem intractable for it, no matter how much you scale them.

So what you're trying to show?

That these pesky AIs will never do LOGIC like we do! JK; just that GPTs aren't,
and will never be able to perform long symbolic manipulations and, thus, long
reasoning processes, which, I believe, are necessary to learn new concepts and
solve hard mathematical theorems (which, often, take years of continuous work) .
That says nothing about novel architectures, specially these in the direction
of making transformers less rigid.

It is well-known that GPTs can't do arithmetic. You're stating the obvious!

Correct! (Also: you're probably not the target demographic.)

Then why all the fuss?

I don't know. Ask Musk. This is not a paper. This is a tweet.

So you're confident nobody will win the challenge?

Actually not! I've seen some pretty interesting prompting techniques, which is
why I made it 12-token, to make a solution possible, in good faith. Even if
someone wins, that doesn't mean GPTs can solve this problem in general. A
20-token instance takes at most 100 operations, which a dedicated human can
absolutely do with near-100% success rate. Human calculators in the Mathematical
Tables Project performed 20-digit multiplications with >99% success rate.

Final Update

The challenge has been beaten.
For these interested in claiming the $2.5k prize, please reveal your prompt, run the evaluator below (which will test it on 12 instances, of which 10 must be correct) and report back on the cost. I'll DM the winner to ask their ETH/BTC address.
Evaluator: https://github.com/VictorTaelin/ab_challenge_eval
The winning prompt by Bob (@futuristfrog on X) is also on that repository.
Thanks all again. Have a great day!