Skip to content

Instantly share code, notes, and snippets.

@koratkar
Last active January 29, 2025 00:22
OpenAI's first version of Operator plays a simplified version of the first level in Montezuma's Revenge with minimal assistance from me. This is an essay on it and chat history.

Update: this draft is now a post

OpenAI's first version of their computer-use agent "Operator" was released to brief fanfare and obvious, glaring issues.

However, it is highly capable and its present issues seem to mostly stem from its inability to see. While this is of course a deciding issue for a computer-use agent, Operator's intended usecase in this early roll-out seems to be filling out forms, and its ability to detect the location of text on screens is nearly flawless. It instead struggles with spatial relationships and contrasting colors. I assume OAI is using some form of RL to generate computer-use data combined with self-supervised training on that data, as was the case in Gato.

This is of particular interest to me due to Metaculus's Date of Weakly General AI market, which seems resolved in spirit since 2020's foreboding AI flurries, chiefly GPT-3 and Agent57, which both displayed considerable and uncanny generality despite their heavy constraints. We're now in a full-on snow storm, clambering to assemble cognitive benchmarks AI can't solve, and yet the market still remains open, presumably due to the insanely difficult criteria that a weakly general AI must be a unified AI system capable of learning Montezuma's Revenge in sub-100 hours of real-time play, along with getting >75th percentile on the SAT, >90% on WinoGrande, and passing a 2 hour long Turing test. These last three have been long since saturated, and only the most fiendishly difficult Atari game remains.

My intuition is that a scaled-up Gato or frankensteined Dreamer-v3 could probably muster its way through Montezuma's Revenge without special training, since the main challenge for RL systems of the 2010s wasn't the complexity of the game mechanics but that no world models existed to predict the game's sparse rewards. With systems of the 2020s now integrating RL with world modeling from unsupervised learning, the remaining issues have dwindled to pesky engineering problems, not a conceptual wall.

Language models show stunning out-of-distribution generalization in-context, and are even capable of episodic memory-style recall on their contexts. While teaching entirely new skills in-context remains a struggle, fine-tuning and reinforcement learning atop LMs now press at the barrier of super-human (competition) coding ability with o3, and crush pared-down tasks attempting to test skill-acquisition.

Testing OpenAI's Operator for the last few days suggests that the Montezuma's Revenge criteria is now satisfied per its original intention, despite Operator's limitations.

I created a simple 2d platformer game of similar "abstract complexity" to the first level of Montezuma's Revenge using Claude 3.5 Sonnet, modifying the game such that Operator would be able to recognize artifacts despite its limited visual ability. Operator impressed me not for its ability to play, but its ability to correct its mistakes, try new strategies, and apply my advice to update its strategies. In terms of complexity or abstraction, exploring all the rooms in a novel environment such as Montezuma's Revenge seems within scope of its ability, though the visual components and "motor complexity" of the game seems far beyond its reach.

Here is the video. I do not interfere with the browser while Operator plays, though it does ask me for help. The video is slow and contains a few minutes of Operator messing up basic tasks, so if you skip most of it or watch it at 3x speed you won't miss anything. Below is a reproduced transcript:

User: Play this game:

https://koratkar.github.io/claude-game/

Operator: Worked for 7 minutes I've attempted various strategies to move the black square to the victory circle, but it seems there's a limitation in reaching the green circle from the bottom level. The "GRAPPLE" button and the ladder haven't been successful. Would you like me to try something else?

User: Remember, you can't grapple across levels.

Operator: Got it! I'll focus on moving the black square within the same level. I'll try to align it with the victory circle using the available movement options.

Operator: Worked for 1 minute I've tried various approaches to move the black square to the victory circle, but it seems there's a challenge in ascending the ladder. Would you like me to try a different strategy or approach?

User: You fell off the ladder.

Operator: Worked for 1 minute I've attempted several approaches to move the black square to the victory circle, but it seems there's a challenge in ascending the ladder. How would you like me to proceed?

User: Why do you keep moving left when you go up the ladder?

Operator: I'm sorry for the confusion. I'll try ascending the ladder again and avoid moving left. I'll focus on moving directly to the victory circle once I reach the top level. Let me attempt that.

Operator: Worked for 37 seconds It seems there's a challenge in ascending the ladder, as the black square isn't moving up. Would you like me to try a different approach or strategy?

User: Oh, I see the issue. You can't see it going up. When you are on the ladder you are moving up, but movement is rather slow. If you are on the ladder, you can keep pressing up and be assured you are moving. You will receive visual indication of success once it says "You are on the: TOP LEVEL".

Operator: Thank you for clarifying! I'll keep pressing the "UP" button while on the ladder until I reach the top level. I'll look for the message indicating success. I'll start by moving the black square to the ladder and proceed from there.

Operator: Worked for 32 seconds The game is complete! The black square reached the victory circle, and the "You Won!" message is displayed. Let me know if there's anything else you'd like to do.

User: Great job!

Operator: Thank you! If you have any other tasks or need assistance, feel free to let me know.


Thanks to Bjartur Tomas for research funding to test Operator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment