m0o0scar/📖 OpenDevin! An Open Platform for AI Software Developers as Generalist Agents.md

## 📖 OpenDevin! An Open Platform for AI Software Developers as Generalist Agents.md

      
    Raw
  

              📖 OpenDevin! An Open Platform for AI Software Developers as Generalist Agents.md
            
          
    [arxiv] OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

Source

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenDevin, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-Bench) and web browsing (e.g., WebArena), among others. Released under the permissive MIT license, OpenDevin is a community project spanning academia and industry with more than 1.3K contributions from over 160 contributors and will improve going forward.
URL: https://huggingface.co/papers/2407.16741
Conversation

▲ Please explain the paper to reader with no relevant technical or research background, be clear and concise.

This paper introduces OpenDevin, a platform designed to help create AI agents that can interact with the world in a similar way to human software developers. Imagine an AI that can write code, browse the web, and use a command line, just like a programmer.
OpenDevin provides tools and a safe environment for these AI agents to learn and work. It allows them to:

Write and execute code:  The AI can create and run programs, just like a human developer.
Browse the web: The AI can navigate websites, click buttons, and gather information.
Use a command line: The AI can interact with a computer's operating system using commands.
Work together: Multiple AI agents can collaborate on tasks, each specializing in a different area.

The paper also describes how OpenDevin is being used to evaluate these AI agents on various tasks, including:

Software engineering: Fixing bugs in code, writing programs, and solving real-world problems.
Web browsing: Completing tasks on websites, like finding information or making purchases.
General assistance: Answering questions, performing calculations, and interacting with the operating system.

OpenDevin is an open-source project, meaning anyone can contribute to its development. The authors hope that this platform will help advance research in AI agents and lead to the creation of more powerful and helpful AI systems in the future.
▲ Tell me more about web browsing tasks

The paper highlights how OpenDevin enables AI agents to perform web browsing tasks, just like a human would.  Here's a breakdown:
What kind of tasks?
Think of the things you do online:

Finding information: Searching for specific topics, reading articles, comparing products.
Making purchases: Adding items to a cart, filling out forms, completing transactions.
Interacting with services: Booking flights, managing accounts, filling out surveys.

How does OpenDevin help?
OpenDevin provides the AI agent with:

A web browser: The agent can access and interact with websites just like a human user.
A set of actions: The agent can perform specific actions like clicking buttons, typing text, navigating between pages, and even uploading files.
Observations: The agent receives information about the website it's interacting with, such as the text on the page, the structure of the website, and even screenshots.

Example:
Imagine an AI agent tasked with finding the price of a specific product on an online store. It could:

Navigate to the website: Using the goto action, the agent would go to the online store's homepage.
Search for the product: The agent could use the fill action to type the product name into a search bar and then click the search button.
Locate the product page: The agent would analyze the search results using the click action to navigate to the specific product page.
Extract the price: The agent would use the parse_text action to identify and extract the price from the product page.

Challenges:

Complexity: Websites are diverse and constantly changing, making it difficult for AI agents to navigate and understand them.
Security: AI agents need to be carefully designed to avoid malicious actions or accessing sensitive information.
Human-like interaction:  AI agents need to be able to understand and respond to human instructions and feedback.

OpenDevin's contribution:
OpenDevin provides a framework for developing and evaluating AI agents that can perform web browsing tasks. This platform helps researchers and developers overcome the challenges mentioned above, paving the way for more sophisticated and capable AI agents in the future.

  
## 🪣 Context.json
{
  "url": "https://huggingface.co/papers/2407.16741",
  "type": "arxiv",
  "title": "OpenDevin: An Open Platform for AI Software Developers as Generalist Agents",
  "subtitle": "Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, Graham Neubig",
  "description": "Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenDevin, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-Bench) and web browsing (e.g., WebArena), among others. Released under the permissive MIT license, OpenDevin is a community project spanning academia and industry with more than 1.3K contributions from over 160 contributors and will improve going forward."
}

## 🪣 Full Content.md

      
    Raw
  

              🪣 Full Content.md
            
          
    arXiv:2407.16741v1 [cs.SE] 23 Jul 2024

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang1,10, Boxuan Li2, Yufan Song2, Bowen Li, Jaskirat Singh7, Mingchen Zhuge6, Jiayi Pan4, Yueqi Song2, Frank F. Xu2, Xiangru Tang3, Hoang H. Tran8, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian3, Yanjun Shao3, Niklas Muennighoff5, Yizhe Zhang, Binyuan Hui9, Junyang Lin9, Robert Brennan10, Hao Peng1, Heng Ji1, Graham Neubig2,10, HCMUT 9Alibaba 10All Hands AI
xingyao6@illinois.edu, gneubig@cs.cmu.edu
Abstract
Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenDevin, a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., SWE-BENCH) and web browsing (e.g., WEBARENA), among others. Released under the permissive MIT license, OpenDevin is a community project spanning academia and industry with more than 1.3K contributions from over 160 contributors and will improve going forward.


Code
https://github.com/OpenDevin/OpenDevin


Benchmark
https://hf.co/spaces/OpenDevin/evaluation


Slack
https://bit.ly/OpenDevin-Slack


Introduction

Powered by large language models (LLMs; [ 5, 20 , 40, 60 ]), user-facing AI systems (such as ChatGPT) have become increasingly capable of performing complex tasks such as accurately responding to user queries, solving math problems, and generating code. In particular, AI agents, systems that can perceive and act upon the external environment, have recently received ever-increasing research focus. They are moving towards performing complex tasks such as developing software [21 ], navigating real-world websites [79 ], doing household chores [1 ], or even performing scientific research [4 , 56 ]. As AI agents become capable of tackling complex problems, their development and evaluation have also become challenging. There are numerous recent efforts in creating open-source frameworks that facilitate the development of agents [ 7 , 16 , 67 ]. These agent frameworks generally include: 1) interfaces through which agents interact with the world (such as JSON-based function calls or code execution), 2) environments in which agents operate, and 3) interaction mechanisms for
Preprint. Under review.

human-agent or agent-agent communication

These frameworks streamline and ease the development process in various ways (Tab. 1, §C).
When designing AI agents, we can also consider how human interacts with the world. The most powerful way in which humans currently interact with the world is through software – software powers every aspect of our life, supporting everything from the logistics for basic needs to the advancement of science, technology, and AI itself. Given the power of software, as well as the existing tooling around its efficient development, use, and deployment, it provides the ideal interface for AI agents to interact with the world in complex ways. However, building agents that can effectively develop software comes with its own unique challenges. How can we enable agents to effectively create and modify code in complex software systems? How can we provide them with tools to gather information on-the-fly to debug problems or gather task-requisite information? How can we ensure that development is safe and avoids negative side effects on the users’ systems?
In this paper, we introduce OpenDevin, a community-driven platform designed for the development of generalist and specialist AI agents that interact with the world through software. It features:

An interaction mechanism which allows user interfaces, agents, and environments to interact through an event stream architecture that is powerful and flexible (§2.1).
An environment that consists of a sandboxed operating system and a web browser that the agents can utilize for their tasks (§2.2).
An interface allowing the agent to interact with the environment in a manner similar to actual software engineers (§2.3). We provide the capability for agents to (a) create complex software, (b) execute the code, and (c) browse websites to collect information.
Multi-agent delegation, allowing multiple specialized agents to work together (§2.4).
Evaluation framework, facilitating the evaluation of agents across a wide range of tasks (§4).

Importantly, OpenDevin is not just a conceptual framework, but it also includes a comprehensive and immediately usable implementation of agents, environments, and evaluations. As of this writing, OpenDevin includes an agent hub with over 10 implemented agents (§3), including a strong generalist agent implemented based on the CodeAct architecture [63], with additions for web browsing [52] and code editing [72]. Interaction with users is implemented through a chat-based user interface that visualizes the agent’s current actions and allows for real-time feedback (Fig. 1, §D). Furthermore, the evaluation framework currently supports 15 benchmarks, which we use to evaluate our agents (§4).
Released under a permissive MIT license allowing commercial use, OpenDevin is poised to support a diverse array of research and real-world applications across academia and industry. OpenDevin has gained significant traction, with 28K GitHub stars and more than 1.3K contributions from over 160 contributors. We envision OpenDevin as a catalyst for future research innovations and diverse applications driven by a broad community of practitioners.
OpenDevin Architecture

We describe, using OpenDevin, (1) how to define and implement an agent (§2.1), (2) how each action execution leads to an observation (§2.2), (3) how to reliably manage and extend commonly used skills for agents (§2.3), and (4) how to compose multiple agents together for task solving (§2.4). Fig. 3 provides an overview.
Agent Definition and Implementation

An agent can perceive the state of the environment (e.g., prior actions and observations) and produce an action for execution while solving a user-specified task.
The State and Event Stream. In OpenDevin, the state is a data structure that encapsulates all relevant information for the agent’s execution. A key component of this state is the event stream, which is a chronological collection of past actions and observations, including the agent’s own actions and user interactions (e.g., instructions, feedback). However, the state extends beyond just the event stream.
1While initially inspired by the AI software engineer Devin [9], OpenDevin has quickly evolved to support a much wider range of applications beyond software engineering through diverse community contributions.

Figure 1: OpenDevin User Interface (UI, §D)

allows users to view files, check executed bash commands/Python code, observe the agent’s browser activity, and directly interact with the agent.
also incorporates auxiliary information for agent’s operation, such as the accumulative cost of LLM calls, metadata to track multi-agent delegation (§2.4), and other execution-related parameters.
Actions

Inspired by CodeAct [63], OpenDevin connects an agent with the environment through a core set of general actions. Actions IPythonRunCellAction and CmdRunAction enable the agent to execute arbitrary Python code and bash commands inside the sandbox environment (e.g., a securely isolated Linux operating system). BrowserInteractiveAction enables interaction with a web browser with a domain-specific language for browsing introduced by BrowserGym [12]. The action space based on programming languages (PL) is powerful and flexible enough to perform any task with tools in different forms (e.g., Python function, REST API, etc.) [63] while being reliable and easy to maintain.
Figure 2: Minimal example of implementing an agent in OpenDevin.


class MinimalAgent:


def reset(self) -> None:
self.system_message = "You are a helpful assistant ..."


def step(self, state: State):
messages: list[dict[str, str]] = [


{'role': 'system', 'content': self.system_message}


for prev_action, obs in state.history:


action_message = get_action_message(prev_action)


messages.append(action_message)


obs_message = get_observation_message(obs)


messages.append(obs_message)


# use llm to generate response (e.g., thought, action)


response = self.llm.do_completion(messages)


# parse and execute action in the runtime


action = self.parse_response(response)


if self.is_finish_command(action):


return AgentFinishAction()


elif self.is_bash_command(action):


return CmdRunAction(command=action.command)


elif self.is_python_code(action):


return IPythonRunCellAction(code=action.code)


elif self.is_browser_action(action):


return BrowseInteractiveAction(code=action.code)


else:


return MessageAction(content=action.message)


Observations

Observations describe environmental changes that the agent observes. It may or may not be caused by the agent’s action: It can be either 1) natural language messages from the user

User Interface


Command Line
Web UI
IDE Plugins
Multi-turn

Interaction

<execute_bash>
python3 app.py > server.log 2>&1 &
cat server.log
<execute_ipython>
Event Stream

Event Stream: List[Action_1, Observation_1, Action_2, ...]
Agent Runtime

Agent Runtime: Action -> Observation
Agent

Agent: State (History) -> Action
Sandbox

Computer Sandbox via SSH Protocol
OpenDevin Abstraction

<execute_browse>
goto("http://127.0.0.1:5000")
</execute_browse>
BrowserOutputObservation

def index():
return str(numbers)
Runtime

Runtime: Action -> Observation
Message

Message (source=user)
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
IPythonRunCellAction

IPythonRunCellAction (source=agent)
IPythonRunCellObservation

IPythonRunCellObservation
Figure 3

OpenDevin consists of 3 main components: 1) Agent abstraction where community can contribute different implementation of agents (§2.1) into an Agent Hub (§3); 2) Event stream for tracking history of actions and observations; 3) Agent runtime to execute all agent actions into observations (§2.2).instructing agents to perform certain tasks, 2) the execution outcome of the agent’s previous action (e.g., code execution result, an accessibility tree [35], a screenshot of the web page, etc.). Implement a New Agent. The agent abstraction is designed to be simple yet powerful, allowing users to create and customize agents for various tasks easily. The core of the agent abstraction lies in the step function, which takes the current state as input and generates an appropriate action based on the agent’s logic. Simplified example code for the agent abstraction is illustrated in Fig. 2. By providing this abstraction, OpenDevin allows the users to focus on defining desired agent behavior and logic without worrying about the low-level details of how actions are executed (§2.2).
Agent Runtime: How Execution of Actions Results in Observations

Agent Runtime provides a general environment that equips the agent with an action space comparable to that of human software developers, enabling OpenDevin agents to tackle a wide range of software development and web-based tasks, including complex software development workflows, data analysis projects, web browsing tasks, and more. It allows the agent to access a bash terminal to run code and command line tools, utilize a Jupyter notebook for writing and executing code on-the-fly, and interact with a web browser for web-based tasks (e.g., information seeking).
Linux SSH Sandbox. For each task session, OpenDevin spins up a securely isolated docker container sandbox, where all the bash commands from the agent are executed. OpenDevin connects to the sandbox through SSH protocol, executes arbitrary commands from the agent, and returns the execution results as observations to the agent. A configurable workspace directory containing files the user wants the agent to work on is mounted into that secure sandbox for OpenDevin agents to access.
Jupyter IPython. The Linux sandbox also supports running an interactive Jupyter server, which can be used by the agent for interactive python code execution [19] and debugging.
Web Browser. OpenDevin implements a Chromium browser based on Playwright [47]. It interfaces with agents using a set of browser action primitives defined by BrowserGym [12, 52], such as navigation, clicking, typing, scrolling. The full set of actions is detailed in §I. After executing these actions, the browser runtime provides a rich set of observations about the current state of the browser, including HTML, DOM, accessibility tree [35], screenshot, opened tabs, etc. These observations can be also augmented with configurable attributes that could allow agents to better understand web page observations, such as using a set-of-marks on screenshot [15, 71], visible element marking, focused element, interactable element marking, in-viewport element filtering [79], etc.

Table 1: Comparison of different AI agent frameworks (§C). SWE refers to ‘software engineering’.


Framework
Domain
Graphic User Interface
Standardized Tool Library
Built-in Sandbox & Code Execution
Built-in Web Browser
Multi-agent Collaboration
Human-AI Collaboration
AgentHub
Evaluation Framework
Agent QC


AutoGPT [14]
General
"
%
%*
%*
%
%
"
%
"


LangChains [6]
General
%
"
%
%
%
%
"
%
%


MetaGPT [16]
General
%
"
%
"
"
%
"
%
"


AutoGen [67]
General
%
"
"
"
"
"
"
"
"


AutoAgents [7]
General
%
%
%
%
"
%
%
%
%


Agents [80]
General
%
%
%
%
"
"
%
%
%


Xagents [61]
General
"
"
%
"
"
%
"
%
"


OpenAgents [69]
General
"
%
"
"
%
%
"
%
%


GPTSwarm [83]
General
%
"
%
%
"
"
%
%
%


AutoCodeRover [78]
SWE
%
%
"
%
%
%
%
%
%


SWE-Agent [72]
SWE
%
%
"
%
%
%
%
%
%


OpenDevin
General
"
"
"
"
"
"
"
"
"


2.3 Agent Skills: The Extensible Agent-Computer Interface

SWE-Agent [72] highlights the importance of a carefully crafted Agent-Computer Interface (ACI, i.e., specialized tools for particular tasks) in successfully solving complex tasks. However, creating, maintaining, and distributing a wide array of tools can be a daunting engineering challenge, especially when we want to make these tools available to different implementations of agents (§3). To tackle these, we build an AgentSkills library, a toolbox designed to enhance the capabilities of agents, offering utilities not readily available through basic bash commands or python code.
Easy to create and extend tools. AgentSkills is designed as a Python package consisting of different utility functions (i.e., tools) that are automatically imported into the Jupyter IPython environment (§2.2). The ease of defining a Python function as a tool lowers the barrier for community members to contribute new tools to the skill library. The generality of Python packages also allows different agent implementations to easily leverage these tools through one of our core action IPythonRunCellAction (§2.1).
Rigorously tested and maintained. We follow best practices in software engineering and write extensive unit tests for tools in AgentSkills to ensure their reliability and usability.
Inclusion criteria and philosophy. In the AgentSkills library, we do not aim to wrap every possible Python package and re-teach agents their usage (e.g., LLM already knows pandas library that can read CSV file, so we don’t need to re-create a tool that teaches the agent to read the same file format). We only add a new skill when: (1) it is not readily achievable for LLM to write code directly (e.g., edit code and replace certain lines), and/or (2) it involves calling an external model (e.g., calling a speech-to-text model, or model for code editing [51]).
Currently supported skills. AgentSkills library includes file editing utilities adapted from SWE-Agent [72] like edit_file, which allows modifying an existing file from a specified line; scrolling functions scroll_up and scroll_down for viewing a different part of files. It also contains tools that support reading multi-modal documents, like parse_image and parse_pdf for extracting information from images using vision-language models (e.g., GPT-4V) and reading text from PDFs, respectively. A complete list of supported skills can be found in §H.
2.4 Agent Delegation: Cooperative Multi-agent Interaction

OpenDevin allows interactions between multiple agents as well. To this end, we use a special action type AgentDelegateAction, which enables an agent to delegate a specific subtask to another agent. For example, the generalist CodeActAgent, with limited support for web-browsing, can use AgentDelegateAction to delegate web browsing tasks to the specialized BrowsingAgent to perform more complex browsing activity (e.g., navigate the web, click buttons, submit forms, etc.).

AgentHub: A Hub of Community-Contributed Agents

Based on our agent abstraction (§2.1), OpenDevin supports a wide range of community-contributed agent implementations for end users to choose from and act as baselines for different agent tasks.
CodeAct Agent. CodeActAgent is the default generalist agent based on the CodeAct framework [63]. At each step, the agent can (1) converse to communicate with humans in natural language to ask for clarification, confirmation, etc., or (2) to perform the task by executing code (a.k.a., CodeAct), including executing bash commands, Python code, or browser-specific programming language (§2.2). This general action space allows the agent (v1.5 and above) to perform various tasks, including editing files, browsing the web, running programs, etc.
Browsing Agent. We implemented a generalist web agent called Browsing Agent, to serve as a simple yet effective baseline for web agent tasks. The agent is similar to that in WebArena [79], but with improved observations and actions, with only zero-shot prompting. At each step, the agent prompts the LLM with the task description, browsing action space description, current observation of the browser using accessibility tree, previous actions, and an action prediction example with chain-of-thought reasoning. The expected response from the LLM will contain chain-of-thought reasoning plus the predicted next actions, including the option to finish the task and convey the result to the user. Full prompts are in §J. It can be extended to create more capable web agents, or called by other agents through delegation (§2.4) to enable browsing capability.
GPTSwarm Agent. GPTSwarm [83] pioneers the use of optimizable graphs to construct agent systems, unifying language agent frameworks through modularity. Each node represents a distinct operation, while edges define collaboration and communication pathways. This design allows automatic optimization of nodes and edges, driving advancements in creating multi-agent systems.
Micro Agent(s). In addition, OpenDevin enables the creation of micro agent, an agent specialized towards a particular task. A micro agent re-uses most implementations from an existing generalist agent (e.g., CodeAct Agent). It is designed to lower the barrier to agent development, where community members can share specialized prompts that work well for their particular use cases. Without programming, a user can create a micro agent by providing the agent’s name, description, the schema for its inputs and outputs, and optionally a specialized prompt (e.g., example demonstrations showing how to perform a particular task) that gear the generalist agent toward specific tasks, for example, the CommitWriterAgent for generating git commit messages, and the TypoFixerAgent for correcting typos across the entire code repository.
Evaluation

To systematically track progress in building generalist digital agents, as listed in Tab. 2, we integrate 15 established benchmarks into OpenDevin.
These benchmarks cover software engineering, web browsing, and miscellaneous assistance. In this section, we compare OpenDevin to open-source reproducible baselines that do not perform manual prompt engineering specifically based on the benchmark content. Please note that we use ‘OD’ as shorthand for OpenDevin for the rest of this section for brevity reasons.


Category
Benchmark
Required Capability


Software
SWE-Bench [21]
Fixing Github issues


HumanEvalFix [37]
Fixing Bugs


BIRD [27]
Text-to-SQL


BioCoder [58]
Bioinformatics coding


ML-Bench [57]
Machine learning coding


Gorilla APIBench [46]
Software API calling


ToolQA [81]
Tool use


Web
WebArena [79]


MiniWoB++ [30]
Goal planning & realistic browsing Short trajectory on synthetic web


GAIA [34]
Tool-use, browsing, multi-modality


GPQA [50]
Graduate-level Google-proof Q&A


Misc. Assistance
AgentBench [31]
Operating system interaction (bash)


MINT [64]
Multi-turn math and code problems


Entity Deduction Arena [77]


ProofWriter [55]
State tracking & strategic planning Deductive Logic Reasoning


Result Overview

In OpenDevin, our goal is to develop general digital agents capable of interacting with the world through software interfaces (as exemplified by the code actions described in §2.1). We recognize that a software agent should excel not only in code editing but also in web browsing and various auxiliary tasks, such as answering questions about code repositories or conducting online research.

Table 3: Selected evaluation results for OpenDevin agents (§4)

See Tab. 4 (software), Tab. 5 (web), Tab. 6 (miscellaneous assistance) for full results across benchmarks.


Agent
Model
Software (§4.2)
Web (§4.3)
Misc. (§4.4)


SWE-Agent [72]
gpt-4-1106-preview
18.0
-
-


AutoCodeRover [78]
gpt-4-0125-preview
19.0
-
-


Aider [13]
gpt-4o & claude-3-opus
26.3
-
-


Moatless Tools [85]
claude-3.5-sonnet
26.7
-
-


Agentless [68]
gpt-4o
27.3
-
-


Lemur [70]
Lemur-chat-70b
-
5.3
-


Patel et al. [45]
Trained 72B w/ synthetic data
-
9.4
-


AutoWebGLM [24]
Trained 7B w/ human/agent annotation
-
18.2
-


Auto Eval & Refine [42]
GPT-4 + Reflexion w/ GPT-4V
-
20.2
-


WebArena Agent [79]
gpt-4-turbo
-
14.4
-


AutoGPT [14]
gpt-4-turbo
-
-
13.2


Few-shot Prompting + Chain-of-Thought [50]
Llama-2-70b-chat
-
28.1
-


Chain-of-Thought [50]
gpt-3.5-turbo-16k
-
29.6
-


Chain-of-Thought [50]
gpt-4
-
38.8
-


gpt-4o-mini-2024-07-18
-
6.3
8.3
53.1


CodeActAgent v1.8
gpt-4o-2024-05-13
22.0
14.5
-


CodeActAgent v1.8
claude-3-5-sonnet
26.0
15.3
52.0


GPTSwarm v1.0
gpt-4o-2024-05-13
-
-
32.1


Numbers are reported from CodeActAgent v1.5.

Tab. 3 showcases a curated set of evaluation results. While OpenDevin agents may not achieve top performance in every category, they are designed with generality in mind. Notably, the same CodeAct agent, without any modifications to its system prompt, demonstrates competitive performance across three major task categories: software development, web interaction, and miscellaneous tasks. This is particularly significant when compared to the baseline agents, which are typically designed and optimized for specific task categories.
Software Engineering

Next, we report results specifically for software engineering benchmarks in Tab. 4.
SWE-Bench

SWE-bench [21] is designed to assess agents’ abilities in solving real-world GitHub issues, such as bug reports or feature requests. The agent interacts with the repository and attempts to fix the issue provided through file editing and code execution. The agent-modified code repository is tested against a test suite incorporating new tests added from human developers’ fixes for the same issue. Each test instance accompanies a piece of “hint text” that consists of natural language suggestions for how to solve the problem. Throughout this paper, we report all results without using hint text. A canonical subset, SWE-bench Lite, is created to facilitate accessible and efficient testing. We default to use this subset for testing for cost-saving consideration.
Result: As shown in Tab. 4, our most recent version of CodeActAgent v1.8 generalist, using claude-3.5-sonnet, achieves a competitive resolve rate of 26% compared to other open-source agents specialized for software development. We also evaluated CodeActAgent v1.8 using gpt-4o-mini. While it only solves 6.3% of the problems, it does so at less than 1% of the cost compared to other models.
HumanEvalFix

HumanEvalFix [37] tasks agents to fix a bug in a provided function with the help of provided test cases. The bugs are created to ensure one or more test cases fail. We focus on the Python subset of
Running the complete set of 2294 instances costs $6.9k, using a conservative estimate of $3 per instance.

Table 4: OpenDevin Software Engineering evaluation results (§4.2)


Agent
Model
Success Rate (%)
$ Avg. Cost


SWE-Agent [72]
gpt-4-1106-preview
18.0
1.67


AutoCodeRover [78]
gpt-4-0125-preview
19.0
-


Aider [13]
gpt-4o & claude-3-opus
26.3
-


OD CodeActAgent v1.5, Generalist.
gpt-4o-2024-05-13
16.7
1.50


gpt-4o-mini-2024-07-18
7.0
0.01


OD CodeActAgent v1.8, Generalist.
gpt-4o-2024-05-13
22.0
1.72


claude-3-5-sonnet@20240620
26.0
1.10


HumanEvalFix [37], 164 instances
BLOOMZ-176B
16.6
-


Prompting, 0-shot [11, 32, 36, 37, 66, 84]
OctoCoder-15B
30.4
-


DeepSeekCoder-33B-Instruct
47.5
-


StarCoder2-15B
48.6
-


SWE-agent, 1-shot [72]
gpt-4-turbo
87.7
-


OD CodeActAgent v1.5, Generalist, 0-shot.
gpt-3.5-turbo-16k-0613
20.1
0.11


gpt-4o-2024-05-13
79.3
0.14


BIRD [27], 300 instances
CodeLlama-7B-Instruct
18.3
-


CodeQwen-7B-Chat
31.3
-


OD CodeActAgent v1.5, Generalist.
gpt-4-1106-preview
42.7
0.19


gpt-4o-2024-05-13
47.3
0.11


ML-Bench [57], 68 instances
prompting + BM25, 0-shot
gpt-3.5-turbo
11.0
-


gpt-4-1106-preview
22.1
--


gpt-4o-2024-05-13
26.2


SWE-Agent [72]
gpt-4-1106-preview
42.6
1.91


Aider [13]
gpt-4o
64.4
-


gpt-4o-2024-05-13
76.5
0.25


OD CodeActAgent v1.5, Generalist.
gpt-4-1106-preview
58.8
1.22


gpt-3.5-turbo-16k-0613
13.2
0.12


BioCoder (Python) [57], 157 instances
prompting, 0-shot
gpt-3.5-turbo
11.0
-


gpt-4-1106-preview
12.7
-


OD CodeActAgent v1.5, Generalist.
gpt-4o-2024-05-13
27.5
0.13


BioCoder (Java) [57], 50 instances
prompting, 0-shot
gpt-3.5-turbo
4.1
-


gpt-4-1106-preview
6.4
-


OD CodeActAgent v1.5, Generalist.
gpt-4o-2024-05-13
44.0
0.11


Gorilla APIBench [46], 1775 instances
claude-v1
8.7
-


Prompting, 0-shot [2, 39, 41, 62]
gpt-4-0314
21.2
--


gpt-3.5-turbo-0301
29.7


Gorilla, finetuned for API calls, 0-shot [46, 62]
llama-7b
75.0
-


OD CodeActAgent v1.5, Generalist.
gpt-3.5-turbo-0125
21.6
0.002


gpt-4o-2024-05-13
36.4
0.04


ToolQA [81], 800 instances
ChatGPT + CoT
5.1
-


Prompting, 0-shot [23, 33, 39, 65]
ChatGPT
5.6
-


Chameleon
10.6
-


ReAct, 0-shot [39, 73]
gpt-3.5-turbo
36.8
--


gpt-3
43.1


OD CodeActAgent v1.5, Generalist.
gpt-3.5-turbo-0125
2.3
0.03


gpt-4o-2024-05-13
47.2
0.91


The benchmark and allow models to solve the bugs by self-debug over multiple turns, incorporating feedback from test execution. We follow the setup from Muennighoff et al. [37] using pass@k [8]. Results. In Tab. 4, OpenDevin CodeActAgent successfully fixes 79.3% of bugs in the Python split. This is significantly better than all non-agentic approaches, almost doubling the performance of StarCoder2-15B [28, 32]. While SWE-Agent achieves 87.7%, Yang et al. [72] provides the model a full demonstration of a successful sample trajectory fixing one of the bugs in the test dataset.

4.2.3 ML-Bench

ML-Bench [57] evaluates agents’ ability to solve machine learning tasks across 18 GitHub repositories. The benchmark comprises 9,641 tasks spanning 169 diverse ML problems, requiring agents to generate bash scripts or Python code in response to user instructions. In the sandbox environment, agents can iteratively execute commands and receive feedback, allowing them to understand the repository context and fulfill user requirements progressively. Following the setup from the original paper, we perform agent evaluation on the quarter subset of ML-Bench.
Results: As shown in Table 4, OpenDevin agents with GPT-4o achieve the highest success rate of 76.47% on ML-Bench, outperforming SWE-Agent (42.64%). Performance drops with less capable models. These results demonstrate the effectiveness of OpenDevin agent in complex ML tasks. We notice that agents show potential in reducing hallucination and syntax errors compared to non-agent approaches in the ML-LLM-Bench settings [57].
4.2.4 Gorilla APIBench

Gorilla APIBench [46] evaluates agents’ abilities to use APIs. It incorporates tasks on TorchHub, TensorHub, and HuggingFace. During the evaluation, models are given a question related to API usage, such as "identify an API capable of converting spoken language in a recording to text." Correctness is evaluated based on whether the model’s API call is in the correct domain.
Results: As shown in Table 4, OpenDevin using GPT-4o, with a success rate of 36.4%, outperforms baselines not specifically finetuned for API calling. While Gorilla shows higher performance on APIBench, Patil et al. [46] finetune this model for API calling in particular.
4.2.5 ToolQA

ToolQA [81] evaluates agents’ abilities to use external tools. This benchmark includes tasks on various topics like flight status, coffee price, Yelp data, and Airbnb data, requiring the use of various tools such as text tools, database tools, math tools, graph tools, code tools, and system tools. It features two levels: easy and hard. Easy questions focus more on single tool usage, while hard questions emphasize reasoning. For evaluation, the easy subset is used to assess tool use capabilities.
Results: Compared to all baselines, OpenDevin with GPT-4o shows the highest performance. We notice that agents perform better on tasks related to CSV and database tool usage but requires improvements on math and calculator tool usage.
4.2.6 BioCoder

BioCoder [58] is a repository-level code generation benchmark that evaluates agents’ performance on bioinformatics-related tasks, specifically the ability to retrieve and accurately utilize context. The original prompts contain the relevant context of the code; however, in this study, we have removed them to demonstrate the capability of OpenDevin to perform context retrieval, self-debugging, and reasoning in multi-turn interactions. BioCoder consists of 157 Python and 50 Java functions, each targeting a specific area in bioinformatics, such as proteomics, genomics, and other specialized domains. The benchmark targets real-world code by generating code in existing repositories where the relevant code has been masked out.
Results: Table 4 shows that OpenDevin, using GPT-4o, achieves a success rate of 44.0%. This outperforms all prompting-based non-agent baselines, with GPT-4 alone only achieving 6.4%. BioCoder proves to be a particularly challenging benchmark for non-agent methods, as they did not incorporate any repository-level retrieval methods, making these models lack access to crucial repo-level information such as global variables and function declarations.

Table 5: OpenDevin Web Browsing Evaluation Results (§4.3)


Agent
Model
Success Rate (%)
$ Avg. Cost


Lemur [70]
Lemur-chat-70b
5.3
−


Patel et al. [45]
Trained 72B with self-improvement synthetic data
9.4
−


AutoWebGLM [24]
Trained 7B with human/agent hybrid annotation
18.2
−


Auto Eval & Refine [42]
GPT-4 + Reflexion w/ GPT-4V reward model
20.2
−


Llama3-chat-8b
3.3
−


WebArena Agent [79]
Llama3-chat-70b
7.0
−


gpt-3.5-turbo
6.2
−


gpt-4-turbo
14.4
−


gpt-3.5-turbo-0125
5.2
0.02


OD BrowsingAgent v1.0
gpt-4o-mini-2024-07-18
8.5
0.01


gpt-4o-2024-05-13
14.8
0.15


claude-3-5-sonnet-20240620
15.5
0.10


OD CodeActAgent v1.8, Generalist.
gpt-4o-mini-2024-07-18
8.3
−


via delegation to BrowsingAgent v1.0
gpt-4o-2024-05-13
14.5
−


claude-3-5-sonnet-20240620
15.3
−


BIRD

BIRD [27] is a benchmark for text-to-SQL tasks aimed at translating natural language into executable SQL. It is designed for realistic and large-scale database environments. 300 samples from the dev set are selected to integrate into OpenDevin for evaluation on execution accuracy. The setting is extended to allow multi-turn interactions for the agent to arrive at the final SQL query, correcting historical results by observing SQL execution results.
Results: OpenDevin with GPT-4o achieves an execution accuracy of 47.3% on a subset of BIRD. This demonstrates the potential of OpenDevin as a SQL agent, outperforming approaches using prompting with code LLMs like CodeLlama-7B-Instruct (18.3%) and CodeQwen-7B-Chat (31.3%).
Web Browsing

WebArena

WebArena [79] is a self-hostable, execution-based web agent benchmark that allows agents to freely choose paths to complete tasks. It includes 812 human-curated task instructions across various domains such as shopping, forums, developer platforms, and content management systems. Each task is paired with a handwritten test case to verify agent success by checking web page element status or textual answers.
Results: The BrowsingAgent achieves competitive performance among agents using LLMs with domain-general prompting techniques. Some agents, like AutoWebGLM, require manual effort tailored to the WebArena task domain. This highlights the performance trade-off between a generalist and a domain-tailored specialist web agent, with a preference for a more general browsing agent in OpenDevin.
MiniWoB

MiniWoB++ [30] is an interactive web benchmark with built-in reward functions. Tasks are initialized on 125 minimalist web interfaces. Tasks are easier compared to WebArena, requiring fewer steps and providing low-level task directions. Some environments in MiniWoB require vision capability for successful completion.

Results

From Tab. 5 we see that our generalist BrowsingAgent finishes nearly half of the tasks without any adaptation on the environment. However, due to the synthetic nature of MiniWoB++, the state-of-the-art agents explicitly trained for the environments with reinforcement learning and/or human behavior cloning have almost saturated the performance. This shows an even bigger trade-off between generalist and specialist agents than on the more general WebArena benchmark.
4.4 Miscellaneous Assistance

Results for miscellaneous assistance benchmarks are reported in Tab. 6. In particular, we report results for:

4.4.1 GAIA

GAIA [34] evaluates agents’ general task-solving skills, covering different real-world scenarios. It requires various agent capabilities, including reasoning, multi-modal understanding, web browsing, and coding. GAIA consists of 466 curated tasks across three levels. Setting up GAIA is traditionally challenging due to the complexity of integrating various tools with the agent, but OpenDevin’s infrastructure (e.g., runtime §2.2, tool library §2.3) simplifies the integration significantly.
Results

In our experiments, we achieved a score of 32.1 on the GAIA (level-1 val), significantly improving over the original AutoGPT [14]. GAIA is sensitive to the support of multimodal input and web navigation skills, suggesting further score improvements as OpenDevin’s infrastructure improves.
4.4.2 GPQA

GPQA [50] evaluates agents’ ability for coordinated tool use when solving challenging graduate-level problems. It consists of 448 curated and difficult multiple-choice questions in biology, physics, and chemistry. Tool use (e.g., python) and web search are often useful to assist agents in answering these questions since they provide accurate calculations that LLMs are often incapable of and access to information outside of the LLM’s parametric knowledge base.
Results

Results are shown in Tab. 6 and 7. We observe that OpenDevin’s integrated for supporting diverse tool use (e.g., python for calculations) as well as web-search (for searching relevant facts) allows the resulting agent to better solve complex multi-step problems, surpassing the prior state-of-the-art by 9.6% and 12.3% on the main and diamond subsets respectively on GPQA [50].
4.4.3 AgentBench

AgentBench [31] evaluates agents’ reasoning and decision-making abilities in a multi-turn, open-ended generation setting. We selected the code-grounded operating system (OS) subset with 144 tasks. Agents from OpenDevin interact directly with the task-specific OS using bash commands in a multi-turn manner, combining interaction and reasoning to automate task completion.
Results

In our experiments (Tab. 6), OpenDevin CodeActAgent v1.5 achieves a score of 57.6% on the AgentBench using gpt-4o, outperforming the 42.4% baseline using gpt-4 from the original paper. Interestingly, when employing weaker models such as gpt-3.5-turbo, OpenDevin agents generally underperform compared to the original baseline agents. This finding suggests that generalist agents, like those implemented in OpenDevin, require a certain threshold of foundation model capability - particularly instruction following - to function effectively.
4.4.4 MINT

MINT [64] is a benchmark designed to evaluate agents’ ability to solve challenging tasks through multi-turn interactions using tools and natural language feedback simulated by GPT-4. We use coding and math subsets used in Eurus [76] for evaluation. We follow the same setting as the original paper and allow the agent to interact up-to five iterations with two chances to propose solutions.
Results

As shown in Tab. 6), OpenDevin agents achieve comparable performance to the default agent in the original benchmark, with a performance improvement in the math subset.

Table 6: OpenDevin miscellaneous assistance evaluation results (§4.4)


Agent
Model
Success Rate (%)
$ Avg. Cost


AutoGPT [14]
gpt-4-turbo
13.2
-


OD GPTSwarm v1.0
gpt-4-0125-preview
30.2
0.110


gpt-4o-2024-05-13
32.1
0.050


GPQA [50], diamond set, 198 instances (refer to §F, Tab. 7 for other subsets)


Human [50]
Expert human
81.3
-


Non-expert human
21.9
-


Llama-2-70b-chat
28.1
-


Few-shot Prompting + Chain-of-Thought [50]
gpt-3.5-turbo-16k
29.6
-


gpt-4
38.8
-


gpt-3.5-turbo-16k
27.9
0.012


OD CodeActAgent v1.5
gpt-4 (azure:2024-02-15-preview)
51.8
0.501


gpt-4o-2024-05-13
53.1
0.054


OD CodeActAgent v1.8
claude-3-5-sonnet-20240620
52.0
0.065


AgentBench [31], OS (bash) subset, 144 instances


AgentBench Baseline Agent [31]
gpt-4
42.4
-


gpt-3.5-turbo
32.6
-


OD CodeActAgent v1.5, Generalist
gpt-4o-2024-05-13
57.6
0.085


gpt-3.5-turbo-0125
11.8
0.006


MINT [64]: math subset, 225 instances


MINT Baseline Agent
gpt-4-0613
65.8
-


OD CodeActAgent v1.5, Generalist
gpt-4o-2024-05-13
77.3
0.070


gpt-3.5-turbo-16k-0613
33.8
0.048


MINT [64]: code subset, 136 instances


MINT Baseline Agent
gpt-4-0613
59.6
-


OD CodeActAgent v1.5, Generalist
gpt-4o-2024-05-13
50.0
0.087


gpt-3.5-turbo-16k-0613
5.2
0.030


ProofWriter [55], 600 instances


Few-shot Prompting + Chain-of-Thought [43]
gpt4
68.1
-


Logic-LM [43]
gpt4 + symbolic solver
79.6
-


OD CodeActAgent v1.5, Generalist
gpt-4o-2024-05-13
78.8
-


Entity Deduction Arena [77], 200 instances


Human
-
21.0
-


Zero-shot Prompting [77]
gpt-4-0314
40.0
-


gpt-3.5-turbo-0613
27.0
-


OD CodeActAgent v1.5, Generalist
gpt-4o-2024-05-13
38.0
-


gpt-3.5-turbo-16k-0613
24.0
-


ProofWriter

ProofWriter [55] is a synthetic dataset created to assess deductive reasoning abilities of LLMs. Same as Logic-LM [43], we focus on the most challenging subset, which contains 600 instances requiring 5-hop reasoning. To minimize the impact of potential errors in semantic parsing, we use the logical forms provided by Logic-LM. Results. In Tab. 6, OpenDevin agent employs a symbolic solver to solve the task, achieving performance comparable to the state-of-the-art neuro-symbolic model (i.e., Logic-LM) [43].
Entity Deduction Arena

Entity Deduction Arena (EDA) [77] evaluates agents’ ability to deduce unknown entities through strategic questioning, akin to the 20 Questions game. This benchmark tests the agent’s state tracking, strategic planning, and inductive reasoning capabilities over multi-turn conversations. We evaluate two datasets “Things” and “Celebrities”, each comprising 100 instances, and report the average success rate over these two datasets. Results. Tab. 6 shows that CodeActAgent yields comparable performance comparing with the results reported in the original paper [77].

Conclusion

We introduce OpenDevin, a community-driven platform that enables the development of agents that interact with the world through software interfaces. By providing a powerful interaction mechanism, a safe sandboxed environment, essential agent skills, multi-agent collaboration capabilities, and a comprehensive evaluation framework, OpenDevin accelerates research innovations and real-world applications of agentic AI systems. Despite challenges in developing safe and reliable agents (§A), we are excited about our vibrant community and look forward to OpenDevin’s continued evolution.
Acknowledgments

We extend our heartfelt gratitude to all contributors who have made OpenDevin possible. Over 160 individuals have made more than 1,300 contributions, demonstrating the power of collaborative development. We are deeply appreciative of everyone who has participated in discussions, raised issues, suggested improvements, and contributed code or documentation.
References

[1] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.[2] Anthropic. Claude, 2022. URL https://www.anthropic.com/index/introducing-claude[3] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.[4] Daniil A Boiko, Robert MacKnight, Ben Kline, and Gabe Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023.[5] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.[6] Harrison Chase. LangChain, October 2022. URL https://github.com/langchain-ai/langchain.[7] Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, Börje F. Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation, 2024.[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.[9] Cognition.ai. Introducing devin, the first ai software engineer. URL https://www.cognition.ai/blog/introducing-devin.[10] Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. Chatlaw: Open-source legal large language model with integrated external knowledge bases, 2023.
A full list of contributors can be found at https://github.com/OpenDevin/OpenDevin/graphs/contributors. Including our stargazers at https://github.com/OpenDevin/OpenDevin/stargazers

DeepSeek-AI

Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, and Wenfeng Liang. Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024. URL https://arxiv.org/abs/2406.11931. 8
Alexandre Drouin

Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. 3, 4
Paul Gauthier

How aider scored sota 26.3% on swe bench lite | aider. https://aider.chat/2024/05/22/swe-bench-lite.html. Accessed: 2024-06-05. 7, 8
Significant Gravitas

Auto-gpt: An autonomous gpt-4 experiment, 2023. URL https://github.com/Significant-Gravitas/Auto-GPT, 2023. 5, 7, 11, 12, 23
Hongliang He

Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024. 4
Sirui Hong

Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The Twelfth International Conference on Learning Representations, 2023. 1, 5, 23
Dong Huang

Qingwen Bu, Jie M. Zhang, Michael Luck, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation, 2024. 23
Peter C Humphreys

David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466–9482. PMLR, 2022. 10
IPython

Jupyter and the future of IPython — IPython. URL https://ipython.org. 4
Albert Q Jiang

Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 1
Carlos E Jimenez

John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can Language Models Resolve Real-world Github Issues? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=VTF8yNQM66. 1, 6, 7, 8, 22, 24
Geunwoo Kim

Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. Advances in Neural Information Processing Systems, 36, 2024. 11
Takeshi Kojima

Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022. 8
Hanyu Lai

Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024. 7, 10
Hareton K. N. Leung

Lee J. White. A study of integration testing and software regression at the integration level. In Proceedings of the Conference on Software Maintenance, ICSM 1990, San Diego, CA, USA, 26-29 November, 1990, pages 290–301. IEEE, 1990. doi: 10.1109/ICSM.1990.131377. URL https://doi.org/10.1109/ICSM.1990.131377. 24

References

[26] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for "mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023. 23[27] Jinyang Li, Binyuan Hui, GE QU, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chang, Fei Huang, Reynold Cheng, and Yongbin Li. Can LLM already serve as a database interface? a BIg bench for large-scale database grounded text-to-SQLs. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. 6, 8, 10, 22[28] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Dan- ish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you!, 2023. URL https://arxiv.org/abs/2305.06161. 8[29] Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. A zero-shot language agent for computer control with structured reflection. arXiv preprint arXiv:2310.08740, 2023. 11[30] Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018. URL https://arxiv.org/abs/1802.08802. 6, 10, 22[31] Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents. arXiv preprint arXiv: 2308.03688, 2023. 6, 11, 12, 22[32] Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Noua- mane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Car- olyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder 2 and the stack v2: The next generation, 2024. URL https://arxiv.org/abs/2402.19173. 8[33] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. Advances in Neural Information Processing Systems, 36, 2024. 8[34] Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants. CoRR, abs/2311.12983, 2023. doi: 10.48550/ARXIV.2311.12983. URL https://doi.org/10.48550/arXiv.2311.12983. 6, 11, 12, 22[35] Mozilla. Accessibility tree - MDN Web Docs Glossary: Definitions of Web-related terms |MDN. URL https://developer.mozilla.org/en-US/docs/Glossary/Accessibility_tree. 4

36

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M Saiful Bari, Sheng Shen, Zheng-Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raffel. Crosslingual generalization through multitask finetuning, 2023. URL https://arxiv.org/abs/2211.01786. 8
37

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning code large language models, 2024. 6, 7, 8, 22
38

Y Nakajima. Babyagi. URL https://github.com/yoheinakajima/babyagi, 2023. 23
39

OpenAI. Chatgpt: May 2024 version. https://www.openai.com/chatgpt, 2024. Accessed: 2024-05-29. 8, 23
40

OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, 2024. Accessed: 2024-05-15. 1
41

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens

Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024. 8, 23

Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents. arXiv preprint arXiv:2404.06474, 2024. 7, 10, 23Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning. arXiv preprint arXiv:2305.12295, 2023. 12Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. 23Ajay Patel, Markus Hofmarcher, Claudiu Leoveanu-Condrei, Marius-Constantin Dinu, Chris Callison-Burch, and Sepp Hochreiter. Large language models can self-improve at web agent tasks. arXiv preprint arXiv:2405.20309, 2024. 7, 10Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023. 6, 8, 9, 22
Playwright. Fast and reliable end-to-end testing for modern web apps | Playwright. URL https://playwright.dev/. 4

Chen Qian, Xin Cong, Wei Liu, Cheng Yang, Weize Chen, Yusheng Su, Yufan Dang, Jiahao Li, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Communicative agents for software development, 2023. 23Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis. CoRR, abs/2307.16789, 2023. doi: 10.48550/ARXIV.2307.16789. URL https://doi.org/10.48550/arXiv.2307.16789. 3
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, JulienDirani, Julian Michael, and Samuel R Bowman. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. arXiv preprint arXiv:2311.12022, 2023. 6, 7, 11, 12, 22, 25

Aman Sanger. Near-instant full-file edits. https://www.cursor.com/blog/instant-apply. Accessed: 2024-06-05. 5

ServiceNow. BrowserGym: a Gym Environment for Web Task Automation. URL https://github.com/ServiceNow/BrowserGym. 2, 4

Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina N Toutanova. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. Advances in Neural Information Processing Systems, 36:34354–34370, 2023. 11

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 23

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.317. URL https://aclanthology.org/2021.findings-acl.317. 6, 12, 22


Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al.

Prioritizing safeguarding over autonomy: Risks of LLM agents for science. arXiv preprint arXiv:2402.04247, 2024.
Xiangru Tang, Yuliang Liu, Zefan Cai, Yanjun Shao, Junjie Lu, Yichi Zhang, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yin Fang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, and Mark Gerstein

ML-Bench: Evaluating large language models and agents for machine learning tasks on repository-level code, 2024. URL https://arxiv.org/abs/2311.09835.
Xiangru Tang, Bill Qian, Rick Gao, Jiakang Chen, Xinyun Chen, and Mark B Gerstein

BioCoder: a benchmark for bioinformatics code generation with large language models. Bioinformatics, 40(Supplement_1):i266–i276, 06 2024. ISSN 1367-4811.
Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein

Medagents: Large language models as collaborators for zero-shot medical reasoning, 2024.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al.

Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
XAgent Team

Xagent: An autonomous agent for complex task solving, 2023.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample

Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji

Executable Code Actions Elicit Better LLM Agents. In ICML, 2024.
Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback. In ICLR 2024.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al.

Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
BigScience Workshop

Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ili´c, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan

Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Ta¸sar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Pey- rounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Pey- rounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdenˇek Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Ne- jadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. Bloom: A 176b-parameter open-access multilingual language model, 2023. URL https://arxiv.org/abs/2211.05100.


Authors
Title
Year
Pages


Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang
Autogen: Enabling next-gen llm applications via multi-agent conversation framework
2023
1, 5, 23


Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang
Agentless: Demystifyingllm-based software engineering agents
2024
7


References


[69] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al.
Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023. 5


[70] Yiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, et al.
Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830, 2023. 7, 10


[71] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao.
Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023. 4


[72] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.
Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. 2, 5, 7, 8, 23


[73] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao.
React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X. 8


[74] Yining Ye, Xin Cong, Shizuo Tian, Jiannan Cao, Hao Wang, Yujia Qin, Yaxi Lu, Heyang Yu, Huadong Wang, Yankai Lin, et al.
Proagent: From robotic process automation to agentic process automation. arXiv preprint arXiv:2311.10751, 2023. 23


[75] Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, and Heng Ji.
CRAFT: customizing llms by creating and retrieving from specialized toolsets. CoRR, abs/2309.17428, 2023. doi: 10.48550/ARXIV.2309.17428. URL https://doi.org/10.48550/arXiv.2309.17428. 3


[76] Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, HuiminChen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun.
Advancing llm reasoning generalists with preference trees, 2024. 11


[77] Yizhe Zhang, Jiarui Lu, and Navdeep Jaitly.
Probing the multi-turn planning capabilities of llms via 20 question games. 2024. 6, 12, 22


[78] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury.
Autocoderover: Autonomous program improvement, 2024. 5, 7, 8, 23


[79] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al.
Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2023. 1, 4, 6, 7, 10, 22


[80] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Xiangru Tang, Ningyu Zhang, Huajun Chen, Peng Cui, and Mrinmaya Sachan.
Agents: An open-source framework for autonomous language agents, 2023. 5, 23


[81] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang.
Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024. 6, 8, 9, 22


[82] Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al.
Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. 23


[83] Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jurgen Schmidhuber.
Language agents as optimizable graphs. arXiv preprint arXiv:2402.16823, 2024. 5, 6, 22, 23


References

[84] Terry Yue Zhuo, Armel Zebaze, Nitchakarn Suppattarachai, Leandro von Werra, Harm de Vries, Qian Liu, and Niklas Muennighoff. Astraios: Parameter-efficient instruction tuning code large language models, 2024. URL https://arxiv.org/abs/2401.00788. 8[85] Albert Örwall. Moatless tools. URL https://github.com/aorwall/moatless-tools. 7

Author Contributions

This work was an open-source collaborative effort across multiple institutions. We employed a point-based system to determine contributions and award authorships, with technical contributions tracked and measured in units of pull requests (PRs). Xingyao Wang led the project, coordinating overall development and paper writing efforts. Detailed contributions were as follows:


Agent Development (§3)
Xingyao Wang led the implementation of CodeAct and Code-ActSWE agents. Frank F. Xu led the development of web browsing agents. Mingchen Zhuge orchestrated the integration of the GPTSwarm agent. Robert Brennan and Boxuan Li lead the development of the Micro Agent.


Architectural Development (Fig. 3)
Robert Brennan initiated the architecture design. Boxuan Li, Frank F. Xu, Xingyao Wang, Yufan Song, and Mingzhang Zheng further refined and expanded the architecture. Boxuan Li implemented the initial version of integration tests, maintained the agentskills library, managed configurations, and resolved resource leaks in evaluation. Frank F. Xu developed the web browsing environment for both agent execution and evaluation and integrated it with both agent and front-end user interfaces. Xingyao Wang authored the initial code for the agentskills library and the Docker sandbox. Yufan Song implemented cost tracking for evaluation, while Mingzhang Zheng developed an image-agnostic docker sandbox for more stable SWE-Bench evaluation.


Benchmarking, Integration, and Code Review
Boxuan Li and Yufan Song led benchmark integration efforts, including coordination, evaluation, and code review. Yufan Song also helped track PR contributions. Graham Neubig, Xingyao Wang, Mingzhang Zheng, Robert Brennan, Hoang H. Tran, Frank F. Xu, Xiangru Tang, Fuqiang Li, and Yanjun Shao provided additional support in integration and code reviews. Specific benchmark contributions included:


SWE-Bench [21]
Bowen Li and Xingyao Wang


WebArena [79] and MiniWob++ [30]
Frank F. Xu


GAIA [34]
Jiayi Pan (integration) and Mingchen Zhuge (GPTSwarm evaluation)


API-Bench [46] and ToolQA [81]
Yueqi Song


HumanEvalFix [37]
Niklas Muennighoff and Xiangru Tang


ProofWriter [55]
Ren Ma


MINT [64]
Hoang H. Tran


AgentBench [31]
Fuqiang Li


BIRD [27]
Binyuan Hui


GPQA [50]
Jaskirat Singh


BioCoder [58]
Xiangru Tang and Bill Qian


ML-Bench [57]
Xiangru Tang and Yanjun Shao


Entity-Deduction-Arena [77]
Yizhe Zhang


Advising: Graham Neubig advised the project, providing guidance, resources, and substantial paper edits. Heng Ji and Hao Peng offered additional project advice and assisted with paper writing. Junyang Lin contributed advisory support and sponsored resources. All authors contributed to the result discussions and manuscript preparation.
Limitations and Future Work

We are excited about the foundations our vibrant community has laid in OpenDevin and look forward to its continued evolution. We identify several directions for future work:

Enhanced multi-modality support. While our current implementation already supports a wide range of file formats through predefined agent skills, we are interested in enabling multi-modality in a principled way through standard IPython and browser integration, such as viewing images and videos using vision-language model through a browser or processing XLSX files with code.
Stronger agents. Current agents still struggle with complex tasks, and we are interested in building better agents through both training and inference time techniques.

For more details, please refer to OpenDevin/OpenDevin#1917.

Web browsing improvements

Due to the extensible nature of OpenDevin, orthogonal components that could improve agents can be integrated easily. For example, thanks to OpenDevin’s extensible architecture, Auto Eval & Refine [42], an agent retry-on-error strategy with Reflexion [54] prompts and task completion reward models, will be integrated as an optional component attached to our browsing agent.
Stable Runtime

Currently, OpenDevin’s default sandbox environment, SSH sandbox, relies on pexpect library6 to pass commands and messages between the sandbox and agent controller. It doesn’t suit other types of sandboxes (e.g. commercial cloud sandboxes) and is not stable enough. In the future, we plan to use EventStream as the bridge for the agent controller and the sandbox to communicate, which will be more stable, sandbox-agnostic, and enables better observability.
Automatic workflow generation

Currently, OpenDevin’s workflow still requires a substantial handcrafted workload. We believe that graph-based frameworks such as GPTSwarm [83] and LangGraph [6] could serve as alternative solutions for building agents. Particularly in GPTSwarm, when agents are constructed using graphs, it becomes easier to incorporate various optimization methods (e.g., reinforcement learning, meta-prompting). OpenDevin considers these methods to lay the groundwork for promising solutions in automatic workflow generation in future versions.
Ethics Statement

Most AI agents today are still research artifacts and lack the ability to perform complex, long-horizon tasks in the real world reliably. However, as their performance continues to improve and they are increasingly deployed in the real world, they have the potential to boost productivity while also posing security risks to society significantly. OpenDevin helps mitigate risks by:

Enabling systematic evaluation of these agents, which can identify and address risks before they are widely deployed.
Facilitating human-agent interaction rather than allowing agents to operate autonomously without oversight.
More importantly, we hope OpenDevin allows researchers worldwide to access the best suites of agents to conduct frontier safety research towards building safe and helpful agents.

Related Work

The breakthroughs in large language models (LLMs) like ChatGPT [39] and GPT-4 [41] have significantly enhanced the capabilities of autonomous agents across various domains [10, 44, 59, 74]. These advances have spurred a multitude of generalist agent proposals [14, 38, 67] aimed at performing diverse user tasks and have gained attention from both developers and broader audiences. Notable works such as Auto-GPT [14] harness LLMs for task completion by decomposing user goals into executable steps. Multi-agent collaboration systems leverage LLMs for elements like role-playing and task-solving capabilities [26, 61, 80, 82], with MetaGPT [16] emphasizing standardized operating procedures, and AutoGen [67] providing a conversation framework for interactive systems. AGENTS [80] and AutoAgents [7] offer new paradigms for customizable agent architecture, while XAgent [61] and GPTSwarm [83] introduce complex management systems and optimizable graphs, respectively, for enhanced agent operations.
Software development, a front-runner in applying LLM-based agents, has seen advancements in frameworks for facilitating the development processes [16, 48]. Innovations such as ChatDev [48] automate the software development lifecycle akin to the waterfall model, and AutoCodeRover [78] addresses GitHub issues via code search and abstract syntax tree manipulation. AgentCoder [17] iteratively refines code generation with integrated testing and feedback, while SWE-Agent [72] integrates LLMs for automated Github issue fixing, streamlining software engineering.
6https://pexpect.readthedocs.io/en/stable/

D Graphical User Interface

Besides running from the command line, OpenDevin features a rich graphical user interface that visualizes the agent’s current actions (e.g., browsing the web, executing base commands or Python code, etc.) and allows for real-time feedback from the user. Screenshots of the UI are shown in Fig. 1. The user may interrupt the agent at any moment to provide additional feedback, comments, or instruction while the agent is working. This user interface directly connects with the event streams (§2.1) to control and visualize the agents and runtime, making it agent and runtime agnostic.
E Quality Control: Integration Tests for Agents

Integration tests [25] have long been used by software developers to ensure software quality. Unlike large language models with simple input-output schema, agents are typically complex pieces of software where minor errors can be easily introduced during the development process and hurt final task performance. While running a full suite evaluation (§4) is the ultimate measure of performance degradation, running them for every code changes can be prohibitively slow and expensive.
In OpenDevin, we pioneer an end-to-end agent test framework that tests prompt regression, actions, and sandbox environments. It combines integration testing from software engineering and foundation model mocking for deterministic behavior to prevent the accidental introduction of bugs during agent development.
Defining an integration test

The integration test framework for OpenDevin is structured to validate end-to-end functionality by automating task execution and result verification. Developers define tasks and expected results; for instance, a task might involve correcting typos in a document named "bad.txt". Upon task execution through OpenDevin, outputs are compared against a predefined "gold file" to ensure accuracy.
Mocking LLM for deterministic behavior

Addressing the challenge of non-determinism in large language models (LLMs) and the associated high costs, the framework intercepts all LLM calls and supplies predefined responses based on exact prompt matches. This method not only ensures consistency in test outcomes but also reduces operational costs by minimizing the reliance on real LLMs.
Regenerate LLM responses on breaking changes

Prompt-response pairs are managed through a script that generates and stores these pairs when new tests are introduced or existing prompts are modified. For routine tests, the framework attempts to reuse existing LLM responses by slightly adjusting the prompts. Substantial changes that affect task handling require regeneration of these pairs using real LLMs.
Benefits of integration tests

The framework offers several advantages, including 1) Prompt regression testing: Stored prompt-response pairs facilitate change tracking and provide a reference for new team members to understand LLM interactions, 2) Multi-platform support: Tests are automatically scheduled for every pull request and commit on the main branch, running across multiple platforms, environments, and agents, including Linux and Mac, and in local, SSH, and exec sandboxes, and 3) Comprehensive error detection: It captures errors in prompt generation, message passing, and sandbox execution, thereby maintaining a high test coverage.
F Additional Results For GPQA Benchmark

We showcase more detailed results, including performance on other subsets for GPQA benchmark in Tab. 7.
G In-context Demonstration for CodeActSWEAgent

The prompt is re-adopted from the SWE-agent’s released trajectory (https://github.com/princeton-nlp/SWE-agent/tree/main/trajectories/demonstrations). The prompt can be found at https://github.com/OpenDevin/OpenDevin/blob/main/agenthub/codeact_swe_agent/prompt.py.
Running a SWE-Bench Lite [21] evaluation with gpt-4o costs around 600 USD.

|Evaluation Method and Model|Accuracy by subset (%)|Avg Cost ($)|Avg Cost ($)|Avg Cost ($)| | |
|---|---|---|---|---|
| |Diamond Set|Main Set|Extended Set| |
|Expert Human Validators|81.2|72.5|65.4|N/A|
|Non-Expert Human Validators|21.9|30.5|33.9|N/A|
|Few-Shot CoT Llama-2-70B-chat|28.1|29.1|30.4|N/A|
|Few-Shot CoT GPT-3.5-turbo-16k|29.6|28.0|28.2|N/A|
|Few-Shot CoT GPT-4|38.8|39.7|38.7|N/A|
|GPT-4 with search (backoff to CoT on abstention)|38.8|41.0|39.4|N/A|
|OpenDevin + CodeActAgent v1.5 + GPT3.5-turbo|27.9|23.4|26.1|0.012|
|OpenDevin + CodeActAgent v1.5 + GPT4-turbo|51.8|47.4|42.4|0.501|
|OpenDevin + CodeActAgent v1.5 + GPT4o|53.1|49.3|52.8|0.054|
Supported AgentSkills

As of OpenDevin v0.6, we support the following list of skills. Please refer to the source code for
the most up-to-date list of skills: https://github.com/OpenDevin/OpenDevin/blob/main/opendevin/runtime/plugins/agent_skills/agentskills.py
def open_file(path: str, line_number: Optional[int] = None) -> None:

Opens the file at the given path in the editor. If line_number is provided, the window will be moved to include that line.
Args:

path: str: The path to the file to open.
line_number: Optional[int]: The line number to move to.

def goto_line(line_number: int) -> None:

Moves the window to show the specified line number.
Args:

line_number: int: The line number to move to.

def scroll_down() -> None:

Moves the window down by 100 lines.
Args:

None

def scroll_up() -> None:

Moves the window up by 100 lines.
Args:

None

def create_file(filename: str) -> None:

25

Creates and opens a new file with the given name.

Args:filename: str: The name of the file to create.
edit_file(start: int, end: int, content: str) -> None:

Edit a file.
It replaces lines start through end (inclusive) with the given text content in the open file. Remember, the file must be open before editing.
Args:start: int: The start line number. Must satisfy start >= 1.end: int: The end line number. Must satisfy start <= end <= number of lines in the file.content: str: The content to replace the lines with.
search_dir(search_term: str, dir_path: str = './') -> None:

Searches for search_term in all files in dir. If dir is not provided, searches in the current directory.
Args:search_term: str: The term to search for.dir_path: Optional[str]: The path to the directory to search.
search_file(search_term: str, file_path: Optional[str] = None) -> None:

Searches for search_term in file. If file is not provided, searches in the current open file.
Args:search_term: str: The term to search for.file_path: Optional[str]: The path to the file to search.
find_file(file_name: str, dir_path: str = './') -> None:

Finds all files with the given name in the specified directory.
Args:file_name: str: The name of the file to find.dir_path: Optional[str]: The path to the directory to search.
parse_pdf(file_path: str) -> None:

Parses the content of a PDF file and prints it.
Args:file_path: str: The path to the file to open.

def parse_docx(file_path: str) -> None:

Parses the content of a DOCX file and prints it.
Args:
file_path
str
The path to the file to open.
def parse_latex(file_path: str) -> None:

Parses the content of a LaTex file and prints it.
Args:
file_path
str
The path to the file to open.
def parse_audio(file_path: str, model: str = 'whisper-1') -> None:

Parses the content of an audio file and prints it.
Args:


file_path
str
The path to the audio file to transcribe.


model
Optional[str]
The audio model to use for transcription. Defaults to 'whisper-1'.


def parse_image(file_path: str, task: str = 'Describe this image as detail as possible.') -> None:

Parses the content of an image file and prints the description.
Args:


file_path
str
The path to the file to open.


task
Optional[str]
The task description for the API call. Defaults to 'Describe this image as detail as possible.'.


def parse_video(file_path: str, task: str = 'Describe this image as detail as possible.', frame_interval: int = 30) -> None:

Parses the content of an image file and prints the description.
Args:


file_path
str
The path to the video file to open.


task
Optional[str]
The task description for the API call. Defaults to 'Describe this image as detail as possible.'.


frame_interval
Optional[int]
The interval between frames to analyze. Defaults to 30.


parse_pptx(file_path: str) -> None

Parses the content of a pptx file and prints it.
Args:

file_path: str: The path to the file to open.

send_msg_to_user(text: str)

Sends a message to the user.
Examples:send_msg_to_user("Based on the results of my search, the city was built in 1751.")
report_infeasible(reason: str)

Notifies the user that their instructions are infeasible.
Examples:

report_infeasible("I cannot follow these instructions because there is no email field in this form.")

noop(wait_ms: float = 1000)

Do nothing, and optionally wait for the given time (in milliseconds).
Examples:

noop()
noop(500)

fill(bid: str, value: str)

BrowserGym Actions
The following are all the supported actions defined in BrowserGym8 as of v0.3.4. The actions can be categorized into several types and can be configured to use only a subset of the functionality. There are agent control actions, navigation actions, page element-based actions, coordinate-based actions, as well as tab-related actions. We use these actions from the BrowserGym library as our main browsing action primitives.
Source: https://github.com/ServiceNow/BrowserGym/blob/main/core/src/browsergym/core/action/functions.py

Fill out a form field

It focuses the element and triggers an input event with the entered text. It works for <input>, <textarea> and [contenteditable] elements.
Examples:

fill('237', 'example value')fill('45', "multi-line\nexample")fill('a12', "example with "quotes"")
check

Ensure a checkbox or radio element is checked.
Examples:

check('55')
uncheck

Ensure a checkbox or radio element is unchecked.
Examples:

uncheck('a5289')
select_option

Select one or multiple options in a <select> element. You can specify option value or label to select. Multiple options can be selected.
Examples:

select_option('a48', "blue")select_option('c48', ["red", "green", "blue"])
click

Click an element.
Examples:

click('a51')click('b22', button="right")

click


bid
'48'


button
"middle"


modifiers
["Shift"]


pass
dblclick

Double click an element.


bid
str


button
Literal["left", "middle", "right"] = "left"


modifiers
list[Literal["Alt", "Control", "Meta", "Shift"]] = []


Examples:

dblclick('12')
dblclick('ca42', button="right")
dblclick('178', button="middle", modifiers=["Shift"])

pass
hover

Hover over an element.
bid
'b8'
pass
press

Focus the matching element and press a combination of keys. It accepts the logical key names that are emitted in the keyboardEvent.key property of the keyboard events: Backquote, Minus, Equal, Backslash, Backspace, Tab, Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown, PageUp, ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You can alternatively specify a single character you'd like to produce such as "a" or "#". Following modification shortcuts are also supported: Shift, Control, Alt, Meta.


bid
str


key_comb
str


Examples:

press('88', 'Backspace')
press('a26', 'Control+a')
press('a61', 'Meta+Shift+t')

pass

def focus(bid: str):

"""
Focus the matching element.
Examples:
focus('b455')
def clear(bid: str):

"""
Clear the input field.
Examples:
clear('996')
def drag_and_drop(from_bid: str, to_bid: str):

"""
Perform a drag & drop. Hover the element that will be dragged. Press
left mouse button. Move mouse to the element that will receive the
drop. Release left mouse button.
Examples:
drag_and_drop('56', '498')
def scroll(delta_x: float, delta_y: float):

"""
Scroll horizontally and vertically. Amounts in pixels, positive for
right or down scrolling, negative for left or up scrolling.
Dispatches a wheel event.
Examples:
scroll(0, 200)
scroll(-50.2, -100.5)
def mouse_move(x: float, y: float):

"""
Move the mouse to a location. Uses absolute client coordinates in
pixels.
Dispatches a mousemove event.
Examples:
mouse_move(65.2, 158.5)

https://playwright.dev/python/docs/api/class-mouse#mouse-up


Function
Parameters
Default
Description


mouse_up
x: float, y: float, button: Literal["left", "middle", "right"]
"left"
Move the mouse to a location then release a mouse button. Dispatches mousemove and mouseup events.


Examples
mouse_up(250, 120)'right')


mouse_up(47, 252)


https://playwright.dev/python/docs/api/class-mouse#mouse-down


Function
Parameters
Default
Description


mouse_down
x: float, y: float, button: Literal["left", "middle", "right"]
"left"
Move the mouse to a location then press and hold a mouse button. Dispatches mousemove and mousedown events.


Examples
mouse_down(140.2, 580.1)


mouse_down(458, 254.5, 'middle')


https://playwright.dev/python/docs/api/class-mouse#mouse-click


Function
Parameters
Default
Description


mouse_click
x: float, y: float, button: Literal["left", "middle", "right"]
"left"
Move the mouse to a location and click a mouse button. Dispatches mousemove, mousedown and mouseup events.


Examples
mouse_click(887.2, 68)


mouse_click(56, 712.56, 'right')


https://playwright.dev/python/docs/api/class-mouse#mouse-dblclick


Function
Parameters
Default
Description


mouse_dblclick
x: float, y: float, button: Literal["left", "middle", "right"]
"left"
Move the mouse to a location and double click a mouse button. Dispatches mousemove, mousedown and mouseup events.


Examples
mouse_dblclick(5, 236)


mouse_dblclick(87.5, 354, 'right')


def mouse_drag_and_drop(from_x: float, from_y: float, to_x: float, to_y: float):

Drag and drop from a location to a location. Uses absolute client coordinates in pixels. Dispatches mousemove, mousedown and mouseup events.
Examples:
mouse_drag_and_drop(10.7, 325, 235.6, 24.54)
def keyboard_press(key: str):

Press a combination of keys. Accepts the logical key names that are emitted in the keyboardEvent.key property of the keyboard events: Backquote, Minus, Equal, Backslash, Backspace, Tab, Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown, PageUp, ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You can alternatively specify a single character you'd like to produce such as "a" or "#". Following modification shortcuts are also supported: Shift, Control, Alt, Meta.
Examples:
keyboard_press('Backspace')keyboard_press('Control+a')keyboard_press('Meta+Shift+t')page.keyboard.press("PageDown")
def keyboard_up(key: str):

Release a keyboard key. Dispatches a keyup event. Accepts the logical key names that are emitted in the keyboardEvent.key property of the keyboard events: Backquote, Minus, Equal, Backslash, Backspace, Tab, Delete, Escape, ArrowDown, End, Enter, Home, Insert, PageDown, PageUp, ArrowRight, ArrowUp, F1 - F12, Digit0 - Digit9, KeyA - KeyZ, etc. You can alternatively specify a single character you'd like to produce such as "a" or "#".
Examples:
keyboard_up('Shift')keyboard_up('c')
def keyboard_down(key: str):

Press and holds a keyboard key. Dispatches a keydown event. Accepts the logical key names that are emitted in the keyboardEvent.key property of the keyboard events.
Examples:
keyboard_down('Shift')keyboard_down('c')

the keyboard events:

BackquoteMinusEqualBackslashBackspaceTabDeleteEscapeArrowDownEndEnterHomeInsertPageDownPageUpArrowRightArrowUpF1 - F12Digit0 - Digit9KeyA - KeyZ
You can alternatively specify a single character such as "a" or "#".
Examples:

keyboard_up('Shift')
keyboard_up('c')
keyboard_type(text: str):

Types a string of text through the keyboard. Sends a keydown, keypress/input, and keyup event for each character in the text. Modifier keys DO NOT affect keyboard_type. Holding down Shift will not type the text in upper case.
Examples:

keyboard_type('Hello world!')
keyboard_insert_text(text: str):

Insert a string of text in the currently focused element. Dispatches only input event, does not emit the keydown, keyup or keypress events. Modifier keys DO NOT affect keyboard_insert_text. Holding down Shift will not type the text in upper case.
Examples:

keyboard_insert_text('Hello world!')
goto(url: str):

Navigate to a url.
Examples:

goto('http://www.example.com')

def go_back():

"""
Navigate to the previous page in history.
Examples:
go_back()
"""
pass
def go_forward():

"""
Navigate to the next page in history.
Examples:
go_forward()
"""
pass
def new_tab():

"""
Open a new tab. It will become the active one.
Examples:
new_tab()
"""
global page
set the new page as the active page

page = page.context.new_page()
trigger the callback that sets this page as active in browsergym

pass
def tab_close():

"""
Close the current tab.
Examples:
tab_close()
"""
pass
def tab_focus(index: int):

"""
Bring tab to front (activate tab).
Examples:
tab_focus(2)
"""
pass

https://playwright.dev/python/docs/input#upload-files

def upload_file(bid: str, file: str | list[str]):
Click an element and wait for a "filechooser" event, then select one or multiple input files for upload. Relative file paths are resolved relative to the current working directory. An empty list clears the selected files.
Examples:
upload_file("572", "my_receipt.pdf")upload_file("63", ["/home/bob/Documents/image.jpg", "/home/bob/Documents/file.zip"])
https://playwright.dev/python/docs/input#upload-files

def mouse_upload_file(x: float, y: float, file: str | list[str]):
Click a location and wait for a "filechooser" event, then select one or multiple input files for upload. Relative file paths are resolved relative to the current working directory. An empty list clears the selected files.
Examples:
mouse_upload_file(132.1, 547, "my_receipt.pdf")mouse_upload_file(328, 812, ["/home/bob/Documents/image.jpg", "/home/bob/Documents/file.zip"])
Browsing Agent Details

The following shows an example prompt containing all the information required for the current step to make a prediction about the next browsing actions. Note that we also instruct the agent to predict multiple actions in one turn if the agent thinks they are meant to be executed sequentially without any feedback from the page. This could save turns for common workflows that consist of a sequence of actions on the same page without any observation change, such as filling the username and password and submit in a login page.
Instructions

Review the current state of the page and all other information to find the best possible next action to accomplish your goal. Your answer will be interpreted and executed by a program, make sure to follow the formatting instructions.
Goal:

Browse localhost:8000, and tell me the ultimate answer to life. Do not ask me for confirmation at any point.
Action Space

16 different types of actions are available.
noop(wait_ms: float = 1000)
Examples:
noop()

noop(500)

send_msg_to_user(text: str)Examples:send_msg_to_user('Based on the results of my search, the city was built in 1751.')
scroll(delta_x: float, delta_y: float)

Examples:scroll(0, 200)scroll(-50.2, -100.5)
fill(bid: str, value: str)

Examples:fill('237', 'example value')fill('45', 'multi-line\nexample')fill('a12', 'example with "quotes"')
select_option(bid: str, options: str | list[str])

Examples:select_option('48', 'blue')select_option('48', ['red', 'green', 'blue'])
click(bid: str, button: Literal['left', 'middle', 'right'] = 'left', modifiers: list[typing.Literal['Alt', 'Control', 'Meta', 'Shift']] = [])

Examples:click('51')click('b22', button='right')click('48', button='middle', modifiers=['Shift'])
dblclick(bid: str, button: Literal['left', 'middle', 'right'] = 'left', modifiers: list[typing.Literal['Alt', 'Control', 'Meta', 'Shift']] = [])

Examples:dblclick('12')dblclick('ca42', button='right')dblclick('178', button='middle', modifiers=['Shift'])
hover(bid: str)

Examples:hover('b8')
press(bid: str, key_comb: str)

Examples:press('88', 'Backspace')press('a26', 'Control+a')press('a61', 'Meta+Shift+t')

focus(bid: str)

Examples:
focus('b455')
clear(bid: str)

Examples:
clear('996')
drag_and_drop(from_bid: str, to_bid: str)

Examples:
drag_and_drop('56', '498')
upload_file(bid: str, file: str | list[str])


Examples:
upload_file('572', 'my_receipt.pdf')


upload_file('63', ['/home/bob/Documents/image.jpg', '/home/bob/Documents/file.zip'])


go_back()

Examples:
go_back()
go_forward()

Examples:
go_forward()
goto(url: str)

Examples:
goto('http://www.example.com')
Multiple actions can be provided at once. Example:
fill('a12', 'example with "quotes"')
click('51')
click('48', button='middle', modifiers=['Shift'])
Multiple actions are meant to be executed sequentially without any feedback from the page.
Don't execute multiple actions at once if you need feedback from the page.
Current Accessibility Tree:

RootWebArea 'The Ultimate Answer', focused[8] heading 'The Ultimate Answer'[9] paragraph ''StaticText 'Click the button to reveal the answer to life, the universe, and everything.'[10] button 'Click me', clickable
Previous Actions

goto('http://localhost:8000')
Here is an example with chain of thought of a valid action when clicking on a button:
"
In order to accomplish my goal I need to click on the button with bid 12
click("12")
And an example response to the above prompt is:
38

In order to accomplish my goal, I need to click on the button with bid 10 to reveal the answer to life, the universe, and everything.
click("10")
For the evaluation on WebArena benchmark, since some of the tasks require checking for an exact match on the agent’s message back to the user, we add the following instruction to let the agent reply with only a concise answer string when messaging the user to prevent the agent from failing the test due to extra text:
Here is another example with a chain of thought of a valid action when providing a concise answer to the user:
"In order to accomplish my goal I need to send the information asked back to the user. This page lists the information of HP Inkjet Fax Machine, which is the product identified in the objective. Its price is $279.49. I will send a message back to the user with the answer."
send_msg_to_user("$279.49")
Code	https://github.com/OpenDevin/OpenDevin
Benchmark	https://hf.co/spaces/OpenDevin/evaluation
Slack	https://bit.ly/OpenDevin-Slack
	class MinimalAgent:
def reset(self) -> None:	self.system_message = "You are a helpful assistant ..."
def step(self, state: State):	messages: list[dict[str, str]] = [
	{'role': 'system', 'content': self.system_message}
	for prev_action, obs in state.history:
	action_message = get_action_message(prev_action)
	messages.append(action_message)
	obs_message = get_observation_message(obs)
	messages.append(obs_message)
	# use llm to generate response (e.g., thought, action)
	response = self.llm.do_completion(messages)
	# parse and execute action in the runtime
	action = self.parse_response(response)
	if self.is_finish_command(action):
	return AgentFinishAction()
	elif self.is_bash_command(action):
	return CmdRunAction(command=action.command)
	elif self.is_python_code(action):
	return IPythonRunCellAction(code=action.code)
	elif self.is_browser_action(action):
	return BrowseInteractiveAction(code=action.code)
	else:
	return MessageAction(content=action.message)
Framework	Domain	Graphic User Interface	Standardized Tool Library	Built-in Sandbox & Code Execution	Built-in Web Browser	Multi-agent Collaboration	Human-AI Collaboration	AgentHub	Evaluation Framework	Agent QC
AutoGPT [14]	General	"	%	%*	%*	%	%	"	%	"
LangChains [6]	General	%	"	%	%	%	%	"	%	%
MetaGPT [16]	General	%	"	%	"	"	%	"	%	"
AutoGen [67]	General	%	"	"	"	"	"	"	"	"
AutoAgents [7]	General	%	%	%	%	"	%	%	%	%
Agents [80]	General	%	%	%	%	"	"	%	%	%
Xagents [61]	General	"	"	%	"	"	%	"	%	"
OpenAgents [69]	General	"	%	"	"	%	%	"	%	%
GPTSwarm [83]	General	%	"	%	%	"	"	%	%	%
AutoCodeRover [78]	SWE	%	%	"	%	%	%	%	%	%
SWE-Agent [72]	SWE	%	%	"	%	%	%	%	%	%
OpenDevin	General	"	"	"	"	"	"	"	"	"
Category	Benchmark	Required Capability
Software	SWE-Bench [21]	Fixing Github issues
	HumanEvalFix [37]	Fixing Bugs
	BIRD [27]	Text-to-SQL
	BioCoder [58]	Bioinformatics coding
	ML-Bench [57]	Machine learning coding
	Gorilla APIBench [46]	Software API calling
	ToolQA [81]	Tool use
Web	WebArena [79]
	MiniWoB++ [30]	Goal planning & realistic browsing Short trajectory on synthetic web
	GAIA [34]	Tool-use, browsing, multi-modality
	GPQA [50]	Graduate-level Google-proof Q&A
Misc. Assistance	AgentBench [31]	Operating system interaction (bash)
	MINT [64]	Multi-turn math and code problems
	Entity Deduction Arena [77]
	ProofWriter [55]	State tracking & strategic planning Deductive Logic Reasoning
Agent	Model	Software (§4.2)	Web (§4.3)	Misc. (§4.4)
SWE-Agent [72]	gpt-4-1106-preview	18.0	-	-
AutoCodeRover [78]	gpt-4-0125-preview	19.0	-	-
Aider [13]	gpt-4o & claude-3-opus	26.3	-	-
Moatless Tools [85]	claude-3.5-sonnet	26.7	-	-
Agentless [68]	gpt-4o	27.3	-	-
Lemur [70]	Lemur-chat-70b	-	5.3	-
Patel et al. [45]	Trained 72B w/ synthetic data	-	9.4	-
AutoWebGLM [24]	Trained 7B w/ human/agent annotation	-	18.2	-
Auto Eval & Refine [42]	GPT-4 + Reflexion w/ GPT-4V	-	20.2	-
WebArena Agent [79]	gpt-4-turbo	-	14.4	-
AutoGPT [14]	gpt-4-turbo	-	-	13.2
Few-shot Prompting + Chain-of-Thought [50]	Llama-2-70b-chat	-	28.1	-
Chain-of-Thought [50]	gpt-3.5-turbo-16k	-	29.6	-
Chain-of-Thought [50]	gpt-4	-	38.8	-
gpt-4o-mini-2024-07-18	-	6.3	8.3	53.1
CodeActAgent v1.8	gpt-4o-2024-05-13	22.0	14.5	-
CodeActAgent v1.8	claude-3-5-sonnet	26.0	15.3	52.0
GPTSwarm v1.0	gpt-4o-2024-05-13	-	-	32.1
Authors	Title	Year	Pages
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang	Autogen: Enabling next-gen llm applications via multi-agent conversation framework	2023	1, 5, 23
Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang	Agentless: Demystifyingllm-based software engineering agents	2024	7
[69] Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, et al.	Openagents: An open platform for language agents in the wild. arXiv preprint arXiv:2310.10634, 2023. 5
[70] Yiheng Xu, Hongjin Su, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, et al.	Lemur: Harmonizing natural language and code for language agents. arXiv preprint arXiv:2310.06830, 2023. 7, 10
[71] Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao.	Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023. 4
[72] John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press.	Swe-agent: Agent-computer interfaces enable automated software engineering, 2024. 2, 5, 7, 8, 23
[73] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao.	React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYUL-X. 8
[74] Yining Ye, Xin Cong, Shizuo Tian, Jiannan Cao, Hao Wang, Yujia Qin, Yaxi Lu, Heyang Yu, Huadong Wang, Yankai Lin, et al.	Proagent: From robotic process automation to agentic process automation. arXiv preprint arXiv:2311.10751, 2023. 23
[75] Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, and Heng Ji.	CRAFT: customizing llms by creating and retrieving from specialized toolsets. CoRR, abs/2309.17428, 2023. doi: 10.48550/ARXIV.2309.17428. URL https://doi.org/10.48550/arXiv.2309.17428. 3
[76] Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, HuiminChen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun.	Advancing llm reasoning generalists with preference trees, 2024. 11
[77] Yizhe Zhang, Jiarui Lu, and Navdeep Jaitly.	Probing the multi-turn planning capabilities of llms via 20 question games. 2024. 6, 12, 22
[78] Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury.	Autocoderover: Autonomous program improvement, 2024. 5, 7, 8, 23
[79] Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al.	Webarena: A realistic web environment for building autonomous agents. In The Twelfth International Conference on Learning Representations, 2023. 1, 4, 6, 7, 10, 22
[80] Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, Shiding Zhu, Jiyu Chen, Wentao Zhang, Xiangru Tang, Ningyu Zhang, Huajun Chen, Peng Cui, and Mrinmaya Sachan.	Agents: An open-source framework for autonomous language agents, 2023. 5, 23
[81] Yuchen Zhuang, Yue Yu, Kuan Wang, Haotian Sun, and Chao Zhang.	Toolqa: A dataset for llm question answering with external tools. Advances in Neural Information Processing Systems, 36, 2024. 6, 8, 9, 22
[82] Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al.	Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066, 2023. 23
[83] Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jurgen Schmidhuber.	Language agents as optimizable graphs. arXiv preprint arXiv:2402.16823, 2024. 5, 6, 22, 23
SWE-Bench [21]	Bowen Li and Xingyao Wang
WebArena [79] and MiniWob++ [30]	Frank F. Xu
GAIA [34]	Jiayi Pan (integration) and Mingchen Zhuge (GPTSwarm evaluation)
API-Bench [46] and ToolQA [81]	Yueqi Song
HumanEvalFix [37]	Niklas Muennighoff and Xiangru Tang
ProofWriter [55]	Ren Ma
MINT [64]	Hoang H. Tran
AgentBench [31]	Fuqiang Li
BIRD [27]	Binyuan Hui
GPQA [50]	Jaskirat Singh
BioCoder [58]	Xiangru Tang and Bill Qian
ML-Bench [57]	Xiangru Tang and Yanjun Shao
Entity-Deduction-Arena [77]	Yizhe Zhang
file_path	str	The path to the video file to open.
task	Optional[str]	The task description for the API call. Defaults to 'Describe this image as detail as possible.'.
frame_interval	Optional[int]	The interval between frames to analyze. Defaults to 30.
bid	str
button	Literal["left", "middle", "right"] = "left"
modifiers	list[Literal["Alt", "Control", "Meta", "Shift"]] = []