R2E: Turning any GitHub Repository into a Programming Agent Environment

UC Berkeley
*Equal Contribution


Current code generation systems are evaluated on static benchmarks like HumanEval which comprise of isolated code snippets lacking real-world aspect of programming like dealing with large codebases, dependencies, and execution environments. While GitHub repositories provide a rich source of real-world codebases, evaluating code generation systems on them is challenging due to the lack of test harnesses associated with the code. We present R2E, a scalable framework that turns any GitHub repository into an environment for programming agents. These environments can be used for benchmarking programming agents that can interact with interpreter on repository-level problems. Further, R2E also enables collecting large-scale execution traces for improving LLMs themselves.


R2E is a framework that turns any GitHub repository into a programming agent environment. This is achieved by a novel equivalence test generation technique powered by a synergy of program analysis and LLMs. These environments can be used for benchmarking static LLMs and dynamic programming agents that can interact with interpreter on real-world (potentially unseen) codebases. Furthermore, R2E environments can also be used for improving LLMs themselves by fine-tuning models with execution traces on such real-world codebases.

👆 See R2E turn a few GitHub repositories into a benchmark.

Test Harness Generation: The 🗝️ to Environments

R2E framework generates test harnesses for any GitHub repository by leveraging the following key design principles
  • Equivalence Tests, not Output Prediction. R2E decouples input generation from output prediction by generating equivalence tests that use the ground truth function implementation to generate test cases - a much simpler task than predicting the output of a function.
  • Harnesses, not I/O Pairs. R2E generates test harnesses that contain the test cases along with a setup which prepares additional information like database connections, external files, configurations, etc. This is a deparature from I/O examples with primitive types in traditional benchmarks.
  • Sliced Context, not Entire Repositories. R2E uses a novel dependecy slicing based prompt construction approach which provides the minimal repository context requires to understand the function implementation. Emperically, we find it to be both cheaper and more effective than using the entire repository context.
  • Execution and Coverage for Quality. The generated test harnesses are run in a self-equivalence mode to filter wrong or stochastic tests. Additionally, we evaluate the quality of the generated tests using branch coverage.

The following figure depicts an example test harness generated by R2E. Test Generation We evaluate the quality of the generated test harnesses by comparing the validity and coverage of the generated tests. Following table compares the quality of the generated tests for a simple output prediction setting against our equivalence test generation setting. As we see, the equivalence tests achieve better validity due to the decoupling of input generation from output prediction. We study different context creation approaches and find that sliced context is the most effective (see paper for details).
Strategy Val Cov Val Cov
Output Pred. 35.43% 87.59% 30.68% 82.54%
Equivalence 52.37% 88.18% 35.01% 79.65%

Evaluating Real-world Code Generation

R2E environments enable evaluating code generation systems. Particularly, we can evaluate both single round "static" code generation from LLMs and multi-round programming agents. We construct R2E-Eval1, a large scale benchmark spanning 137 repositories that can be used to evaluate code generation systems. The figure on the bottom left compares single round performance of current LLMs and on the bottom right depicts performance of self-repair agents over multiple rounds.
Pass@1 on R2E-Eval1
Repair Rate on R2E-Eval1
As we see, model performances on R2E-Eval1 are considerably lower than traditional benchmarks like HumanEval highlighting the challenges of real-world codebases. Next, we find that self-repair agents can significantly improve the performance of LLMs using oracle mistake feedback and compiler interactions over multiple turns. GPT-4 can fix more than 30% of the mistakes after 3 turns! We will release a larger and more challenging benchmark (R2E-Eval2) soon with more detailed results. Stay tuned!

Beyond Code Generation with Environments

R2E environments enable applications beyond code generation. Powered by equivalence tests, R2E can inherently verify program transformations such as code optimizations, refactoring, and transpilation. These tests simply check if the transformed version of code is equivalent to the original code under a set of automatically generated high-coverage test cases. Below we demonstrate how R2E environments can be used to optimize code in real-world:
Here, we show an demo of how a programming agent optimizes code in an arbitrary Python repository (PythonRobotics). R2E enables this by first building an environment for the repository and generating equivalence tests that verifies if the agent's code changes are correct. The agent then interacts with the environment to optimize the code by 1.8x 🚀!
In an a more complex example, we show an agent optimizing a method from R2E's own repository!! In fact, this was an optimization opportunity suggested in our code review! Again, R2E scales to arbitrary repositories and generates high coverage equivalence tests for functions and methods. R2E additionally simplifies the execution of generated or edited code from LLMs in a repository setting. Failing equivalence tests provide feedback for the agent to attempt multiple rounds to reach the optimal solution, speedups upto 239x on adversarial performance tests!


    title={R2E: Turning any Github Repository into a Programming Agent Environment},
    author={Naman Jain and Manish Shetty and Tianjun Zhang and King Han and Koushik Sen and Ion Stoica},
    booktitle={ICML 2024},