Skip to content

ABC-Bench is a benchmark for Agentic Backend Coding. It evaluates whether code agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external end-to-end API tests (HTTP-based integration tests) across realistic backend stacks.

Notifications You must be signed in to change notification settings

OpenMOSS/ABC-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 

Repository files navigation

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

πŸ“„ Paper | 🌐 Blog | πŸ€— Dataset

ABC-Bench is a benchmark for Agentic Backend Coding. It evaluates whether code agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external end-to-end API tests (HTTP-based integration tests) across realistic backend stacks.


πŸ“° News

  • 2026/01/20: Evaluation harness and the full ABC-Bench dataset released on Hugging Face.
  • 2026/01/19: Blog post released, detailing benchmark construction and baseline results.
  • 2026/01/16: Public preprint released: ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development.

ABC-Bench main results


πŸš€ Why ABC-Bench?

  • End-to-End Lifecycle: repository exploration β†’ code editing/implementation β†’ environment setup β†’ containerized deployment β†’ external end-to-end API verification.
  • Real-World Diversity: 224 tasks curated from 127 MIT-licensed repositories, spanning 8 languages and 19 frameworks.
  • Environment-Aware Tasks: 92 tasks require autonomous environment configuration and containerized service startup.
  • Automated Construction: built via ABC-Pipeline with minimal manual intervention, enabling scalable task creation and future expansions.
  • Challenging Baselines: even state-of-the-art models remain far from fully reliable.

πŸ“Š Benchmark Composition

ABC-Bench main results


βš–οΈ Evaluation Protocol

ABC-Bench main results


πŸ’Ύ Dataset Access

Download the full benchmark (tasks, build assets, verification suites) on Hugging Face:
πŸ‘‰ πŸ€— OpenMOSS-Team/ABC-Bench

After downloading, set --dataset-path to the local dataset root directory.


⚑ Quickstart

1. Prerequisites

  • Docker
  • Python β‰₯ 3.10

2. Install Terminal-Bench CLI

pip install terminal-bench

Verify:

tb --help

3. Run Evaluation

Replace <DATASET_PATH> with your local dataset root directory downloaded from Hugging Face.

tb run \
  --dataset-path <DATASET_PATH> \
  --agent openhands \
  --model openai/GPT-5 \
  --n-attempts 3 \
  --global-agent-timeout-sec 3600 \
  --global-test-timeout-sec 1800 \
  --n-concurrent 30 \
  --run-id demo

🧠 SFT Models

We provide two models that have been Supervised Fine-Tuned (SFT) specifically for agentic backend coding tasks:

πŸ‘‰ πŸ€— OpenMOSS-Team/Qwen3-8B-ABC

πŸ‘‰ πŸ€— OpenMOSS-Team/Qwen3-32B-ABC


🀝 Contributing

Pull requests and issues are welcome. For substantial changes (new scripts, new baselines, major doc updates), please open an issue first.


πŸ“ Citation

@misc{yang2026abcbenchbenchmarkingagenticbackend,
      title={ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development}, 
      author={Jie Yang and Honglin Guo and Li Ji and Jiazheng Zhou and Rui Zheng and Zhikai Lei and Shuo Zhang and Zhiheng Xi and Shichun Liu and Yuxin Wang and Bo Wang and Yining Zheng and Tao Gui and Xipeng Qiu},
      year={2026},
      eprint={2601.11077},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.11077}, 
}

πŸ™ Acknowledgements

ABC-Bench is built from MIT-licensed open-source repositories. We thank the maintainers and contributors whose work makes realistic evaluation possible.

About

ABC-Bench is a benchmark for Agentic Backend Coding. It evaluates whether code agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external end-to-end API tests (HTTP-based integration tests) across realistic backend stacks.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published