ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

📄 Paper ｜ 🌐 Blog ｜ 🤗 Dataset

ABC-Bench is a benchmark for Agentic Backend Coding. It evaluates whether code agents can explore real repositories, edit code, configure environments, deploy containerized services, and pass external end-to-end API tests (HTTP-based integration tests) across realistic backend stacks.

📰 News

2026/01/20: Evaluation harness and the full ABC-Bench dataset released on Hugging Face.
2026/01/19: Blog post released, detailing benchmark construction and baseline results.
2026/01/16: Public preprint released: ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development.

🚀 Why ABC-Bench?

End-to-End Lifecycle: repository exploration → code editing/implementation → environment setup → containerized deployment → external end-to-end API verification.
Real-World Diversity: 224 tasks curated from 127 MIT-licensed repositories, spanning 8 languages and 19 frameworks.
Environment-Aware Tasks: 92 tasks require autonomous environment configuration and containerized service startup.
Automated Construction: built via ABC-Pipeline with minimal manual intervention, enabling scalable task creation and future expansions.
Challenging Baselines: even state-of-the-art models remain far from fully reliable.

📊 Benchmark Composition

⚖️ Evaluation Protocol

💾 Dataset Access

Download the full benchmark (tasks, build assets, verification suites) on Hugging Face:
👉 🤗 OpenMOSS-Team/ABC-Bench

After downloading, set --dataset-path to the local dataset root directory.

⚡ Quickstart

1. Prerequisites

Docker
Python ≥ 3.10

2. Install Terminal-Bench CLI

pip install terminal-bench

Verify:

tb --help

3. Run Evaluation

Replace <DATASET_PATH> with your local dataset root directory downloaded from Hugging Face.

tb run \
  --dataset-path <DATASET_PATH> \
  --agent openhands \
  --model openai/GPT-5 \
  --n-attempts 3 \
  --global-agent-timeout-sec 3600 \
  --global-test-timeout-sec 1800 \
  --n-concurrent 30 \
  --run-id demo

🧠 SFT Models

We provide two models that have been Supervised Fine-Tuned (SFT) specifically for agentic backend coding tasks:

👉 🤗 OpenMOSS-Team/Qwen3-8B-ABC

👉 🤗 OpenMOSS-Team/Qwen3-32B-ABC

🤝 Contributing

Pull requests and issues are welcome. For substantial changes (new scripts, new baselines, major doc updates), please open an issue first.

📝 Citation

@misc{yang2026abcbenchbenchmarkingagenticbackend,
      title={ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development}, 
      author={Jie Yang and Honglin Guo and Li Ji and Jiazheng Zhou and Rui Zheng and Zhikai Lei and Shuo Zhang and Zhiheng Xi and Shichun Liu and Yuxin Wang and Bo Wang and Yining Zheng and Tao Gui and Xipeng Qiu},
      year={2026},
      eprint={2601.11077},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2601.11077}, 
}

🙏 Acknowledgements

ABC-Bench is built from MIT-licensed open-source repositories. We thank the maintainers and contributors whose work makes realistic evaluation possible.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figs		figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

📄 Paper ｜ 🌐 Blog ｜ 🤗 Dataset

📰 News

🚀 Why ABC-Bench?

📊 Benchmark Composition

⚖️ Evaluation Protocol

💾 Dataset Access

⚡ Quickstart

1. Prerequisites

2. Install Terminal-Bench CLI

3. Run Evaluation

🧠 SFT Models

🤝 Contributing

📝 Citation

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

OpenMOSS/ABC-Bench

Folders and files

Latest commit

History

Repository files navigation

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

📄 Paper ｜ 🌐 Blog ｜ 🤗 Dataset

📰 News

🚀 Why ABC-Bench?

📊 Benchmark Composition

⚖️ Evaluation Protocol

💾 Dataset Access

⚡ Quickstart

1. Prerequisites

2. Install Terminal-Bench CLI

3. Run Evaluation

🧠 SFT Models

🤝 Contributing

📝 Citation

🙏 Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages