Integrate ae-agent into ArtEval benchmark by Couen · Pull Request #131 · sys-intelligence/system-intelligence-benchmark

Couen · 2026-02-12T09:58:06Z

Summary

Sync ae-agent runner logic from the standalone ae-agent repo into benchmarks/arteval_bench/src/agents/ae_agent.
Wire ae_agent into benchmarks/arteval_bench/src/main.py so it can be selected via -a ae_agent.
Update benchmarks/arteval_bench/src/run_eval_in_env.py to treat ae_agent as a long-running agent: pass tasks via /agent/current_task.txt, stream live logs, and support Anthropic Foundry env vars.
Add README_ae_agent.md and run_ae_agent.sh under benchmarks/arteval_bench/data/benchmark to document and simplify running ArtEval with ae_agent.

Details

ae_agent now shares the same long-running behavior as claude_sdk (48h timeout, _agent_eval removal before run and re-upload before evaluation, container kept running for inspection).
For ae_agent we avoid passing large task strings directly through the shell by uploading the task to /agent/current_task.txt.
Foundry environments are supported via ANTHROPIC_FOUNDRY_API_KEY, ANTHROPIC_FOUNDRY_BASE_URL, and CLAUDE_CODE_USE_FOUNDRY.

Testing

python benchmarks/arteval_bench/src/main.py -i benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_smoke_test (basic smoke test).
benchmarks/arteval_bench/data/benchmark/run_ae_agent.sh (helper wrapper around the same command).

Made with Cursor

Co-authored-by: Cursor <cursoragent@cursor.com>

…rmat - Add run_eval.py and main.py to ae_agent for running tasks on host (env=local) or in Docker; run_eval(env, ...) is the single entry point. - Expand utils.py with helpers for main/run_eval (safe_task_id, env_from_item, resolve_project_path, Tee, write_task_report, compute_and_write_summary). - Update ae_agent README with host mode usage and new file descriptions. - Unify arteval_tasks.jsonl to new format: artifact_id, artifact_dir, artifact_readme, artifact_url, env, gpu; remove evaluator/expected_score. - Ignore duplicate task list copies (arteval_tasks copy*.jsonl) in .gitignore.

xuafeng · 2026-02-13T19:59:27Z

benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl

-{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
-{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
+{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "env": "bastoica/ae-agent-ubuntu24.04:latest", "gpu": false}


This is benchmark. I think that we need evaluator and expected_scocre, right?

xuafeng · 2026-02-13T20:00:59Z

@Couen Please continue to work on this PR.

Couen requested a review from xuafeng February 12, 2026 10:01

Couen force-pushed the feature/ae-agent-arteval-bench branch 2 times, most recently from 030f7af to f5c3bab Compare February 12, 2026 10:11

Integrate ae-agent as long-running agent for ArtEvalBench

b99ebc4

Couen force-pushed the feature/ae-agent-arteval-bench branch from f5c3bab to b99ebc4 Compare February 12, 2026 10:16

bastoica and others added 2 commits February 12, 2026 10:18

Remove unused AE agent helper files from data/benchmark

1562eb2

Co-authored-by: Cursor <cursoragent@cursor.com>

xuafeng reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate ae-agent into ArtEval benchmark#131

Integrate ae-agent into ArtEval benchmark#131
Couen wants to merge 3 commits intosys-intelligence:mainfrom
Couen:feature/ae-agent-arteval-bench

Couen commented Feb 12, 2026

Uh oh!

xuafeng Feb 13, 2026

Uh oh!

xuafeng commented Feb 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Couen commented Feb 12, 2026

Summary

Details

Testing

Uh oh!

xuafeng Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

xuafeng commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xuafeng commented Feb 13, 2026 •

edited

Loading