Skip to content

Integrate ae-agent into ArtEval benchmark#131

Open
Couen wants to merge 3 commits intosys-intelligence:mainfrom
Couen:feature/ae-agent-arteval-bench
Open

Integrate ae-agent into ArtEval benchmark#131
Couen wants to merge 3 commits intosys-intelligence:mainfrom
Couen:feature/ae-agent-arteval-bench

Conversation

@Couen
Copy link
Collaborator

@Couen Couen commented Feb 12, 2026

Summary

  • Sync ae-agent runner logic from the standalone ae-agent repo into benchmarks/arteval_bench/src/agents/ae_agent.
  • Wire ae_agent into benchmarks/arteval_bench/src/main.py so it can be selected via -a ae_agent.
  • Update benchmarks/arteval_bench/src/run_eval_in_env.py to treat ae_agent as a long-running agent: pass tasks via /agent/current_task.txt, stream live logs, and support Anthropic Foundry env vars.
  • Add README_ae_agent.md and run_ae_agent.sh under benchmarks/arteval_bench/data/benchmark to document and simplify running ArtEval with ae_agent.

Details

  • ae_agent now shares the same long-running behavior as claude_sdk (48h timeout, _agent_eval removal before run and re-upload before evaluation, container kept running for inspection).
  • For ae_agent we avoid passing large task strings directly through the shell by uploading the task to /agent/current_task.txt.
  • Foundry environments are supported via ANTHROPIC_FOUNDRY_API_KEY, ANTHROPIC_FOUNDRY_BASE_URL, and CLAUDE_CODE_USE_FOUNDRY.

Testing

  • python benchmarks/arteval_bench/src/main.py -i benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_smoke_test (basic smoke test).
  • benchmarks/arteval_bench/data/benchmark/run_ae_agent.sh (helper wrapper around the same command).

Made with Cursor

@Couen Couen requested a review from xuafeng February 12, 2026 10:01
@Couen Couen force-pushed the feature/ae-agent-arteval-bench branch 2 times, most recently from 030f7af to f5c3bab Compare February 12, 2026 10:11
@Couen Couen force-pushed the feature/ae-agent-arteval-bench branch from f5c3bab to b99ebc4 Compare February 12, 2026 10:16
bastoica and others added 2 commits February 12, 2026 10:18
Co-authored-by: Cursor <cursoragent@cursor.com>
…rmat

- Add run_eval.py and main.py to ae_agent for running tasks on host (env=local)
  or in Docker; run_eval(env, ...) is the single entry point.
- Expand utils.py with helpers for main/run_eval (safe_task_id, env_from_item,
  resolve_project_path, Tee, write_task_report, compute_and_write_summary).
- Update ae_agent README with host mode usage and new file descriptions.
- Unify arteval_tasks.jsonl to new format: artifact_id, artifact_dir,
  artifact_readme, artifact_url, env, gpu; remove evaluator/expected_score.
- Ignore duplicate task list copies (arteval_tasks copy*.jsonl) in .gitignore.
{"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"}
{"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} No newline at end of file
{"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "env": "bastoica/ae-agent-ubuntu24.04:latest", "gpu": false}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is benchmark. I think that we need evaluator and expected_scocre, right?

@xuafeng
Copy link
Collaborator

xuafeng commented Feb 13, 2026

@Couen Please continue to work on this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants