Integrate ae-agent into ArtEval benchmark#131
Open
Couen wants to merge 3 commits intosys-intelligence:mainfrom
Open
Integrate ae-agent into ArtEval benchmark#131Couen wants to merge 3 commits intosys-intelligence:mainfrom
Couen wants to merge 3 commits intosys-intelligence:mainfrom
Conversation
030f7af to
f5c3bab
Compare
f5c3bab to
b99ebc4
Compare
Co-authored-by: Cursor <cursoragent@cursor.com>
…rmat - Add run_eval.py and main.py to ae_agent for running tasks on host (env=local) or in Docker; run_eval(env, ...) is the single entry point. - Expand utils.py with helpers for main/run_eval (safe_task_id, env_from_item, resolve_project_path, Tee, write_task_report, compute_and_write_summary). - Update ae_agent README with host mode usage and new file descriptions. - Unify arteval_tasks.jsonl to new format: artifact_id, artifact_dir, artifact_readme, artifact_url, env, gpu; remove evaluator/expected_score. - Ignore duplicate task list copies (arteval_tasks copy*.jsonl) in .gitignore.
xuafeng
reviewed
Feb 13, 2026
| {"artifact_id": "osdi24_anvil", "artifact_dir": "osdi24_anvil", "artifact_readme": "osdi24_anvil/anvil/README.md", "artifact_url": "https://github.com/anvil-verifier/anvil", "evaluator": "osdi24_anvil/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "sosp23_acto", "artifact_dir": "sosp23_acto", "artifact_readme": "sosp23_acto/acto/README.md", "artifact_url": "https://github.com/xlab-uiuc/acto", "evaluator": "sosp23_acto/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} | ||
| {"artifact_id": "eurosys25_egwalker", "artifact_dir": "eurosys25_egwalker", "artifact_readme": "eurosys25_egwalker/egwalker/README.md", "artifact_url": "https://github.com/josephg/egwalker-paper", "evaluator": "eurosys25_egwalker/_agent_eval/main.py", "expected_score": 4, "docer_env": "bastoica/ae-agent-ubuntu24.04:latest"} No newline at end of file | ||
| {"artifact_id": "sosp24_wasabi", "artifact_dir": "sosp24_wasabi", "artifact_readme": "sosp24_wasabi/wasabi/README.md", "artifact_url": "https://github.com/bastoica/wasabi/tree/sosp24-ae", "env": "bastoica/ae-agent-ubuntu24.04:latest", "gpu": false} |
Collaborator
There was a problem hiding this comment.
This is benchmark. I think that we need evaluator and expected_scocre, right?
Collaborator
|
@Couen Please continue to work on this PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
benchmarks/arteval_bench/src/agents/ae_agent.benchmarks/arteval_bench/src/main.pyso it can be selected via-a ae_agent.benchmarks/arteval_bench/src/run_eval_in_env.pyto treat ae_agent as a long-running agent: pass tasks via/agent/current_task.txt, stream live logs, and support Anthropic Foundry env vars.README_ae_agent.mdandrun_ae_agent.shunderbenchmarks/arteval_bench/data/benchmarkto document and simplify running ArtEval with ae_agent.Details
_agent_evalremoval before run and re-upload before evaluation, container kept running for inspection)./agent/current_task.txt.ANTHROPIC_FOUNDRY_API_KEY,ANTHROPIC_FOUNDRY_BASE_URL, andCLAUDE_CODE_USE_FOUNDRY.Testing
python benchmarks/arteval_bench/src/main.py -i benchmarks/arteval_bench/data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_smoke_test(basic smoke test).benchmarks/arteval_bench/data/benchmark/run_ae_agent.sh(helper wrapper around the same command).Made with Cursor