Update integration tests to use claude-sonnet-4-6#2113
Update integration tests to use claude-sonnet-4-6#2113
Conversation
Replace claude-sonnet-4-5-20250929 with claude-sonnet-4-6 in the integration test workflow default model list. Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
@OpenHands can you help me fix this workflow: https://github.com/OpenHands/software-agent-sdk/actions/runs/22143088112/job/64014196910?pr=2113 |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
…rkflow The setup-matrix job was failing because resolve_model_config.py imports litellm, but the dependency was not installed before running the script. Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI fixed the failing workflow by addressing the root cause: the Fix AppliedAdded a single step to install - name: Install required dependencies
run: pip install litellmChecklist
Commit
The workflow should now run successfully since |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
The MODELS dictionary is used by the integration-runner workflow to resolve model configurations without needing litellm installed. By moving the litellm import inside the test_model function, we allow importing MODELS without requiring the litellm dependency. This fixes the setup-matrix job failure in pull_request_target workflows where the workflow file from main branch is used but the PR code is checked out. Co-authored-by: openhands <openhands@all-hands.dev>
Add a push trigger for this branch so the workflow runs using the PR's code instead of the main branch's workflow file. TODO: Remove the push trigger after this PR is merged. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands why integration test in this PR is taking FOREVER To run? https://github.com/OpenHands/software-agent-sdk/actions/runs/22143916847/job/64015522735?pr=2113 |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
SummaryI investigated why the integration tests in PR #2113 are taking forever to run (the workflow at https://github.com/OpenHands/software-agent-sdk/actions/runs/22143916847). Key FindingsThe integration tests are running 4 different models in parallel:
Root CauseThe slow jobs are all running reasoning/thinking models (DeepSeek Reasoner, Kimi K2 Thinking) which:
Why Claude Finished QuicklyClaude Sonnet 4.6 is a standard model (not a reasoning model), so it responds much faster and completed all 18 integration tests in ~90 seconds. Suggested Fixes
No code changes were made - this was purely an investigative answer to the user's question about why the tests are taking so long. |
The push trigger was running the full test suite (18 tests including behavior and condenser tests), causing runs to take hours with reasoning models. Now it runs only integration tests (8 tests) like the schedule trigger. Co-authored-by: openhands <openhands@all-hands.dev>
🧪 Integration Tests ResultsOverall Success Rate: 0.0% 📊 Summary
📋 Detailed Resultslitellm_proxy_anthropic_claude_sonnet_4_6
Failed Tests:
|
Anthropic's claude-sonnet-4-6 model doesn't support having both temperature and top_p specified simultaneously. Set top_p=None to override the SDK default (1.0) when temperature is set. Co-authored-by: openhands <openhands@all-hands.dev>
|
@OpenHands check #2113 (comment) we probably need to add sonnet 4.6 to model_features.py so we can fixed this parameter issue. we should also revert bca02e8 after adding it to model features |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Claude Sonnet 4.6 doesn't support both temperature and top_p specified at the same time. This adds a new model feature flag supports_top_p that removes top_p from API calls for models that don't support it. Changes: - Add supports_top_p field to ModelFeatures dataclass - Add SUPPORTS_TOP_P_FALSE_MODELS list with claude-sonnet-4-6 - Update chat_options.py to remove top_p for unsupported models - Revert the top_p=None workaround in resolve_model_config.py - Add tests for the new feature Co-authored-by: openhands <openhands@all-hands.dev>
SummaryI've successfully addressed the user's request to fix the claude-sonnet-4-6 parameter issue by adding it to model_features.py and reverting the workaround commit. Checklist:
Files Changed:
The changes are concise and directly address the root cause: the SDK now properly handles models that don't support both PR updated: #2113 |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Move the litellm import inside the test_model function to prevent import errors when the workflow only needs to access the MODELS dict. The integration workflow setup-matrix step imports the module to read model configurations, but doesn't need litellm at that point. Co-authored-by: openhands <openhands@all-hands.dev>
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
Summary of ChangesI fixed the two failing GitHub Actions workflow jobs by making the Change MadeSingle commit pushed (2f8bba5):
This is a minimal, focused change (2 lines changed - one removed, one added). Why This Fixes Both Jobs
Verification
The fix is now live and the workflow is executing properly: https://github.com/OpenHands/software-agent-sdk/actions/runs/22190208822 |
|
@OpenHands why the integration tests are taking SO LONG? https://github.com/OpenHands/software-agent-sdk/actions/runs/22190208822/job/64175523277?pr=2113 Can we try to switch it back to using the blacksmith CI (we were removing them in some previous commit in main) - Can we maybe bring it back and re-trigger the integration tests? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Switch back to Blacksmith CI runners for performance-critical jobs: - run-integration-tests: blacksmith-4vcpu-ubuntu-2204 (was ubuntu-22.04) - consolidate-results: blacksmith-2vcpu-ubuntu-2404 (was ubuntu-24.04) This should significantly improve integration test execution time. Co-authored-by: openhands <openhands@all-hands.dev>
Final SummaryThe integration tests were taking a long time because a previous commit had switched from Blacksmith CI runners to standard GitHub runners. I've restored the Blacksmith runners and the changes have already been pushed. What was changedFile modified:
Checklist
New Workflow RunURL: https://github.com/OpenHands/software-agent-sdk/actions/runs/22191196173 The integration tests should now run significantly faster with the 4 vCPU Blacksmith runner instead of the standard GitHub runner. |
|
Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly. |
Summary
Update the integration test workflow to use
claude-sonnet-4-6instead ofclaude-sonnet-4-5-20250929as the default Claude model.Changes:
DEFAULT_MODEL_IDSin.github/workflows/integration-runner.ymlChecklist
@xingyaoww can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:2ecc67e-pythonRun
All tags pushed for this build
About Multi-Architecture Support
2ecc67e-python) is a multi-arch manifest supporting both amd64 and arm642ecc67e-python-amd64) are also available if needed