Skip to content

feat: add SLURM integration test workflow#28

Open
timurcarstensen wants to merge 95 commits intomainfrom
slurm-integration-tests
Open

feat: add SLURM integration test workflow#28
timurcarstensen wants to merge 95 commits intomainfrom
slurm-integration-tests

Conversation

@timurcarstensen
Copy link
Collaborator

@timurcarstensen timurcarstensen commented Jan 30, 2026

Add a GitHub Actions workflow that sets up a real SLURM cluster with Apptainer on a GPU runner to test the schedule_evals workflow end-to-end.

Relevant files:

  • workflow to set-up SLURM & Apptainer on AWS EC2 GPU instance is in .github/workflows/build-and-push-apptainer.yml
  • tests/integration contains all the stuff needed to understand our slurm integration testing. Tests (1) dry run sbatch script generation/setup of the run dir (2) dataset download for all datasets needed for the tasks in task-groups.yaml (3) end-to-end scheduling and running of the first task in every task group (to save time). The latter tests both lm-eval and lighteval tasks

Regarding test suite support: this bumps lighteval to the latest version (installed from the github repo) and adapts the launch args accordingly

Add a GitHub Actions workflow that sets up a real SLURM cluster with
Apptainer on a GPU runner to test the schedule_evals workflow end-to-end.

- New workflow: slurm-integration.yml
  - Sets up SLURM (slurmctld + slurmd) on AWS GPU runner
  - Installs Apptainer and builds test container
  - Pre-downloads tiny-gpt2 model and arc_easy dataset
  - Runs integration test that validates full workflow

- New test container: tests/integration/ci.def
  - Based on PyTorch with CUDA support
  - Includes lm_eval and dependencies

- New integration test: tests/integration/test_slurm.py
  - Submits real SLURM job via schedule_evals
  - Waits for completion and validates results JSON

- Updated clusters.yaml with CI cluster configuration
Switch to CPU instances (i7ie) since GPU quota is not available:
- Remove GPU/GRES configuration from SLURM setup
- Update test to support --dry-run mode for CPU-only testing
- Validate sbatch script generation without actual job execution
- Update CI cluster config to use debug partition

Full GPU testing can be re-enabled later when quota is available.
@timurcarstensen
Copy link
Collaborator Author

@JeniaJitsev this is the base PR which #37 is built upon. Once this one is merged I'll merge #37

Copy link
Collaborator

@geoalgo geoalgo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give the current cost of running the pipeline on AWS once? (I see that the time out is currently set to 45 min, for 45 min the cost would 0.4$ per commit which would be a bit annoying)

Do we need GPU at all for testing also? If we could run on a CPU machine, it would be much cheaper

uv pip install --system --break-system-packages nltk

# Pre-load lighteval registry to trigger tinyBenchmarks data download at build time
/opt/uv-tools/lighteval/bin/python -c "from lighteval.tasks.registry import Registry; Registry.load_all_task_configs(load_multilingual=False)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant we use uvx instead rather than hardcoding the path here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best we can do is: $UV_TOOL_DIR/lighteval/bin/python -c "from lighteval.tasks.registry import Registry; Registry.load_all_task_configs(load_multilingual=True)"

@timurcarstensen
Copy link
Collaborator Author

Can you give the current cost of running the pipeline on AWS once?

@geoalgo one test run right now takes about 15 minutes. The instance that I use is about 0.3-0.72 USD per hour depending on whether we use spot or on-demand instances (spot is fine for now). 0,075-0,18USD + some service charge for the runs-on service so at most 0.25 USD per run

@geoalgo
Copy link
Collaborator

geoalgo commented Feb 5, 2026

Ok thanks, how hard is deactivating AWS? Is it enough to just comment the github action?

I am a bit worried about the complexity hit that we are taking here including needing to manage an AWS account (compared to having a manual / semi automatic integration test that would run in our clusters for instance).

We could merge it but we should have a very easy way to remove it.

@timurcarstensen
Copy link
Collaborator Author

Ok thanks, how hard is deactivating AWS? Is it enough to just comment the github action?

I am a bit worried about the complexity hit that we are taking here including needing to manage an AWS account (compared to having a manual / semi automatic integration test that would run in our clusters for instance).

We could merge it but we should have a very easy way to remove it.

it now auto runs on PRs to main and for changes the paths I defined in the workflow file, it's very easy to disable. Yes, you can just do a semi-automatic thing where it runs on one of the clusters we have access to. I would say the overhead of having an AWS is quite minimal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants