feat(fitness): comprehensive system fitness checks for worker startup validation #472

deanq · 2025-12-19T00:29:51Z

Summary

Implemented comprehensive fitness check system that validates worker readiness at startup. Includes 7 built-in system checks (memory, disk, network, GPU) plus a simple API for users to register custom checks. All checks provide clear visibility in production logs with intelligent, scalable validation thresholds.

Custom User-Defined Checks

Users can register custom fitness checks using a simple decorator:

import runpod

@runpod.serverless.register_fitness_check
def check_model_files():
    """Verify required model files exist."""
    from pathlib import Path
    model_path = Path("/models/my-model.safetensors")
    if not model_path.exists():
        raise RuntimeError(f"Model not found: {model_path}")

@runpod.serverless.register_fitness_check
async def check_external_api():
    """Verify external API is accessible."""
    import aiohttp
    async with aiohttp.ClientSession() as session:
        async with session.get("https://api.example.com/health", timeout=5) as resp:
            if resp.status != 200:
                raise RuntimeError(f"API health check failed: {resp.status}")

def handler(job):
    return {"output": "success"}

runpod.serverless.start({"handler": handler})

Key Features:

Simple decorator: @runpod.serverless.register_fitness_check
Supports both sync and async functions
User checks run before built-in system checks
Raise exceptions on failure (no return value needed)

Documentation: Complete guide at docs/serverless/worker_fitness_checks.md with examples for GPU validation, file checks, environment variables, and external service connectivity.

System Fitness Checks

All Workers (3 checks):

Memory Check - Validates sufficient RAM available
- Default: 4GB minimum
- Configurable: `RUNPOD_MIN_MEMORY_GB`
Disk Space Check - Validates adequate disk space
- Default: 10% of total disk must be free
- Automatically scales with machine size
- Configurable: `RUNPOD_MIN_DISK_PERCENT`
Network Connectivity Check - Validates internet access
- Default: 5s timeout to 8.8.8.8:53
- Configurable: `RUNPOD_NETWORK_CHECK_TIMEOUT`

GPU Workers (4 additional checks):
4. CUDA Version Check - Validates CUDA driver version

Default: CUDA 11.8+
Configurable: `RUNPOD_MIN_CUDA_VERSION`

CUDA Initialization Check - Verifies GPU accessibility
- Tests actual device initialization
- Validates each GPU's memory allocation
- Catches runtime failures early
GPU Compute Benchmark - Quick performance validation
- Matrix multiplication test
- Default: 100ms max execution time
- Configurable: `RUNPOD_GPU_BENCHMARK_TIMEOUT`
GPU Binary Test - Comprehensive GPU health via native binary
- CUDA driver availability
- NVML initialization
- Memory allocation per GPU

Behavior

Execution: Runs once at worker startup, before accepting jobs
Failure: Worker exits immediately with code 1, marked unhealthy
Success: All checks pass, worker begins accepting jobs
CPU Workers: GPU checks skip silently (same code works for CPU/GPU)

Example Production Logs

GPU Worker Startup:
```
[info]--- Starting Serverless Worker | Version 1.8.2.dev ---
[info]Running 7 fitness check(s)...
[info]Memory check passed: 181.49GB available (of 187.85GB total)
[info]Disk space check passed: 9.88GB free (98.8% available)
[info]Network connectivity passed: Connected to 8.8.8.8 (23ms)
[info]CUDA version check passed: 12.7 (minimum: 11.8)
[info]CUDA initialization passed: 2 device(s) initialized successfully
[info]GPU compute benchmark passed: Matrix multiply completed in 42ms
[info]GPU binary test passed: 2 GPU(s) healthy (CUDA 12.7)
[info]All fitness checks passed.
```

CPU Worker Startup:
```
[info]--- Starting Serverless Worker | Version 1.8.2.dev ---
[info]Running 3 fitness check(s)...
[info]Memory check passed: 359.41GB available (of 377.18GB total)
[info]Disk space check passed: 9.88GB free (98.8% available)
[info]Network connectivity passed: Connected to 8.8.8.8 (23ms)
[info]All fitness checks passed.
```

Configuration

All thresholds are configurable via environment variables:

```dockerfile

Dockerfile

ENV RUNPOD_MIN_MEMORY_GB=8.0
ENV RUNPOD_MIN_DISK_PERCENT=15.0
ENV RUNPOD_MIN_CUDA_VERSION=12.0
ENV RUNPOD_NETWORK_CHECK_TIMEOUT=10
ENV RUNPOD_GPU_BENCHMARK_TIMEOUT=2
```

Or in Python:
```python
import os
os.environ["RUNPOD_MIN_MEMORY_GB"] = "8.0"
os.environ["RUNPOD_MIN_DISK_PERCENT"] = "15.0"
```

Test Coverage

Memory availability detection
Disk space validation across different disk sizes
Network connectivity testing
CUDA version parsing and comparison
GPU device initialization verification
GPU compute performance benchmarking
Edge cases for cross-platform compatibility

Implementation Details

Auto-registers checks at worker startup
Deferred registration avoids circular imports
Graceful degradation on missing tools
Comprehensive error messages for debugging
Platform-aware (Linux /proc/meminfo fallback for memory)

Implement a health validation system for serverless workers that runs at startup before handler initialization. Allows users to register sync and async validation functions via the @runpod.serverless.register_fitness_check decorator. Key features: - Decorator-based registration API with arbitrary function registry - Support for both synchronous and asynchronous fitness checks - Automatic async/sync detection and execution - Production-only execution (skips in local/test modes) - Comprehensive error logging with exception details - Fail-fast behavior: logs error and exits with code 1 on first failure Files: - runpod/serverless/modules/rp_fitness.py: Core implementation - runpod/serverless/__init__.py: Export register_fitness_check - runpod/serverless/worker.py: Integration into run_worker() - docs/serverless/worker_fitness_checks.md: User documentation - tests/test_serverless/test_modules/test_fitness.py: Test suite - docs/serverless/worker.md: Added See Also section with documentation link

- Add comprehensive Fitness Checks section to architecture.md with execution flow, key functions, and performance characteristics - Update system architecture diagram to show fitness check node in worker startup flow - Add Fitness Check Flow sequence diagram showing registration and execution paths - Add Fitness Check Contract to Integration Points section - Update High-Level Flow to include fitness check validation step - Add fitness check overhead metrics to Performance Characteristics - Update Table of Contents and References with fitness check entries - Add Worker Fitness Checks section to README.md with practical example - Include link to detailed fitness check documentation - Update Last Updated date to 2025-12-13 - Remove empty core/__pycache__/ directory

- Add gpu_test.c: CUDA binary for GPU memory allocation testing - Add compile_gpu_test.sh: Docker-based build script with CUDA support - Add _binary_helpers.py: Utility to locate package-bundled binaries - Supports environment variable override for binary path

- Add rp_gpu_fitness.py: GPU health check using native binary or nvidia-smi fallback - Auto-registers GPU check on import when GPUs are detected - Validates GPU driver, NVML, enumeration, and memory allocation - Gracefully skips on CPU-only workers - Modify rp_fitness.py: Auto-register GPU check during module initialization

- Add pre-compiled gpu_test binary to package data - Include binary documentation (README.md) - Add MANIFEST.in to ensure binary is included in source distribution - Update pyproject.toml: Add package-data configuration - Update setup.py: Add package_data for setuptools - Guarantees binary availability when installing from any git branch

- Add test_gpu_fitness.py: Unit tests for GPU fitness check logic - Add test_gpu_fitness_integration.py: Integration tests with mocked binary - Add mock_gpu_test fixture: Simulated GPU test binary for testing - Covers binary path resolution, output parsing, error handling - Tests both native binary and fallback code paths

- Set RUNPOD_SKIP_GPU_CHECK env var in subprocess calls - Improves benchmark consistency by avoiding GPU check overhead - Enhances output parsing to handle debug messages gracefully - Ensures performance tests measure import time, not GPU detection - Maintains benchmark reliability across GPU and CPU-only systems

- Update worker_fitness_checks.md: Add GPU memory allocation test section - Add gpu_binary_compilation.md: Complete guide to building gpu_test binary - Document auto-registration, configuration, and performance characteristics - Include troubleshooting guide and deployment examples - Provide version compatibility matrix for CUDA and NVML

- Move auto-registration from module import time to first run of run_fitness_checks() - Prevents circular import issues where rp_gpu_fitness couldn't import RunPodLogger - Use _ensure_gpu_check_registered() guard to register GPU check once on demand - Maintains same functionality but with proper import ordering

RunPodLogger uses warn() method, not warning(). Update both files to use the correct method name to prevent AttributeError at runtime.

Change log.warning() to log.warn() for RunPodLogger API consistency. This was causing AttributeError during worker startup.

Change from relative import .rp_cuda to correct path ..utils.rp_cuda

Implements 5 new automatic fitness checks: - Memory availability (default: 4GB minimum) - Disk space (default: 10GB minimum) - Network connectivity (8.8.8.8:53) - CUDA version validation (GPU workers only) - GPU compute benchmark (GPU workers only) All checks run automatically at worker startup and exit immediately if any fail.

- Add _ensure_system_checks_registered() function - Register system checks when run_fitness_checks() starts - Add _reset_registration_state() for testing - Add RUNPOD_SKIP_AUTO_SYSTEM_CHECKS environment variable for tests

… checks

Adds comprehensive CUDA initialization check that verifies: - Device initialization succeeds (not just available) - Each device has accessible memory - Tensor allocation works on all devices - Fallback support for PyTorch and CuPy Catches runtime CUDA initialization failures early at worker startup instead of during job processing. Includes 7 new tests covering: - Successful PyTorch/CuPy initialization - Device count validation - Memory accessibility checks - Device allocation failures - Graceful fallback behavior

Adds comprehensive documentation for the new CUDA initialization check: - Explains what the check validates - Shows expected log output - Provides failure scenario example - Updates check summary to list all 6 checks (3 base + 3 GPU) - Clarifies check runs after CUDA version check

Lowers the default disk space threshold from 10GB to 1GB to accommodate smaller container environments while still catching critically low disk conditions. Users can override with RUNPOD_MIN_DISK_GB environment variable if more space is required. - Default: RUNPOD_MIN_DISK_GB=1.0 (was 10.0) - Configurable: RUNPOD_MIN_DISK_GB=20.0 for higher requirements

Prevents confusing error message from appearing in logs on CPU-only serverless endpoints. Changed subprocess.check_output() to use stderr=subprocess.DEVNULL instead of shell=True, which also improves security by avoiding shell injection risks. Change: - subprocess.check_output('nvidia-smi', shell=True) + subprocess.check_output(['nvidia-smi'], stderr=subprocess.DEVNULL) This ensures the shell error '/bin/sh: 1: nvidia-smi: not found' does not leak to the logs when GPU detection runs on non-GPU workers. Behavior is unchanged - still returns True/False as before.

- Fixed nvidia-smi fallback parsing incorrect CUDA version - Changed from querying --query-gpu=driver_version (returns 500+) - Now parses 'CUDA Version: X.Y' from nvidia-smi standard output - This fixes confusing logs that showed driver version instead of CUDA version - Updated test to use realistic nvidia-smi output format - Added edge case tests for version extraction validation - Ensures workers report correct CUDA version (11.x-12.x not 500+)

- Changed from static 1GB minimum to percentage-based check - Default: 10% of total disk must be free (scales with disk size) - Configurable via RUNPOD_MIN_DISK_PERCENT environment variable - Benefits: - Small disks (50GB): ~5GB free required - Large disks (1TB): ~100GB free required - Automatically scales to machine resources - Updated tests to verify percentage-based logic

…dation - Changed from MIN_DISK_GB to MIN_DISK_PERCENT configuration - Document 10% default scaling across different disk sizes - Add examples showing 100GB, 1TB, and 10TB scaling - Update configuration examples in dockerfile and Python - Explain automatic scaling benefits

- Simplified disk check to only verify root (/) filesystem - In containers, /tmp is just a subdirectory of / (same disk) - Eliminates duplicate log messages with identical results - Updated tests to verify only root filesystem is checked - Updated documentation to reflect container behavior - Reduces check overhead with single disk_usage() call

github-advanced-security

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

The test assertions were expecting the old API (shell=True) but the implementation was changed to use a list and stderr=subprocess.DEVNULL. Updated test assertions to match the actual implementation.

Copilot

Pull request overview

This PR implements a comprehensive fitness check system for worker startup validation, consisting of user-defined custom checks and 7 built-in system resource checks (memory, disk, network, and GPU-specific validations). The system validates worker readiness before accepting jobs, with intelligent thresholds that scale automatically and clear production logging.

Key Changes:

Added user-facing decorator API (@runpod.serverless.register_fitness_check) for custom health checks
Implemented 7 automatic system checks (3 for all workers, 4 additional for GPU workers)
Included GPU test binary with native CUDA memory allocation validation
Comprehensive documentation and examples for user-defined checks

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`runpod/serverless/modules/rp_fitness.py`	Core fitness check registration and execution system
`runpod/serverless/modules/rp_system_fitness.py`	Built-in system resource checks (memory, disk, network, CUDA)
`runpod/serverless/modules/rp_gpu_fitness.py`	GPU-specific health checks using native binary and fallback
`runpod/serverless/__init__.py`	Exported `register_fitness_check` to public API
`runpod/serverless/worker.py`	Integrated fitness checks into worker startup flow
`docs/serverless/worker_fitness_checks.md`	Complete user documentation with examples
`tests/test_serverless/test_modules/test_fitness.py`	Unit tests for core fitness system
`tests/test_serverless/test_modules/test_system_fitness.py`	Unit tests for system resource checks
`tests/test_serverless/test_modules/test_gpu_fitness.py`	Unit tests for GPU check system
`build_tools/gpu_test.c`	Native CUDA binary source for GPU memory testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

runpod/serverless/modules/rp_system_fitness.py

tests/test_serverless/test_modules/test_system_fitness.py

tests/test_serverless/test_modules/test_gpu_fitness.py

runpod/serverless/modules/rp_gpu_fitness.py

- Use configurable GPU_BENCHMARK_TIMEOUT instead of hardcoded 100ms threshold - Change test assertion from >= to == to prevent accidental check registration - Remove empty TestFallbackExecution class (covered by integration tests) - Make error message limit configurable via RUNPOD_GPU_MAX_ERROR_MESSAGES env var All 83 fitness check tests pass with these changes.

runpod/serverless/modules/rp_fitness.py

+        _system_checks_registered = True
+        return
+
+    _system_checks_registered = True


runpod/serverless/modules/rp_gpu_fitness.py

tests/test_serverless/test_modules/test_gpu_fitness_integration.py

tests/test_serverless/test_modules/test_system_fitness.py

- Remove unused 'inspect' import from rp_gpu_fitness.py - Remove unused 'call' import from test files (test_fitness.py, test_gpu_fitness.py, test_system_fitness.py) - Add explanatory comment to empty except clause in rp_gpu_fitness.py All 83 fitness check tests pass with these changes.

tests/test_serverless/test_modules/test_fitness.py

tests/test_serverless/test_modules/test_system_fitness.py

Eliminate unnecessary nvidia-smi call in _run_gpu_test_fallback(). The function was calling is_available() before immediately trying nvidia-smi --list-gpus, resulting in redundant GPU detection. Direct attempt to list GPUs handles all failure cases without the pre-check. Also clean up ambiguous variable name 'l' → 'line' in list comprehension.

- Add explanatory comment to empty except clause in gpu_test output parsing - Change test assertion from >= to == to catch regressions in error detection These changes address CodeQL and Copilot feedback on PR #472 to improve code clarity and test assertion specificity.

Copilot

Pull request overview

Copilot reviewed 28 out of 31 changed files in this pull request and generated 1 comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

runpod/serverless/modules/rp_scale.py

- Remove unused asyncio import from rp_fitness.py - Remove unused Any import from rp_system_fitness.py - Remove unused imports from test files (os from test_fitness.py, os and subprocess from test_system_fitness.py, _run_gpu_test_fallback from test_gpu_fitness.py) - Add explanatory comments to empty except blocks in test_gpu_fitness_integration.py fixture cleanup - Test assertion already uses == 6 (previously addressed)

tests/test_serverless/test_modules/test_fitness.py

- Remove unused mock_wait_for variable assignment in test_fitness_check_with_timeout - Remove unused mock_exec variable assignment in test_gpu_check_runs_in_correct_order These variables were captured but never used in the test logic.

- CLI: Add explicit re-exports for config, ssh, and get_pod_ssh_ip_port (F401) - Pod commands: Replace bare except with specific OSError handling (E722) - FastAPI: Remove duplicate Job import from worker_state (F811) - RPC Job: Use isinstance() instead of type() for dict comparison (E721) - Llama2 template: Add ruff noqa directive for template placeholder code (F821) - Download tests: Rename duplicate test_download_file to test_download_file_with_content_disposition (F811) All ruff checks now pass without errors.

runpod/cli/groups/pod/commands.py

jhcipar · 2025-12-22T18:54:34Z

runpod/serverless/modules/rp_fitness.py

+
+            # Auto-detect async vs sync using inspect
+            if inspect.iscoroutinefunction(check_func):
+                await check_func()


Maybe not something for right now, but I feel like it'd be nice to have some instrumentation for timing with these - like out of the box, debug logs will display how long each fitness check took.

deanq added 30 commits December 12, 2025 13:19

docs: moved serverless architecture doc

757f334

fix(logging): use warn() instead of warning() for RunPodLogger

3fbc0ba

RunPodLogger uses warn() method, not warning(). Update both files to use the correct method name to prevent AttributeError at runtime.

fix(logging): fix RunPodLogger.warning() call in rp_scale

d8833ea

Change log.warning() to log.warn() for RunPodLogger API consistency. This was causing AttributeError during worker startup.

fix(gpu-fitness): correct import path for rp_cuda

4e6ab53

Change from relative import .rp_cuda to correct path ..utils.rp_cuda

fix(test): correct mock patch target for binary path resolution test

bf5cb38

build(gpu-binary): replace ARM binary with x86-64 compiled version

349cf43

build(deps): add psutil for system resource checking

06cdb94

test(system-fitness): add comprehensive test suite for system fitness…

309e9c1

… checks

test(fitness): update fixtures to handle system checks auto-registration

9238ec1

docs: document built-in system fitness checks with configuration

4fc7c16