Skip to content

Conversation

@deanq
Copy link
Member

@deanq deanq commented Dec 19, 2025

Summary

Implemented comprehensive fitness check system that validates worker readiness at startup. Includes 7 built-in system checks (memory, disk, network, GPU) plus a simple API for users to register custom checks. All checks provide clear visibility in production logs with intelligent, scalable validation thresholds.

Custom User-Defined Checks

Users can register custom fitness checks using a simple decorator:

import runpod

@runpod.serverless.register_fitness_check
def check_model_files():
    """Verify required model files exist."""
    from pathlib import Path
    model_path = Path("/models/my-model.safetensors")
    if not model_path.exists():
        raise RuntimeError(f"Model not found: {model_path}")

@runpod.serverless.register_fitness_check
async def check_external_api():
    """Verify external API is accessible."""
    import aiohttp
    async with aiohttp.ClientSession() as session:
        async with session.get("https://api.example.com/health", timeout=5) as resp:
            if resp.status != 200:
                raise RuntimeError(f"API health check failed: {resp.status}")

def handler(job):
    return {"output": "success"}

runpod.serverless.start({"handler": handler})

Key Features:

  • Simple decorator: @runpod.serverless.register_fitness_check
  • Supports both sync and async functions
  • User checks run before built-in system checks
  • Raise exceptions on failure (no return value needed)

Documentation: Complete guide at docs/serverless/worker_fitness_checks.md with examples for GPU validation, file checks, environment variables, and external service connectivity.

System Fitness Checks

All Workers (3 checks):

  1. Memory Check - Validates sufficient RAM available

    • Default: 4GB minimum
    • Configurable: `RUNPOD_MIN_MEMORY_GB`
  2. Disk Space Check - Validates adequate disk space

    • Default: 10% of total disk must be free
    • Automatically scales with machine size
    • Configurable: `RUNPOD_MIN_DISK_PERCENT`
  3. Network Connectivity Check - Validates internet access

    • Default: 5s timeout to 8.8.8.8:53
    • Configurable: `RUNPOD_NETWORK_CHECK_TIMEOUT`

GPU Workers (4 additional checks):
4. CUDA Version Check - Validates CUDA driver version

  • Default: CUDA 11.8+
  • Configurable: `RUNPOD_MIN_CUDA_VERSION`
  1. CUDA Initialization Check - Verifies GPU accessibility

    • Tests actual device initialization
    • Validates each GPU's memory allocation
    • Catches runtime failures early
  2. GPU Compute Benchmark - Quick performance validation

    • Matrix multiplication test
    • Default: 100ms max execution time
    • Configurable: `RUNPOD_GPU_BENCHMARK_TIMEOUT`
  3. GPU Binary Test - Comprehensive GPU health via native binary

    • CUDA driver availability
    • NVML initialization
    • Memory allocation per GPU

Behavior

  • Execution: Runs once at worker startup, before accepting jobs
  • Failure: Worker exits immediately with code 1, marked unhealthy
  • Success: All checks pass, worker begins accepting jobs
  • CPU Workers: GPU checks skip silently (same code works for CPU/GPU)

Example Production Logs

GPU Worker Startup:
```
[info]--- Starting Serverless Worker | Version 1.8.2.dev ---
[info]Running 7 fitness check(s)...
[info]Memory check passed: 181.49GB available (of 187.85GB total)
[info]Disk space check passed: 9.88GB free (98.8% available)
[info]Network connectivity passed: Connected to 8.8.8.8 (23ms)
[info]CUDA version check passed: 12.7 (minimum: 11.8)
[info]CUDA initialization passed: 2 device(s) initialized successfully
[info]GPU compute benchmark passed: Matrix multiply completed in 42ms
[info]GPU binary test passed: 2 GPU(s) healthy (CUDA 12.7)
[info]All fitness checks passed.
```

CPU Worker Startup:
```
[info]--- Starting Serverless Worker | Version 1.8.2.dev ---
[info]Running 3 fitness check(s)...
[info]Memory check passed: 359.41GB available (of 377.18GB total)
[info]Disk space check passed: 9.88GB free (98.8% available)
[info]Network connectivity passed: Connected to 8.8.8.8 (23ms)
[info]All fitness checks passed.
```

Configuration

All thresholds are configurable via environment variables:

```dockerfile

Dockerfile

ENV RUNPOD_MIN_MEMORY_GB=8.0
ENV RUNPOD_MIN_DISK_PERCENT=15.0
ENV RUNPOD_MIN_CUDA_VERSION=12.0
ENV RUNPOD_NETWORK_CHECK_TIMEOUT=10
ENV RUNPOD_GPU_BENCHMARK_TIMEOUT=2
```

Or in Python:
```python
import os
os.environ["RUNPOD_MIN_MEMORY_GB"] = "8.0"
os.environ["RUNPOD_MIN_DISK_PERCENT"] = "15.0"
```

Test Coverage

  • Memory availability detection
  • Disk space validation across different disk sizes
  • Network connectivity testing
  • CUDA version parsing and comparison
  • GPU device initialization verification
  • GPU compute performance benchmarking
  • Edge cases for cross-platform compatibility

Implementation Details

  • Auto-registers checks at worker startup
  • Deferred registration avoids circular imports
  • Graceful degradation on missing tools
  • Comprehensive error messages for debugging
  • Platform-aware (Linux /proc/meminfo fallback for memory)

deanq added 30 commits December 12, 2025 13:19
Implement a health validation system for serverless workers that runs at startup
before handler initialization. Allows users to register sync and async validation
functions via the @runpod.serverless.register_fitness_check decorator.

Key features:
- Decorator-based registration API with arbitrary function registry
- Support for both synchronous and asynchronous fitness checks
- Automatic async/sync detection and execution
- Production-only execution (skips in local/test modes)
- Comprehensive error logging with exception details
- Fail-fast behavior: logs error and exits with code 1 on first failure

Files:
- runpod/serverless/modules/rp_fitness.py: Core implementation
- runpod/serverless/__init__.py: Export register_fitness_check
- runpod/serverless/worker.py: Integration into run_worker()
- docs/serverless/worker_fitness_checks.md: User documentation
- tests/test_serverless/test_modules/test_fitness.py: Test suite
- docs/serverless/worker.md: Added See Also section with documentation link
Implement a health validation system for serverless workers that runs at startup
before handler initialization. Allows users to register sync and async validation
functions via the @runpod.serverless.register_fitness_check decorator.

Key features:
- Decorator-based registration API with arbitrary function registry
- Support for both synchronous and asynchronous fitness checks
- Automatic async/sync detection and execution
- Production-only execution (skips in local/test modes)
- Comprehensive error logging with exception details
- Fail-fast behavior: logs error and exits with code 1 on first failure

Files:
- runpod/serverless/modules/rp_fitness.py: Core implementation
- runpod/serverless/__init__.py: Export register_fitness_check
- runpod/serverless/worker.py: Integration into run_worker()
- docs/serverless/worker_fitness_checks.md: User documentation
- tests/test_serverless/test_modules/test_fitness.py: Test suite
- docs/serverless/worker.md: Added See Also section with documentation link
- Add comprehensive Fitness Checks section to architecture.md with execution flow, key functions, and performance characteristics
- Update system architecture diagram to show fitness check node in worker startup flow
- Add Fitness Check Flow sequence diagram showing registration and execution paths
- Add Fitness Check Contract to Integration Points section
- Update High-Level Flow to include fitness check validation step
- Add fitness check overhead metrics to Performance Characteristics
- Update Table of Contents and References with fitness check entries
- Add Worker Fitness Checks section to README.md with practical example
- Include link to detailed fitness check documentation
- Update Last Updated date to 2025-12-13
- Remove empty core/__pycache__/ directory
- Add gpu_test.c: CUDA binary for GPU memory allocation testing
- Add compile_gpu_test.sh: Docker-based build script with CUDA support
- Add _binary_helpers.py: Utility to locate package-bundled binaries
- Supports environment variable override for binary path
- Add rp_gpu_fitness.py: GPU health check using native binary or nvidia-smi fallback
- Auto-registers GPU check on import when GPUs are detected
- Validates GPU driver, NVML, enumeration, and memory allocation
- Gracefully skips on CPU-only workers
- Modify rp_fitness.py: Auto-register GPU check during module initialization
- Add pre-compiled gpu_test binary to package data
- Include binary documentation (README.md)
- Add MANIFEST.in to ensure binary is included in source distribution
- Update pyproject.toml: Add package-data configuration
- Update setup.py: Add package_data for setuptools
- Guarantees binary availability when installing from any git branch
- Add test_gpu_fitness.py: Unit tests for GPU fitness check logic
- Add test_gpu_fitness_integration.py: Integration tests with mocked binary
- Add mock_gpu_test fixture: Simulated GPU test binary for testing
- Covers binary path resolution, output parsing, error handling
- Tests both native binary and fallback code paths
- Set RUNPOD_SKIP_GPU_CHECK env var in subprocess calls
- Improves benchmark consistency by avoiding GPU check overhead
- Enhances output parsing to handle debug messages gracefully
- Ensures performance tests measure import time, not GPU detection
- Maintains benchmark reliability across GPU and CPU-only systems
- Update worker_fitness_checks.md: Add GPU memory allocation test section
- Add gpu_binary_compilation.md: Complete guide to building gpu_test binary
- Document auto-registration, configuration, and performance characteristics
- Include troubleshooting guide and deployment examples
- Provide version compatibility matrix for CUDA and NVML
- Move auto-registration from module import time to first run of run_fitness_checks()
- Prevents circular import issues where rp_gpu_fitness couldn't import RunPodLogger
- Use _ensure_gpu_check_registered() guard to register GPU check once on demand
- Maintains same functionality but with proper import ordering
RunPodLogger uses warn() method, not warning(). Update both files to use
the correct method name to prevent AttributeError at runtime.
Change log.warning() to log.warn() for RunPodLogger API consistency.
This was causing AttributeError during worker startup.
Change from relative import .rp_cuda to correct path ..utils.rp_cuda
Implements 5 new automatic fitness checks:
- Memory availability (default: 4GB minimum)
- Disk space (default: 10GB minimum)
- Network connectivity (8.8.8.8:53)
- CUDA version validation (GPU workers only)
- GPU compute benchmark (GPU workers only)

All checks run automatically at worker startup and exit immediately if any fail.
- Add _ensure_system_checks_registered() function
- Register system checks when run_fitness_checks() starts
- Add _reset_registration_state() for testing
- Add RUNPOD_SKIP_AUTO_SYSTEM_CHECKS environment variable for tests
Adds comprehensive CUDA initialization check that verifies:
- Device initialization succeeds (not just available)
- Each device has accessible memory
- Tensor allocation works on all devices
- Fallback support for PyTorch and CuPy

Catches runtime CUDA initialization failures early at worker startup
instead of during job processing. Includes 7 new tests covering:
- Successful PyTorch/CuPy initialization
- Device count validation
- Memory accessibility checks
- Device allocation failures
- Graceful fallback behavior
Adds comprehensive documentation for the new CUDA initialization check:
- Explains what the check validates
- Shows expected log output
- Provides failure scenario example
- Updates check summary to list all 6 checks (3 base + 3 GPU)
- Clarifies check runs after CUDA version check
Lowers the default disk space threshold from 10GB to 1GB to accommodate
smaller container environments while still catching critically low disk
conditions. Users can override with RUNPOD_MIN_DISK_GB environment
variable if more space is required.

- Default: RUNPOD_MIN_DISK_GB=1.0 (was 10.0)
- Configurable: RUNPOD_MIN_DISK_GB=20.0 for higher requirements
Prevents confusing error message from appearing in logs on CPU-only
serverless endpoints. Changed subprocess.check_output() to use
stderr=subprocess.DEVNULL instead of shell=True, which also improves
security by avoiding shell injection risks.

Change:
- subprocess.check_output('nvidia-smi', shell=True)
+ subprocess.check_output(['nvidia-smi'], stderr=subprocess.DEVNULL)

This ensures the shell error '/bin/sh: 1: nvidia-smi: not found' does
not leak to the logs when GPU detection runs on non-GPU workers.
Behavior is unchanged - still returns True/False as before.
- Fixed nvidia-smi fallback parsing incorrect CUDA version
- Changed from querying --query-gpu=driver_version (returns 500+)
- Now parses 'CUDA Version: X.Y' from nvidia-smi standard output
- This fixes confusing logs that showed driver version instead of CUDA version
- Updated test to use realistic nvidia-smi output format
- Added edge case tests for version extraction validation
- Ensures workers report correct CUDA version (11.x-12.x not 500+)
- Changed from static 1GB minimum to percentage-based check
- Default: 10% of total disk must be free (scales with disk size)
- Configurable via RUNPOD_MIN_DISK_PERCENT environment variable
- Benefits:
  - Small disks (50GB): ~5GB free required
  - Large disks (1TB): ~100GB free required
- Automatically scales to machine resources
- Updated tests to verify percentage-based logic
…dation

- Changed from MIN_DISK_GB to MIN_DISK_PERCENT configuration
- Document 10% default scaling across different disk sizes
- Add examples showing 100GB, 1TB, and 10TB scaling
- Update configuration examples in dockerfile and Python
- Explain automatic scaling benefits
- Simplified disk check to only verify root (/) filesystem
- In containers, /tmp is just a subdirectory of / (same disk)
- Eliminates duplicate log messages with identical results
- Updated tests to verify only root filesystem is checked
- Updated documentation to reflect container behavior
- Reduces check overhead with single disk_usage() call
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

The test assertions were expecting the old API (shell=True) but the
implementation was changed to use a list and stderr=subprocess.DEVNULL.
Updated test assertions to match the actual implementation.
@deanq deanq requested a review from Copilot December 19, 2025 00:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a comprehensive fitness check system for worker startup validation, consisting of user-defined custom checks and 7 built-in system resource checks (memory, disk, network, and GPU-specific validations). The system validates worker readiness before accepting jobs, with intelligent thresholds that scale automatically and clear production logging.

Key Changes:

  • Added user-facing decorator API (@runpod.serverless.register_fitness_check) for custom health checks
  • Implemented 7 automatic system checks (3 for all workers, 4 additional for GPU workers)
  • Included GPU test binary with native CUDA memory allocation validation
  • Comprehensive documentation and examples for user-defined checks

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
runpod/serverless/modules/rp_fitness.py Core fitness check registration and execution system
runpod/serverless/modules/rp_system_fitness.py Built-in system resource checks (memory, disk, network, CUDA)
runpod/serverless/modules/rp_gpu_fitness.py GPU-specific health checks using native binary and fallback
runpod/serverless/__init__.py Exported register_fitness_check to public API
runpod/serverless/worker.py Integrated fitness checks into worker startup flow
docs/serverless/worker_fitness_checks.md Complete user documentation with examples
tests/test_serverless/test_modules/test_fitness.py Unit tests for core fitness system
tests/test_serverless/test_modules/test_system_fitness.py Unit tests for system resource checks
tests/test_serverless/test_modules/test_gpu_fitness.py Unit tests for GPU check system
build_tools/gpu_test.c Native CUDA binary source for GPU memory testing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Use configurable GPU_BENCHMARK_TIMEOUT instead of hardcoded 100ms threshold
- Change test assertion from >= to == to prevent accidental check registration
- Remove empty TestFallbackExecution class (covered by integration tests)
- Make error message limit configurable via RUNPOD_GPU_MAX_ERROR_MESSAGES env var

All 83 fitness check tests pass with these changes.
_system_checks_registered = True
return

_system_checks_registered = True

Check notice

Code scanning / CodeQL

Unused global variable Note

The global variable '_system_checks_registered' is not used.
- Remove unused 'inspect' import from rp_gpu_fitness.py
- Remove unused 'call' import from test files (test_fitness.py, test_gpu_fitness.py, test_system_fitness.py)
- Add explanatory comment to empty except clause in rp_gpu_fitness.py

All 83 fitness check tests pass with these changes.
@deanq deanq marked this pull request as ready for review December 19, 2025 02:02
Eliminate unnecessary nvidia-smi call in _run_gpu_test_fallback(). The function
was calling is_available() before immediately trying nvidia-smi --list-gpus,
resulting in redundant GPU detection. Direct attempt to list GPUs handles all
failure cases without the pre-check.

Also clean up ambiguous variable name 'l' → 'line' in list comprehension.
- Add explanatory comment to empty except clause in gpu_test output parsing
- Change test assertion from >= to == to catch regressions in error detection

These changes address CodeQL and Copilot feedback on PR #472 to improve code clarity
and test assertion specificity.
@deanq deanq requested a review from Copilot December 19, 2025 04:20
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 31 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove unused asyncio import from rp_fitness.py
- Remove unused Any import from rp_system_fitness.py
- Remove unused imports from test files (os from test_fitness.py, os and subprocess from test_system_fitness.py, _run_gpu_test_fallback from test_gpu_fitness.py)
- Add explanatory comments to empty except blocks in test_gpu_fitness_integration.py fixture cleanup
- Test assertion already uses == 6 (previously addressed)
- Remove unused mock_wait_for variable assignment in test_fitness_check_with_timeout
- Remove unused mock_exec variable assignment in test_gpu_check_runs_in_correct_order

These variables were captured but never used in the test logic.
- CLI: Add explicit re-exports for config, ssh, and get_pod_ssh_ip_port (F401)
- Pod commands: Replace bare except with specific OSError handling (E722)
- FastAPI: Remove duplicate Job import from worker_state (F811)
- RPC Job: Use isinstance() instead of type() for dict comparison (E721)
- Llama2 template: Add ruff noqa directive for template placeholder code (F821)
- Download tests: Rename duplicate test_download_file to test_download_file_with_content_disposition (F811)

All ruff checks now pass without errors.

# Auto-detect async vs sync using inspect
if inspect.iscoroutinefunction(check_func):
await check_func()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not something for right now, but I feel like it'd be nice to have some instrumentation for timing with these - like out of the box, debug logs will display how long each fitness check took.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants