Skip to content

Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system#731

Open
frostebite wants to merge 234 commits intomainfrom
cloud-runner-develop
Open

Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system#731
frostebite wants to merge 234 commits intomainfrom
cloud-runner-develop

Conversation

@frostebite
Copy link
Member

@frostebite frostebite commented Sep 8, 2025

Summary

Major improvements to Cloud Runner with LocalStack support, rclone storage provider, dynamic provider plugin system, and enhanced CI testing capabilities.

I have contacted LocalStack to regain access to ECS mocking functionality again, but for now mocking myself with local-docker for AWS workflows.

Changes

New Features

  • LocalStack Support: Full support for LocalStack (local AWS emulator) with per-service endpoint configuration
  • Rclone Storage Provider: Experimental storage backend using rclone, enabling any rclone-supported remote (S3, GCS, Azure, SFTP, etc.)
  • Provider Plugin System: Dynamic provider loading from GitHub repositories, local paths, or NPM packages
  • aws-local CI Mode: AWS_FORCE_PROVIDER=aws-local validates AWS CloudFormation templates while executing via local-docker (no LocalStack Pro required)
  • Resource Tracking: New diagnostic feature for monitoring disk usage and container resource allocations
  • Windows Support: Windows execution path for local runs
  • Fully self-hosted K8s and AWS testing: via k3ds and LocalStack AWS

New Action Inputs

Input Description
resourceTracking Enable disk usage and allocation logging
awsEndpoint Base AWS endpoint (for LocalStack)
awsCloudFormationEndpoint CloudFormation-specific endpoint override
awsEcsEndpoint ECS-specific endpoint override
awsKinesisEndpoint Kinesis-specific endpoint override
awsCloudWatchLogsEndpoint CloudWatch Logs-specific endpoint override
awsS3Endpoint S3-specific endpoint override
storageProvider Storage backend: s3 (default) or rclone
rcloneRemote Rclone remote path (e.g., myremote:bucket/path)
cloneDepth Git clone depth for repository (default: 50, use 0 for full clone)
cloudRunnerRepoName Unity builder repo name (default: game-ci/unity-builder, useful for forks)

Improvements

  • AWS Client Factory: Centralized factory for AWS clients with endpoint configuration
  • Container Endpoint Transformation: Automatic localhost → host.docker.internal / container hostname conversion for AWS/K8s containers
  • Kubernetes Endpoint Normalization: Pods can reach LocalStack via shared Docker network
  • Retry/Backoff: More robust AWS interactions with retry logic
  • Logging & Caching: Enhanced logging and caching mechanisms
  • S3-based Workspace Locking: Shared workspace coordination via S3

Bug Fixes (includes #686)

  • AWS Secrets Fix: CloudFormation now works correctly when no secrets are specified (previously failed with empty Secrets: block)
  • Image Version Format: imageRollingVersion now supports dot versions (e.g., "3.1.0")

Testing

  • New integrity workflows for K8s (k3d) and AWS (LocalStack) validation
  • Rclone integration tests with LocalStack S3 backend
  • CI-specific Jest configuration (jest.ci.config.js)
  • Removed legacy Cloud Runner CI pipeline

Documentation

  • Provider loader guide (src/model/cloud-runner/providers/README.md)

CI Testing Modes

Mode Environment Variable Behavior
Full AWS AWS_FORCE_PROVIDER=aws Uses AWS provider with LocalStack (requires Pro for ECS)
AWS-Local AWS_FORCE_PROVIDER=aws-local Validates CloudFormation templates, executes via local-docker
Auto (unset) Auto-detects LocalStack and falls back to local-docker

Related PRs

Checklist

  • Read the contribution guide and accept the code of conduct
  • Docs (If new inputs or outputs have been added or changes to behavior that should be documented. Please make a PR in the documentation repo)
  • Readme (updated or not needed)
  • Tests (added, updated or not needed)

Summary by CodeRabbit

  • New Features

    • Resource-tracking/diagnostics, LocalStack testing support, rclone-backed storage, dynamic provider loading, configurable clone depth and builder repo.
  • Enhancements

    • Safer caching with disk-pressure guards, improved CI/test workflows and Jest CI script, stronger Kubernetes log/diagnostics, Windows command handling, more resilient remote/git operations.
  • Chores

    • Removed legacy CI pipeline and cleaned up related workflows.

✏️ Tip: You can customize this high-level summary in your review settings.

- Implemented a primary attempt to pull LFS files using GIT_PRIVATE_TOKEN.
- Added a fallback mechanism to use GITHUB_TOKEN if the initial attempt fails.
- Configured git to replace SSH and HTTPS URLs with token-based authentication for the fallback.
- Improved error handling to log specific failure messages for both token attempts.

This change ensures more robust handling of LFS file retrieval in various authentication scenarios.
- Added permissions for packages, pull-requests, statuses, and id-token to enhance workflow capabilities.
- This change improves the CI pipeline's ability to manage pull requests and access necessary resources.
…ation

- Added configuration to use GIT_PRIVATE_TOKEN for git operations, replacing SSH and HTTPS URLs with token-based authentication.
- Improved error handling to ensure GIT_PRIVATE_TOKEN availability before attempting to pull LFS files.
- This change streamlines the process of pulling LFS files in environments requiring token authentication.
…entication

- Enhanced the process of configuring git to use GIT_PRIVATE_TOKEN and GITHUB_TOKEN by clearing existing URL configurations before setting new ones.
- Improved the clarity of the URL replacement commands for better readability and maintainability.
- This change ensures a more robust setup for pulling LFS files in environments requiring token authentication.
… pipeline

- Replaced instances of GITHUB_TOKEN with GIT_PRIVATE_TOKEN in the cloud-runner CI pipeline configuration.
- This change ensures consistent use of token-based authentication across various jobs in the workflow, enhancing security and functionality.
…L unsetting

- Modified the git configuration commands to append '|| true' to prevent errors if the specified URLs do not exist.
- This change enhances the reliability of the URL clearing process in the RemoteClient class, ensuring smoother execution during token-based authentication setups.
…tion

- Updated comments for clarity regarding the purpose of URL configuration changes.
- Simplified the git configuration commands by removing redundant lines while maintaining functionality for HTTPS token-based authentication.
- This change enhances the readability and maintainability of the RemoteClient class's git setup process.
# Conflicts:
#	dist/index.js
#	dist/index.js.map
#	jest.config.js
#	yarn.lock
…logs; tests: retained workspace AWS assertion (#381)
…nd log management; update builder path logic based on provider strategy
…sed on provider strategy and credentials; update binary files
…ained markers; hooks: include AWS S3 hooks on aws provider
…t:ci script; fix(windows): skip grep-based version regex tests; logs: echo CACHE_KEY/retained markers; hooks: include AWS hooks on aws provider
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/cloud-runner.ts`:
- Around line 31-33: Add a blank line before comment blocks to satisfy ESLint:
insert an empty line immediately above the comment starting with "When true,
validates AWS CloudFormation templates even when using local-docker execution"
that precedes the static property validateAwsTemplates, and likewise add a blank
line before the other comment block reported by ESLint later in the file (the
one around the second flagged comment). Ensure both comment blocks have a
preceding empty line so the linter no longer reports "Expected line before
comment."
- Around line 44-45: ResourceTracking.logAllocationSummary('setup') is not
best-effort and can throw (e.g., if CloudRunner.buildParameters is undefined or
CloudRunnerLogger.log errors), potentially failing setup; wrap the call in a
try/catch and log the caught error without rethrowing so setup continues (mirror
the behavior of ResourceTracking.logDiskUsageSnapshot which uses runAndLog), or
alternatively wrap both ResourceTracking.logAllocationSummary and
ResourceTracking.logDiskUsageSnapshot in a single try/catch to ensure any
logging failures don’t abort setup; locate calls to
ResourceTracking.logAllocationSummary and CloudRunner.logDiskUsageSnapshot in
cloud-runner.ts and add the try/catch around
ResourceTracking.logAllocationSummary (and optionally both) using
CloudRunnerLogger.log or process logger to record the error.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/cloud-runner-integrity.yml:
- Around line 836-840: Replace the insecure "curl https://rclone.org/install.sh
| sudo bash" step in the "Install and configure rclone with LocalStack S3"
workflow step by pinning a specific rclone release and installing it explicitly:
define a pinned VERSION variable (e.g., RCLONE_VERSION), download the
corresponding release artifact (tarball or zip) from rclone's official GitHub
releases URL, verify its checksum or signature, extract the rclone binary and
move it to /usr/local/bin with sudo, then run "rclone version" to confirm;
remove any use of the remote install script and ensure the workflow documents
the pinned version and verification step.
🧹 Nitpick comments (1)
.github/workflows/cloud-runner-integrity.yml (1)

51-54: Consider updating kubectl to a more recent version.

kubectl v1.31.0 is from April 2024. The current stable release is v1.35.0 (December 2025), which provides improved Kubernetes compatibility and support for more recent features.

Proposed fix
       - name: Set up kubectl
         uses: azure/setup-kubectl@v4
         with:
-          version: 'v1.31.0'
+          version: 'v1.35.0'

frostebite and others added 2 commits January 28, 2026 06:48
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update kubectl to v1.34.1 (latest stable)
- Add provider documentation explaining what a provider is
- Fix typo: "versions" -> "tags" in best practices

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@frostebite
Copy link
Member Author

Addressed review feedback from @GabLeRoux in commit 4393f04:

  • ✅ Updated kubectl to v1.34.1 (latest stable)
  • ✅ Added provider documentation explaining what a provider is and why it's useful
  • ✅ Fixed text: 'Use specific branches or tags' (was 'versions')

Note: The setup-localstack action suggestion doesn't apply - we're using docker directly to start LocalStack, not the action.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In @.github/workflows/cloud-runner-integrity.yml:
- Around line 844-854: The heredoc fed to cat via the line "cat >
~/.config/rclone/rclone.conf << 'EOF'" is indented by the YAML file and
therefore writes leading spaces into rclone.conf; replace the indented heredoc
with a non-indented literal (so EOF and the block start at column 0) or write
the same content using a whitespace-stripping approach such as piping through
sed (e.g. generate the same block after the current heredoc but run sed -e
's/^[[:space:]]*//' before redirecting) or use printf to emit each config line
into ~/.config/rclone/rclone.conf to ensure no leading spaces are written.

In `@src/model/cloud-runner/providers/README.md`:
- Around line 200-208: Add a new "Security Considerations" subsection
immediately after the existing "Error Handling" section that warns about
executing third‑party code and provides concrete mitigations; include guidance
to verify sources, pin versions/tags/commit SHAs, audit external provider code,
run builds with least privilege and isolation, monitor dependency advisories,
and prefer built‑in providers for production, and reference
ProviderLoader.loadProvider usage as an example to illustrate pinning (e.g.,
prefer explicit version specifiers like user/repo@v1.2.3 over default branch
loading).
- Around line 216-222: The README is inconsistent: best practice `#1` requires
specifying a branch/tag but the examples call
ProviderLoader.loadProvider('https://github.com/user/my-provider',
buildParameters) and ProviderLoader.loadProvider('user/repo', buildParameters)
without one; update those example calls to include an explicit branch or tag
argument (e.g., pass the branch/tag param to ProviderLoader.loadProvider) and
align the prose to state that specifying a branch or tag is required (or, if you
prefer to keep optional, change the best-practice text to “prefer”/“recommend”).
Ensure you update both example occurrences and the Best Practices item so
ProviderLoader.loadProvider usage and guidance are consistent.
- Around line 26-32: The TypeScript examples import ProviderLoader using a
relative path ('./provider-loader') which is incorrect for users importing
externally; update all occurrences to use the package export (e.g., import
ProviderLoader from the package entry or the full module path such as
'@game-ci/unity-builder/model/cloud-runner/providers/provider-loader') so
examples resolve for external consumers; search for examples referencing
'./provider-loader' and replace them with the package import, ensuring examples
that show ProviderLoader.loadProvider (and variables like
awsProvider/k8sProvider) keep the same usage but with the corrected import path.
- Around line 99-135: The README example uses local import paths that won't
resolve for external consumers and doesn't explicitly state the required export
pattern for dynamic loading; update the snippet to import ProviderInterface and
BuildParameters from the published package (e.g., import { ProviderInterface,
BuildParameters } from '@game-ci/unity-builder') instead of
'./provider-interface' and './build-parameters', and explicitly note/use "export
default class MyProvider implements ProviderInterface" so the runtime can
dynamically load the provider (referencing the MyProvider class,
ProviderInterface, and BuildParameters names in the README).
🧹 Nitpick comments (4)
src/model/cloud-runner/providers/README.md (3)

5-5: Consider removing the hardcoded branch from the GitHub link.

The link to ProviderInterface uses a hardcoded main branch. If this file is being added to a different branch or the repository's default branch changes, the link may break.

📝 Suggested improvement
-A **provider** is a pluggable backend that Cloud Runner uses to run builds and workflows. Examples include **AWS**, **Kubernetes**, or local execution. Each provider implements the [ProviderInterface](https://github.com/game-ci/unity-builder/blob/main/src/model/cloud-runner/providers/provider-interface.ts), which defines the common lifecycle methods (setup, run, cleanup, garbage collection, etc.).
+A **provider** is a pluggable backend that Cloud Runner uses to run builds and workflows. Examples include **AWS**, **Kubernetes**, or local execution. Each provider implements the [ProviderInterface](../../cloud-runner/providers/provider-interface.ts), which defines the common lifecycle methods (setup, run, cleanup, garbage collection, etc.).

Alternatively, use a branch-agnostic GitHub URL format or a relative path to keep the link valid across branches.


162-165: Remove or update the hardcoded provider list.

The example shows a specific list of built-in providers that may become outdated as the codebase evolves. This could mislead developers if new providers are added or existing ones are removed.

📝 Suggested improvements

Option 1: Remove the specific list and let the code speak for itself:

 ```typescript
 // Get list of built-in providers
 const providers = ProviderLoader.getAvailableProviders();
-console.log(providers); // ['aws', 'k8s', 'test', 'local-docker', 'local-system', 'local']
+console.log(providers); // Returns array of available built-in provider names

Option 2: Add a note indicating the list may vary:
```diff
 ```typescript
 // Get list of built-in providers
 const providers = ProviderLoader.getAvailableProviders();
-console.log(providers); // ['aws', 'k8s', 'test', 'local-docker', 'local-system', 'local']
+console.log(providers); // Example: ['aws', 'k8s', 'test', 'local-docker', 'local-system', 'local']
+                        // Note: Actual list may vary depending on the version
</details>

---

`22-78`: **Add error handling examples to usage section.**

Best practice `#2` (line 219) recommends wrapping provider loading in try-catch blocks, but none of the usage examples (lines 22-78) demonstrate this pattern. Given the complexity of loading from multiple sources and potential failure modes, showing error handling would be valuable.


<details>
<summary>📝 Suggested addition</summary>

Add an error handling example after line 78:

```markdown
### Error Handling Example

```typescript
import { ProviderLoader } from '@game-ci/unity-builder';

async function loadProviderSafely(source: string, buildParameters: any) {
  try {
    const provider = await ProviderLoader.loadProvider(source, buildParameters);
    return provider;
  } catch (error) {
    console.error(`Failed to load provider from ${source}:`, error.message);
    
    // Fallback to local provider
    console.log('Falling back to local provider...');
    return await ProviderLoader.loadProvider('local', buildParameters);
  }
}

// Usage
const provider = await loadProviderSafely('user/custom-provider@main', buildParameters);
</details>

</blockquote></details>
<details>
<summary>.github/workflows/cloud-runner-integrity.yml (1)</summary><blockquote>

`299-324`: **Consider extracting repeated cleanup logic into a reusable script or composite action.**

The K8s cleanup logic (crictl cleanup, PVC/PV deletion, secret cleanup) is duplicated across ~15 steps. While some duplication in workflows is acceptable, this volume creates maintenance burden. Consider extracting to a shell script in the repository that can be called from each cleanup step.

<details>
<summary>Example approach</summary>

Create `.github/scripts/k8s-cleanup.sh`:
```bash
#!/bin/bash
# K8s test cleanup script
kubectl delete jobs --all --ignore-not-found=true -n default || true
# ... rest of cleanup logic

Then in workflow:

      - name: Clean up after test
        if: always()
        run: .github/scripts/k8s-cleanup.sh

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@src/model/build-parameters.ts`:
- Line 213: The cloneDepth assignment should validate and sanitize
CloudRunnerOptions.cloneDepth before use: parse it with Number.parseInt, then if
Number.isNaN(parsed) or parsed < 0 (or not an integer) replace it with a
sensible default (e.g. a DEFAULT_CLONE_DEPTH constant or 1); update the
cloneDepth property assignment in build-parameters.ts (the cloneDepth field and
CloudRunnerOptions.cloneDepth reference) to use the validated/fallback value so
downstream git operations never receive NaN or a negative depth.

In `@src/model/cloud-runner/providers/aws/aws-cloud-formation-templates.ts`:
- Around line 22-26: The getSecretDefinitionTemplate function currently returns
a snippet starting with the top-level "Secrets:" key which causes duplicate YAML
keys when called per-secret; change getSecretDefinitionTemplate to return only
the list item block (the "- Name: '...'\n  ValueFrom: !Ref ...") without the
"Secrets:" header, and update the caller loop that inserts these snippets to
either (a) create a single "Secrets:" header once and concatenate all list-item
snippets under it, or (b) emit the header on the first insertion only and append
subsequent list items — locate getSecretDefinitionTemplate to modify its return
string and the code that calls insertAtTemplate('p3 - container def', ...) to
ensure the "Secrets:" header is produced exactly once.

In `@src/model/image-tag.ts`:
- Line 41: Validate containerRegistryImageVersion before assigning to
this.imageRollingVersion by checking it matches a Docker tag-safe pattern (e.g.,
start with an alphanumeric/underscore and only contain alphanumerics, dots,
underscores or dashes, max length ~128) and reject values containing disallowed
characters like '/' or spaces; update the assignment site (the code where
this.imageRollingVersion = containerRegistryImageVersion) to perform this check
and throw or return a clear error when the value is invalid so failures are
explicit and early.
🧹 Nitpick comments (2)
src/model/cloud-runner/remote-client/index.ts (2)

213-213: Simplify async return statement.

The function is already async, so wrapping the return value in a Promise constructor is unnecessary. This can be simplified.

♻️ Suggested simplification
-    return new Promise((result) => result(``));
+    return ``;

309-311: Address static analysis hints for naming and formatting.

ESLint flags the variable name depthArg (should be depthArgument) and Prettier flags the line formatting.

♻️ Suggested fix
-      const depthArg = CloudRunnerOptions.cloneDepth !== '0' ? `--depth ${CloudRunnerOptions.cloneDepth}` : '';
+      const depthArgument = CloudRunnerOptions.cloneDepth !== '0' ? `--depth ${CloudRunnerOptions.cloneDepth}` : '';
       await CloudRunnerSystem.Run(
-        `git clone ${depthArg} ${CloudRunnerFolders.targetBuildRepoUrl} ${path.basename(CloudRunnerFolders.repoPathAbsolute)}`.trim(),
+        `git clone ${depthArgument} ${CloudRunnerFolders.targetBuildRepoUrl} ${path.basename(
+          CloudRunnerFolders.repoPathAbsolute,
+        )}`.trim(),
       );

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/model/cloud-runner/providers/aws/index.ts (1)

163-171: Unconditional wait for cleanup stack deletion when it may not exist.

The DeleteStackCommand for the cleanup stack (line 151) is only sent when CloudRunnerOptions.useCleanupCron is true, but this waitUntilStackDeleteComplete for the cleanup stack is always executed. If useCleanupCron is false, the cleanup stack was never created, so this wait will either fail or timeout unnecessarily.

🔧 Proposed fix
     await waitUntilStackDeleteComplete(
       {
         client: CF,
         maxWaitTime: stackWaitTimeSeconds,
       },
       {
         StackName: taskDef.taskDefStackName,
       },
     );
+    if (CloudRunnerOptions.useCleanupCron) {
       await waitUntilStackDeleteComplete(
         {
           client: CF,
           maxWaitTime: stackWaitTimeSeconds,
         },
         {
           StackName: `${taskDef.taskDefStackName}-cleanup`,
         },
       );
+    }
src/model/cloud-runner/remote-client/index.ts (1)

363-363: Fix contradictory assertion message.

Line 363 asserts fs.existsSync(path.join('.git', 'lfs')) but the message says "LFS folder should not exist before caching". Either the assertion or message is incorrect.

If LFS folder should NOT exist at this point:

-    assert(fs.existsSync(path.join(`.git`, `lfs`)), 'LFS folder should not exist before caching');
+    assert(!fs.existsSync(path.join(`.git`, `lfs`)), 'LFS folder should not exist before caching');
🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/providers/k8s/kubernetes-task-runner.ts`:
- Line 154: The checks inside KubernetesTaskRunner that inspect error?.message
should use the normalized errorMessage variable (set earlier) instead of
accessing error.message directly; update both occurrences (the condition near
the isRunning/continueStreaming branch and the later check around line 204) to
use a safe include against errorMessage (e.g. (errorMessage ||
'').includes('previous terminated container')) so string rejections from
CloudRunnerSystem.Run are handled consistently and won't silently fail.
🧹 Nitpick comments (12)
src/model/cloud-runner/tests/cloud-runner-s3-steps.test.ts (3)

22-24: Remove or replace trivial test with meaningful assertion.

This test asserts true === true, which provides no validation. If the intent is to verify the file loads without errors, the import/parse step already does that. Consider removing this test or replacing it with a meaningful assertion (e.g., verifying exported functions exist).


72-126: Consider extracting repeated credential setup in customJob YAML.

The AWS credential configuration block is duplicated three times across test-s3-pull-cache, test-s3-upload-cache, and test-s3-upload-build steps. This increases maintenance burden and risk of inconsistency.

Consider moving credential setup to a shared initialization step or relying on environment variables passed to containers rather than repeating aws configure in each step.


134-135: Remove redundant shouldRunS3 check.

This code path is only reachable when shouldRunS3 is true (guarded at line 43), making the inner check redundant.

-        // Only run S3 operations if environment supports it
-        if (shouldRunS3) {
+        // Run S3 verification operations
+        {
src/model/cloud-runner/providers/aws/aws-job-stack.ts (1)

21-30: Extract getStackWaitTime() to a shared module to avoid duplication.

This function and DEFAULT_STACK_WAIT_TIME_SECONDS are duplicated identically in three files: aws-job-stack.ts, aws-base-stack.ts, and index.ts. Consider extracting them to aws-client-factory.ts or a dedicated utility module.

♻️ Example consolidation

In aws-client-factory.ts:

const DEFAULT_STACK_WAIT_TIME_SECONDS = 600;

export function getStackWaitTime(): number {
  const overrideValue = Number(process.env.CLOUD_RUNNER_AWS_STACK_WAIT_TIME ?? '');
  if (!Number.isNaN(overrideValue) && overrideValue > 0) {
    return overrideValue;
  }

  return DEFAULT_STACK_WAIT_TIME_SECONDS;
}

Then import it in the other files:

import { getStackWaitTime } from './aws-client-factory';
src/model/cloud-runner/providers/aws/index.ts (1)

105-109: Consider removing or documenting the unused factory method calls.

getECS() and getKinesis() are called but their return values are discarded. If the intent is to pre-initialize clients for early failure detection, add a comment explaining this. Otherwise, remove these calls since the factory will lazily initialize clients when actually needed.

   ResourceTracking.logAllocationSummary('aws workflow');
   await ResourceTracking.logDiskUsageSnapshot('aws workflow (host)');
-  AwsClientFactory.getECS();
   const CF = AwsClientFactory.getCloudFormation();
-  AwsClientFactory.getKinesis();
src/model/cloud-runner/providers/k8s/index.ts (2)

162-200: Test cleanup block uses dynamic import unnecessarily.

The dynamic import of CloudRunnerSystem at line 168-170 is unnecessary since CloudRunnerSystem is already imported at the module level (line 19). While past reviews marked this as addressed, the dynamic import pattern persists. Use the existing module-level import directly.

♻️ Suggested fix
       if (process.env['cloudRunnerTests'] === 'true') {
         try {
           CloudRunnerLogger.log('Cleaning up old images in k3d node before pulling new image...');
-          const { CloudRunnerSystem: CloudRunnerSystemModule } = await import(
-            '../../services/core/cloud-runner-system'
-          );

           // Aggressive cleanup: remove stopped containers and non-Unity images
           // ... rest of code using CloudRunnerSystem instead of CloudRunnerSystemModule

And update line 188:

-              await CloudRunnerSystemModule.Run(cmd, true, true);
+              await CloudRunnerSystem.Run(cmd, true, true);

206-282: Redundant dynamic import in image cache validation block.

Similar to the cleanup block, line 208-210 dynamically imports CloudRunnerSystem when it's already available at module scope. This adds unnecessary overhead and complexity.

♻️ Suggested fix
         if (process.env['cloudRunnerTests'] === 'true' && image.includes('unityci/editor')) {
           try {
-            const { CloudRunnerSystem: CloudRunnerSystemModule2 } = await import(
-              '../../services/core/cloud-runner-system'
-            );
-
             // Check if image is cached on agent node (where pods run)
-            const agentImageCheck = await CloudRunnerSystemModule2.Run(
+            const agentImageCheck = await CloudRunnerSystem.Run(
               `docker exec k3d-unity-builder-agent-0 sh -c "crictl images | grep -q unityci/editor && echo 'cached' || echo 'not_cached'" || echo 'not_cached'`,

Apply similar changes to all CloudRunnerSystemModule2.Run calls in this block.

src/model/cloud-runner/tests/e2e/cloud-runner-end2end-retaining.test.ts (1)

141-170: Cleanup logic is thorough but duplicates earlier block.

The cache cleanup logic (lines 141-170) largely duplicates the workspace cleanup logic (lines 111-138). Consider extracting a shared helper function to reduce duplication.

♻️ Suggested refactor pattern
async function safeCleanupDirectory(dirPath: string, logger: typeof CloudRunnerLogger) {
  if (!fs.existsSync(dirPath)) return;
  
  try {
    logger.log(`Cleaning up: ${dirPath}`);
    await CloudRunnerSystem.Run(`chmod -R u+w ${dirPath} 2>/dev/null || true`);
    await CloudRunnerSystem.Run(`rm -rf ${dirPath} 2>/dev/null || true`);
    await CloudRunnerSystem.Run(`find ${dirPath} -type f -delete 2>/dev/null || true`);
    await CloudRunnerSystem.Run(`find ${dirPath} -type d -empty -delete 2>/dev/null || true`);
  } catch (error: any) {
    logger.log(`Failed to cleanup: ${error.message}`);
  }
}
src/model/cloud-runner/cloud-runner.ts (1)

156-168: Dynamic provider loading with fallback is acceptable but consider logging level.

The dynamic provider loading approach with fallback to LocalCloudRunner is reasonable for extensibility. However, silently falling back could mask configuration errors in production.

Consider using CloudRunnerLogger.logWarning instead of CloudRunnerLogger.log for the fallback message to make it more visible.

src/model/cloud-runner/tests/e2e/cloud-runner-end2end-caching.test.ts (1)

102-136: Cleanup logic duplicates the retaining test pattern.

This cleanup block is nearly identical to the one in cloud-runner-end2end-retaining.test.ts. Consider extracting a shared test utility for cache cleanup to reduce duplication across e2e tests.

src/model/cloud-runner/remote-client/index.ts (1)

397-400: Remove unconventional void 0 statement.

The void 0; statement at line 399 is unconventional for a no-op. The comment alone suffices, or use an empty block.

     } catch {
       /* no-op: best-effort git lfs pull without tokens may fail */
-      void 0;
     }
src/model/cloud-runner/providers/k8s/kubernetes-storage.ts (1)

61-75: Avoid creating redundant KubeConfig instances; reuse the initialized client's kubeconfig.

While both kubeClient and the independently created KubeConfig at line 62 use loadFromDefault() and thus load the same default cluster config, creating separate KubeConfig and StorageV1Api instances each time this method is called is inefficient and violates DRY.

Instead, add StorageV1Api to the Kubernetes class initialization (in index.ts), or pass the KubeConfig instance as a parameter to this method to reuse it. The same pattern is repeated at line 187.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/model/docker.ts`:
- Around line 58-60: The image detection logic (isAlpineBasedImage /
commandPrefix) misses tagged and registry-qualified refs (e.g., alpine:3.19 or
ghcr.io/rclone/rclone:1.66); update detection to normalize or pattern-match the
image string instead of exact equality: extract the repository/name portion or
use a regex to match (^(.*/)?alpine(:|$) or (^(.*/)?rclone/rclone(:|$)) so
images with tags or registry prefixes are detected as Alpine-based, then set
commandPrefix to /bin/sh accordingly; update the comment placement to include
the required blank line before it.
🧹 Nitpick comments (1)
src/model/cloud-runner/providers/docker/index.ts (1)

140-141: Add blank line before the comment to satisfy ESLint.

Static analysis reports a missing blank line before the comment.

🔧 Proposed fix
     const entrypointFilePath = `start.sh`;
+
     // Use #!/bin/sh for POSIX compatibility (Alpine-based images like rclone/rclone don't have bash)
     const fileContents = `#!/bin/sh

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@src/model/cloud-runner/providers/docker/index.ts`:
- Around line 97-106: The set endpointEnvironmentNames in
src/model/cloud-runner/providers/docker/index.ts is missing the INPUT_* variants
for CloudFormation, ECS, Kinesis, and CloudWatch Logs; update the Set used in
the localhost rewriting logic (symbol: endpointEnvironmentNames) to include
'INPUT_AWSCLOUDFORMATIONENDPOINT', 'INPUT_AWSECSENDPOINT',
'INPUT_AWSKINESISENDPOINT', and 'INPUT_AWSCLOUDWATCHLOGSENDPOINT', and apply the
same additions to the equivalent sets/logic in
src/model/cloud-runner/providers/k8s/kubernetes-job-spec-factory.ts and
src/model/cloud-runner/providers/aws/aws-task-runner.ts so containers using
INPUT_* endpoint env vars are rewritten to host.docker.internal consistently.

frostebite and others added 6 commits January 29, 2026 16:47
Reverts cosmetic changes that renamed workflow_id to workflowId in GitHub
API calls. The GitHub REST API uses workflow_id, so we keep the eslint
camelcase suppression comments to match the official API convention.

Also restores the getCheckStatus() method that was removed.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…s, versioning.test.ts

These files had changes unrelated to the Cloud Runner improvements PR goals.
Reverting to main branch state.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ovider

The rclone/rclone image is Alpine-based and only has /bin/sh, not /bin/bash.
This fixes exit code 127 errors when running rclone commands in containers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The previous implementation fetched ALL PR refs with:
  git fetch origin +refs/pull/*:refs/remotes/origin/pull/*

This is extremely slow for repos with many PRs (700+ PRs in unity-builder).
Now fetches only the specific PR ref needed, e.g., for pull/731/merge:
  git fetch origin +refs/pull/731/merge:... +refs/pull/731/head:...

This should significantly speed up the Cloud Runner integrity tests.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests are already covered by cloud-runner-integrity.yml

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@frostebite frostebite changed the title Cloud Runner Improvements - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system Cloud Runner Improvements - LTS Candidate - S3 Locking, Aws Local Stack (Pipelines), Testing Improvements, Rclone storage support, Provider plugin system Jan 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants