Skip to content

Conversation

@mpryc
Copy link
Contributor

@mpryc mpryc commented Dec 3, 2025

Original PR: #2034

Author: Tiger Kaovilai <tkaovila@redhat.com>
Date:   Fri Nov 28 09:17:23 2025 -0500

    refactor: update runtime permissions in failure analysis documentation and scripts

    - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script.
    - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions.
    - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation.

    These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime.

    Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

Why the changes were made

How to test the changes made

Original PR: openshift#2034

Author: Tiger Kaovilai <tkaovila@redhat.com>
Date:   Fri Nov 28 09:17:23 2025 -0500

    refactor: update runtime permissions in failure analysis documentation and scripts

    - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script.
    - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions.
    - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation.

    These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime.

    Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 3, 2025

Walkthrough

Adds Claude integration for post-E2E failure analysis: new static permissions, docs and design, CI and Dockerfile changes to install Claude and Node, Makefile test-e2e changes to run analysis, a new analyze_failures.sh preprocessing + Claude invocation script, and an additional flake ignore pattern.

Changes

Cohort / File(s) Summary
Configuration
\.claude/config.json
New JSON permissions object with top-level allow and deny arrays listing permitted and denied commands/actions for Claude integration.
Documentation & Design
CLAUDE.md, docs/design/claude-prow-failure-analysis_design.md
Adds repository guidance and a comprehensive design doc covering Claude/Vertex AI integration, artifact preprocessing, prompts/output format, credential flows, security, rollout, and testing plans.
CI Makefile
Makefile
Updates test-e2e to capture test exit code, set Google/Vertex-related env vars conditionally on OPENSHIFT_CI, invoke analyze_failures.sh with the exit code, and exit with the original code.
CI Image
build/ci-Dockerfile
Switches to multi-arch kubectl install using ARG TARGETARCH, makes the analysis script executable, and installs Node.js plus the Claude CLI.
Failure Analysis Script
tests/e2e/scripts/analyze_failures.sh
New script that checks prerequisites, redacts secrets, preprocesses large artifacts (subagent or grep extraction), builds a structured Claude prompt, runs Claude with timeouts, saves redacted analysis to claude-failure-analysis.md, and exits with the original test exit code.
Flake Filtering
tests/e2e/lib/flakes.go
Adds an ignore pattern for a hypershift-related message to the errorIgnorePatterns list used in flake detection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • tests/e2e/scripts/analyze_failures.sh — preprocessing logic, redaction regexes, timeouts, fallback paths, and artifact handling.
  • Makefile — conditional credential handling, environment propagation, and correct exit-code behavior in CI.
  • build/ci-Dockerfile — multi-arch kubectl download correctness and Node.js/Claude CLI install steps.
  • \.claude/config.json — ensure permission entries align with runtime expectations and the analysis script's CLI usage.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between fa6ce87 and 416fc4c.

📒 Files selected for processing (1)
  • tests/e2e/lib/flakes.go (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • tests/e2e/lib/flakes.go
🔇 Additional comments (1)
tests/e2e/lib/flakes.go (1)

24-24: LGTM! Appropriate addition to error ignore patterns.

The new ignore pattern correctly filters an expected informational message when the hypershift plugin skips execution in non-hypershift scenarios. The error checking for HostedControlPlane CRD is anticipated behavior when hypershift is not installed or not being used, making this pattern suitable for the ignore list.


Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from sseago and weshayutin December 3, 2025 15:53
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
tests/e2e/scripts/analyze_failures.sh (2)

56-72: Tighten large‑log preprocessing to avoid duplication and follow shell best practices

Two points worth adjusting here:

  1. Avoid masking get_file_size exit status (SC2155)
    Declaring and assigning file_size in one line can hide a failure from get_file_size. Prefer:
-    local file_size=$(get_file_size "$log_file")
+    local file_size
+    file_size=$(get_file_size "$log_file")
  1. Prevent double‑processing must‑gather logs
    The first find "${ARTIFACT_DIR}" ... walk will already include ${ARTIFACT_DIR}/must-gather/**, and then you loop over ${ARTIFACT_DIR}/must-gather again. That can duplicate work and inflate token usage for big runs. One simple fix is to exclude must-gather from the first search:
-        done < <(find "${ARTIFACT_DIR}" -maxdepth 4 -name "*.log" -type f 2>/dev/null | while read f; do
+        done < <(find "${ARTIFACT_DIR}" -maxdepth 4 -name "*.log" -type f ! -path "${ARTIFACT_DIR}/must-gather/*" 2>/dev/null | while read f; do

This keeps per‑test logs and must‑gather logs logically separated in the two loops and avoids redundant subagent calls.

Also applies to: 119-167


21-22: Make exit‑code handling and temp‑file cleanup more robust

The failure‑analysis behavior is correct, but a couple of small tweaks will harden the script:

  1. Initialize and quote EXIT_CODE defensively

Right now EXIT_CODE=$1 and later if [ $EXIT_CODE -ne 0 ]; then and exit $EXIT_CODE. This is fine when the caller always passes a numeric code, but it is easy to make the script more bullet‑proof:

-EXIT_CODE=$1
+EXIT_CODE=${1:-0}
...
-if [ $EXIT_CODE -ne 0 ]; then
+if [ "$EXIT_CODE" -ne 0 ]; then
...
-exit $EXIT_CODE
+exit "$EXIT_CODE"
  1. Use a quoted single‑quoted trap to avoid premature expansion (SC2064)
-    TEMP_OUTPUT=$(mktemp)
-    trap "rm -f $TEMP_OUTPUT" EXIT
+    TEMP_OUTPUT=$(mktemp)
+    trap 'rm -f "$TEMP_OUTPUT"' EXIT

This ensures the trap always refers to the actual temp file path and behaves correctly even if the variable were changed later.

Also applies to: 169-181, 188-193, 365-367, 372-373, 432-432

docs/design/claude-prow-failure-analysis_design.md (1)

629-648: Update test-e2e Makefile snippet to match the current implementation

The Makefile example here has diverged from the real test-e2e target (flag names, provider/credentials wiring, and CI‑only Claude setup). For someone copying this snippet, that mismatch will be confusing and could break local runs.

Recommend regenerating this section from the live Makefile (including the EXIT_CODE handling and analyze_failures.sh call) so the design doc remains an accurate reference.

build/ci-Dockerfile (1)

8-22: CI image wiring is correct; consider pinning Node/Claude versions

The changes correctly:

  • Ensure tests/e2e/scripts/analyze_failures.sh is executable for CI.
  • Switch kubectl installation to a multi‑arch flow using TARGETARCH.
  • Add Node.js and the global @anthropic-ai/claude-code CLI needed by the analysis script.

To improve reproducibility of CI images, consider pinning:

  • A specific Node.js major/minor (via a fixed NodeSource channel or explicit package version).
  • A specific @anthropic-ai/claude-code version (e.g. npm install -g @anthropic-ai/claude-code@<version>).

That will make it easier to audit and debug CI behavior over time when dependencies evolve.

Makefile (1)

860-886: EXIT_CODE capture and Claude invocation look good; optionally initialize EXIT_CODE explicitly

The updated test-e2e target correctly:

  • Runs ginkgo with the extended args (including $(HCP_EXTERNAL_ARGS)).
  • Captures the test exit status into EXIT_CODE via ... || EXIT_CODE=$$?.
  • In OpenShift CI, wires up Vertex AI env vars from /var/run/oadp-credentials/* and invokes analyze_failures.sh with the captured code.
  • Exits with the original test result so Prow status is preserved.

For extra robustness against odd environments where EXIT_CODE might already be set, you could initialize it in this recipe shell before running ginkgo:

test-e2e: test-e2e-setup install-ginkgo
	EXIT_CODE=0; \
	ginkgo run ... || EXIT_CODE=$$?; \
	...
	exit $$EXIT_CODE

The current behavior is still correct in normal CI usage; this is just a defensive hardening.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between fcd0541 and fa6ce87.

📒 Files selected for processing (6)
  • .claude/config.json (1 hunks)
  • CLAUDE.md (1 hunks)
  • Makefile (1 hunks)
  • build/ci-Dockerfile (1 hunks)
  • docs/design/claude-prow-failure-analysis_design.md (1 hunks)
  • tests/e2e/scripts/analyze_failures.sh (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • build/ci-Dockerfile
  • CLAUDE.md
  • tests/e2e/scripts/analyze_failures.sh
  • Makefile
  • docs/design/claude-prow-failure-analysis_design.md
🪛 LanguageTool
docs/design/claude-prow-failure-analysis_design.md

[style] ~12-~12: Consider a different adjective to strengthen your wording.
Context: ...analysis is time-consuming and requires deep domain knowledge of Velero, CSI snapsho...

(DEEP_PROFOUND)


[grammar] ~26-~26: Use a hyphen to join words.
Context: ...n't block test result reporting) ## Non Goals - Live cluster diagnostics during...

(QB_NEW_EN_HYPHEN)


[grammar] ~956-~956: Ensure spelling is correct
Context: ...ly more complex plumbing Decision: Chose Option B for better error isolation and...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~1075-~1075: Ensure spelling is correct
Context: ...original test code - Vertex AI timeout (>10min): Script logs timeout, preserves test r...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[style] ~1256-~1256: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ifact Sets Question: How to handle very large must-gather archives or many per-test l...

(EN_WEAK_ADJECTIVE)


[style] ~1259-~1259: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...Claude Code CLI may truncate or fail on very large inputs - Some test runs generate extens...

(EN_WEAK_ADJECTIVE)

🪛 markdownlint-cli2 (0.18.1)
docs/design/claude-prow-failure-analysis_design.md

434-434: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


517-517: Headings must start at the beginning of the line

(MD023, heading-start-left)


527-527: Headings must start at the beginning of the line

(MD023, heading-start-left)


531-531: Headings must start at the beginning of the line

(MD023, heading-start-left)


532-532: Headings must start at the beginning of the line

(MD023, heading-start-left)


533-533: Headings must start at the beginning of the line

(MD023, heading-start-left)


534-534: Headings must start at the beginning of the line

(MD023, heading-start-left)


566-566: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)


596-596: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


623-623: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)


632-632: Hard tabs
Column: 1

(MD010, no-hard-tabs)


633-633: Hard tabs
Column: 1

(MD010, no-hard-tabs)


634-634: Hard tabs
Column: 1

(MD010, no-hard-tabs)


635-635: Hard tabs
Column: 1

(MD010, no-hard-tabs)


636-636: Hard tabs
Column: 1

(MD010, no-hard-tabs)


637-637: Hard tabs
Column: 1

(MD010, no-hard-tabs)


638-638: Hard tabs
Column: 1

(MD010, no-hard-tabs)


639-639: Hard tabs
Column: 1

(MD010, no-hard-tabs)


640-640: Hard tabs
Column: 1

(MD010, no-hard-tabs)


641-641: Hard tabs
Column: 1

(MD010, no-hard-tabs)


642-642: Hard tabs
Column: 1

(MD010, no-hard-tabs)


643-643: Hard tabs
Column: 1

(MD010, no-hard-tabs)


644-644: Hard tabs
Column: 1

(MD010, no-hard-tabs)


645-645: Hard tabs
Column: 1

(MD010, no-hard-tabs)


646-646: Hard tabs
Column: 1

(MD010, no-hard-tabs)


647-647: Hard tabs
Column: 1

(MD010, no-hard-tabs)


757-757: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


788-788: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


821-821: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


850-850: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


873-873: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)
tests/e2e/scripts/analyze_failures.sh

[warning] 62-62: Declare and assign separately to avoid masking return values.

(SC2155)


[warning] 366-366: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (3)
.claude/config.json (1)

1-45: Permissions policy matches intended read‑only, sandboxed usage

Allow/deny lists are coherent with the log‑analysis use case and correctly block destructive and networked actions (rm, curl/wget, kubectl apply/delete, docker, git push, WebSearch/WebFetch). No issues from a security/CI safety standpoint.

tests/e2e/scripts/analyze_failures.sh (1)

27-44: Overall flow, safety checks, and redaction look solid

The script’s structure is sound:

  • It gates analysis on a non‑zero test exit code and explicit CLI/Vertex configuration checks.
  • It keeps analysis read‑only and bounded via timeout (60s per subagent, 600s overall) and early exits that preserve the original test result.
  • All user‑visible Claude output is passed through redact_secrets, which substantially reduces credential‑leakage risk in the generated markdown.

No blocking issues here from a correctness or security perspective.

Also applies to: 169-181, 368-424

CLAUDE.md (1)

1-239: Claude integration overview is consistent with the implementation

The description of how Claude hooks into Prow (artifacts used, where claude-failure-analysis.md appears, and the opt‑out/credential behavior) matches the bash script and Makefile wiring. This should be a useful high‑level entry point for maintainers.

@weshayutin
Copy link
Contributor

/test 4.20-e2e-test-aws

@weshayutin
Copy link
Contributor

/test all

weshayutin
weshayutin previously approved these changes Dec 9, 2025
Copy link
Contributor

@weshayutin weshayutin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

@weshayutin
Copy link
Contributor

/retest

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 9, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f8b06e8 and 2 for PR HEAD fa6ce87 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 7a8aba0 and 1 for PR HEAD fa6ce87 in total

@weshayutin
Copy link
Contributor

/retest

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f75a38d and 0 for PR HEAD fa6ce87 in total

@sseago
Copy link
Contributor

sseago commented Dec 15, 2025

/override ci/prow/4.20-e2e-test-cli-aws

@openshift-ci
Copy link

openshift-ci bot commented Dec 15, 2025

@sseago: Overrode contexts on behalf of sseago: ci/prow/4.20-e2e-test-cli-aws

Details

In response to this:

/override ci/prow/4.20-e2e-test-cli-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kaovilai
Copy link
Member

/override ci/prow/4.20-e2e-test-cli-aws

@openshift-ci
Copy link

openshift-ci bot commented Dec 15, 2025

@kaovilai: Overrode contexts on behalf of kaovilai: ci/prow/4.20-e2e-test-cli-aws

Details

In response to this:

/override ci/prow/4.20-e2e-test-cli-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@weshayutin
Copy link
Contributor

/test 4.20-e2e-test-hcp-aws

@openshift-ci-robot
Copy link

/hold

Revision fa6ce87 was retested 3 times: holding

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2025
@weshayutin weshayutin removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2025
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2025
@openshift-ci
Copy link

openshift-ci bot commented Dec 15, 2025

New changes are detected. LGTM label has been removed.

@kaovilai
Copy link
Member

/retest

ai-retester: The e2e test for Mongo application DATAMOVER failed because the PersistentVolumeClaim binding failed during restore, causing the pod to never become ready and eventually time out. A volume conflict arose as a different Pod claimed the PVC from previous run.

comment for /pull/2038

@kaovilai
Copy link
Member

/retest

ai-retester: The end-to-end test e2e-test-aws failed because the Parks application Native-Snapshots subtest hit a timeout while waiting for the restify container to become ready after the restore. The analysis mentions further investigation for issues, relating it to Image pull issues, Init container failure, or etc. In general, there was one or more failure points for the backup and restore process to execute correctly for successful build result.

comment for /pull/2038

@openshift-ci
Copy link

openshift-ci bot commented Dec 16, 2025

@mpryc: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.20-e2e-test-cli-aws 416fc4c link true /test 4.20-e2e-test-cli-aws
ci/prow/4.20-e2e-test-aws 416fc4c link true /test 4.20-e2e-test-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants