PR to analyze failures based on the Tigers PR. #2038

mpryc · 2025-12-03T15:53:35Z

Original PR: #2034

Author: Tiger Kaovilai <tkaovila@redhat.com>
Date:   Fri Nov 28 09:17:23 2025 -0500

    refactor: update runtime permissions in failure analysis documentation and scripts

    - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script.
    - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions.
    - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation.

    These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime.

    Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

Why the changes were made

How to test the changes made

Original PR: openshift#2034 Author: Tiger Kaovilai <tkaovila@redhat.com> Date: Fri Nov 28 09:17:23 2025 -0500 refactor: update runtime permissions in failure analysis documentation and scripts - Replace `--allowedTools` with `--add-dir` for granting directory access in the analysis script. - Enhance documentation to clarify the use of `--add-dir` and `--allowedTools` for bypassing sandbox CWD restrictions. - Ensure consistent usage of CLI flags across the `analyze_failures.sh` script and design documentation. These changes improve clarity and functionality in the failure analysis process, ensuring proper access to necessary directories during runtime. Signed-off-by: Tiger Kaovilai <tkaovila@redhat.com>

coderabbitai · 2025-12-03T15:53:46Z

Walkthrough

Adds Claude integration for post-E2E failure analysis: new static permissions, docs and design, CI and Dockerfile changes to install Claude and Node, Makefile test-e2e changes to run analysis, a new analyze_failures.sh preprocessing + Claude invocation script, and an additional flake ignore pattern.

Changes

Cohort / File(s)	Summary
Configuration `\.claude/config.json`	New JSON permissions object with top-level `allow` and `deny` arrays listing permitted and denied commands/actions for Claude integration.
Documentation & Design `CLAUDE.md`, `docs/design/claude-prow-failure-analysis_design.md`	Adds repository guidance and a comprehensive design doc covering Claude/Vertex AI integration, artifact preprocessing, prompts/output format, credential flows, security, rollout, and testing plans.
CI Makefile `Makefile`	Updates `test-e2e` to capture test exit code, set Google/Vertex-related env vars conditionally on `OPENSHIFT_CI`, invoke `analyze_failures.sh` with the exit code, and exit with the original code.
CI Image `build/ci-Dockerfile`	Switches to multi-arch `kubectl` install using `ARG TARGETARCH`, makes the analysis script executable, and installs Node.js plus the Claude CLI.
Failure Analysis Script `tests/e2e/scripts/analyze_failures.sh`	New script that checks prerequisites, redacts secrets, preprocesses large artifacts (subagent or grep extraction), builds a structured Claude prompt, runs Claude with timeouts, saves redacted analysis to `claude-failure-analysis.md`, and exits with the original test exit code.
Flake Filtering `tests/e2e/lib/flakes.go`	Adds an ignore pattern for a hypershift-related message to the `errorIgnorePatterns` list used in flake detection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

tests/e2e/scripts/analyze_failures.sh — preprocessing logic, redaction regexes, timeouts, fallback paths, and artifact handling.
Makefile — conditional credential handling, environment propagation, and correct exit-code behavior in CI.
build/ci-Dockerfile — multi-arch kubectl download correctness and Node.js/Claude CLI install steps.
\.claude/config.json — ensure permission entries align with runtime expectations and the analysis script's CLI usage.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between fa6ce87 and 416fc4c.

📒 Files selected for processing (1)

tests/e2e/lib/flakes.go (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

tests/e2e/lib/flakes.go

🔇 Additional comments (1)

tests/e2e/lib/flakes.go (1)

24-24: LGTM! Appropriate addition to error ignore patterns.

The new ignore pattern correctly filters an expected informational message when the hypershift plugin skips execution in non-hypershift scenarios. The error checking for HostedControlPlane CRD is anticipated behavior when hypershift is not installed or not being used, making this pattern suitable for the ignore list.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

tests/e2e/scripts/analyze_failures.sh (2)
56-72: Tighten large‑log preprocessing to avoid duplication and follow shell best practices

Two points worth adjusting here:

Avoid masking get_file_size exit status (SC2155)
Declaring and assigning file_size in one line can hide a failure from get_file_size. Prefer:
-    local file_size=$(get_file_size "$log_file")
+    local file_size
+    file_size=$(get_file_size "$log_file")
Prevent double‑processing must‑gather logs
The first find "${ARTIFACT_DIR}" ... walk will already include ${ARTIFACT_DIR}/must-gather/**, and then you loop over ${ARTIFACT_DIR}/must-gather again. That can duplicate work and inflate token usage for big runs. One simple fix is to exclude must-gather from the first search:
-        done < <(find "${ARTIFACT_DIR}" -maxdepth 4 -name "*.log" -type f 2>/dev/null | while read f; do
+        done < <(find "${ARTIFACT_DIR}" -maxdepth 4 -name "*.log" -type f ! -path "${ARTIFACT_DIR}/must-gather/*" 2>/dev/null | while read f; do
This keeps per‑test logs and must‑gather logs logically separated in the two loops and avoids redundant subagent calls.

Also applies to: 119-167

21-22: Make exit‑code handling and temp‑file cleanup more robust

The failure‑analysis behavior is correct, but a couple of small tweaks will harden the script:

Initialize and quote EXIT_CODE defensively

Right now EXIT_CODE=$1 and later if [ $EXIT_CODE -ne 0 ]; then and exit $EXIT_CODE. This is fine when the caller always passes a numeric code, but it is easy to make the script more bullet‑proof:
-EXIT_CODE=$1
+EXIT_CODE=${1:-0}
...
-if [ $EXIT_CODE -ne 0 ]; then
+if [ "$EXIT_CODE" -ne 0 ]; then
...
-exit $EXIT_CODE
+exit "$EXIT_CODE"
Use a quoted single‑quoted trap to avoid premature expansion (SC2064)
-    TEMP_OUTPUT=$(mktemp)
-    trap "rm -f $TEMP_OUTPUT" EXIT
+    TEMP_OUTPUT=$(mktemp)
+    trap 'rm -f "$TEMP_OUTPUT"' EXIT
This ensures the trap always refers to the actual temp file path and behaves correctly even if the variable were changed later.

Also applies to: 169-181, 188-193, 365-367, 372-373, 432-432
docs/design/claude-prow-failure-analysis_design.md (1)

629-648: Update test-e2e Makefile snippet to match the current implementation

The Makefile example here has diverged from the real test-e2e target (flag names, provider/credentials wiring, and CI‑only Claude setup). For someone copying this snippet, that mismatch will be confusing and could break local runs.

Recommend regenerating this section from the live Makefile (including the EXIT_CODE handling and analyze_failures.sh call) so the design doc remains an accurate reference.

build/ci-Dockerfile (1)

8-22: CI image wiring is correct; consider pinning Node/Claude versions

The changes correctly:

Ensure tests/e2e/scripts/analyze_failures.sh is executable for CI.

Switch kubectl installation to a multi‑arch flow using TARGETARCH.

Add Node.js and the global @anthropic-ai/claude-code CLI needed by the analysis script.

To improve reproducibility of CI images, consider pinning:

A specific Node.js major/minor (via a fixed NodeSource channel or explicit package version).

A specific @anthropic-ai/claude-code version (e.g. npm install -g @anthropic-ai/claude-code@<version>).

That will make it easier to audit and debug CI behavior over time when dependencies evolve.
Makefile (1)
860-886: EXIT_CODE capture and Claude invocation look good; optionally initialize EXIT_CODE explicitly

The updated test-e2e target correctly:

Runs ginkgo with the extended args (including $(HCP_EXTERNAL_ARGS)).

Captures the test exit status into EXIT_CODE via ... || EXIT_CODE=$$?.

In OpenShift CI, wires up Vertex AI env vars from /var/run/oadp-credentials/* and invokes analyze_failures.sh with the captured code.

Exits with the original test result so Prow status is preserved.

For extra robustness against odd environments where EXIT_CODE might already be set, you could initialize it in this recipe shell before running ginkgo:
test-e2e: test-e2e-setup install-ginkgo
	EXIT_CODE=0; \
	ginkgo run ... || EXIT_CODE=$$?; \
	...
	exit $$EXIT_CODE
The current behavior is still correct in normal CI usage; this is just a defensive hardening.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between fcd0541 and fa6ce87.

📒 Files selected for processing (6)

.claude/config.json (1 hunks)
CLAUDE.md (1 hunks)
Makefile (1 hunks)
build/ci-Dockerfile (1 hunks)
docs/design/claude-prow-failure-analysis_design.md (1 hunks)
tests/e2e/scripts/analyze_failures.sh (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

build/ci-Dockerfile
CLAUDE.md
tests/e2e/scripts/analyze_failures.sh
Makefile
docs/design/claude-prow-failure-analysis_design.md

🪛 LanguageTool

docs/design/claude-prow-failure-analysis_design.md

[style] ~12-~12: Consider a different adjective to strengthen your wording.
Context: ...analysis is time-consuming and requires deep domain knowledge of Velero, CSI snapsho...

(DEEP_PROFOUND)

[grammar] ~26-~26: Use a hyphen to join words.
Context: ...n't block test result reporting) ## Non Goals - Live cluster diagnostics during...

(QB_NEW_EN_HYPHEN)

[grammar] ~956-~956: Ensure spelling is correct
Context: ...ly more complex plumbing Decision: Chose Option B for better error isolation and...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~1075-~1075: Ensure spelling is correct
Context: ...original test code - Vertex AI timeout (>10min): Script logs timeout, preserves test r...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[style] ~1256-~1256: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...ifact Sets Question: How to handle very large must-gather archives or many per-test l...

(EN_WEAK_ADJECTIVE)

[style] ~1259-~1259: As an alternative to the over-used intensifier ‘very’, consider replacing this phrase.
Context: ...Claude Code CLI may truncate or fail on very large inputs - Some test runs generate extens...

(EN_WEAK_ADJECTIVE)

🪛 markdownlint-cli2 (0.18.1)

docs/design/claude-prow-failure-analysis_design.md

434-434: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

517-517: Headings must start at the beginning of the line

(MD023, heading-start-left)

527-527: Headings must start at the beginning of the line

(MD023, heading-start-left)

531-531: Headings must start at the beginning of the line

(MD023, heading-start-left)

532-532: Headings must start at the beginning of the line

(MD023, heading-start-left)

533-533: Headings must start at the beginning of the line

(MD023, heading-start-left)

534-534: Headings must start at the beginning of the line

(MD023, heading-start-left)

566-566: Code block style
Expected: fenced; Actual: indented

(MD046, code-block-style)

596-596: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

623-623: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

632-632: Hard tabs
Column: 1

(MD010, no-hard-tabs)

633-633: Hard tabs
Column: 1

(MD010, no-hard-tabs)

634-634: Hard tabs
Column: 1

(MD010, no-hard-tabs)

635-635: Hard tabs
Column: 1

(MD010, no-hard-tabs)

636-636: Hard tabs
Column: 1

(MD010, no-hard-tabs)

637-637: Hard tabs
Column: 1

(MD010, no-hard-tabs)

638-638: Hard tabs
Column: 1

(MD010, no-hard-tabs)

639-639: Hard tabs
Column: 1

(MD010, no-hard-tabs)

640-640: Hard tabs
Column: 1

(MD010, no-hard-tabs)

641-641: Hard tabs
Column: 1

(MD010, no-hard-tabs)

642-642: Hard tabs
Column: 1

(MD010, no-hard-tabs)

643-643: Hard tabs
Column: 1

(MD010, no-hard-tabs)

644-644: Hard tabs
Column: 1

(MD010, no-hard-tabs)

645-645: Hard tabs
Column: 1

(MD010, no-hard-tabs)

646-646: Hard tabs
Column: 1

(MD010, no-hard-tabs)

647-647: Hard tabs
Column: 1

(MD010, no-hard-tabs)

757-757: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

788-788: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

821-821: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

850-850: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

873-873: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.11.0)

tests/e2e/scripts/analyze_failures.sh

[warning] 62-62: Declare and assign separately to avoid masking return values.

(SC2155)

[warning] 366-366: Use single quotes, otherwise this expands now rather than when signalled.

(SC2064)

🔇 Additional comments (3)

.claude/config.json (1)

1-45: Permissions policy matches intended read‑only, sandboxed usage

Allow/deny lists are coherent with the log‑analysis use case and correctly block destructive and networked actions (rm, curl/wget, kubectl apply/delete, docker, git push, WebSearch/WebFetch). No issues from a security/CI safety standpoint.

tests/e2e/scripts/analyze_failures.sh (1)

27-44: Overall flow, safety checks, and redaction look solid

The script’s structure is sound:

It gates analysis on a non‑zero test exit code and explicit CLI/Vertex configuration checks.

It keeps analysis read‑only and bounded via timeout (60s per subagent, 600s overall) and early exits that preserve the original test result.

All user‑visible Claude output is passed through redact_secrets, which substantially reduces credential‑leakage risk in the generated markdown.

No blocking issues here from a correctness or security perspective.

Also applies to: 169-181, 368-424

CLAUDE.md (1)

1-239: Claude integration overview is consistent with the implementation

The description of how Claude hooks into Prow (artifacts used, where claude-failure-analysis.md appears, and the opt‑out/credential behavior) matches the bash script and Makefile wiring. This should be a useful high‑level entry point for maintainers.

weshayutin · 2025-12-03T21:19:51Z

/test 4.20-e2e-test-aws

weshayutin · 2025-12-04T18:05:03Z

/test all

weshayutin

/LGTM

weshayutin · 2025-12-09T17:39:44Z

/retest

openshift-ci · 2025-12-09T17:40:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc, weshayutin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mpryc]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2025-12-09T22:05:35Z

/retest-required

Remaining retests: 0 against base HEAD f8b06e8 and 2 for PR HEAD fa6ce87 in total

openshift-ci-robot · 2025-12-12T00:05:52Z

/retest-required

Remaining retests: 0 against base HEAD 7a8aba0 and 1 for PR HEAD fa6ce87 in total

weshayutin · 2025-12-12T22:02:03Z

/retest

openshift-ci-robot · 2025-12-15T14:16:20Z

/retest-required

Remaining retests: 0 against base HEAD f75a38d and 0 for PR HEAD fa6ce87 in total

sseago · 2025-12-15T14:39:54Z

/override ci/prow/4.20-e2e-test-cli-aws

openshift-ci · 2025-12-15T14:40:29Z

@sseago: Overrode contexts on behalf of sseago: ci/prow/4.20-e2e-test-cli-aws

Details

In response to this:

/override ci/prow/4.20-e2e-test-cli-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

kaovilai · 2025-12-15T14:57:57Z

/override ci/prow/4.20-e2e-test-cli-aws

openshift-ci · 2025-12-15T14:58:31Z

@kaovilai: Overrode contexts on behalf of kaovilai: ci/prow/4.20-e2e-test-cli-aws

Details

In response to this:

/override ci/prow/4.20-e2e-test-cli-aws

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

weshayutin · 2025-12-15T15:33:02Z

/test 4.20-e2e-test-hcp-aws

openshift-ci-robot · 2025-12-15T17:19:52Z

/hold

Revision fa6ce87 was retested 3 times: holding

openshift-ci · 2025-12-15T17:28:04Z

New changes are detected. LGTM label has been removed.

kaovilai · 2025-12-15T21:04:23Z

/retest

ai-retester: The e2e test for Mongo application DATAMOVER failed because the PersistentVolumeClaim binding failed during restore, causing the pod to never become ready and eventually time out. A volume conflict arose as a different Pod claimed the PVC from previous run.

comment for /pull/2038

kaovilai · 2025-12-16T00:29:25Z

/retest

ai-retester: The end-to-end test e2e-test-aws failed because the Parks application Native-Snapshots subtest hit a timeout while waiting for the restify container to become ready after the restore. The analysis mentions further investigation for issues, relating it to Image pull issues, Init container failure, or etc. In general, there was one or more failure points for the backup and restore process to execute correctly for successful build result.

comment for /pull/2038

openshift-ci · 2025-12-16T04:02:17Z

@mpryc: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/4.20-e2e-test-cli-aws	`416fc4c`	link	true	`/test 4.20-e2e-test-cli-aws`
ci/prow/4.20-e2e-test-aws	`416fc4c`	link	true	`/test 4.20-e2e-test-aws`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from sseago and weshayutin December 3, 2025 15:53

weshayutin mentioned this pull request Dec 3, 2025

Add automated Claude failure analysis for Prow CI #2034

Draft

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2025

coderabbitai bot reviewed Dec 3, 2025

View reviewed changes

weshayutin previously approved these changes Dec 9, 2025

View reviewed changes

openshift-ci bot assigned weshayutin Dec 9, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 9, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2025

weshayutin removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 15, 2025

update to ignore hypershift error

416fc4c

weshayutin dismissed their stale review via 416fc4c December 15, 2025 17:27

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2025

PR to analyze failures based on the Tigers PR. #2038

Are you sure you want to change the base?

PR to analyze failures based on the Tigers PR. #2038

Conversation

mpryc commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why the changes were made

How to test the changes made

Uh oh!

coderabbitai bot commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

weshayutin commented Dec 3, 2025

Uh oh!

weshayutin commented Dec 4, 2025

Uh oh!

weshayutin left a comment

Choose a reason for hiding this comment

Uh oh!

weshayutin commented Dec 9, 2025

Uh oh!

openshift-ci bot commented Dec 9, 2025

Uh oh!

openshift-ci-robot commented Dec 9, 2025

Uh oh!

openshift-ci-robot commented Dec 12, 2025

Uh oh!

weshayutin commented Dec 12, 2025

Uh oh!

openshift-ci-robot commented Dec 15, 2025

Uh oh!

sseago commented Dec 15, 2025

Uh oh!

openshift-ci bot commented Dec 15, 2025

Uh oh!

kaovilai commented Dec 15, 2025

Uh oh!

openshift-ci bot commented Dec 15, 2025

Uh oh!

weshayutin commented Dec 15, 2025

Uh oh!

openshift-ci-robot commented Dec 15, 2025

Uh oh!

openshift-ci bot commented Dec 15, 2025

Uh oh!

kaovilai commented Dec 15, 2025

Uh oh!

kaovilai commented Dec 16, 2025

Uh oh!

openshift-ci bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mpryc commented Dec 3, 2025 •

edited

Loading

coderabbitai bot commented Dec 3, 2025 •

edited

Loading