[CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager #3556

leixm · 2025-12-04T02:52:32Z

What changes were proposed in this pull request?

Retrieve the number of task failures from TaskSetManager in SparkUtils#shouldReportShuffleFetchFailure method

Why are the changes needed?

https://github.com/apache/celeborn/blob/main/client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java#L484 We record the failure counts for task attempts in the "UNKNOWN" and "FAILED" states, but spark might not record the failure counts for task attempts in the FAILED state. This is a common occurrence in our production environment where task attempts fail due to container preemption. This situation happens frequently and failure counts should not be recorded, as existing code logic makes it easier for stageRerun to be triggered prematurely. Therefore, obtaining the failure counts for task attempts from the taskSetManager would be more accurate.

Does this PR resolve a correctness bug?

No.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

… number of task failures from TaskSetManager

leixm · 2025-12-04T02:52:53Z

cc @AngersZhuuuu @turboFei @RexXiong

Copilot

Pull request overview

This PR improves the accuracy of task failure counting in the shouldReportShuffleFetchFailure method by retrieving failure counts directly from Spark's TaskSetManager instead of manually counting failed task attempts. This addresses an issue where the previous implementation incorrectly counted failures for tasks in the "FAILED" state, which Spark may not always count depending on the failure reason (e.g., container preemption).

Key changes:

Added getTaskFailureCount method to retrieve failure counts from TaskSetManager's internal numFailures array
Removed manual failure counting logic that iterated through task attempts in "FAILED" and "UNKNOWN" states
Updated logging messages to clearly distinguish between previous failure count and total failure count

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java	Added `getTaskFailureCount` helper method and refactored `shouldReportShuffleFetchFailure` to use TaskSetManager's failure count
client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java	Same changes as spark-3 version, maintaining consistency across Spark versions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-06T07:14:43Z

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java

+        if (previousFailureCount < 0) {
+          return true;
+        }
+        if (previousFailureCount + 1 >= maxTaskFails || !hasRunningAttempt) {


[nitpick] The comment mentions comparing with (maxTaskFails - 1) for equivalence, but the actual code uses previousFailureCount + 1 >= maxTaskFails. While mathematically equivalent, the implementation differs from the explanation. Consider either updating the comment to match the code or adjusting the code to match the comment explanation for consistency.

Copilot · 2025-12-06T07:14:43Z

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java

+        //       so we compare with (maxTaskFails - 1) which is equivalent to
+        //       (previousFailureCount + 1) >= maxTaskFails
+        int previousFailureCount = getTaskFailureCount(taskSetManager, taskInfo.index());
+        if (previousFailureCount < 0) {


The error handling for previousFailureCount < 0 returns true to trigger FetchFailed, but lacks a comment explaining this fail-safe behavior. Consider adding a brief comment explaining that returning true when failure count cannot be determined is a conservative safety measure to trigger FetchFailed and prevent silent failures.

Suggested change

if (previousFailureCount < 0) {

if (previousFailureCount < 0) {

// Fail-safe: If the previous failure count cannot be determined, conservatively trigger FetchFailed

// to prevent silent failures and ensure the error is handled.

turboFei · 2025-12-06T07:15:20Z

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkUtils.java

+   * @return the number of previous failed attempts, or -1 if an error occurs
+   */
+  @VisibleForTesting
+  protected static int getTaskFailureCount(TaskSetManager taskSetManager, int index) {


I saw it is VisibleForTesting.

Could you test it in the UT?

github-actions · 2026-01-06T08:41:09Z

This PR is stale because it has been open 20 days with no activity. Remove stale label or comment or this will be closed in 10 days.

SparkUtils#shouldReportShuffleFetchFailure method should retrieve the…

f116b87

… number of task failures from TaskSetManager

github-actions bot added module:client module:spark labels Dec 4, 2025

leixm added 5 commits December 4, 2025 10:57

fix

c73fa25

fix

e91feda

fix

50bf4e4

fix

f826cc3

fix

658ff0a

turboFei requested a review from Copilot December 6, 2025 07:14

Copilot AI reviewed Dec 6, 2025

View reviewed changes

turboFei reviewed Dec 6, 2025

View reviewed changes

github-actions bot added the stale label Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager #3556

[CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager #3556

leixm commented Dec 4, 2025

Uh oh!

leixm commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Dec 6, 2025

Uh oh!

Copilot AI Dec 6, 2025

Uh oh!

turboFei Dec 6, 2025

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager #3556

Are you sure you want to change the base?

[CELEBORN-2230] SparkUtils#shouldReportShuffleFetchFailure method should retrieve the number of task failures from TaskSetManager #3556

Conversation

leixm commented Dec 4, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

leixm commented Dec 4, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

turboFei Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants