Skip to content

Conversation

@cristiand391
Copy link
Member

@cristiand391 cristiand391 commented Jan 6, 2026

What does this PR do?

Updates MetadataTransfer class to track errors during retrieve/deploy polling.

The polling is done using sfdx-core's PollingClient without specifying retryLimit, so we have it run until a timeout happens:
https://github.com/forcedotcom/sfdx-core/blob/5564069767b85a96e73f8bf88dbdd3d7e4b5da03/src/status/pollingClient.ts#L131

During a retrieve/deploy with sf, if the metadata api starts to return one of the retryable errors constantly (backend or metadata issue?) the polling keeps going until a timeout (--wait flag value) and then throws a generic "client timed out" error without much info about the real issue.

This PR adds an error tracker to count consequent errors during polling checks, allowing to:

  1. throw if the same error is being retried X times
  2. still retry intermittent api flaky responses during long running poll

What issues does this PR fix or reference?

@W-18203875@

Functionality Before

SDR not defining a retry limit would cause consequent api errors to be retried until timeout and throw a generic timeout error

Functionality After

SDR tracks consequent errors, throws after 25 consequent errors during pollling and allows to customize the limit via env var.

Screenshot 2026-01-06 at 17 45 28

cristiand391 and others added 4 commits January 6, 2026 12:43
Increase timeout from 1 to 3 seconds in retry limit tests to ensure
the retry limit is reached before timeout on all platforms. The
1-second timeout was causing race conditions on Windows where only
16 retries completed instead of the expected 20 due to execution
overhead.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace total retry limit with consecutive error retry limit to prevent
infinite loops from repeated errors while allowing long-running operations
to poll indefinitely until timeout.

Key changes:
- Track consecutive retryable errors separately from normal polling
- Reset counter on successful status check
- Default limit of 25 consecutive errors (configurable via SF_METADATA_POLL_ERROR_RETRY_LIMIT)
- Remove PollingClient retryLimit to allow unlimited normal polling
- Add error message for retry limit exceeded without wrapper duplication

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@cristiand391 cristiand391 requested a review from a team as a code owner January 6, 2026 20:31
this.errorRetryLimit = calculateErrorRetryLimit(this.logger);
this.errorRetryLimitExceeded = undefined;

// Set a very high retryLimit for PollingClient to prevent it from stopping on errors
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo for me:
remove this comment, claude was still setting a high retryLimit after last changes

const err = e as Error | SfError;

// Don't wrap the error retry limit exceeded error
if (err instanceof SfError && err.message.includes('consecutive retryable errors')) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid wrapping mostly bc the final error printed to the user had duplicate msg parts like Metadata API request failed: Metadata API request failed

try {
mdapiStatus = await this.checkStatus();
// Reset error counter on successful status check
this.consecutiveErrorRetries = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

successful status != successful operation

a successful status check means the metadata API returned a valid response, it can still be InProgress.
The errors caught in the catch block are exceptions thrown by jsforce (network errors, parsing failures, etc)

error: e,
count: this.errorRetryLimit,
};
return { completed: true };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completed: true to signal the polling client to stop

@cristiand391 cristiand391 changed the title fix(poll): add dynamic retry limits for polling W-18203875 fix(poll): track consequent errors during polling W-18203875 Jan 6, 2026
cristiand391 and others added 2 commits January 7, 2026 13:42
Increase the default consecutive error retry limit from 25 to 1000
to provide more tolerance for intermittent network issues during
long-running deploy/retrieve operations.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@iowillhoit
Copy link
Contributor

Looks good to me! Tested this by disconnecting wifi mid deploy/retrieve. ENOTFOUND is one of the retry-able errors outlined here

  • Default retries is 1000 (deploy)
Screenshot 2026-01-07 at 2 52 55 PM
  • Overrode the default with SF_METADATA_POLL_ERROR_RETRY_LIMIT=50
Screenshot 2026-01-07 at 2 55 00 PM
  • When it reaches limit, it throws an error with the last known error
image
  • Same with retrieves
image

@iowillhoit iowillhoit merged commit 588ed78 into main Jan 7, 2026
3 checks passed
@iowillhoit iowillhoit deleted the cd/retry-limit branch January 7, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants