Skip to content

Do not crash on metrics errors#6854

Open
dmerand wants to merge 1 commit intodlm-log-metrics-errorsfrom
dlm-handle-metrics-errors
Open

Do not crash on metrics errors#6854
dmerand wants to merge 1 commit intodlm-log-metrics-errorsfrom
dlm-handle-metrics-errors

Conversation

@dmerand
Copy link
Contributor

@dmerand dmerand commented Feb 12, 2026

WHY are these changes introduced?

Resolves incident #inv-18682 (February 2026 OTLP failures)

When the OTLP telemetry endpoint is unavailable or experiencing failures, the CLI currently crashes with unhandled errors, forcing users to run commands with SHOPIFY_CLI_NO_ANALYTICS=1 as a workaround. This violates the principle that telemetry should be transparent to users - it should never impact CLI functionality.

Related to: PR #6851 (adds Bugsnag reporting for telemetry failures)

WHAT is this pull request doing?

Changes:

Always resolve the OTEL metrics emitter - There is no need to halt execution/reject the emitting process if OTEL reporting fails. This is a non-critical process, and we should allow things to continue to function normally even if this fails

HOW is this fixing things?

  • Commands exit with EXIT_CODE=0 now as we are always resolving. Previously, they were exiting with 1
1. CLI command finishes                                                                                                          
     ↓                                                                                                                             
  2. postrun hook (cli-kit/.../hooks/postrun.ts)                                                                                   
     → calls reportAnalyticsEvent()                                                                                                
     ↓                                                                                                                             
  3. reportAnalyticsEvent (analytics.ts:44)                                                                                        
     → Promise.all([doMonorail(), doOpenTelemetry()])                                                                              
     → has try/catch, BUT...                                                                                                       
     ↓                                                                                                                             
  4. doOpenTelemetry → recordMetrics (otel-metrics.ts:58)                                                                          
     → calls recorder.otel.record() on counters/histograms                                                                         
     ↓                                                                                                                             
  5. BaseOtelService.record() (BaseOtelService.ts:~140)                                                                            
     → instrument.record(value, labels)                                                                                            
     → void meterProvider.forceFlush({}).catch(() => {})   ← FIRE-AND-FORGET                                                       
     ↓                                                                                                                             
  6. MeterProvider.forceFlush()                                                                                                    
     ↓                                                                                                                             
  7. InstantaneousMetricReader.onForceFlush()  [throttled to 1000ms]                                                               
     → this.collect({})  — gathers accumulated metrics                                                                             
     → this._exporter.export(resourceMetrics, callback)                                                                            
     ↓                                                                                                                             
  8. OTLPMetricExporter sends HTTP request to OTEL collector                                                                       
     → callback fires with { code: SUCCESS | FAILED, error? } 
     → previously rejected any errors, now we log/report them, but always resolve

Copy link
Contributor Author

dmerand commented Feb 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@dmerand dmerand mentioned this pull request Feb 12, 2026
3 tasks
@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

Coverage report

St.
Category Percentage Covered / Total
🟡 Statements 78.91% 14557/18448
🟡 Branches 73.23% 7223/9863
🟡 Functions 79.12% 3705/4683
🟡 Lines 79.25% 13763/17366

Test suite run success

3775 tests passing in 1455 suites.

Report generated by 🧪jest coverage report action from 10c96e1

@dmerand dmerand force-pushed the dlm-handle-metrics-errors branch 2 times, most recently from b53ecc5 to dfa63b2 Compare February 12, 2026 18:18
@dmerand dmerand closed this Feb 12, 2026
@dmerand dmerand reopened this Feb 12, 2026
@alexanderMontague alexanderMontague changed the base branch from dlm-log-metrics-errors to graphite-base/6854 February 13, 2026 15:17
@alexanderMontague alexanderMontague force-pushed the dlm-handle-metrics-errors branch from dfa63b2 to dccc1c6 Compare February 13, 2026 16:36
@alexanderMontague alexanderMontague changed the base branch from graphite-base/6854 to dlm-log-metrics-errors February 13, 2026 16:36
@github-actions
Copy link
Contributor

Differences in type declarations

We detected differences in the type declarations generated by Typescript for this branch compared to the baseline ('main' branch). Please, review them to ensure they are backward-compatible. Here are some important things to keep in mind:

  • Some seemingly private modules might be re-exported through public modules.
  • If the branch is behind main you might see odd diffs, rebase main into this branch.

New type declarations

packages/cli-kit/dist/public/node/vendor/otel-js/service/FailSafeOtelService.d.ts
import type { MetricAttributes } from '@opentelemetry/api';
import { OtelService, OnRecordCallback } from './types.js';
/**
 * Wraps an OtelService to ensure telemetry failures never crash the CLI.
 *
 * This service attempts to record metrics on a "fail-safe" basis - if recording
 * fails, the error is logged (in verbose mode) but not propagated.
 * This ensures telemetry infrastructure issues (like OTLP endpoint outages) don't
 * block developer workflows.
 *
 * Only wraps the methods that are actually called from outside the service.
 * Initialization errors are handled by the caller.
 *
 * Related: incident #inv-18682 (February 2026 OTLP failures)
 */
export declare class FailSafeOtelService {
    private readonly inner;
    private hasLoggedError;
    constructor(inner: OtelService);
    get serviceName(): string;
    record<TAttributes extends MetricAttributes = any>(...args: Parameters<OnRecordCallback<TAttributes>>): void;
    private logError;
}

Existing type declarations

We found no diffs with existing type declarations

@alexanderMontague alexanderMontague force-pushed the dlm-handle-metrics-errors branch from dccc1c6 to 8363f46 Compare February 13, 2026 16:42
@alexanderMontague alexanderMontague marked this pull request as ready for review February 13, 2026 17:17
@alexanderMontague alexanderMontague requested a review from a team as a code owner February 13, 2026 17:17
@github-actions
Copy link
Contributor

We detected some changes at packages/*/src and there are no updates in the .changeset.
If the changes are user-facing, run pnpm changeset add to track your changes and include them in the next release CHANGELOG.

Caution

DO NOT create changesets for features which you do not wish to be included in the public changelog of the next CLI release.

test('resolves without rejecting when export error is undefined', async () => {
const exporter = createMockExporter(ExportResultCode.FAILED)
const {reader, provider} = createReaderWithProvider(exporter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should perhaps test that the diag.error comes through in this and the above case.

@@ -41,13 +41,12 @@ export class InstantaneousMetricReader extends MetricReader {
diag.error('PeriodicExportingMetricReader: metrics collection errors', ...errors)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This isn't anything we did in this PR, but this class name isn't correct any more -- perhaps a miss when vendoring?

await provider.shutdown()
})

test('resolves without rejecting on export failure', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add a test case for export returning synchronous exceptions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants