Skip to content

fix: allow for complete crash#258

Merged
jason-lynch merged 2 commits intomainfrom
fix/PLAT-331/proper-crash
Jan 27, 2026
Merged

fix: allow for complete crash#258
jason-lynch merged 2 commits intomainfrom
fix/PLAT-331/proper-crash

Conversation

@jason-lynch
Copy link
Member

@jason-lynch jason-lynch commented Jan 25, 2026

Summary

The workflows worker does not exit until its context is canceled. This commit fixes an issue where certain errors, such as those originating from Etcd or the various watches, could lead to the app shutting down without canceling the worker's context. This put the server in a partially shut down state, where it would still try to run workflows but fail from missing services in our DI system.

Now, the App manages a separate context for services, and the worker manages its own context as well. This allows for a complete crash in case of error, and improves our shutdown process by waiting until all the shutdown methods have been tried before canceling the service context.

Testing

The app should function normally. In extreme situations that prompt a crash, such as the one described in PLAT-331, the server should shut down completely and restart itself rather than partially shutting down.

PLAT-331

Summary by CodeRabbit

Release Notes

  • Chores
    • Enhanced structured logging throughout system components with component-scoped identifiers for improved observability.
    • Improved graceful shutdown coordination with better error propagation across services.
    • Optimized log verbosity by adjusting diagnostic messages to appropriate detail levels.
    • Refined service lifecycle management for more predictable startup and shutdown sequencing.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This PR introduces comprehensive logging infrastructure across server components and refactors the App lifecycle to implement error propagation through a new ErrorProducer interface. Changes include adding logger fields with component scoping to multiple services, reducing log verbosity by converting Info-level logs to Debug across several components, and introducing service context management with dedicated error handling channels in the App struct.

Changes

Cohort / File(s) Summary
Logging Infrastructure
server/internal/api/server.go, server/internal/etcd/embedded.go, server/internal/etcd/remote.go, server/internal/workflows/worker.go
Added logger field to Server and Worker structs; initialized with component metadata for scoped logging; added debug logs during startup and shutdown phases
Log Level Adjustments
server/internal/api/apiv1/post_init_handlers.go, server/internal/etcd/provide.go, server/internal/migrate/runner.go, server/internal/monitor/service.go, server/internal/scheduler/service.go
Converted Info-level logs to Debug for diagnostic messages: parsed timestamps, etcd mode transitions, migration status, monitor lifecycle, and scheduled job events
App Lifecycle & Error Handling
server/internal/app/app.go
Introduced ErrorProducer interface; added errCh, serviceCtx, and serviceCtxCancel to App struct; implemented error propagation mechanism via addErrorProducer helper; refactored lifecycle to use dedicated service context for internal operations while maintaining external context for shutdown coordination; replaced public Shutdown with private shutdown method featuring graceful shutdown with configurable grace period
Worker Integration
server/internal/workflows/provide.go, server/internal/workflows/worker.go
Updated NewWorker to accept logger as first parameter; added ctx and cancel fields to Worker; implemented cancellable context lifecycle in Start/Shutdown; added debug logging for worker lifecycle events

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • rshoemaker

Poem

🐰 A logger hops through the code so bright,
Components log with debug-level delight,
Error channels flow where services roam,
Context trees grow in the App's new home!
With graceful shutdown, we hop toward the night. 🌙

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: allow for complete crash' directly relates to the main objective of enabling complete application shutdown on errors, though it's somewhat abstract and could be more specific.
Description check ✅ Passed The PR description covers all required template sections: Summary (clear explanation of the issue and fix), Changes (implicit in the summary), Testing (app should function normally, server should fully shut down on crash), Checklist (mostly absent), and Notes for Reviewers. However, the checklist is not explicitly completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jason-lynch
Copy link
Member Author

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@server/internal/workflows/worker.go`:
- Around line 55-63: Start currently overwrites w.cancel on re-entry; change
Worker.Start to guard and only assign the active cancel after the underlying
worker successfully started: first check if w.cancel != nil (or if w.ctx != nil)
and return an error or no-op to prevent re-entry, then create a localCtx,
localCancel := context.WithCancel(ctx) and call w.worker.Start(localCtx); if
that returns an error call localCancel() and return the error, otherwise set
w.ctx = localCtx and w.cancel = localCancel. Reference Worker.Start, w.ctx,
w.cancel and w.worker.Start when making the change.

@jason-lynch jason-lynch force-pushed the fix/PLAT-403/cancel-panic branch from 1fc7985 to 7264b7e Compare January 25, 2026 19:13
@jason-lynch jason-lynch force-pushed the fix/PLAT-331/proper-crash branch 3 times, most recently from e1a27fd to a12872f Compare January 25, 2026 19:28
@jason-lynch jason-lynch force-pushed the fix/PLAT-403/cancel-panic branch from 7264b7e to 9b24f17 Compare January 26, 2026 14:44
@jason-lynch jason-lynch force-pushed the fix/PLAT-331/proper-crash branch from a12872f to 3afed69 Compare January 26, 2026 14:44
Copy link
Contributor

@rshoemaker rshoemaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved.

The workflows worker does not exit until its context is canceled. This
commit fixes an issue where certain errors, such as those originating
from Etcd or the various watches, could lead to the app shutting down
without canceling the worker's context. This put the server in a
partially shut down state, where it would still try to run workflows
but fail from missing services in our DI system.

Now, the `App` manages a separate context for services, and the worker
manages its own context as well. This allows for a complete crash in
case of error, and improves our shutdown process by waiting until all
the shutdown methods have been tried before canceling the service
context.

PLAT-331
Simplifies our startup and shut down log output by limiting it to
general process status and services that are opening ports.

PLAT-331
@jason-lynch jason-lynch force-pushed the fix/PLAT-331/proper-crash branch from 3afed69 to 49fe78a Compare January 27, 2026 13:49
@jason-lynch jason-lynch changed the base branch from fix/PLAT-403/cancel-panic to main January 27, 2026 13:50
@jason-lynch jason-lynch merged commit aceb011 into main Jan 27, 2026
3 checks passed
@jason-lynch jason-lynch deleted the fix/PLAT-331/proper-crash branch January 27, 2026 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants