[nexus] FM alert requests and rendezvous task #9552

hawkw · 2025-12-19T21:57:23Z

This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the alert table to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a background fm_rendezvous task reconciles the requested alerts in the sitrep with the actual alert table. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.

I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.

This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.

smklein

Overall structure looks good, but I've got some questions about lifetimes of things

smklein · 2026-01-05T19:28:10Z

schema/crdb/dbinit.sql

    lookup_ereports_assigned_to_fm_case
 ON omicron.public.fm_ereport_in_case (sitrep_id, case_id);

+CREATE TABLE IF NOT EXISTS omicron.public.fm_alert_request (


Just because it's a little quirky - could we document the lifetime of entries in the fm_alert_request table?

You said in the PR description that these get inserted "at sitrep insertion time", and then they may or may not be used depending on whether the sitrep gets activated. When do we remove them?

smklein · 2026-01-05T19:30:24Z

schema/crdb/dbinit.sql

+    -- UUID of the sitrep in which the alert is requested.
+    sitrep_id UUID NOT NULL,
+    -- UUID of the sitrep in which the alert request was created.
+    requested_sitrep_id UUID NOT NULL,


To confirm: This value of requested_sitrep_id is static, and the sitrep_id value will be the one changing as the fm_alert_request is carried from sitrep to sitrep, right?

That's correct --- is there something that would have made the comment clearer here?

This feels like a nitpick; maybe just:

-- UUID of the ongoing sitrep in which the alert is requested. ... -- UUID of the original sitrep in which the alert request was created.

(it's clear when seeing the pattern in the other tables, but it has been a few weeks since I looked at this code, and I had to check things)

smklein · 2026-01-05T19:34:23Z

nexus/types/src/alert.rs

+                variant.as_str().starts_with("test.") && !variant.is_test()
+            })
+            .collect::<Vec<_>>();
+        assert_eq!(


Nice quality-of-life test, though it makes me wonder if test. as a stringified prefix should just imply variant.is_test() for us

ehh I'm seeing this is just moved code; consider this alternate is_test suggestion just musing rather than a request.

nexus/types/src/internal_api/background.rs

smklein · 2026-01-05T19:50:26Z

nexus/src/app/background/tasks/fm_rendezvous.rs

+        // XXX(eliza): is it better to allocate all of these into a big array
+        // and do a single `INSERT INTO` query, or iterate over them one by one
+        // (not allocating) but insert one at a time?


Something to consider: There are a ton of sources of conflicts due to raciness here:

If we have multiple Nexus instances trying to perform rendezvous for a single sitrep simultaneously, they'll potentially overlap

They could also be operating on distinct sitreps (e.g., a slow Nexus is doing rendezvous for "what it thinks is the latest sitrep", but it becomes out-of-date immediately after it starts the rendezvous process) and they'll have partially overlapping sets of alerts

I think this would require a batched version of alert_create (maybe alerts_create?) to use on_conflict...do_nothing.

However, I think I'm okay with the usual: "Let's keep it simple now, instrument it, and optimize it later when we need to"

smklein · 2026-01-05T20:33:59Z

nexus/src/app/background/tasks/fm_rendezvous.rs

+            let class = class.into();
+            match self
+                .datastore
+                .alert_create(&opctx, id, class, payload.clone())


I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.

Suppose we want to delete an alert record from cockroachdb

Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.

If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)

I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.

I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.

smklein · 2026-01-05T20:50:29Z

nexus/db-queries/src/db/datastore/fm.rs

                .internal_context("failed to delete case ereport assignments")
        })?;

+        // Delete alert requests


I know we currently happen to be calling fm_sitrep_delete_all from a context where we're deleting orphan sitreps, but theoretically someone could call this function for "just an old sitrep", including one that was not orphaned.

Is it possible that this would be deleting records concurrently with the fm_rendezvous background task?

looking at the background task, it looks like we read the "alert request" values in-memory from the watch channel, so, I guess this would be like:

Background task reads a sitrep, starts acting on it

Someone calls fm_sitrep_delete_all, and passes the ID sitrep going through the background task. Note: This might be happening on a different Nexus instance from where the background task is executing.

The background task is still acting upon a sitrep that we deleted from the database

(maybe this is fine, but kinda just checking-in on "what guards what" with respect to row lifetimes)

I don't actually think there's anything particularly wrong with that; it should always be safe to delete any sitrep as long as it is not the current sitrep according to CRDB. Everything that is driven by data stored in a sitrep operates on the in-memory sitrep loaded by the fm_sitrep_load task, it should actually be totally fine to delete the sitrep because any Nexus operating from that sitrep is operating on it from memory rather than from CRDB --- we won't see parts of a sitrep disappear on us as long as we've already loaded it.

That said, I do actually think there's a separate issue around deleting a sitrep *while the fm_sitrep_loader task is currently in the process of loading that sitrep. In that case, it's possible to read an incomplete sitrep if the sitrep is deleted concurrently to loading it. We could change the loader task to re-load the metadata record after loading all the other components of the sitrep, to check that the sitrep wasn't in the process of being deleted while it was being loaded. However, we currently delete the sitrep metadata record as the last step in deleting the sitrep (as its what indicates that the delete operation is "completed"); this means that we could have a situation where a load overlaps with the beginning of a delete but doesn't see the delete finish, and therefore loads an incomplete sitrep.

I think this could be fixed by adding a time_deleted field to the sitrep metadata (or something, could just be a bool), and having the delete operation start by setting that field, and then go and delete all the subordinate records, and then delete the metadata record. This way, the loader task could detect if it raced with a delete by re-loading the metadata record again and checking if its time_deleted was set or if it just no longer exists at all.

Does that make sense to you?

It does make sense, as a way to prevent reading a partial sitrep. Though, just to be explicit, it doesn't prevent the "loaded sitrep" from being immediately deleted after loading (nothing can prevent that).

I also think the issue here is shared with blueprints, if I'm reading nexus/src/app/background/tasks/blueprint_load.rs correctly:

blueprint_target_get_current_full calls:

blueprint_current_target_only (which loads the current target)

Then it calls blueprint_read on that target

There's even a comment about the case of concurrent deletion, but I think it's somewhat wrong:

omicron/nexus/db-queries/src/db/datastore/deployment.rs

Lines 2140 to 2145 in 0c0bdf4

// The blueprint for the current target cannot be deleted while it is

// the current target, but it's possible someone else (a) made a new

// blueprint the target and (b) deleted the blueprint pointed to by our

// `target` between the above query and the below query. In such a case,

// this query will fail with an "unknown blueprint ID" error. This

// should be rare in practice.

It's possible that the blueprint target change + blueprint deletion happens in the body of blueprint_read, which would not return an "unknown blueprint ID" necessarily. It could just result in a torn blueprint, with many rows of data missing.

I think this could be fixed by adding a time_deleted field to the sitrep metadata (or something, could just be a bool), and having the delete operation start by setting that field, and then go and delete all the subordinate records, and then delete the metadata record. This way, the loader task could detect if it raced with a delete by re-loading the metadata record again and checking if its time_deleted was set or if it just no longer exists at all.

To be clear: I do like the idea of "identifying the start + end of the delete", and being able to look for either marker as a way to prevent reading torn sitreps

Also, gonna mention it because it would be a very straightforward fix:

Making "sitrep delete" and "sitrep load" transactions would also fix this issue. Since sitreps are immutable, there should be no contention, except in this case we care about, where someone is attempting to a do a load / deletion of the same sitrep concurrently.

Filed #9594 for this case

Co-authored-by: Sean Klein <sean@oxide.computer>

hawkw added 10 commits December 17, 2025 17:04

most of alert requests

e48ad41

reticulating

f6f5adf

draw more of the owl

69dcfbd

reticulating docs

5f0ae87

tests

c9da743

migrations

742de01

add config files

9e59159

actually delete cases

4197cdf

fixup stuff

98d9dd5

omdb

68a78cd

hawkw requested review from davepacheco and smklein December 19, 2025 21:57

hawkw self-assigned this Dec 19, 2025

hawkw added nexus Related to nexus fault-management Everything related to the fault-management initiative (RFD480 and others) labels Dec 19, 2025

Merge branch 'main' into eliza/fm-alerts

6d2ca31

hawkw changed the title ~~[nexus] fm alert requests and rendezvous task~~ [nexus] FM alert requests and rendezvous task Dec 19, 2025

hawkw mentioned this pull request Dec 23, 2025

[fm] add a SitrepBuilder, to help with building sitreps #9566

Draft

smklein assigned smklein and unassigned hawkw Jan 5, 2026

smklein mentioned this pull request Jan 5, 2026

Blueprint Rendezvous Table Garbage Collection #9592

Open

smklein reviewed Jan 5, 2026

View reviewed changes

smklein assigned hawkw and unassigned smklein Jan 5, 2026

Update nexus/types/src/internal_api/background.rs

20b53eb

Co-authored-by: Sean Klein <sean@oxide.computer>

smklein mentioned this pull request Jan 5, 2026

It's possible to load a torn blueprint / sitrep / inventory #9594

Open

	// The blueprint for the current target cannot be deleted while it is
	// the current target, but it's possible someone else (a) made a new
	// blueprint the target and (b) deleted the blueprint pointed to by our
	// `target` between the above query and the below query. In such a case,
	// this query will fail with an "unknown blueprint ID" error. This
	// should be rare in practice.

[nexus] FM alert requests and rendezvous task #9552

Are you sure you want to change the base?

[nexus] FM alert requests and rendezvous task #9552

Uh oh!

Conversation

hawkw commented Dec 19, 2025

Uh oh!

smklein left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants