Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Dec 19, 2025

This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the alert table to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a background fm_rendezvous task reconciles the requested alerts in the sitrep with the actual alert table. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.

I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.

This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.

@hawkw hawkw requested review from davepacheco and smklein December 19, 2025 21:57
@hawkw hawkw self-assigned this Dec 19, 2025
@hawkw hawkw added nexus Related to nexus fault-management Everything related to the fault-management initiative (RFD480 and others) labels Dec 19, 2025
@hawkw hawkw changed the title [nexus] fm alert requests and rendezvous task [nexus] FM alert requests and rendezvous task Dec 19, 2025
@smklein smklein assigned smklein and unassigned hawkw Jan 5, 2026
Copy link
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall structure looks good, but I've got some questions about lifetimes of things

lookup_ereports_assigned_to_fm_case
ON omicron.public.fm_ereport_in_case (sitrep_id, case_id);

CREATE TABLE IF NOT EXISTS omicron.public.fm_alert_request (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because it's a little quirky - could we document the lifetime of entries in the fm_alert_request table?

You said in the PR description that these get inserted "at sitrep insertion time", and then they may or may not be used depending on whether the sitrep gets activated. When do we remove them?

-- UUID of the sitrep in which the alert is requested.
sitrep_id UUID NOT NULL,
-- UUID of the sitrep in which the alert request was created.
requested_sitrep_id UUID NOT NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm: This value of requested_sitrep_id is static, and the sitrep_id value will be the one changing as the fm_alert_request is carried from sitrep to sitrep, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct --- is there something that would have made the comment clearer here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a nitpick; maybe just:

    -- UUID of the ongoing sitrep in which the alert is requested.
...
    -- UUID of the original sitrep in which the alert request was created.

(it's clear when seeing the pattern in the other tables, but it has been a few weeks since I looked at this code, and I had to check things)

variant.as_str().starts_with("test.") && !variant.is_test()
})
.collect::<Vec<_>>();
assert_eq!(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice quality-of-life test, though it makes me wonder if test. as a stringified prefix should just imply variant.is_test() for us

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehh I'm seeing this is just moved code; consider this alternate is_test suggestion just musing rather than a request.

Comment on lines +82 to +84
// XXX(eliza): is it better to allocate all of these into a big array
// and do a single `INSERT INTO` query, or iterate over them one by one
// (not allocating) but insert one at a time?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something to consider: There are a ton of sources of conflicts due to raciness here:

  • If we have multiple Nexus instances trying to perform rendezvous for a single sitrep simultaneously, they'll potentially overlap
  • They could also be operating on distinct sitreps (e.g., a slow Nexus is doing rendezvous for "what it thinks is the latest sitrep", but it becomes out-of-date immediately after it starts the rendezvous process) and they'll have partially overlapping sets of alerts

I think this would require a batched version of alert_create (maybe alerts_create?) to use on_conflict...do_nothing.

However, I think I'm okay with the usual: "Let's keep it simple now, instrument it, and optimize it later when we need to"

let class = class.into();
match self
.datastore
.alert_create(&opctx, id, class, payload.clone())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.

  • Suppose we want to delete an alert record from cockroachdb
  • Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.
  • If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)

I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.

I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.

.internal_context("failed to delete case ereport assignments")
})?;

// Delete alert requests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we currently happen to be calling fm_sitrep_delete_all from a context where we're deleting orphan sitreps, but theoretically someone could call this function for "just an old sitrep", including one that was not orphaned.

Is it possible that this would be deleting records concurrently with the fm_rendezvous background task?

looking at the background task, it looks like we read the "alert request" values in-memory from the watch channel, so, I guess this would be like:

  • Background task reads a sitrep, starts acting on it
  • Someone calls fm_sitrep_delete_all, and passes the ID sitrep going through the background task. Note: This might be happening on a different Nexus instance from where the background task is executing.
  • The background task is still acting upon a sitrep that we deleted from the database

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe this is fine, but kinda just checking-in on "what guards what" with respect to row lifetimes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually think there's anything particularly wrong with that; it should always be safe to delete any sitrep as long as it is not the current sitrep according to CRDB. Everything that is driven by data stored in a sitrep operates on the in-memory sitrep loaded by the fm_sitrep_load task, it should actually be totally fine to delete the sitrep because any Nexus operating from that sitrep is operating on it from memory rather than from CRDB --- we won't see parts of a sitrep disappear on us as long as we've already loaded it.

That said, I do actually think there's a separate issue around deleting a sitrep *while the fm_sitrep_loader task is currently in the process of loading that sitrep. In that case, it's possible to read an incomplete sitrep if the sitrep is deleted concurrently to loading it. We could change the loader task to re-load the metadata record after loading all the other components of the sitrep, to check that the sitrep wasn't in the process of being deleted while it was being loaded. However, we currently delete the sitrep metadata record as the last step in deleting the sitrep (as its what indicates that the delete operation is "completed"); this means that we could have a situation where a load overlaps with the beginning of a delete but doesn't see the delete finish, and therefore loads an incomplete sitrep.

I think this could be fixed by adding a time_deleted field to the sitrep metadata (or something, could just be a bool), and having the delete operation start by setting that field, and then go and delete all the subordinate records, and then delete the metadata record. This way, the loader task could detect if it raced with a delete by re-loading the metadata record again and checking if its time_deleted was set or if it just no longer exists at all.

Does that make sense to you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does make sense, as a way to prevent reading a partial sitrep. Though, just to be explicit, it doesn't prevent the "loaded sitrep" from being immediately deleted after loading (nothing can prevent that).

I also think the issue here is shared with blueprints, if I'm reading nexus/src/app/background/tasks/blueprint_load.rs correctly:

  • blueprint_target_get_current_full calls:
    • blueprint_current_target_only (which loads the current target)
    • Then it calls blueprint_read on that target

There's even a comment about the case of concurrent deletion, but I think it's somewhat wrong:

// The blueprint for the current target cannot be deleted while it is
// the current target, but it's possible someone else (a) made a new
// blueprint the target and (b) deleted the blueprint pointed to by our
// `target` between the above query and the below query. In such a case,
// this query will fail with an "unknown blueprint ID" error. This
// should be rare in practice.

It's possible that the blueprint target change + blueprint deletion happens in the body of blueprint_read, which would not return an "unknown blueprint ID" necessarily. It could just result in a torn blueprint, with many rows of data missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be fixed by adding a time_deleted field to the sitrep metadata (or something, could just be a bool), and having the delete operation start by setting that field, and then go and delete all the subordinate records, and then delete the metadata record. This way, the loader task could detect if it raced with a delete by re-loading the metadata record again and checking if its time_deleted was set or if it just no longer exists at all.

To be clear: I do like the idea of "identifying the start + end of the delete", and being able to look for either marker as a way to prevent reading torn sitreps

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, gonna mention it because it would be a very straightforward fix:

Making "sitrep delete" and "sitrep load" transactions would also fix this issue. Since sitreps are immutable, there should be no contention, except in this case we care about, where someone is attempting to a do a load / deletion of the same sitrep concurrently.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #9594 for this case

@smklein smklein assigned hawkw and unassigned smklein Jan 5, 2026
Co-authored-by: Sean Klein <sean@oxide.computer>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

fault-management Everything related to the fault-management initiative (RFD480 and others) nexus Related to nexus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants