-
Notifications
You must be signed in to change notification settings - Fork 66
[nexus] FM alert requests and rendezvous task #9552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
smklein
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall structure looks good, but I've got some questions about lifetimes of things
| lookup_ereports_assigned_to_fm_case | ||
| ON omicron.public.fm_ereport_in_case (sitrep_id, case_id); | ||
|
|
||
| CREATE TABLE IF NOT EXISTS omicron.public.fm_alert_request ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because it's a little quirky - could we document the lifetime of entries in the fm_alert_request table?
You said in the PR description that these get inserted "at sitrep insertion time", and then they may or may not be used depending on whether the sitrep gets activated. When do we remove them?
| -- UUID of the sitrep in which the alert is requested. | ||
| sitrep_id UUID NOT NULL, | ||
| -- UUID of the sitrep in which the alert request was created. | ||
| requested_sitrep_id UUID NOT NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm: This value of requested_sitrep_id is static, and the sitrep_id value will be the one changing as the fm_alert_request is carried from sitrep to sitrep, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct --- is there something that would have made the comment clearer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a nitpick; maybe just:
-- UUID of the ongoing sitrep in which the alert is requested.
...
-- UUID of the original sitrep in which the alert request was created.
(it's clear when seeing the pattern in the other tables, but it has been a few weeks since I looked at this code, and I had to check things)
| variant.as_str().starts_with("test.") && !variant.is_test() | ||
| }) | ||
| .collect::<Vec<_>>(); | ||
| assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice quality-of-life test, though it makes me wonder if test. as a stringified prefix should just imply variant.is_test() for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ehh I'm seeing this is just moved code; consider this alternate is_test suggestion just musing rather than a request.
| // XXX(eliza): is it better to allocate all of these into a big array | ||
| // and do a single `INSERT INTO` query, or iterate over them one by one | ||
| // (not allocating) but insert one at a time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something to consider: There are a ton of sources of conflicts due to raciness here:
- If we have multiple Nexus instances trying to perform rendezvous for a single sitrep simultaneously, they'll potentially overlap
- They could also be operating on distinct sitreps (e.g., a slow Nexus is doing rendezvous for "what it thinks is the latest sitrep", but it becomes out-of-date immediately after it starts the rendezvous process) and they'll have partially overlapping sets of alerts
I think this would require a batched version of alert_create (maybe alerts_create?) to use on_conflict...do_nothing.
However, I think I'm okay with the usual: "Let's keep it simple now, instrument it, and optimize it later when we need to"
| let class = class.into(); | ||
| match self | ||
| .datastore | ||
| .alert_create(&opctx, id, class, payload.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we're deleting alert records yet - AFAICT, we're marking them dispatched, but leaving rows in CRDB for them - but when we do, this will be something we need to consider.
- Suppose we want to delete an alert record from cockroachdb
- Suppose there is a really laggy Nexus somewhere, running this rendezvous task. It's stuck doing rendezvous for a very old sitrep.
- If we do "actual SQL DELETE" of the alert, this background task could theoretically bring it back to life (which would be a bug)
I don't think this problem has been totally solved for blueprints either - I'm not seeing such guards in reconcile_blueprint_rendezvous_tables either - but from a discussion with @jgallagher , the priority there was lower, because the rendezvous tables for blueprints are much lower-churn than they presumably will be for alerts.
I wrote up an issue for this on the blueprint side with #9592 , but I think it'll be relevant here much sooner, especially as each alert is injecting an arbitrary JSON payload, which means the table is going to grow in size more quickly.
| .internal_context("failed to delete case ereport assignments") | ||
| })?; | ||
|
|
||
| // Delete alert requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we currently happen to be calling fm_sitrep_delete_all from a context where we're deleting orphan sitreps, but theoretically someone could call this function for "just an old sitrep", including one that was not orphaned.
Is it possible that this would be deleting records concurrently with the fm_rendezvous background task?
looking at the background task, it looks like we read the "alert request" values in-memory from the watch channel, so, I guess this would be like:
- Background task reads a sitrep, starts acting on it
- Someone calls
fm_sitrep_delete_all, and passes the ID sitrep going through the background task. Note: This might be happening on a different Nexus instance from where the background task is executing. - The background task is still acting upon a sitrep that we deleted from the database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe this is fine, but kinda just checking-in on "what guards what" with respect to row lifetimes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually think there's anything particularly wrong with that; it should always be safe to delete any sitrep as long as it is not the current sitrep according to CRDB. Everything that is driven by data stored in a sitrep operates on the in-memory sitrep loaded by the fm_sitrep_load task, it should actually be totally fine to delete the sitrep because any Nexus operating from that sitrep is operating on it from memory rather than from CRDB --- we won't see parts of a sitrep disappear on us as long as we've already loaded it.
That said, I do actually think there's a separate issue around deleting a sitrep *while the fm_sitrep_loader task is currently in the process of loading that sitrep. In that case, it's possible to read an incomplete sitrep if the sitrep is deleted concurrently to loading it. We could change the loader task to re-load the metadata record after loading all the other components of the sitrep, to check that the sitrep wasn't in the process of being deleted while it was being loaded. However, we currently delete the sitrep metadata record as the last step in deleting the sitrep (as its what indicates that the delete operation is "completed"); this means that we could have a situation where a load overlaps with the beginning of a delete but doesn't see the delete finish, and therefore loads an incomplete sitrep.
I think this could be fixed by adding a time_deleted field to the sitrep metadata (or something, could just be a bool), and having the delete operation start by setting that field, and then go and delete all the subordinate records, and then delete the metadata record. This way, the loader task could detect if it raced with a delete by re-loading the metadata record again and checking if its time_deleted was set or if it just no longer exists at all.
Does that make sense to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does make sense, as a way to prevent reading a partial sitrep. Though, just to be explicit, it doesn't prevent the "loaded sitrep" from being immediately deleted after loading (nothing can prevent that).
I also think the issue here is shared with blueprints, if I'm reading nexus/src/app/background/tasks/blueprint_load.rs correctly:
blueprint_target_get_current_fullcalls:blueprint_current_target_only(which loads the current target)- Then it calls
blueprint_readon that target
There's even a comment about the case of concurrent deletion, but I think it's somewhat wrong:
omicron/nexus/db-queries/src/db/datastore/deployment.rs
Lines 2140 to 2145 in 0c0bdf4
| // The blueprint for the current target cannot be deleted while it is | |
| // the current target, but it's possible someone else (a) made a new | |
| // blueprint the target and (b) deleted the blueprint pointed to by our | |
| // `target` between the above query and the below query. In such a case, | |
| // this query will fail with an "unknown blueprint ID" error. This | |
| // should be rare in practice. |
It's possible that the blueprint target change + blueprint deletion happens in the body of blueprint_read, which would not return an "unknown blueprint ID" necessarily. It could just result in a torn blueprint, with many rows of data missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be fixed by adding a
time_deletedfield to the sitrep metadata (or something, could just be a bool), and having the delete operation start by setting that field, and then go and delete all the subordinate records, and then delete the metadata record. This way, the loader task could detect if it raced with a delete by re-loading the metadata record again and checking if itstime_deletedwas set or if it just no longer exists at all.
To be clear: I do like the idea of "identifying the start + end of the delete", and being able to look for either marker as a way to prevent reading torn sitreps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, gonna mention it because it would be a very straightforward fix:
Making "sitrep delete" and "sitrep load" transactions would also fix this issue. Since sitreps are immutable, there should be no contention, except in this case we care about, where someone is attempting to a do a load / deletion of the same sitrep concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #9594 for this case
Co-authored-by: Sean Klein <sean@oxide.computer>
This branch builds on #9492 by adding alert requests to fault management cases. This is a mechanism to allow a sitrep to specify that a set of alerts should exist. UUIDs and payloads for these alerts are specified in the sitrep. We don't want entries in the
alerttable to be created immediately when a sitrep is inserted, as that sitrep may not be made current. If the alert dispatcher operates on alerts created in sitreps that were not made current, it could dispatch duplicate or spurious alerts. Instead, we indirect the creation of alert records by having the sitrep insertion create alert requests, and if that sitrep becomes current, a backgroundfm_rendezvoustask reconciles the requested alerts in the sitrep with the actualalerttable. Eventually, this task will be responsible for updating other rendezvous tables based on the current sitrep.I also did a bit of refactoring of the alert class types so that the structured enum of alert classes could be used by the sitrep.
This change was originally factored out from #9346, but I ultimately ended up rewriting a lot of it.