-
Notifications
You must be signed in to change notification settings - Fork 0
Recreate firewall on unhealthy condition #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Gerrit91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for coming up with a PR for this.
controllers/set/delete.go
Outdated
|
|
||
| for _, fw := range fws { | ||
| fw := fw | ||
| if c.isFirewallUnhealthy(fw) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With some small changes on your newly introduced struct I think something like this would be clearer and really point out that this is about timeouts:
status := evaluateFirewallConditions(fw)
switch {
case status.CreateTimeout || status.HealthTimeout:
r.Log.Info("firewall health or creation timeout exceeded, deleting from set", "firewall-name", fw.Name)
err := c.deleteFirewalls(r, fw)
if err != nil {
return nil, err
}
result = append(result, fw)
}The isProgressing that's used in setStatus would also not be required anymore as it can be derived in case all other cases do not match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it, what do you think now?
controllers/set/status.go
Outdated
| if fw.Status.Phase == v2.FirewallPhaseCreating && timeSinceReconcile > allocationTimeout { | ||
| c.log.Info("create timeout reached") | ||
| return firewallConditionStatus{CreateTimeout: true} | ||
| } | ||
|
|
||
| if seedConnected && unhealthyTimeout != 0 && created && timeSinceReconcile > unhealthyTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if allocationTimeout is set to be able to disable this check
integration/integration_test.go
Outdated
| fw.Status.ControllerStatus = &v2.ControllerConnection{} | ||
| } | ||
| //add a fake concile so the unhealty firewall gets deleted | ||
| fw.Status.ControllerStatus.SeedUpdated.Time = time.Now().Add(-20 * 24 * time.Hour) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate of L1984 ?
integration/integration_test.go
Outdated
| fw.Status.ControllerStatus = &v2.ControllerConnection{} | ||
| } | ||
| //add a fake concile so the unhealty firewall gets deleted | ||
| fw.Status.ControllerStatus.SeedUpdated.Time = time.Now().Add(-20 * 24 * time.Hour) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better reuse the existing time constants for health and create timeout and add or substract a specific amount of time, otherwise it is hard to follow what you are trying to test
controllers/set/status.go
Outdated
| } | ||
|
|
||
| // duration after which a firewall in the creation phase will be recreated, exceeded | ||
| if fw.Status.Phase == v2.FirewallPhaseCreating && timeSinceReconcile > allocationTimeout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocation timestamp must be used for checking the creation timeout because the firewall-controller was never able to connect in this phase?
Co-authored-by: Stefan Majer <stefan.majer@f-i-ts.de>
Gerrit91
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I will try it out in our test environment.
Just very small comments.
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
|
Test needs adaption (fake one of the unhealthy conditions). |
|
I changed a line in the code to only apply health timeout once we have a non-zero seed reconcile timestamp and made possible to specify 0s as timeout which translates to disabling the deletion. |
|
|
||
| if created && time.Since(pointer.SafeDeref(fw.Status.MachineStatus).AllocationTimestamp.Time) < c.c.GetFirewallHealthTimeout() { | ||
| r.Target.Status.ProgressingReplicas++ | ||
| case statusReport.CreateTimeout || statusReport.HealthTimeout: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this again, I think this introduces a behavioral change. Unfortunately, this stuff is quite tricky. 🙈
Now, a FirewallSet becomes only unhealthy when a create or health timeout occurs. But it should become unhealthy already before a timeout occurs, such that a user/operator has the information about something not being okay before the firewall gets removed from the set.
Maybe we need to rethink this again. It's complex because there are two applications (firewall-controller and FCM) with different reconcile intervals. Also reconcile intervals are not very clear at first glance.
- If the firewall-controller stops reconciling the
Firewallin the seed (condition typeSeedConnected), this would cause changes like new prefixes, rate limits, etc. are not getting applied anymore. - If the firewall-controller stops reconciling the
FirewallMonitorin the shoot (condition typeConnected), this indicates that noServices orClusterWideNetworkPolicys are getting reconciled anymore. - The
Firewallgets reconciled by the firewall-controller at least in a hard-coded interval of three minutes, setting the seed updated timestamps for the nextFirewallMonitorupdate. - The firewall-controller reconciles the
FirewallMonitorwith theIntervaldefined in theFirewallspec (default 10s). As this happens pretty often, the sync into theFirewallstatus done by the monitor controller is not propagated all the time. - The
Firewallstatus gets updated by the FCM at least every two minutes, syncing the timestamps in theFirewallwith the values from theFirewallMonitor. This also triggers a reconcile of theFirewallSet(and hence, a check for unhealthy timeout).
In reality, the reconciles happen a bit more often, but these should be the maximum times.
Probably I get a bit crazy now because it's getting theoretical. Here is small figure showing the write intervals of the FirewallMonitor by the firewall-controller (fc, every two minutes for seed updates, we neglect the 10s of the FirewallMonitor for simplicity) and the sync times into the FirewallSet by the FCM (every three minutes).
fc (write) w w w w
| | | |
t (minutes) 0--1--2--3--4--5--6--7--8--9--10--
| | | | | |
FCM (read) r r r r r r
- As can be seen, the update times in the
FirewallSetresource will contain t={0, 2, 6, 8} - The update times alternate between 2 and 4 minutes
- I hope I am not mistaken, but in general it can be said that:
- The times between updates are multiples of the shorter interval that are closest to the longer interval
- For example: If the FCM reads in 5 minutes, then it alternates between 3 and 6 minutes
So, for this example with the given reconcile times, we should assume a Firewall to be unhealthy if the time since the last reconcile is larger than four minutes.
Then, during a reconcile of the FirewallSet and a health timeout of 20 minutes, we can delete a firewall when, e.g. unhealthy since 12:04:00 (maybe it became unhealthy even at 12:00 but we saw that maximum drift between the update timestamps can be 4 minutes) and the current time is >= 12:24:00 (time since last reconcile + maximum update time + unhealthy timeout).
For the creation timeout it does not need to be so complicated, I think, because the cluster did either not exist or there is another running firewall in place, there is not a big risk to delete a firewall. Maximum sync times are very fast in this case (10s). So we can just use the allocation timestamp to calculate the timeout, otherwise it's progressing.
What do you think?
|
Okay the issue with using FirewallPhaseRunning:
So either we go ahead and fix when a fw is running( I wouldn't do that) or we say hey there is a new fw condition when the fw was ready once i.e. it finished progressing |
|
Is it an issue if the firewall is phoned home and entered the running phase and the firewall is unhealthy until the firewall controller connects? It should not take longer than a minute anyway? |
|
No, it's not an issue. During the window between phoned-home and firewall-controller-connecting, the FirewallHealthy condition(It's a new one) hasn't been set yet (it's only set once ALL conditions are met for the first time) |
|
Okay, I see now where you want to go, I will comment in the code. |
| SetFirewallStatusFromMonitor(r.Target, mon) | ||
|
|
||
| if isAllConditionsMet(r.Target) { | ||
| cond := v2.NewCondition(v2.FirewallHealthy, v2.ConditionTrue, "Healthy", "All firewall conditions have been met.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done in SetFirewallStatusFromMonitor. The case needs to be handled when FirewallNoControllerConnectionAnnotation is set in this same function. Otherwise, for these firewalls the health timeout would not work (i.e. when metal-api reports machine dead).
| type firewallConditionStatus struct { | ||
| IsReady bool | ||
| CreateTimeout bool | ||
| HealthTimeout bool | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with your new condition we can also make use of it in this file. Maybe I prepare a pull request against this branch for you to give you an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, tag me in it if you are ready.
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Description
Closes #62.
This pr introduces the functionality for deleting firewalls if they exceed the firewallHealthTimeout which for now is set to 20 minutes.
Integration tests where added to make sure everything works as intended.
CA were updated, otherwise it is not possible to deploy to mini-lab.