Skip to content

Conversation

@Honigeintopf
Copy link
Collaborator

@Honigeintopf Honigeintopf commented Nov 4, 2024

Description

Closes #62.

This pr introduces the functionality for deleting firewalls if they exceed the firewallHealthTimeout which for now is set to 20 minutes.
Integration tests where added to make sure everything works as intended.

CA were updated, otherwise it is not possible to deploy to mini-lab.

@Honigeintopf Honigeintopf requested a review from a team as a code owner November 4, 2024 14:31
@Honigeintopf Honigeintopf linked an issue Nov 4, 2024 that may be closed by this pull request
@Honigeintopf Honigeintopf changed the title Firewall health check Firewall delete on unhealthy condition Nov 4, 2024
@Honigeintopf Honigeintopf requested a review from Gerrit91 November 4, 2024 14:33
@Gerrit91 Gerrit91 changed the title Firewall delete on unhealthy condition Recreate firewall on unhealthy condition Nov 4, 2024
Copy link
Contributor

@Gerrit91 Gerrit91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for coming up with a PR for this.


for _, fw := range fws {
fw := fw
if c.isFirewallUnhealthy(fw) {
Copy link
Contributor

@Gerrit91 Gerrit91 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some small changes on your newly introduced struct I think something like this would be clearer and really point out that this is about timeouts:

status := evaluateFirewallConditions(fw)

switch {
case status.CreateTimeout || status.HealthTimeout:
    r.Log.Info("firewall health or creation timeout exceeded, deleting from set", "firewall-name", fw.Name)

    err := c.deleteFirewalls(r, fw)
    if err != nil {
        return nil, err
    }

    result = append(result, fw)
}

The isProgressing that's used in setStatus would also not be required anymore as it can be derived in case all other cases do not match.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it, what do you think now?

@github-project-automation github-project-automation bot moved this to Review in Development Jun 5, 2025
@Gerrit91 Gerrit91 removed the status in Development Jun 13, 2025
@Gerrit91 Gerrit91 moved this to Upcoming in Development Oct 20, 2025
Comment on lines 38 to 43
if fw.Status.Phase == v2.FirewallPhaseCreating && timeSinceReconcile > allocationTimeout {
c.log.Info("create timeout reached")
return firewallConditionStatus{CreateTimeout: true}
}

if seedConnected && unhealthyTimeout != 0 && created && timeSinceReconcile > unhealthyTimeout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if allocationTimeout is set to be able to disable this check

fw.Status.ControllerStatus = &v2.ControllerConnection{}
}
//add a fake concile so the unhealty firewall gets deleted
fw.Status.ControllerStatus.SeedUpdated.Time = time.Now().Add(-20 * 24 * time.Hour)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicate of L1984 ?

fw.Status.ControllerStatus = &v2.ControllerConnection{}
}
//add a fake concile so the unhealty firewall gets deleted
fw.Status.ControllerStatus.SeedUpdated.Time = time.Now().Add(-20 * 24 * time.Hour)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better reuse the existing time constants for health and create timeout and add or substract a specific amount of time, otherwise it is hard to follow what you are trying to test

}

// duration after which a firewall in the creation phase will be recreated, exceeded
if fw.Status.Phase == v2.FirewallPhaseCreating && timeSinceReconcile > allocationTimeout {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocation timestamp must be used for checking the creation timeout because the firewall-controller was never able to connect in this phase?

Honigeintopf and others added 2 commits January 22, 2026 09:32
Co-authored-by: Stefan Majer <stefan.majer@f-i-ts.de>
@Honigeintopf Honigeintopf requested a review from Gerrit91 January 22, 2026 12:27
Copy link
Contributor

@Gerrit91 Gerrit91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I will try it out in our test environment.

Just very small comments.

Honigeintopf and others added 3 commits January 23, 2026 12:37
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
@Gerrit91 Gerrit91 moved this from Upcoming to In Progress in Development Jan 26, 2026
Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
@Gerrit91
Copy link
Contributor

Test needs adaption (fake one of the unhealthy conditions).

@Honigeintopf
Copy link
Collaborator Author

Honigeintopf commented Feb 4, 2026

I changed a line in the code to only apply health timeout once we have a non-zero seed reconcile timestamp and made possible to specify 0s as timeout which translates to disabling the deletion.


if created && time.Since(pointer.SafeDeref(fw.Status.MachineStatus).AllocationTimestamp.Time) < c.c.GetFirewallHealthTimeout() {
r.Target.Status.ProgressingReplicas++
case statusReport.CreateTimeout || statusReport.HealthTimeout:
Copy link
Contributor

@Gerrit91 Gerrit91 Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this again, I think this introduces a behavioral change. Unfortunately, this stuff is quite tricky. 🙈

Now, a FirewallSet becomes only unhealthy when a create or health timeout occurs. But it should become unhealthy already before a timeout occurs, such that a user/operator has the information about something not being okay before the firewall gets removed from the set.

Maybe we need to rethink this again. It's complex because there are two applications (firewall-controller and FCM) with different reconcile intervals. Also reconcile intervals are not very clear at first glance.

  • If the firewall-controller stops reconciling the Firewall in the seed (condition type SeedConnected), this would cause changes like new prefixes, rate limits, etc. are not getting applied anymore.
  • If the firewall-controller stops reconciling the FirewallMonitor in the shoot (condition type Connected), this indicates that no Services or ClusterWideNetworkPolicys are getting reconciled anymore.
  • The Firewall gets reconciled by the firewall-controller at least in a hard-coded interval of three minutes, setting the seed updated timestamps for the next FirewallMonitor update.
  • The firewall-controller reconciles the FirewallMonitor with the Interval defined in the Firewall spec (default 10s). As this happens pretty often, the sync into the Firewall status done by the monitor controller is not propagated all the time.
  • The Firewall status gets updated by the FCM at least every two minutes, syncing the timestamps in the Firewall with the values from the FirewallMonitor. This also triggers a reconcile of the FirewallSet (and hence, a check for unhealthy timeout).

In reality, the reconciles happen a bit more often, but these should be the maximum times.

Probably I get a bit crazy now because it's getting theoretical. Here is small figure showing the write intervals of the FirewallMonitor by the firewall-controller (fc, every two minutes for seed updates, we neglect the 10s of the FirewallMonitor for simplicity) and the sync times into the FirewallSet by the FCM (every three minutes).

fc (write)  w        w        w        w   
            |        |        |        |  
t (minutes) 0--1--2--3--4--5--6--7--8--9--10--
            |     |     |     |     |     |
FCM (read)  r     r     r     r     r     r
  • As can be seen, the update times in the FirewallSet resource will contain t={0, 2, 6, 8}
  • The update times alternate between 2 and 4 minutes
  • I hope I am not mistaken, but in general it can be said that:
    • The times between updates are multiples of the shorter interval that are closest to the longer interval
    • For example: If the FCM reads in 5 minutes, then it alternates between 3 and 6 minutes

So, for this example with the given reconcile times, we should assume a Firewall to be unhealthy if the time since the last reconcile is larger than four minutes.

Then, during a reconcile of the FirewallSet and a health timeout of 20 minutes, we can delete a firewall when, e.g. unhealthy since 12:04:00 (maybe it became unhealthy even at 12:00 but we saw that maximum drift between the update timestamps can be 4 minutes) and the current time is >= 12:24:00 (time since last reconcile + maximum update time + unhealthy timeout).

For the creation timeout it does not need to be so complicated, I think, because the cluster did either not exist or there is another running firewall in place, there is not a big risk to delete a firewall. Maximum sync times are very fast in this case (10s). So we can just use the allocation timestamp to calculate the timeout, otherwise it's progressing.

What do you think?

@Gerrit91 Gerrit91 mentioned this pull request Feb 9, 2026
@Honigeintopf
Copy link
Collaborator Author

Okay the issue with using FirewallPhaseRunning:

  1. Phase = Running (machine phoned home)
  2. But Connected, SeedConnected, DistanceConfigured haven't been set to True yet (monitor not updated)
  3. !allConditionsMet is true even though conditions never degraded - they were never fully met in the first place

So either we go ahead and fix when a fw is running( I wouldn't do that) or we say hey there is a new fw condition when the fw was ready once i.e. it finished progressing

@Gerrit91
Copy link
Contributor

Is it an issue if the firewall is phoned home and entered the running phase and the firewall is unhealthy until the firewall controller connects? It should not take longer than a minute anyway?

@Honigeintopf
Copy link
Collaborator Author

No, it's not an issue. During the window between phoned-home and firewall-controller-connecting, the FirewallHealthy condition(It's a new one) hasn't been set yet (it's only set once ALL conditions are met for the first time)

@Gerrit91
Copy link
Contributor

Okay, I see now where you want to go, I will comment in the code.

SetFirewallStatusFromMonitor(r.Target, mon)

if isAllConditionsMet(r.Target) {
cond := v2.NewCondition(v2.FirewallHealthy, v2.ConditionTrue, "Healthy", "All firewall conditions have been met.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done in SetFirewallStatusFromMonitor. The case needs to be handled when FirewallNoControllerConnectionAnnotation is set in this same function. Otherwise, for these firewalls the health timeout would not work (i.e. when metal-api reports machine dead).

Comment on lines +11 to +15
type firewallConditionStatus struct {
IsReady bool
CreateTimeout bool
HealthTimeout bool
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with your new condition we can also make use of it in this file. Maybe I prepare a pull request against this branch for you to give you an idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, tag me in it if you are ready.

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

Firewall health check

3 participants