Recreate firewall on unhealthy condition #63

Honigeintopf · 2024-11-04T14:31:09Z

Description

Closes #62.

This pr introduces the functionality for deleting firewalls if they exceed the firewallHealthTimeout which for now is set to 20 minutes.
Integration tests where added to make sure everything works as intended.

CA were updated, otherwise it is not possible to deploy to mini-lab.

Allow configuration of firewall create and health timeout gardener-extension-provider-metal#487

…egration tests

Gerrit91

Thanks for coming up with a PR for this.

controllers/set/delete.go

controllers/set/status.go

Gerrit91 · 2024-11-06T14:03:33Z

controllers/set/delete.go


 	for _, fw := range fws {
-		fw := fw
+		if c.isFirewallUnhealthy(fw) {


With some small changes on your newly introduced struct I think something like this would be clearer and really point out that this is about timeouts:

status := evaluateFirewallConditions(fw) switch { case status.CreateTimeout || status.HealthTimeout: r.Log.Info("firewall health or creation timeout exceeded, deleting from set", "firewall-name", fw.Name) err := c.deleteFirewalls(r, fw) if err != nil { return nil, err } result = append(result, fw) }

The isProgressing that's used in setStatus would also not be required anymore as it can be derived in case all other cases do not match.

I changed it, what do you think now?

controllers/set/status.go

majst01 · 2025-10-28T15:10:53Z

controllers/set/status.go

+	if fw.Status.Phase == v2.FirewallPhaseCreating && timeSinceReconcile > allocationTimeout {
+		c.log.Info("create timeout reached")
+		return firewallConditionStatus{CreateTimeout: true}
+	}
+
+	if seedConnected && unhealthyTimeout != 0 && created && timeSinceReconcile > unhealthyTimeout {


Check if allocationTimeout is set to be able to disable this check

majst01 · 2025-10-28T15:13:45Z

integration/integration_test.go

+						fw.Status.ControllerStatus = &v2.ControllerConnection{}
+					}
+					//add a fake concile so the unhealty firewall gets deleted
+					fw.Status.ControllerStatus.SeedUpdated.Time = time.Now().Add(-20 * 24 * time.Hour)


duplicate of L1984 ?

integration/integration_test.go

majst01 · 2025-10-28T15:16:37Z

integration/integration_test.go

+						fw.Status.ControllerStatus = &v2.ControllerConnection{}
+					}
+					//add a fake concile so the unhealty firewall gets deleted
+					fw.Status.ControllerStatus.SeedUpdated.Time = time.Now().Add(-20 * 24 * time.Hour)


better reuse the existing time constants for health and create timeout and add or substract a specific amount of time, otherwise it is hard to follow what you are trying to test

majst01 · 2025-10-28T15:20:24Z

controllers/set/status.go

+	}
+
+	// duration after which a firewall in the creation phase will be recreated, exceeded
+	if fw.Status.Phase == v2.FirewallPhaseCreating && timeSinceReconcile > allocationTimeout {


allocation timestamp must be used for checking the creation timeout because the firewall-controller was never able to connect in this phase?

Co-authored-by: Stefan Majer <stefan.majer@f-i-ts.de>

Gerrit91

Looks good, I will try it out in our test environment.

Just very small comments.

controllers/set/status.go

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

controllers/set/status.go

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Gerrit91 · 2026-01-27T12:24:11Z

Test needs adaption (fake one of the unhealthy conditions).

Honigeintopf · 2026-02-04T10:00:23Z

I changed a line in the code to only apply health timeout once we have a non-zero seed reconcile timestamp and made possible to specify 0s as timeout which translates to disabling the deletion.

Gerrit91 · 2026-02-09T10:41:00Z

controllers/set/status.go

-
-		if created && time.Since(pointer.SafeDeref(fw.Status.MachineStatus).AllocationTimestamp.Time) < c.c.GetFirewallHealthTimeout() {
-			r.Target.Status.ProgressingReplicas++
+		case statusReport.CreateTimeout || statusReport.HealthTimeout:


After thinking about this again, I think this introduces a behavioral change. Unfortunately, this stuff is quite tricky. 🙈

Now, a FirewallSet becomes only unhealthy when a create or health timeout occurs. But it should become unhealthy already before a timeout occurs, such that a user/operator has the information about something not being okay before the firewall gets removed from the set.

Maybe we need to rethink this again. It's complex because there are two applications (firewall-controller and FCM) with different reconcile intervals. Also reconcile intervals are not very clear at first glance.

If the firewall-controller stops reconciling the Firewall in the seed (condition type SeedConnected), this would cause changes like new prefixes, rate limits, etc. are not getting applied anymore.

If the firewall-controller stops reconciling the FirewallMonitor in the shoot (condition type Connected), this indicates that no Services or ClusterWideNetworkPolicys are getting reconciled anymore.

The Firewall gets reconciled by the firewall-controller at least in a hard-coded interval of three minutes, setting the seed updated timestamps for the next FirewallMonitor update.

The firewall-controller reconciles the FirewallMonitor with the Interval defined in the Firewall spec (default 10s). As this happens pretty often, the sync into the Firewall status done by the monitor controller is not propagated all the time.

The Firewall status gets updated by the FCM at least every two minutes, syncing the timestamps in the Firewall with the values from the FirewallMonitor. This also triggers a reconcile of the FirewallSet (and hence, a check for unhealthy timeout).

In reality, the reconciles happen a bit more often, but these should be the maximum times.

Probably I get a bit crazy now because it's getting theoretical. Here is small figure showing the write intervals of the FirewallMonitor by the firewall-controller (fc, every two minutes for seed updates, we neglect the 10s of the FirewallMonitor for simplicity) and the sync times into the FirewallSet by the FCM (every three minutes).

fc (write) w w w w | | | | t (minutes) 0--1--2--3--4--5--6--7--8--9--10-- | | | | | | FCM (read) r r r r r r

As can be seen, the update times in the FirewallSet resource will contain t={0, 2, 6, 8}

The update times alternate between 2 and 4 minutes

I hope I am not mistaken, but in general it can be said that:

The times between updates are multiples of the shorter interval that are closest to the longer interval

For example: If the FCM reads in 5 minutes, then it alternates between 3 and 6 minutes

So, for this example with the given reconcile times, we should assume a Firewall to be unhealthy if the time since the last reconcile is larger than four minutes.

Then, during a reconcile of the FirewallSet and a health timeout of 20 minutes, we can delete a firewall when, e.g. unhealthy since 12:04:00 (maybe it became unhealthy even at 12:00 but we saw that maximum drift between the update timestamps can be 4 minutes) and the current time is >= 12:24:00 (time since last reconcile + maximum update time + unhealthy timeout).

For the creation timeout it does not need to be so complicated, I think, because the cluster did either not exist or there is another running firewall in place, there is not a big risk to delete a firewall. Maximum sync times are very fast in this case (10s). So we can just use the allocation timestamp to calculate the timeout, otherwise it's progressing.

What do you think?

Honigeintopf · 2026-02-10T10:54:51Z

Okay the issue with using FirewallPhaseRunning:

Phase = Running (machine phoned home)
But Connected, SeedConnected, DistanceConfigured haven't been set to True yet (monitor not updated)
!allConditionsMet is true even though conditions never degraded - they were never fully met in the first place

So either we go ahead and fix when a fw is running( I wouldn't do that) or we say hey there is a new fw condition when the fw was ready once i.e. it finished progressing

Gerrit91 · 2026-02-10T11:36:06Z

Is it an issue if the firewall is phoned home and entered the running phase and the firewall is unhealthy until the firewall controller connects? It should not take longer than a minute anyway?

Honigeintopf · 2026-02-10T12:02:59Z

No, it's not an issue. During the window between phoned-home and firewall-controller-connecting, the FirewallHealthy condition(It's a new one) hasn't been set yet (it's only set once ALL conditions are met for the first time)

Gerrit91 · 2026-02-10T12:31:04Z

Okay, I see now where you want to go, I will comment in the code.

Gerrit91 · 2026-02-10T12:32:56Z

controllers/firewall/reconcile.go

 		SetFirewallStatusFromMonitor(r.Target, mon)
+
+		if isAllConditionsMet(r.Target) {
+			cond := v2.NewCondition(v2.FirewallHealthy, v2.ConditionTrue, "Healthy", "All firewall conditions have been met.")


This should be done in SetFirewallStatusFromMonitor. The case needs to be handled when FirewallNoControllerConnectionAnnotation is set in this same function. Otherwise, for these firewalls the health timeout would not work (i.e. when metal-api reports machine dead).

api/v2/types_firewall.go

Gerrit91 · 2026-02-10T13:26:01Z

controllers/set/status.go

+type firewallConditionStatus struct {
+	IsReady       bool
+	CreateTimeout bool
+	HealthTimeout bool
+}


I think with your new condition we can also make use of it in this file. Maybe I prepare a pull request against this branch for you to give you an idea.

Okay, tag me in it if you are ready.

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Honigeintopf added 6 commits October 25, 2024 14:49

Update Certs

2072462

Update Readme to include "-n firewall"

65cc193

Created test to check if unhealty firewall is replaced when unhealthy

68d79ea

Added delte after healthtimeout is exceeded, still need to adjust int…

fd71798

…egration tests

Added integration tests and deletion of fw after unhealthytimeout

0de0032

refactor

9605a18

Honigeintopf requested a review from a team as a code owner November 4, 2024 14:31

Honigeintopf linked an issue Nov 4, 2024 that may be closed by this pull request

Firewall health check #62

Open

Honigeintopf changed the title ~~Firewall health check~~ Firewall delete on unhealthy condition Nov 4, 2024

Honigeintopf requested a review from Gerrit91 November 4, 2024 14:33

Fix Refactoring

c6b5758

Gerrit91 changed the title ~~Firewall delete on unhealthy condition~~ Recreate firewall on unhealthy condition Nov 4, 2024

Gerrit91 reviewed Nov 6, 2024

View reviewed changes

Honigeintopf added 7 commits November 7, 2024 12:18

Finish refactor

2fa826d

Updated allocation timeout to longer than created timeout

47f4029

Check if firewall is creating before setting allocation timeout

21d648c

Updated with seed

4d9affd

update integration test

0262546

Adjust test to not use retry on conflict

fe0994c

Merge branch 'main' into firewall-health-check

3c98792

vknabel added this to Development Jun 5, 2025

github-project-automation bot moved this to Review in Development Jun 5, 2025

Gerrit91 removed the status in Development Jun 13, 2025

Gerrit91 moved this to Upcoming in Development Oct 20, 2025

Merge branch 'main' into firewall-health-check

41371c9

majst01 reviewed Oct 28, 2025

View reviewed changes

integration/integration_test.go Outdated Show resolved Hide resolved

majst01 reviewed Oct 28, 2025

View reviewed changes

Honigeintopf and others added 2 commits January 22, 2026 09:32

Update integration/integration_test.go

aec1033

Co-authored-by: Stefan Majer <stefan.majer@f-i-ts.de>

check for allocation timeout set

15bdf7b

Honigeintopf requested a review from Gerrit91 January 22, 2026 12:27

Merge branch 'main' into firewall-health-check

0510288

Gerrit91 reviewed Jan 23, 2026

View reviewed changes

controllers/set/status.go Outdated Show resolved Hide resolved

controllers/set/status.go Outdated Show resolved Hide resolved

controllers/set/status.go Outdated Show resolved Hide resolved

Honigeintopf and others added 3 commits January 23, 2026 12:37

Update controllers/set/status.go

8a4f4dc

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Update controllers/set/status.go

31b364e

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Update controllers/set/status.go

6e4d69c

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Gerrit91 moved this from Upcoming to In Progress in Development Jan 26, 2026

Gerrit91 reviewed Jan 27, 2026

View reviewed changes

controllers/set/status.go Outdated Show resolved Hide resolved

Gerrit91 mentioned this pull request Jan 27, 2026

Allow configuration of firewall create and health timeout metal-stack/gardener-extension-provider-metal#487

Open

Apply suggestions from code review

3d92644

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Honigeintopf added 3 commits February 4, 2026 10:19

set seed reconcile time

1323e3f

remove annotation of fw to set reconcile connected but never reconciled.

8472f4e

only apply health timeoput if we actually have a seed connected once

8cf61d4

allow 0s timeout to disable health timeout

851fb18

Gerrit91 reviewed Feb 9, 2026

View reviewed changes

set health timeout if cond not met and fw phase running

f4574a6

Gerrit91 mentioned this pull request Feb 9, 2026

Update dependencies #83

Draft

Honigeintopf added 2 commits February 10, 2026 11:56

new condition foir fw

f6afb92

use monitor specific conditions

d6f38a2

Gerrit91 reviewed Feb 10, 2026

View reviewed changes

Update api/v2/types_firewall.go

1e2328f

Co-authored-by: Gerrit <Gerrit91@users.noreply.github.com>

Recreate firewall on unhealthy condition #63

Are you sure you want to change the base?

Recreate firewall on unhealthy condition #63

Uh oh!

Conversation

Honigeintopf commented Nov 4, 2024 • edited by Gerrit91 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Gerrit91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Gerrit91 Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gerrit91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Gerrit91 commented Jan 27, 2026

Uh oh!

Honigeintopf commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gerrit91 Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Honigeintopf commented Feb 10, 2026

Uh oh!

Gerrit91 commented Feb 10, 2026

Uh oh!

Honigeintopf commented Feb 10, 2026

Uh oh!

Gerrit91 commented Feb 10, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Honigeintopf commented Nov 4, 2024 •

edited by Gerrit91

Loading

Gerrit91 Nov 6, 2024 •

edited

Loading

Honigeintopf commented Feb 4, 2026 •

edited

Loading

Gerrit91 Feb 9, 2026 •

edited

Loading