Skip to content

Add RabbitMQ version upgrade and queue type migration basic support#526

Open
lmiccini wants to merge 10 commits intoopenstack-k8s-operators:mainfrom
lmiccini:rmq4upg
Open

Add RabbitMQ version upgrade and queue type migration basic support#526
lmiccini wants to merge 10 commits intoopenstack-k8s-operators:mainfrom
lmiccini:rmq4upg

Conversation

@lmiccini
Copy link
Contributor

@lmiccini lmiccini commented Jan 27, 2026

Implement automatic storage wipe for RabbitMQ upgrades that require
incompatible data format changes (major/minor version changes). The
upgrade process safely handles:

- Version upgrades (e.g., 3.9 → 4.0) and downgrades (4.0 → 3.9)
- Queue type migration from Mirrored to Quorum queues
- Automatic Quorum queue enforcement when upgrading to RabbitMQ 4.x

Storage wipe process:
1. Delete RabbitMQCluster to terminate all pods
2. Wait for pod termination to prevent data corruption
3. Patch PV reclaim policies to "Delete" for automatic cleanup
4. Delete PVCs (triggers Kubernetes to wipe underlying storage)
5. Verify PV cleanup before cluster recreation

Version tracking uses Status.CurrentVersion (controller-managed) and
the rabbitmq-version label (user-specified target). Patch version
updates (3.13.0 → 3.13.1) skip storage wipe and perform in-place upgrades.

Jira: https://issues.redhat.com/browse/OSPRH-22219

Depends-On: openstack-k8s-operators/openstack-operator#1805

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 27, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lmiccini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lmiccini lmiccini requested review from stuggi and removed request for viroel January 27, 2026 06:44
@lmiccini lmiccini force-pushed the rmq4upg branch 5 times, most recently from dfeb678 to 9be78b8 Compare January 27, 2026 09:32
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d9308d8b92f146ef93533d3002d49ade

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 32m 04s
podified-multinode-edpm-deployment-crc FAILURE in 1h 01m 24s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 15m 20s

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/645d32e739b74209b129928de9e5af66

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 27m 41s
podified-multinode-edpm-deployment-crc FAILURE in 1h 03m 01s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 12m 32s

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/203560e945fd459095e9862a1dfebfa2

openstack-k8s-operators-content-provider NODE_FAILURE Node request 100-0008160425 failed in 0s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/0565bc902cc540cbbb4a71ec46c402cf

openstack-k8s-operators-content-provider NODE_FAILURE Node request 100-0008160510 failed in 0s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/1940f844c9064cc9ab906819f6a331ab

openstack-k8s-operators-content-provider NODE_FAILURE Node request 100-0008160517 failed in 0s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/f45533ae9ce047e08720f67fdfa9bb2d

openstack-k8s-operators-content-provider NODE_FAILURE Node request 100-0008160529 failed in 0s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@lmiccini lmiccini force-pushed the rmq4upg branch 7 times, most recently from 451d65d to 397e29b Compare February 10, 2026 16:38
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/baa1ae88035349b8b7136b862bdae9c9

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 32m 31s
podified-multinode-edpm-deployment-crc FAILURE in 1h 01m 16s
cifmw-crc-podified-edpm-baremetal FAILURE in 1h 15m 49s

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

This commit fixes the webhook validation that was blocking automatic
queue type migration when upgrading to RabbitMQ 4.0.

Changes:

1. Removed strict validation blocking queueType: Mirrored on RabbitMQ 4.x
   - The validation was running before Default() function
   - This prevented the automatic override from Mirrored → Quorum
   - Default() and controller logic handle the enforcement instead

2. Enhanced DefaultForUpdate() to override Mirrored → Quorum
   - Previously only set default when queueType was nil/empty
   - Now also overrides when queueType is explicitly set to Mirrored
   - Only applies when target-version annotation is 4.0+

3. Updated test expectations
   - Changed test to verify automatic override instead of rejection
   - Test now confirms webhook overrides Mirrored → Quorum on 4.0

This allows OpenStackControlPlane to update RabbitMQ instances without
validation errors, while still ensuring Quorum queues are enforced on
RabbitMQ 4.0 through automatic webhook defaulting.
@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

5 similar comments
@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/046d2c93850e458a9f6e1896319ffe95

openstack-k8s-operators-content-provider FAILURE in 9m 55s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/05dc903ed05f4bf8a2dfb29b1ed22e14

openstack-k8s-operators-content-provider FAILURE in 12m 12s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/d04167cc63d941689f1fc0d5d2235734

openstack-k8s-operators-content-provider FAILURE in 10m 53s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/468b712d2bd546b39686e571cb38cb7e

openstack-k8s-operators-content-provider FAILURE in 13m 26s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

1 similar comment
@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/485010bad94d4e8fb154b5c02bd35832

openstack-k8s-operators-content-provider FAILURE in 9m 40s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/openstack-operator for 1805,cf85737c125747b561114c5f9c89485c2da5a240

@lmiccini
Copy link
Contributor Author

recheck

Prioritize Spec.QueueType over Status.QueueType when determining the
quorum queue setting for RabbitMQ transport URLs. This fixes a race
condition where TransportURL reconciles before Status.QueueType is set
during RabbitMQ upgrades/recreations (e.g., 3.9→4.0 with storage wipe).

Previously, TransportURL would default to quorum=false during the ~13
second window between cluster creation and Status.QueueType update,
causing services to create classic queues on RabbitMQ 4.0 clusters
configured for quorum queues. When services later reconnected with
quorum=true, they would fail with PRECONDITION_FAILED errors.

Spec.QueueType is set immediately when the CR is created and represents
the configured queue type, making it the reliable source of truth during
cluster initialization.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@lmiccini
Copy link
Contributor Author

recheck

When reconciling an existing RabbitMQ 3.9 cluster with a new operator
version that tracks Status.CurrentVersion, the initialization logic
incorrectly prioritized the target-version annotation over detecting
the existing cluster.

Bug scenario:
1. RabbitMQ 3.9 cluster exists (old operator, no CurrentVersion tracking)
2. New operator starts reconciling
3. openstack-operator sets target-version: "4.0" annotation
4. Controller sees annotation and initializes CurrentVersion = "4.0"
5. requiresStorageWipe("4.0", "4.0") returns FALSE
6. Storage wipe is SKIPPED
7. Cluster updates to RabbitMQ 4.0 image with old 3.9 storage
8. RabbitMQ 4.0 fails to boot: "classic_mirrored_queue_version: required feature flag not enabled!"

Root cause:
The initialization logic at lines 203-207 checked for the annotation
FIRST and used it as initialVersion. It only checked for existing
clusters when NO annotation was present.

Fix:
Changed priority order to ALWAYS check for existing RabbitMQCluster
first, regardless of annotation presence:

1. Check if RabbitMQCluster exists
   - If exists → initialize CurrentVersion = "3.9" (backwards compat)
   - Triggers storage wipe for 3.9→4.0 upgrade
2. If cluster doesn't exist (new deployment)
   - Use target-version annotation if present
   - Otherwise use DefaultRabbitMQVersion (4.0)

The annotation is the TARGET version (where we want to go), not the
CURRENT version (where we are). We should only use it as initialVersion
for brand new deployments without an existing cluster.

Added test coverage:
- Verifies existing cluster detection works correctly
- Confirms CurrentVersion initializes to "3.9" when cluster exists
- Validates storage wipe is triggered for 3.9→4.0 upgrade
- Ensures upgrade completes successfully with clean storage

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/e96a7b1bf9344b76a7b7ad7b8ed4beb3

openstack-k8s-operators-content-provider FAILURE in 9m 04s
⚠️ podified-multinode-edpm-deployment-crc SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider
⚠️ cifmw-crc-podified-edpm-baremetal SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider

…g 3.9→4.0

This commit fixes a critical bug where CurrentVersion was being updated
quickly during storage wipe upgrades, making it impossible to observe
the intermediate upgrade state and causing test failures.

Problem:
When a RabbitMQ cluster required a storage wipe for upgrade (e.g., 3.9 → 4.0),
the controller was updating CurrentVersion to the target version immediately
after storage wipe completion, before the new cluster was even created. This
caused:
1. CurrentVersion to not reflect the actually deployed version during upgrade
2. Tests to be unable to observe the upgrade process
3. Race conditions where the old cluster was marked as "ready" instead of the
   new cluster

Solution:
1. Introduced new "WaitingForCluster" upgrade phase to track post-wipe state
2. Deferred CurrentVersion update until the new cluster is actually ready
3. Added 200ms delay after storage wipe before cluster recreation for
   observability
4. Updated cluster ready logic to detect WaitingForCluster phase and update
   CurrentVersion only when the new cluster is confirmed ready

Controller Changes:
- After storage wipe completes, set UpgradePhase = "WaitingForCluster"
- Keep CurrentVersion at old version (e.g., "3.9") until new cluster is ready
- When cluster becomes ready with UpgradePhase = "WaitingForCluster":
  - Update CurrentVersion to target version (e.g., "4.0")
  - Clear UpgradePhase

Test Fixes:
Updated three tests to wait for WaitingForCluster phase before simulating
the new cluster as ready:
- "should require storage wipe and update Status.CurrentVersion after upgrade"
- "should require storage wipe for downgrade"
- "should automatically migrate to Quorum queues and wipe cluster"

This ensures CurrentVersion accurately represents the deployed RabbitMQ version
throughout the upgrade lifecycle, making upgrades observable and testable.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

@softwarefactory-project-zuul
Copy link

This change depends on a change that failed to merge.

Change openstack-k8s-operators/openstack-operator#1805 is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant