docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261

tsivaprasad · 2026-02-03T18:41:03Z

[DRAFT]

…aned instance cleanup

coderabbitai · 2026-02-03T18:41:21Z

📝 Walkthrough

Walkthrough

Added partial recovery documentation and implemented automatic cleanup of orphaned database instances during host failure recovery. A new CleanupInstance activity removes stale instance records, integrated into host removal event handling. PostgreSQL service gains improved timeout error detection for better failure scenario handling.

Changes

Cohort / File(s)	Summary
Partial Recovery Documentation `changes/unreleased/Added-20260204-000115.yaml`, `docs/disaster-recovery/partial-recovery.md`	Added comprehensive partial recovery guide documenting five-phase procedure for recovering from partial cluster failures, including removal, verification, restoration, and rejoining steps with troubleshooting section.
CleanupInstance Activity `server/internal/workflows/activities/cleanup_instance.go`, `server/internal/workflows/activities/activities.go`	Introduced new activity to clean up orphaned database instances with input/output types and execution methods; registered activity in activities registry.
Error Handling & Recovery Workflow `server/internal/orchestrator/swarm/postgres_service.go`, `server/internal/workflows/common.go`	Enhanced timeout error detection in PostgreSQL service and integrated instance cleanup logic into host removal event handling for Delete operations.

Poem

🐰 When hosts do fail and troubles brew,
Our cleanup tasks know what to do,
With phases five and timeout care,
We mend the clusters everywhere,
From chaos blooms recovery's grace! 🌱

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete, containing only '[DRAFT]' without following the required template sections (Summary, Changes, Testing, Checklist, Notes for Reviewers).	Complete the PR description by filling in all required template sections: provide a concise summary, list high-level changes, document testing steps, check all checklist items, and add notes for reviewers if applicable.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main changes: adding documentation for partial recovery with guides for host failure recovery and orphaned instance cleanup.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch PLAT-313-recovery-approach-when-swarm-quorum-is-lost

Important

Action Needed: IP Allowlist Update

If your organization protects your Git platform with IP whitelisting, please add the new CodeRabbit IP address to your allowlist:

✨ 136.113.208.247/32 (new)
34.170.211.100/32
35.222.179.152/32

Reviews will stop working after February 8, 2026 if the new IP is not added to your allowlist.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 179-183: Replace the placeholder comments with concrete psql
commands that show how to insert test data on the primary and verify it on the
replica; for example, add a psql INSERT on n1 targeting the example database and
test_table (e.g., INSERT INTO test_table (id, value) VALUES (...)) and a psql
SELECT on n2 to confirm the replicated row (e.g., SELECT * FROM test_table WHERE
value = 'recovery_test'); ensure you include host/port/user flags (-h, -p, -U)
and use the same database name (example) and table name (test_table) as shown in
the suggested improvement so readers can run the steps end-to-end.
- Around line 413-416: Replace the placeholder comments "# Insert on n1" and "#
Verify on n3" with concrete psql commands: on the primary node run a psql
command to INSERT a test row into the example database (e.g., use psql -h host-1
-p 5432 -U admin -d example -c "INSERT INTO test_table (id, value) VALUES
(...);"), and on the restored node run a psql SELECT to verify the row exists
(e.g., psql -h host-3 -p 5432 -U admin -d example -c "SELECT * FROM test_table
WHERE value = '...';"); ensure the SQL matches the test_table schema and uses a
unique value (e.g., 'full_recovery_test') so the verification step is
unambiguous.

In `@server/internal/orchestrator/swarm/postgres_service.go`:
- Around line 171-185: The timeout branch currently falls through and lets
ServiceRemove proceed even though scale-down may not have completed; update the
isTimeoutError(err) case in the same function (the scale-down/ServiceRemove
sequence) to return a descriptive error instead of continuing, e.g., return an
error indicating the scale-down timed out (wrap the original err), so callers
must handle the timeout (until the TODO RemovingHosts/context distinction is
implemented); reference isTimeoutError, the scale-down logic, and the subsequent
ServiceRemove call when making the change.

🧹 Nitpick comments (1)

server/internal/workflows/activities/cleanup_instance.go (1)
27-31: MaxAttempts: 1 means no retries on transient failures.

With MaxAttempts: 1, any transient failure (network blip, etcd leader election) will fail the cleanup permanently. Consider allowing a small number of retries with backoff for resilience.
Optional: Allow limited retries
 	options := workflow.ActivityOptions{
 		Queue: utils.HostQueue(a.Config.HostID),
 		RetryOptions: workflow.RetryOptions{
-			MaxAttempts: 1,
+			MaxAttempts: 3,
 		},
 	}

coderabbitai · 2026-02-03T18:45:19Z

docs/disaster-recovery/partial-recovery.md

+```bash
+# Connect to the database on n1
+# Insert test data
+# Query on n2 to verify replication
+```


⚠️ Potential issue | 🟡 Minor

Placeholder comments should be replaced with actual commands or removed.

The replication verification step contains only placeholder comments without actual commands. Users following this guide won't have actionable instructions.

Consider providing specific psql commands or linking to relevant documentation.

Suggested improvement

```bash -# Connect to the database on n1 -# Insert test data -# Query on n2 to verify replication +# Connect to n1 and insert test data +psql -h host-1 -p 5432 -U admin -d example -c \ + "INSERT INTO test_table (id, value) VALUES (1, 'recovery_test');" + +# Verify replication on n2 +psql -h host-2 -p 5432 -U admin -d example -c \ + "SELECT * FROM test_table WHERE value = 'recovery_test';"

</details>  <details> <summary>📝 Committable suggestion</summary> > ‼️ **IMPORTANT** > Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements. ```suggestion

🧰 Tools

🪛 markdownlint-cli2 (0.20.0)

[warning] 179-179: Code block style
Expected: indented; Actual: fenced

(MD046, code-block-style)

🤖 Prompt for AI Agents

In `@docs/disaster-recovery/partial-recovery.md` around lines 179 - 183, Replace the placeholder comments with concrete psql commands that show how to insert test data on the primary and verify it on the replica; for example, add a psql INSERT on n1 targeting the example database and test_table (e.g., INSERT INTO test_table (id, value) VALUES (...)) and a psql SELECT on n2 to confirm the replicated row (e.g., SELECT * FROM test_table WHERE value = 'recovery_test'); ensure you include host/port/user flags (-h, -p, -U) and use the same database name (example) and table name (test_table) as shown in the suggested improvement so readers can run the steps end-to-end.

coderabbitai · 2026-02-03T18:45:19Z

docs/disaster-recovery/partial-recovery.md

+```bash
+# Insert on n1
+# Verify on n3
+```


⚠️ Potential issue | 🟡 Minor

Placeholder comments in verification step.

Similar to Step 2.3, this section has placeholder comments instead of actionable commands.

Suggested improvement

```bash -# Insert on n1 -# Verify on n3 +# Insert on n1 +psql -h host-1 -p 5432 -U admin -d example -c \ + "INSERT INTO test_table (id, value) VALUES (2, 'full_recovery_test');" + +# Verify on restored n3 +psql -h host-3 -p 5432 -U admin -d example -c \ + "SELECT * FROM test_table WHERE value = 'full_recovery_test';"

</details> <details> <summary>🧰 Tools</summary> <details> <summary>🪛 markdownlint-cli2 (0.20.0)</summary> [warning] 413-413: Code block style Expected: indented; Actual: fenced (MD046, code-block-style) </details> </details> <details> <summary>🤖 Prompt for AI Agents</summary>

In @docs/disaster-recovery/partial-recovery.md around lines 413 - 416, Replace
the placeholder comments "# Insert on n1" and "# Verify on n3" with concrete
psql commands: on the primary node run a psql command to INSERT a test row into
the example database (e.g., use psql -h host-1 -p 5432 -U admin -d example -c
"INSERT INTO test_table (id, value) VALUES (...);"), and on the restored node
run a psql SELECT to verify the row exists (e.g., psql -h host-3 -p 5432 -U
admin -d example -c "SELECT * FROM test_table WHERE value = '...';"); ensure the
SQL matches the test_table schema and uses a unique value (e.g.,
'full_recovery_test') so the verification step is unambiguous.

</details>  

coderabbitai · 2026-02-03T18:45:19Z

server/internal/orchestrator/swarm/postgres_service.go

+	switch {
+	case errors.Is(err, docker.ErrNotFound):
 		// Service is already deleted.
 		return nil
-	} else if err != nil {
+	case isTimeoutError(err):
+		// NOTE: This assumes timeout always indicates an unreachable host.
+		// This is currently intended for remove-host --force operations.
+		// If timeout occurs during normal operations (e.g., temporary network
+		// issues), this could incorrectly proceed with removal.
+		//
+		// TODO: Add RemovingHosts to resource.Context to safely distinguish
+		// between force-removal scenarios and transient network issues.
+	case err != nil:
 		return fmt.Errorf("failed to scale down postgres service before removal: %w", err)
 	}


⚠️ Potential issue | 🟠 Major

Timeout case falls through to ServiceRemove without guaranteed scale-down completion.

When a timeout occurs, the code proceeds to ServiceRemove without confirmation that containers have stopped. The comment on lines 161-164 explicitly states the scale-down is performed "to guarantee that containers have stopped before this function returns." The timeout case bypasses this guarantee.

While the TODO acknowledges this is intended for --force operations, without the proposed RemovingHosts context field, there's no way to distinguish force-removal from transient network issues. This could lead to:

Attempting ServiceRemove while containers are still running

Resource cleanup errors downstream

Consider returning an error for timeout cases until the RemovingHosts context is implemented, or explicitly document that callers must handle potential race conditions.

Alternative: Return error until context distinction is available

case isTimeoutError(err): - // NOTE: This assumes timeout always indicates an unreachable host. - // This is currently intended for remove-host --force operations. - // If timeout occurs during normal operations (e.g., temporary network - // issues), this could incorrectly proceed with removal. - // - // TODO: Add RemovingHosts to resource.Context to safely distinguish - // between force-removal scenarios and transient network issues. + // TODO: Once RemovingHosts is added to resource.Context, allow + // proceeding for force-removal scenarios only. + return fmt.Errorf("timeout scaling down postgres service; host may be unreachable: %w", err)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

switch {

case errors.Is(err, docker.ErrNotFound):

// Service is already deleted.

return nil

} else if err != nil {

case isTimeoutError(err):

// NOTE: This assumes timeout always indicates an unreachable host.

// This is currently intended for remove-host --force operations.

// If timeout occurs during normal operations (e.g., temporary network

// issues), this could incorrectly proceed with removal.

//

// TODO: Add RemovingHosts to resource.Context to safely distinguish

// between force-removal scenarios and transient network issues.

case err != nil:

return fmt.Errorf("failed to scale down postgres service before removal: %w", err)

}

switch {

case errors.Is(err, docker.ErrNotFound):

// Service is already deleted.

return nil

case isTimeoutError(err):

// TODO: Once RemovingHosts is added to resource.Context, allow

// proceeding for force-removal scenarios only.

return fmt.Errorf("timeout scaling down postgres service; host may be unreachable: %w", err)

case err != nil:

return fmt.Errorf("failed to scale down postgres service before removal: %w", err)

}

🤖 Prompt for AI Agents

In `@server/internal/orchestrator/swarm/postgres_service.go` around lines 171 - 185, The timeout branch currently falls through and lets ServiceRemove proceed even though scale-down may not have completed; update the isTimeoutError(err) case in the same function (the scale-down/ServiceRemove sequence) to return a descriptive error instead of continuing, e.g., return an error indicating the scale-down timed out (wrap the original err), so callers must handle the timeout (until the TODO RemovingHosts/context distinction is implemented); reference isTimeoutError, the scale-down logic, and the subsequent ServiceRemove call when making the change.

jason-lynch · 2026-02-03T18:48:33Z

docs/disaster-recovery/partial-recovery.md

+- A host becomes unreachable (hardware failure, network isolation, VM crash, etc.)
+- The Docker Swarm cluster is still functional on remaining hosts
+- The Control Plane etcd cluster maintains quorum on remaining hosts
+- You need to remove the failed host and continue operations


I don't think we need to list this. It's more of a statement than a requirement, because they can't modify their database until they complete this process.

Suggested change

- You need to remove the failed host and continue operations

jason-lynch · 2026-02-03T18:54:44Z

docs/disaster-recovery/partial-recovery.md

+
+Partial recovery is appropriate when:
+
+- A host becomes unreachable (hardware failure, network isolation, VM crash, etc.)


We should be clear that this is for when you've permanently lost a host. Network partitions and VM crashes are typically temporary, and the host will be reachable again once the network is restored or the VM is started again.

jason-lynch · 2026-02-03T18:57:40Z

docs/disaster-recovery/partial-recovery.md

+
+Before starting the recovery process, ensure you have:
+
+- Access to the Control Plane API (via `cp-req` CLI or direct API calls)


Please remove the references to cp-req in this document, including the code examples. That's something we have for development, but it wouldn't be usable by a customer without some modifications. It's fine to just have curl examples for now.

jason-lynch · 2026-02-03T19:08:17Z

docs/disaster-recovery/partial-recovery.md

+- Access to the Control Plane API (via `cp-req` CLI or direct API calls)
+- SSH access to healthy cluster hosts for Docker Swarm management
+- The failed host identified by its host ID (e.g., `host-3`)
+- Knowledge of which databases have nodes on the failed host


Non-blocking: we could say that you can get this information from the /v1/databases endpoint. The response will include every instance, and each instance will have a host_id.

jason-lynch · 2026-02-03T19:08:52Z

docs/disaster-recovery/partial-recovery.md

+
+### Step 1.1: Force Remove the Host from Control Plane
+
+When a host is unreachable, use the `--force` flag to remove it from the Control Plane cluster. This operation will:


Suggested change

When a host is unreachable, use the `--force` flag to remove it from the Control Plane cluster. This operation will:

When a host is unrecoverable, use the `--force` flag to remove it from the Control Plane cluster. This operation will:

jason-lynch · 2026-02-03T19:35:28Z

docs/disaster-recovery/partial-recovery.md

+    cp1-req update-database example < full-spec.json
+    ```
+
+By default, the new node syncs data from n1. You can specify a different source using the `source_node` property:


It's not accurate to say it always defaults to n1. For example, that wouldn't be the case if you were restoring n1. The details aren't super important, so I'd just say that the Control Plane will automatically choose a source node if you don't specify one, and they can use the source_node property if they want to use a specific source node.

jason-lynch · 2026-02-03T19:40:08Z

docs/disaster-recovery/partial-recovery.md

+    ```sh
+    cp1-req remove-host host-3 --force
+    ```
+


Removing a host is an asynchronous operation. The response here will contain a task ID for the overall removal process. If the host is running any databases, this response will also include task IDs for each database update it performs. It's worth mentioning how they can monitor the progress of these operations here.

jason-lynch · 2026-02-03T19:47:27Z

docs/disaster-recovery/partial-recovery.md

+
+## Troubleshooting
+
+### "Peer URLs already exists" Error During Join


What were the conditions where you ran into this? This shouldn't happen unless something else failed, and I would prefer to address that failure rather than advise users to do this, because this is potentially dangerous.

jason-lynch · 2026-02-03T19:56:46Z

server/internal/orchestrator/swarm/postgres_service.go

 		// Service is already deleted.
 		return nil
-	} else if err != nil {
+	case isTimeoutError(err):


Your NOTE provides a good reason not to do this, so let's see if we can come up with a better solution. ServiceScale is calling a few API operations under the hood. We should understand where/why we're timing out and see if we can improve the conditions that we're watching for, or perhaps return an appropriate sentinel error as we do with ErrNotFound.

jason-lynch · 2026-02-03T19:59:56Z

server/internal/workflows/common.go

+					// If this is an instance resource, we also need to clean up
+					// the instance record from etcd since the normal Delete()
+					// lifecycle method couldn't run on the removed host.
+					if event.Resource.Identifier.Type == database.ResourceTypeInstance {


I'm OK with this workaround, but please add a TODO comment above this that we should remove this in PLAT-398.

docs: partial recovery — add guide for host failure recovery and orph…

41978dc

…aned instance cleanup

tsivaprasad marked this pull request as draft February 3, 2026 18:41

coderabbitai bot reviewed Feb 3, 2026

View reviewed changes

jason-lynch reviewed Feb 3, 2026

View reviewed changes


		Partial recovery is appropriate when:

		- A host becomes unreachable (hardware failure, network isolation, VM crash, etc.)


		Before starting the recovery process, ensure you have:

		- Access to the Control Plane API (via `cp-req` CLI or direct API calls)


		### Step 1.1: Force Remove the Host from Control Plane

		When a host is unreachable, use the `--force` flag to remove it from the Control Plane cluster. This operation will:

	When a host is unreachable, use the `--force` flag to remove it from the Control Plane cluster. This operation will:
	When a host is unrecoverable, use the `--force` flag to remove it from the Control Plane cluster. This operation will:


		## Troubleshooting

		### "Peer URLs already exists" Error During Join

docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261

Are you sure you want to change the base?

docs: partial recovery — add guide for host failure recovery and orphaned instance cleanup #261

Conversation

tsivaprasad commented Feb 3, 2026

Uh oh!

coderabbitai bot commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Feb 3, 2026 •

edited

Loading