Skip to content

fix: persist mount metadata across master switch#650

Closed
jlon wants to merge 3 commits intoCurvineIO:mainfrom
jlon:fix/mount-master-switch-loss
Closed

fix: persist mount metadata across master switch#650
jlon wants to merge 3 commits intoCurvineIO:mainfrom
jlon:fix/mount-master-switch-loss

Conversation

@jlon
Copy link
Contributor

@jlon jlon commented Feb 10, 2026

Summary

  • flush mount/unmount journal entries before returning success to avoid mount metadata loss during leader switch
  • fix mount update-mode existence check to use curvine mount-path index
  • expose standby master fs in mini cluster test helper for failover assertions
  • add mount regression and coverage tests in curvine-tests (failover, mount manager, mount table)

Verification

  • cargo clippy --all-targets --jobs 2 -- --deny=warnings --allow clippy::uninlined-format-args
  • cargo test -p curvine-tests --test mount_failover_test --test mount_manager_test --test mount_table_test -- --nocapture

@jlon
Copy link
Contributor Author

jlon commented Feb 11, 2026

Additional note for this PR update (commit: d340099):

Why this change

After fixing the mount-loss issue, we also need clear diagnostics for MountTable::restore() so that restore-time failures are visible instead of silent.

What was added

File: curvine-server/src/master/mount/mount_table.rs

  • Failed to load mount table from metadata store:
    • mount restore failed: unable to load mount table from metadata store, err=...
  • Restore start with total entries:
    • mount restore started: <N> entries loaded from metadata store
  • Empty-table case:
    • mount restore completed: no entries found
  • Per-entry success:
    • mount restore entry succeeded: mount_id=..., cv_path=..., ufs_path=...
  • Per-entry failure:
    • mount restore entry failed: mount_id=..., cv_path=..., ufs_path=..., err=...
  • Final summary:
    • all success: mount restore completed successfully: restored=..., failed=0
    • partial failure: mount restore completed with errors: restored=..., failed=...

Behavioral scope

  • This is observability-only for restore path.
  • No fail-fast behavior change was introduced in this commit.

Validation

  • mount_table_test
  • mount_manager_test
  • mount_failover_test
    All passed.

if total == 0 {
info!("mount restore completed: no entries found");
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else {
info!(
"mount restore started: {} entries loaded from metadata store",
total
);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this outside else is intentional. We want a clear restore lifecycle log even when metadata load succeeds with zero entries, so operators can distinguish "restore executed and found 0 mounts" from "restore never ran".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 0b5a30e. The start log now runs only when total > 0, so the empty-table path emits only a single completion line.

};

let total = mounts.len();
info!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

total is kept to support restore observability and summary correlation. It is used to report how many entries were loaded from metadata before per-entry restore, which helps diagnose partial-restore cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 0b5a30e. Removed redundant temp vars in the restore loop and simplified logging flow.

for mnt in mounts {
let mount_id = mnt.mount_id;
let cv_path = mnt.cv_path.clone();
let ufs_path = mnt.ufs_path.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the three lines above only prints in the logs, so it's unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am keeping these logs on purpose. They provide low-cost but important startup visibility for master failover: load success/failure, empty-table case, and restore progress are distinct states in production troubleshooting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 0b5a30e. Empty restore now logs once (no entries found) and avoids duplicate-looking startup lines.

@jlon jlon force-pushed the fix/mount-master-switch-loss branch from 0b5a30e to 3fba9dd Compare February 11, 2026 10:17
@jlon jlon closed this Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants