Skip to content

Conversation

@wshwsh12
Copy link

@wshwsh12 wshwsh12 commented Feb 11, 2026

Summary

This PR hardens shard lookup in client-c by validating shards_map hits against the epoch index (shards) before returning a cached shard.

Problem

ShardCacheForOneIndex keeps two indexes:

  • shards_map: keyed by range end key (used by searchCachedShard)
  • shards: keyed by (shard_id, shard_epoch) (used by getRPCContext)

In a stale-cache sequence, these two indexes can diverge.

Detailed Trigger Logic (Before This PR)

A concrete trigger path is:

  1. Fulltext remote read builds tasks via buildCopTaskForFullText.
  2. Task rebuild/retry calls locateKey, which first calls searchCachedShard.
  3. searchCachedShard trusts shards_map directly, so a stale entry can still be returned.
  4. Because locateKey got a cache hit, it does not call loadShardByKey (no meta reload).
  5. Request send then calls getRPCContext, which checks shards by (id, epoch).
  6. If that epoch entry has already been dropped, getRPCContext returns empty and ShardClient throws "not in shard cache".
  7. Retry path catches the exception, backoffs, and rebuilds tasks.
  8. Rebuild hits the same stale shards_map entry again, so it still does not reload meta.
  9. This repeats until copNextMaxBackoff is exhausted (effectively retry-until-timeout for this path).

So in this bug path, retries do not refresh metadata; they loop on stale cache hits until timeout.

Root Cause

searchCachedShard accepted shards_map entries without checking whether the corresponding (id, epoch) still exists in shards.

Fix

When searchCachedShard hits shards_map, validate that (shard_id, shard_epoch) exists in shards:

  • if yes: return cached shard;
  • if no: treat as cache miss (return nullptr) so locateKey can fall back to meta reload.

Tests

Updated repro test in:

  • src/test/shard_cache_test.cc

The test covers:

  • stale/drop sequence,
  • behavior when meta reload is unavailable,
  • recovery after simulating refreshed shard metadata.

Related Issue

Signed-off-by: wshwsh12 <793703860@qq.com>
@ti-chi-bot ti-chi-bot bot added the dco-signoff: yes Indicates the PR's author has signed the dco. label Feb 11, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Feb 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign pingyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Feb 11, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 11, 2026
@wshwsh12 wshwsh12 changed the title kv: relax shard-cache consistency check to epoch presence kv: validate shard-map hits against epoch index to avoid stale retries Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant