kv: validate shard-map hits against epoch index to avoid stale retries #228
+102
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR hardens shard lookup in
client-cby validatingshards_maphits against the epoch index (shards) before returning a cached shard.Problem
ShardCacheForOneIndexkeeps two indexes:shards_map: keyed by range end key (used bysearchCachedShard)shards: keyed by(shard_id, shard_epoch)(used bygetRPCContext)In a stale-cache sequence, these two indexes can diverge.
Detailed Trigger Logic (Before This PR)
A concrete trigger path is:
buildCopTaskForFullText.locateKey, which first callssearchCachedShard.searchCachedShardtrustsshards_mapdirectly, so a stale entry can still be returned.locateKeygot a cache hit, it does not callloadShardByKey(no meta reload).getRPCContext, which checksshardsby(id, epoch).getRPCContextreturns empty andShardClientthrows "not in shard cache".shards_mapentry again, so it still does not reload meta.copNextMaxBackoffis exhausted (effectively retry-until-timeout for this path).So in this bug path, retries do not refresh metadata; they loop on stale cache hits until timeout.
Root Cause
searchCachedShardacceptedshards_mapentries without checking whether the corresponding(id, epoch)still exists inshards.Fix
When
searchCachedShardhitsshards_map, validate that(shard_id, shard_epoch)exists inshards:nullptr) solocateKeycan fall back to meta reload.Tests
Updated repro test in:
src/test/shard_cache_test.ccThe test covers:
Related Issue