Skip to content

Fix flaky tests#8409

Open
janezpodhostnik wants to merge 3 commits intomasterfrom
janez/flaky-test-fixes
Open

Fix flaky tests#8409
janezpodhostnik wants to merge 3 commits intomasterfrom
janez/flaky-test-fixes

Conversation

@janezpodhostnik
Copy link
Contributor

@janezpodhostnik janezpodhostnik commented Feb 11, 2026

This fix is three fixes in a trench coat:

  1. Fix flakiness of compactor tests. The root cause was that the wal still had a lock on the test dir when the test exited, because we explicitly decide not to wait for it in the code. I added a max 10ms wait, which makes the tests 100% stable (up from 75% stable) (tested on 100 runs on my machine)
  2. network/test/cohort2/unicast_authorization_test.go the reasoning is in the comments. I'm least sure about this one.
  3. connection/cache.go. The problem was that wg.done() was being called after wg was already finished. I added a lock on the closeRequested and wg, which is probably going to be a slight performance hit, since we used to use atomics for the closeRequested

Summary by CodeRabbit

  • Tests

    • Refactored concurrent access testing with batched approach for improved performance validation.
    • Updated error tolerance in authorization tests to handle edge cases.
    • Improved test resilience with proper public method accessors.
  • Bug Fixes

    • Enhanced shutdown logic with bounded timeout for checkpoint operations to improve stability.

@janezpodhostnik janezpodhostnik requested a review from a team as a code owner February 11, 2026 14:32
@github-actions
Copy link
Contributor

github-actions bot commented Feb 11, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

@janezpodhostnik janezpodhostnik self-assigned this Feb 11, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 11, 2026

📝 Walkthrough

Walkthrough

Test stability improvements and code refactoring across multiple components. Changes include batching concurrent test operations, replacing direct field access with public methods, adding shutdown timing delays, and loosening assertions to handle race conditions.

Changes

Cohort / File(s) Summary
Connection Cache Tests
engine/access/rpc/connection/cache_test.go, engine/access/rpc/connection/connection_test.go
Refactored concurrent test from single massive pattern to 100 batches of 1000 calls with per-batch invalidation; replaced direct atomic field access with public CloseRequested() method call.
Ledger Compactor Shutdown
ledger/complete/compactor.go
Added bounded 10ms delay in shutdown logic to wait for ongoing checkpoint completion before proceeding, improving test stability without delaying shutdown when checkpointing is active.
Network Authorization Test
network/test/cohort2/unicast_authorization_test.go
Loosened Unicast call assertion to tolerate potential "stream reset" errors, reflecting race condition handling where receiver may reset before sender closure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • jordanschalm
  • zhangchiqing
  • tarakby

Poem

🐰 In batches we test, not all at once,
A thousand at a time, we flush the dunce!
With shorter waits and gentler assertions,
Our flaky tests find stable, true diversions.
Hop, hop, stability found! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (37 files):

⚔️ AGENTS.md (content)
⚔️ cmd/access/node_builder/access_node_builder.go (content)
⚔️ cmd/observer/node_builder/observer_builder.go (content)
⚔️ cmd/verification_builder.go (content)
⚔️ docs/agents/GoDocs.md (content)
⚔️ docs/agents/OperationalDoctrine.md (content)
⚔️ engine/access/access_test.go (content)
⚔️ engine/access/ingestion/engine.go (content)
⚔️ engine/access/ingestion/engine_test.go (content)
⚔️ engine/access/ingestion/tx_error_messages/tx_error_messages_engine.go (content)
⚔️ engine/access/ingestion/tx_error_messages/tx_error_messages_engine_test.go (content)
⚔️ engine/access/ingestion2/engine_test.go (content)
⚔️ engine/access/ingestion2/finalized_block_processor.go (content)
⚔️ engine/access/rpc/connection/cache_test.go (content)
⚔️ engine/access/rpc/connection/connection_test.go (content)
⚔️ engine/testutil/mock/nodes.go (content)
⚔️ engine/testutil/nodes.go (content)
⚔️ engine/verification/assigner/blockconsumer/consumer.go (content)
⚔️ engine/verification/assigner/blockconsumer/consumer_test.go (content)
⚔️ engine/verification/fetcher/chunkconsumer/consumer.go (content)
⚔️ engine/verification/fetcher/chunkconsumer/consumer_test.go (content)
⚔️ fvm/evm/emulator/config.go (content)
⚔️ fvm/evm/emulator/emulator.go (content)
⚔️ fvm/evm/evm_test.go (content)
⚔️ fvm/evm/testutils/accounts.go (content)
⚔️ fvm/evm/types/errors.go (content)
⚔️ ledger/complete/compactor.go (content)
⚔️ module/jobqueue/component_consumer.go (content)
⚔️ module/jobqueue/component_consumer_test.go (content)
⚔️ module/jobqueue/consumer.go (content)
⚔️ module/jobqueue/consumer_behavior_test.go (content)
⚔️ module/jobqueue/consumer_test.go (content)
⚔️ module/state_synchronization/indexer/indexer.go (content)
⚔️ module/state_synchronization/indexer/indexer_test.go (content)
⚔️ module/state_synchronization/requester/execution_data_requester.go (content)
⚔️ module/state_synchronization/requester/execution_data_requester_test.go (content)
⚔️ network/test/cohort2/unicast_authorization_test.go (content)

These conflicts must be resolved before merging into master.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Fix flaky tests' accurately summarizes the main objective of the pull request, which is to address test flakiness across multiple test cases.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch janez/flaky-test-fixes
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch janez/flaky-test-fixes
  • Create stacked PR with resolved conflicts
  • Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
ledger/complete/compactor.go 75.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@network/test/cohort2/unicast_authorization_test.go`:
- Around line 466-470: Fix the comment typo above the require.ErrorContains
check: change "strem reset" to "stream reset" in the inline comment that
explains the receiver-reset case (the comment immediately preceding the
require.ErrorContains(u.T(), err, "stream reset", ...)). Ensure the corrected
spelling matches the string being asserted so the comment and assertion stay
consistent.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
"fmt"
"sync"

lru "github.com/hashicorp/golang-lru/v2"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm attempting to fix this in another PR by completely refactoring the connection cache:
#7859

it needs some more rounds of review though.

my preference is to leave this logic as is, and push that PR forward to fix the tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look 👀.

Would you be ok with adding a delay in the test to temporarily address the issue?

Are you ok with the other two?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@engine/access/rpc/connection/cache_test.go`:
- Around line 180-188: The test spawns goroutines that call require.NoError
which calls t.FailNow and must only run in the main test goroutine; replace the
in-goroutine require.NoError usage in the goroutine that calls
cache.GetConnected (and the similar case in the first subtest near
cachedConn.AddRequest/connectFn) by either using assert.NoError and returning
from the goroutine on error or sending the error back on a channel/atomic and
having the main test goroutine fail; specifically update the anonymous goroutine
that calls cache.GetConnected("foo", cfg, nil, connectFn) to check the error
without t.FailNow (use assert.NoError from testify or propagate the err to the
main goroutine and handle fail there) and bail the goroutine early if an error
occurs.
🧹 Nitpick comments (1)
engine/access/rpc/connection/cache_test.go (1)

202-204: Lower bound assertion is very loose — consider tightening.

Since each of the 100 batches starts after the previous batch's invalidate("foo"), every batch must create at least one new connection. The lower bound could be int32(numBatches) instead of int32(1), which would make the test more meaningful without risking flakiness.

Proposed tighter bound
-		assert.Greater(t, callCount.Load(), int32(1))
+		assert.GreaterOrEqual(t, callCount.Load(), int32(numBatches))

Comment on lines +180 to +188
go func() {
defer wg.Done()
cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)
require.NoError(t, err)

done := cachedConn.AddRequest()
time.Sleep(1 * time.Millisecond)
done()
}()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the test file
find . -name "cache_test.go" -path "*/engine/access/rpc/connection/*"

Repository: onflow/flow-go

Length of output: 102


🏁 Script executed:

# Examine the specific lines mentioned (180-188 and surrounding context)
sed -n '170,195p' engine/access/rpc/connection/cache_test.go

Repository: onflow/flow-go

Length of output: 664


🏁 Script executed:

# Also check line 149 mentioned in the note
sed -n '140,160p' engine/access/rpc/connection/cache_test.go

Repository: onflow/flow-go

Length of output: 736


🏁 Script executed:

# Search for all goroutines + require patterns in this file
rg -n "go func" -A 15 engine/access/rpc/connection/cache_test.go | head -100

Repository: onflow/flow-go

Length of output: 2885


🏁 Script executed:

# Check imports to see if both require and assert are available
head -30 engine/access/rpc/connection/cache_test.go

Repository: onflow/flow-go

Length of output: 785


require.NoError inside a goroutine can cause undefined test behavior.

require.NoError calls t.FailNow(), which invokes runtime.Goexit(). The testing package explicitly states this must only be called from the goroutine running the test function. In a spawned goroutine, it can cause a panic, a hang, or a silently passing test.

Use assert.NoError and return early instead, or collect the error via a channel/atomic.

Note: the same pattern exists at line 149 in the unchanged first subtest.

Proposed fix
 				go func() {
 					defer wg.Done()
 					cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)
-					require.NoError(t, err)
+					if !assert.NoError(t, err) {
+						return
+					}
 
 					done := cachedConn.AddRequest()
 					time.Sleep(1 * time.Millisecond)
 					done()
 				}()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
go func() {
defer wg.Done()
cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)
require.NoError(t, err)
done := cachedConn.AddRequest()
time.Sleep(1 * time.Millisecond)
done()
}()
go func() {
defer wg.Done()
cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)
if !assert.NoError(t, err) {
return
}
done := cachedConn.AddRequest()
time.Sleep(1 * time.Millisecond)
done()
}()
🤖 Prompt for AI Agents
In `@engine/access/rpc/connection/cache_test.go` around lines 180 - 188, The test
spawns goroutines that call require.NoError which calls t.FailNow and must only
run in the main test goroutine; replace the in-goroutine require.NoError usage
in the goroutine that calls cache.GetConnected (and the similar case in the
first subtest near cachedConn.AddRequest/connectFn) by either using
assert.NoError and returning from the goroutine on error or sending the error
back on a channel/atomic and having the main test goroutine fail; specifically
update the anonymous goroutine that calls cache.GetConnected("foo", cfg, nil,
connectFn) to check the error without t.FailNow (use assert.NoError from testify
or propagate the err to the main goroutine and handle fail there) and bail the
goroutine early if an error occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants