Fix flaky tests by janezpodhostnik · Pull Request #8409 · onflow/flow-go

janezpodhostnik · 2026-02-11T14:32:20Z

This fix is three fixes in a trench coat:

Fix flakiness of compactor tests. The root cause was that the wal still had a lock on the test dir when the test exited, because we explicitly decide not to wait for it in the code. I added a max 10ms wait, which makes the tests 100% stable (up from 75% stable) (tested on 100 runs on my machine)
network/test/cohort2/unicast_authorization_test.go the reasoning is in the comments. I'm least sure about this one.
connection/cache.go. The problem was that wg.done() was being called after wg was already finished. I added a lock on the closeRequested and wg, which is probably going to be a slight performance hit, since we used to use atomics for the closeRequested

Summary by CodeRabbit

Tests
- Refactored concurrent access testing with batched approach for improved performance validation.
- Updated error tolerance in authorization tests to handle edge cases.
- Improved test resilience with proper public method accessors.
Bug Fixes
- Enhanced shutdown logic with bounded timeout for checkpoint operations to improve stability.

github-actions · 2026-02-11T14:32:36Z

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

coderabbitai · 2026-02-11T14:32:51Z

📝 Walkthrough

Walkthrough

Test stability improvements and code refactoring across multiple components. Changes include batching concurrent test operations, replacing direct field access with public methods, adding shutdown timing delays, and loosening assertions to handle race conditions.

Changes

Cohort / File(s)	Summary
Connection Cache Tests `engine/access/rpc/connection/cache_test.go`, `engine/access/rpc/connection/connection_test.go`	Refactored concurrent test from single massive pattern to 100 batches of 1000 calls with per-batch invalidation; replaced direct atomic field access with public `CloseRequested()` method call.
Ledger Compactor Shutdown `ledger/complete/compactor.go`	Added bounded 10ms delay in shutdown logic to wait for ongoing checkpoint completion before proceeding, improving test stability without delaying shutdown when checkpointing is active.
Network Authorization Test `network/test/cohort2/unicast_authorization_test.go`	Loosened Unicast call assertion to tolerate potential "stream reset" errors, reflecting race condition handling where receiver may reset before sender closure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

jordanschalm
zhangchiqing
tarakby

Poem

🐰 In batches we test, not all at once,
A thousand at a time, we flush the dunce!
With shorter waits and gentler assertions,
Our flaky tests find stable, true diversions.
Hop, hop, stability found! ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (37 files): ⚔️ `AGENTS.md` (content) ⚔️ `cmd/access/node_builder/access_node_builder.go` (content) ⚔️ `cmd/observer/node_builder/observer_builder.go` (content) ⚔️ `cmd/verification_builder.go` (content) ⚔️ `docs/agents/GoDocs.md` (content) ⚔️ `docs/agents/OperationalDoctrine.md` (content) ⚔️ `engine/access/access_test.go` (content) ⚔️ `engine/access/ingestion/engine.go` (content) ⚔️ `engine/access/ingestion/engine_test.go` (content) ⚔️ `engine/access/ingestion/tx_error_messages/tx_error_messages_engine.go` (content) ⚔️ `engine/access/ingestion/tx_error_messages/tx_error_messages_engine_test.go` (content) ⚔️ `engine/access/ingestion2/engine_test.go` (content) ⚔️ `engine/access/ingestion2/finalized_block_processor.go` (content) ⚔️ `engine/access/rpc/connection/cache_test.go` (content) ⚔️ `engine/access/rpc/connection/connection_test.go` (content) ⚔️ `engine/testutil/mock/nodes.go` (content) ⚔️ `engine/testutil/nodes.go` (content) ⚔️ `engine/verification/assigner/blockconsumer/consumer.go` (content) ⚔️ `engine/verification/assigner/blockconsumer/consumer_test.go` (content) ⚔️ `engine/verification/fetcher/chunkconsumer/consumer.go` (content) ⚔️ `engine/verification/fetcher/chunkconsumer/consumer_test.go` (content) ⚔️ `fvm/evm/emulator/config.go` (content) ⚔️ `fvm/evm/emulator/emulator.go` (content) ⚔️ `fvm/evm/evm_test.go` (content) ⚔️ `fvm/evm/testutils/accounts.go` (content) ⚔️ `fvm/evm/types/errors.go` (content) ⚔️ `ledger/complete/compactor.go` (content) ⚔️ `module/jobqueue/component_consumer.go` (content) ⚔️ `module/jobqueue/component_consumer_test.go` (content) ⚔️ `module/jobqueue/consumer.go` (content) ⚔️ `module/jobqueue/consumer_behavior_test.go` (content) ⚔️ `module/jobqueue/consumer_test.go` (content) ⚔️ `module/state_synchronization/indexer/indexer.go` (content) ⚔️ `module/state_synchronization/indexer/indexer_test.go` (content) ⚔️ `module/state_synchronization/requester/execution_data_requester.go` (content) ⚔️ `module/state_synchronization/requester/execution_data_requester_test.go` (content) ⚔️ `network/test/cohort2/unicast_authorization_test.go` (content) These conflicts must be resolved before merging into `master`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix flaky tests' accurately summarizes the main objective of the pull request, which is to address test flakiness across multiple test cases.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch janez/flaky-test-fixes

⚔️ Resolve merge conflicts (beta)

Auto-commit resolved conflicts to branch janez/flaky-test-fixes
Create stacked PR with resolved conflicts
Post resolved changes as copyable diffs in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov-commenter · 2026-02-11T14:35:55Z

Codecov Report

❌ Patch coverage is 75.00000% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
ledger/complete/compactor.go	75.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@network/test/cohort2/unicast_authorization_test.go`:
- Around line 466-470: Fix the comment typo above the require.ErrorContains
check: change "strem reset" to "stream reset" in the inline comment that
explains the receiver-reset case (the comment immediately preceding the
require.ErrorContains(u.T(), err, "stream reset", ...)). Ensure the corrected
spelling matches the string being asserted so the comment and assertion stay
consistent.

network/test/cohort2/unicast_authorization_test.go

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

peterargue · 2026-02-11T17:02:03Z

engine/access/rpc/connection/cache.go

 	"fmt"
 	"sync"

 	lru "github.com/hashicorp/golang-lru/v2"


I'm attempting to fix this in another PR by completely refactoring the connection cache:
#7859

it needs some more rounds of review though.

my preference is to leave this logic as is, and push that PR forward to fix the tests.

I'll take a look 👀.

Would you be ok with adding a delay in the test to temporarily address the issue?

Are you ok with the other two?

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@engine/access/rpc/connection/cache_test.go`:
- Around line 180-188: The test spawns goroutines that call require.NoError
which calls t.FailNow and must only run in the main test goroutine; replace the
in-goroutine require.NoError usage in the goroutine that calls
cache.GetConnected (and the similar case in the first subtest near
cachedConn.AddRequest/connectFn) by either using assert.NoError and returning
from the goroutine on error or sending the error back on a channel/atomic and
having the main test goroutine fail; specifically update the anonymous goroutine
that calls cache.GetConnected("foo", cfg, nil, connectFn) to check the error
without t.FailNow (use assert.NoError from testify or propagate the err to the
main goroutine and handle fail there) and bail the goroutine early if an error
occurs.

🧹 Nitpick comments (1)

engine/access/rpc/connection/cache_test.go (1)
202-204: Lower bound assertion is very loose — consider tightening.

Since each of the 100 batches starts after the previous batch's invalidate("foo"), every batch must create at least one new connection. The lower bound could be int32(numBatches) instead of int32(1), which would make the test more meaningful without risking flakiness.
Proposed tighter bound
-		assert.Greater(t, callCount.Load(), int32(1))
+		assert.GreaterOrEqual(t, callCount.Load(), int32(numBatches))

coderabbitai · 2026-02-13T17:42:49Z

engine/access/rpc/connection/cache_test.go

+				go func() {
+					defer wg.Done()
+					cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)
+					require.NoError(t, err)
+
+					done := cachedConn.AddRequest()
+					time.Sleep(1 * time.Millisecond)
+					done()
+				}()


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the test file find . -name "cache_test.go" -path "*/engine/access/rpc/connection/*"

Repository: onflow/flow-go

Length of output: 102

🏁 Script executed:

# Examine the specific lines mentioned (180-188 and surrounding context) sed -n '170,195p' engine/access/rpc/connection/cache_test.go

Repository: onflow/flow-go

Length of output: 664

🏁 Script executed:

# Also check line 149 mentioned in the note sed -n '140,160p' engine/access/rpc/connection/cache_test.go

Repository: onflow/flow-go

Length of output: 736

🏁 Script executed:

# Search for all goroutines + require patterns in this file rg -n "go func" -A 15 engine/access/rpc/connection/cache_test.go | head -100

Repository: onflow/flow-go

Length of output: 2885

🏁 Script executed:

# Check imports to see if both require and assert are available head -30 engine/access/rpc/connection/cache_test.go

Repository: onflow/flow-go

Length of output: 785

require.NoError inside a goroutine can cause undefined test behavior.

require.NoError calls t.FailNow(), which invokes runtime.Goexit(). The testing package explicitly states this must only be called from the goroutine running the test function. In a spawned goroutine, it can cause a panic, a hang, or a silently passing test.

Use assert.NoError and return early instead, or collect the error via a channel/atomic.

Note: the same pattern exists at line 149 in the unchanged first subtest.

Proposed fix

go func() { defer wg.Done() cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn) - require.NoError(t, err) + if !assert.NoError(t, err) { + return + } done := cachedConn.AddRequest() time.Sleep(1 * time.Millisecond) done() }()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

go func() {

defer wg.Done()

cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)

require.NoError(t, err)

done := cachedConn.AddRequest()

time.Sleep(1 * time.Millisecond)

done()

}()

go func() {

defer wg.Done()

cachedConn, err := cache.GetConnected("foo", cfg, nil, connectFn)

if !assert.NoError(t, err) {

return

}

done := cachedConn.AddRequest()

time.Sleep(1 * time.Millisecond)

done()

}()

🤖 Prompt for AI Agents

In `@engine/access/rpc/connection/cache_test.go` around lines 180 - 188, The test spawns goroutines that call require.NoError which calls t.FailNow and must only run in the main test goroutine; replace the in-goroutine require.NoError usage in the goroutine that calls cache.GetConnected (and the similar case in the first subtest near cachedConn.AddRequest/connectFn) by either using assert.NoError and returning from the goroutine on error or sending the error back on a channel/atomic and having the main test goroutine fail; specifically update the anonymous goroutine that calls cache.GetConnected("foo", cfg, nil, connectFn) to check the error without t.FailNow (use assert.NoError from testify or propagate the err to the main goroutine and handle fail there) and bail the goroutine early if an error occurs.

Fix flaky tests

7f62e96

janezpodhostnik requested a review from a team as a code owner February 11, 2026 14:32

janezpodhostnik self-assigned this Feb 11, 2026

janezpodhostnik requested review from a team, fxamacker, holyfuchs, peterargue and zhangchiqing February 11, 2026 14:33

coderabbitai bot reviewed Feb 11, 2026

View reviewed changes

network/test/cohort2/unicast_authorization_test.go Show resolved Hide resolved

Update network/test/cohort2/unicast_authorization_test.go

cd439b6

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

peterargue reviewed Feb 11, 2026

View reviewed changes

alternate fix forTestConcurrentConnections

b2f89d7

janezpodhostnik requested a review from peterargue February 13, 2026 17:40

coderabbitai bot reviewed Feb 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky tests#8409

Fix flaky tests#8409
janezpodhostnik wants to merge 3 commits intomasterfrom
janez/flaky-test-fixes

janezpodhostnik commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

codecov-commenter commented Feb 11, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

peterargue Feb 11, 2026

Uh oh!

janezpodhostnik Feb 11, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

janezpodhostnik commented Feb 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependency Review

Scanned Files

Uh oh!

coderabbitai bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

codecov-commenter commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peterargue Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

janezpodhostnik Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

janezpodhostnik commented Feb 11, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Feb 11, 2026 •

edited

Loading

coderabbitai bot commented Feb 11, 2026 •

edited

Loading

codecov-commenter commented Feb 11, 2026 •

edited

Loading