perf: cache serialized query plans to avoid per-partition serialization #3246

andygrove · 2026-01-22T21:33:35Z

Which issue does this PR close?

N/A - Performance optimization

Rationale for this change

The CometExec.getCometIterator() method serializes the protobuf query plan to bytes every time it's called. Since this method is called inside mapPartitions* for each partition, the same plan was being serialized N times for a query with N partitions. This causes unnecessary CPU overhead and GC pressure.

What changes are included in this PR?

operators.scala - Added caching infrastructure:
- serializeNativePlan(nativePlan: Operator): Array[Byte] - helper to serialize a plan once
- New getCometIterator overload accepting pre-serialized Array[Byte] instead of Operator
- Refactored existing method to use the new helper
CometExecUtils.scala - Updated getNativeLimitRDD:
- Serialize the limit plan once before mapPartitionsWithIndexInternal
- Pass pre-serialized bytes to each partition
CometTakeOrderedAndProjectExec.scala - Updated doExecuteColumnar():
- Serialize the topK plan once before the local topK mapping
- Serialize the topKAndProjection plan once before the final mapping

How are these changes tested?

Existing tests pass:

CometNativeSuite: 4 tests
CometExecSuite: 89 tests (includes TakeOrderedAndProjectExec tests)
CometShuffleSuite: 41 tests

🤖 Generated with Claude Code

Replace ByteArrayOutputStream with direct CodedOutputStream serialization to eliminate unnecessary allocations during query plan serialization. This optimization: - Pre-allocates exact buffer size using getSerializedSize() - Eliminates ByteArrayOutputStream's internal buffer resizing - Removes defensive array copying from toByteArray() - Applies to 5 hot paths called per-partition during query execution For a query with 1000 partitions, this eliminates 5000+ unnecessary allocations and array copies, significantly reducing GC pressure. Changes: - operators.scala: getCometIterator() and convertBlock() - CometNativeWriteExec.scala: serializedPlanOpt() and doExecute() - ParquetFilters.scala: createNativeFilters()

Add caching for serialized protobuf query plans to avoid serializing the same plan repeatedly for every partition in CometExecIterator. Changes: - Add serializeNativePlan() helper method in CometExec - Add getCometIterator() overload accepting pre-serialized bytes - Update CometExecUtils.getNativeLimitRDD to serialize once - Update CometTakeOrderedAndProjectExec to serialize plans once For a query with N partitions, this eliminates N-1 redundant protobuf serializations per affected code path, reducing CPU overhead and GC pressure during query execution. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-01-22T21:56:49Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.11%. Comparing base (f09f8af) to head (cdf9dfa).
⚠️ Report is 868 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3246      +/-   ##
============================================
+ Coverage     56.12%   60.11%   +3.98%     
- Complexity      976     1428     +452     
============================================
  Files           119      170      +51     
  Lines         11743    15799    +4056     
  Branches       2251     2604     +353     
============================================
+ Hits           6591     9497    +2906     
- Misses         4012     4983     +971     
- Partials       1140     1319     +179

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hsiang-c

Moving serialization out of mapPartitionsWithIndexInternal LGTM

wForget

Thanks @andygrove

andygrove · 2026-01-23T13:30:49Z

Thanks for the reviews @hsiang-c @wForget

andygrove and others added 3 commits January 22, 2026 12:24

fix

0a6f97d

andygrove marked this pull request as draft January 22, 2026 21:33

upmerge

cdf9dfa

andygrove marked this pull request as ready for review January 22, 2026 22:58

hsiang-c approved these changes Jan 23, 2026

View reviewed changes

andygrove requested review from mbutrovich, parthchandra and wForget January 23, 2026 01:52

wForget approved these changes Jan 23, 2026

View reviewed changes

andygrove merged commit 5d6ed61 into apache:main Jan 23, 2026
118 checks passed

andygrove deleted the cache-protobuf branch January 23, 2026 13:30

andygrove mentioned this pull request Jan 23, 2026

Comet 0.13.0 Release (January 2026) #2845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache serialized query plans to avoid per-partition serialization #3246

perf: cache serialized query plans to avoid per-partition serialization #3246

andygrove commented Jan 22, 2026

Uh oh!

codecov-commenter commented Jan 22, 2026 •

edited

Loading

Uh oh!

hsiang-c left a comment

Uh oh!

wForget left a comment

Uh oh!

Uh oh!

andygrove commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf: cache serialized query plans to avoid per-partition serialization #3246

perf: cache serialized query plans to avoid per-partition serialization #3246

Conversation

andygrove commented Jan 22, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hsiang-c left a comment

Choose a reason for hiding this comment

Uh oh!

wForget left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andygrove commented Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-commenter commented Jan 22, 2026 •

edited

Loading