Skip to content

Conversation

@andygrove
Copy link
Member

Which issue does this PR close?

N/A - Performance optimization

Rationale for this change

The CometExec.getCometIterator() method serializes the protobuf query plan to bytes every time it's called. Since this method is called inside mapPartitions* for each partition, the same plan was being serialized N times for a query with N partitions. This causes unnecessary CPU overhead and GC pressure.

What changes are included in this PR?

  1. operators.scala - Added caching infrastructure:

    • serializeNativePlan(nativePlan: Operator): Array[Byte] - helper to serialize a plan once
    • New getCometIterator overload accepting pre-serialized Array[Byte] instead of Operator
    • Refactored existing method to use the new helper
  2. CometExecUtils.scala - Updated getNativeLimitRDD:

    • Serialize the limit plan once before mapPartitionsWithIndexInternal
    • Pass pre-serialized bytes to each partition
  3. CometTakeOrderedAndProjectExec.scala - Updated doExecuteColumnar():

    • Serialize the topK plan once before the local topK mapping
    • Serialize the topKAndProjection plan once before the final mapping

How are these changes tested?

Existing tests pass:

  • CometNativeSuite: 4 tests
  • CometExecSuite: 89 tests (includes TakeOrderedAndProjectExec tests)
  • CometShuffleSuite: 41 tests

🤖 Generated with Claude Code

andygrove and others added 3 commits January 22, 2026 12:24
Replace ByteArrayOutputStream with direct CodedOutputStream serialization
to eliminate unnecessary allocations during query plan serialization.

This optimization:
- Pre-allocates exact buffer size using getSerializedSize()
- Eliminates ByteArrayOutputStream's internal buffer resizing
- Removes defensive array copying from toByteArray()
- Applies to 5 hot paths called per-partition during query execution

For a query with 1000 partitions, this eliminates 5000+ unnecessary
allocations and array copies, significantly reducing GC pressure.

Changes:
- operators.scala: getCometIterator() and convertBlock()
- CometNativeWriteExec.scala: serializedPlanOpt() and doExecute()
- ParquetFilters.scala: createNativeFilters()
Add caching for serialized protobuf query plans to avoid serializing
the same plan repeatedly for every partition in CometExecIterator.

Changes:
- Add serializeNativePlan() helper method in CometExec
- Add getCometIterator() overload accepting pre-serialized bytes
- Update CometExecUtils.getNativeLimitRDD to serialize once
- Update CometTakeOrderedAndProjectExec to serialize plans once

For a query with N partitions, this eliminates N-1 redundant protobuf
serializations per affected code path, reducing CPU overhead and GC
pressure during query execution.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as draft January 22, 2026 21:33
@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.11%. Comparing base (f09f8af) to head (cdf9dfa).
⚠️ Report is 868 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3246      +/-   ##
============================================
+ Coverage     56.12%   60.11%   +3.98%     
- Complexity      976     1428     +452     
============================================
  Files           119      170      +51     
  Lines         11743    15799    +4056     
  Branches       2251     2604     +353     
============================================
+ Hits           6591     9497    +2906     
- Misses         4012     4983     +971     
- Partials       1140     1319     +179     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as ready for review January 22, 2026 22:58
Copy link
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving serialization out of mapPartitionsWithIndexInternal LGTM

Copy link
Member

@wForget wForget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove

@andygrove andygrove merged commit 5d6ed61 into apache:main Jan 23, 2026
118 checks passed
@andygrove
Copy link
Member Author

Thanks for the reviews @hsiang-c @wForget

@andygrove andygrove deleted the cache-protobuf branch January 23, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants