Skip to content

Conversation

@parthchandra
Copy link
Contributor

Rationale for this change

We see increased GC collection times in jobs with Iceberg scans with a large number (10K-100K) of partitions

What changes are included in this PR?

Partition values are currently serialized to native by constructing a JSON string. This PR changes that to use Protobuf instead.

How are these changes tested?

Added a new unit test for a large number of partitions.

AI note: large parts were generated using Claude Code.

Copy link
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace JSON with Protocol Buffer LGTM

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 39.00000% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.07%. Comparing base (f09f8af) to head (e8b87e7).
⚠️ Report is 873 commits behind head on main.

Files with missing lines Patch % Lines
.../comet/serde/operator/CometIcebergNativeScan.scala 39.00% 55 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3247      +/-   ##
============================================
+ Coverage     56.12%   60.07%   +3.94%     
- Complexity      976     1438     +462     
============================================
  Files           119      172      +53     
  Lines         11743    15927    +4184     
  Branches       2251     2631     +380     
============================================
+ Hits           6591     9568    +2977     
- Misses         4012     5031    +1019     
- Partials       1140     1328     +188     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 2655 to 2656
/// This replaces JSON parsing with direct protobuf deserialization with a more compact
/// representation (e.g., timestamps as integers vs strings).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can remove the part of the comment that explains how this used to work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

checkIcebergNativeScan(
"SELECT COUNT(*) FROM s3_catalog.db.large_partitioned_test WHERE partition_id IN (0, 50, 99)")

spark.sql("DROP TABLE s3_catalog.db.large_partitioned_test")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) You can try DROP TABLE s3_catalog.db.large_partitioned_test PURGE to remove files on disk.

Comment on lines +116 to +149
oneof value {
int32 int_val = 2;
int64 long_val = 3;
int64 date_val = 4; // days since epoch
int64 timestamp_val = 5; // microseconds since epoch
int64 timestamp_tz_val = 6; // microseconds with timezone
string string_val = 7;
double double_val = 8;
float float_val = 9;
bytes decimal_val = 10; // unscaled BigInteger bytes
bool bool_val = 11;
bytes uuid_val = 12;
bytes fixed_val = 13;
bytes binary_val = 14;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to consider consolidating this with the existing Literal defined in protobuf. This does not need to happen for the current PR.

message Literal {
  oneof value {
    bool bool_val = 1;
    // Protobuf doesn't provide int8 and int16, we put them into int32 and convert
    // to int8 and int16 when deserializing.
    int32 byte_val = 2;
    int32 short_val = 3;
    int32 int_val = 4;
    int64 long_val = 5;
    float float_val = 6;
    double double_val = 7;
    string string_val = 8;
    bytes bytes_val = 9;
    bytes decimal_val = 10;
    ListLiteral list_val = 11;
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are Iceberg types though.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending CI

Comment on lines 203 to 207
/**
* Legacy JSON serialization function - removed in favor of protobuf. Kept as reference for
* conversion logic.
*/
private def partitionValueToJson(fieldTypeStr: String, value: Any): JValue = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just remove this now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found some unused code as a result of removing this. Thanks!

@andygrove
Copy link
Member

@parthchandra there is a clippy failure

@parthchandra parthchandra force-pushed the iceberg_scan_protobuf_partition branch from 8319470 to e8b87e7 Compare January 23, 2026 16:50
@andygrove
Copy link
Member

@parthchandra do you have benchmarks showing the performance improvement?

@parthchandra parthchandra merged commit 077005c into apache:main Jan 23, 2026
222 of 223 checks passed
@parthchandra
Copy link
Contributor Author

Merged. Thanks @andygrove @hsiang-c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants