-
Notifications
You must be signed in to change notification settings - Fork 272
feat: Enable native columnar to row by default [WIP] #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
andygrove
wants to merge
50
commits into
apache:main
Choose a base branch
from
andygrove:native-c2r-enabled
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+97,808
−100,257
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Adds a dev script that automates regenerating golden files for the CometTPCDSV1_4_PlanStabilitySuite and CometTPCDSV2_7_PlanStabilitySuite tests across all supported Spark versions (3.4, 3.5, 4.0). The script verifies JDK 17+ is configured (required for Spark 4.0) and supports regenerating for a specific Spark version or all versions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This PR adds an experimental native (Rust-based) implementation of ColumnarToRowExec that converts Arrow columnar data to Spark UnsafeRow format. Benefits over the current Scala implementation: - Zero-copy for variable-length types: String and Binary data is written directly to the output buffer without intermediate Java object allocation - Vectorized processing: The native implementation processes data in a columnar fashion, improving CPU cache utilization - Reduced GC pressure: All conversion happens in native memory, avoiding the creation of temporary Java objects that would need garbage collection - Buffer reuse: The output buffer is allocated once and reused across batches, minimizing memory allocation overhead The feature is disabled by default and can be enabled by setting: spark.comet.exec.columnarToRow.native.enabled=true Supported data types: - Primitive types: Boolean, Byte, Short, Int, Long, Float, Double - Date and Timestamp (microseconds) - Decimal (both inline precision<=18 and variable-length precision>18) - String and Binary - Complex types: Struct, Array, Map (nested) This is an experimental feature for evaluation and benchmarking purposes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Spark's UnsafeArrayData uses the actual primitive size for elements (e.g., 4 bytes for INT32), not always 8 bytes like UnsafeRow fields. This fix: - Added get_element_size() to determine correct sizes for each type - Added write_array_element() to write values with type-specific widths - Updated write_list_data() and write_map_data() to use correct sizes - Added LargeUtf8/LargeBinary support for struct fields - Added comprehensive test suite (CometNativeColumnarToRowSuite) - Updated compatibility documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a fuzz test using FuzzDataGenerator to test the native columnar to row conversion with randomly generated schemas containing arrays, structs, and maps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying that native columnar to row conversion correctly handles complex nested types: - Array<Array<Int>> - Map<String, Array<Int>> - Struct<Array<Map<String, Int>>, String> These tests confirm the recursive conversion logic works for arbitrary nesting depth. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a fuzz test using FuzzDataGenerator.generateNestedSchema to test native columnar to row conversion with deeply nested random schemas (depth 1-3, with arrays, structs, and maps). The test uses only primitive types supported by native C2R (excludes TimestampNTZType which is not yet supported). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use actual array type for dispatching instead of schema type to handle type mismatches between serialized schema and FFI arrays - Add support for LargeList (64-bit offsets) arrays - Replace .unwrap() with proper error handling to provide clear error messages instead of panics - Add tests for LargeList handling Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When Parquet data is read, string columns may be dictionary-encoded for efficiency. The schema says Utf8 but the actual Arrow array is Dictionary(Int32, Utf8). This caused a type mismatch error. - Add support for Dictionary-encoded arrays in get_variable_length_data - Handle all common key types (Int8, Int16, Int32, Int64, UInt8-64) - Support Utf8, LargeUtf8, Binary, and LargeBinary value types - Add tests for dictionary-encoded string arrays Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add CometColumnarToRowBenchmark to compare performance of: - Spark's default ColumnarToRowExec - Comet's JVM-based CometColumnarToRowExec - Comet's Native CometNativeColumnarToRowExec Benchmark covers: - Primitive types (int, long, double, string, boolean, date) - String-heavy workloads (short, medium, long strings) - Struct types (simple, nested, deeply nested) - Array types (primitives and strings) - Map types (various key/value combinations) - Complex nested types (arrays of structs, maps with arrays) - Wide rows (50 columns of mixed types) Run with: SPARK_GENERATE_BENCHMARK_FILES=1 make benchmark-org.apache.spark.sql.benchmark.CometColumnarToRowBenchmark Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The native columnar-to-row conversion was allocating intermediate Vec<u8> for every variable-length field (strings, binary). This change: - Adds write_variable_length_to_buffer() that writes directly to the output buffer instead of returning a Vec - Adds write_dictionary_to_buffer() functions for dictionary-encoded arrays - Adds #[inline] hints to hot-path functions - Removes intermediate allocations for Utf8, LargeUtf8, Binary, LargeBinary Benchmark results for String Types: - Before: Native was slower than Spark - After: Native matches Spark (1.0X) Primitive types and complex nested types (struct, array, map) still have overhead from JNI/FFI and remaining intermediate allocations. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Inspired by Velox UnsafeRowFast, add optimizations for all-fixed-width schemas: - Add is_fixed_width() and is_all_fixed_width() detection functions - Add convert_fixed_width() fast path that: - Pre-allocates entire buffer at once (row_size * num_rows) - Pre-fills offsets/lengths arrays (constant row size) - Processes column-by-column for better cache locality - Add write_column_fixed_width() for type-specific column processing - Add tests for fixed-width fast path detection Limitations: - UnsafeRow format stores 8-byte fields per row (not columnar), so bulk memcpy of entire columns is not possible - JNI/FFI boundary crossing still has overhead - The "primitive types" benchmark includes strings, so it doesn't trigger the fixed-width fast path For schemas with only fixed-width columns (no strings, arrays, maps, structs), this reduces allocations and improves cache locality. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add fixedWidthOnlyBenchmark() with only fixed-width types (no strings) to test the native C2R fast path that pre-allocates buffers - Refactor all benchmark methods to use addC2RBenchmarkCases() helper, reducing ~110 lines of duplicated code Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…e allocations - Add direct-write functions (write_struct_to_buffer, write_list_to_buffer, write_map_to_buffer) that write directly to output buffer - Remove legacy functions that returned intermediate Vec<u8> objects - Eliminates memory allocation per complex type value Benchmark improvements: - Struct: 604ms → 330ms (1.8x faster) - Array: 580ms → 410ms (1.4x faster) - Map: 1141ms → 705ms (1.6x faster) - Complex Nested: 1434ms → 798ms (1.8x faster) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add memcpy-style bulk copying for arrays of primitive types without nulls. When array elements are fixed-width primitives (Int8, Int16, Int32, Int64, Float32, Float64, Date32, Timestamp) and have no null values, copy the entire values buffer at once instead of iterating element by element. Benchmark improvement for Array Types: - Before: 410ms (0.5X of Spark) - After: 301ms (0.7X of Spark) - 27% faster Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Move type dispatch outside the inner row loop by pre-downcasting all arrays to typed variants before processing. This eliminates the O(rows * columns * type_dispatch_cost) overhead in the general path. Adds TypedArray enum with variants for all supported types, with methods for null checking, fixed-value extraction, and variable-length writing that operate directly on concrete array types. Benchmark improvements: - Primitive Types: 201ms → 126ms (37% faster, 0.5X → 0.7X) - String Types: 164ms → 120ms (27% faster, 1.0X → 1.4X) - Wide Rows: 1242ms → 737ms (41% faster, 0.6X → 1.0X) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use correct Arrow array types for bulk copy (Date32Array instead of Int32Array, TimestampMicrosecondArray instead of Int64Array) - Add Boolean array support to bulk copy path (element-by-element but still avoiding type dispatch overhead) - Enable bulk copy for arrays with nulls - copy values buffer then set null bits separately (null slots contain garbage but won't be read) - Restore fixed-width value writing in slow path for unsupported types (e.g., Decimal128 in arrays) This fixes the fuzz test failure where Date32 arrays in maps were producing incorrect values due to failed downcast falling through to an incomplete slow path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements Velox-style optimizations for array and map conversion: 1. **TypedElements enum**: Pre-downcast element arrays once to avoid type dispatch in inner loops 2. **Direct offset access**: Use ListArray/MapArray offsets directly instead of calling value(row_idx) which allocates a sliced ArrayRef 3. **Range-based bulk copy**: Copy element ranges directly from the underlying values buffer using pointer arithmetic Benchmark improvements: - Array Types: 274ms → 163ms (40% faster, 0.8X → 1.4X) - Map Types: 605ms → 292ms (52% faster, 0.6X → 1.4X) - Complex Nested: 701ms → 410ms (42% faster, 0.6X → 1.2X) Native C2R now matches or beats Comet JVM for array/map types. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Remove Vec allocation overhead by using inline type dispatch for struct fields instead of pre-collecting into a Vec<TypedElements>. This improves struct type performance from 357ms to 272ms (24% faster). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pre-downcast all struct field columns into TypedElements at batch initialization time (in TypedArray::from_array). This eliminates per-row type dispatch overhead for struct fields. Performance improvement for struct types: - Before: 272ms (0.8X of Spark) - After: 220ms (1.0X of Spark, matching Spark performance) The pre-downcast pattern is now consistently applied to: - Top-level columns (TypedArray) - Array/List elements (TypedElements) - Map keys/values (TypedElements) - Struct fields (TypedElements) - NEW Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pre-compute variable-length column indices once per batch instead of calling is_variable_length() for every column in every row. In pass 2, only iterate over variable-length columns using the pre-computed indices. Also skip writing placeholder values for variable-length columns in pass 1, since they will be overwritten in pass 2. Performance improvement for primitive types (mixed with strings): - Before: 131ms (0.8X of Spark) - After: ~114ms (0.9X of Spark) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add #[allow(clippy::too_many_arguments)] to write_elements_slow - Remove unused functions that were added during development: - write_variable_length_to_buffer - get_element_size - try_bulk_copy_primitive_array_with_nulls - write_array_data_to_buffer - write_array_data_to_buffer_for_map - Remove #[inline] from write_struct_to_buffer (too large/complex) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review feedback: the #[inline] hint doesn't make sense for a function with macro-generated match arms. Let the compiler decide. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Member
Author
|
@sqlbenchmark run tpch |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3228 +/- ##
============================================
+ Coverage 56.12% 56.33% +0.20%
- Complexity 976 1362 +386
============================================
Files 119 175 +56
Lines 11743 16086 +4343
Branches 2251 2655 +404
============================================
+ Hits 6591 9062 +2471
- Misses 4012 5715 +1703
- Partials 1140 1309 +169 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
When an array is dictionary-encoded, store the actual array type instead of the schema type in TypedArray::Dictionary. This fixes the error "Expected Dictionary type but got Binary" that occurred when processing BloomFilter columns with native_comet scan implementation. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3 tasks
…pt' into native-c2r-enabled
When native columnar-to-row conversion is enabled (now the default), CometNativeColumnarToRowExec is used instead of CometColumnarToRowExec. However, it was missing the doExecuteBroadcast implementation required for broadcast exchange operations, causing test failures. Changes: - Add doExecuteBroadcast implementation to CometNativeColumnarToRowExec that uses the native converter for broadcast data transformation - Update CometExecSuite test to handle both CometColumnarToRowExec and CometNativeColumnarToRowExec - Fix parent-child relationship check to account for InputAdapter wrapper nodes used by Spark's codegen - Remove nodeName override from CometNativeColumnarToRowExec Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add NullVector to getFieldVector in Utils.scala to allow export - Add DataType::Null handling in columnar_to_row.rs for native C2R - Update withInfo test for new native C2R plan structure This fixes the round test failure when scale is null, which produces a NullArray that needs to be handled by the native C2R path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use local `root_op` variable instead of unwrapping `exec_context.root_op` - Replace `is_some()` + `unwrap()` pattern with `if let Some(...)` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…version - Add automatic unpacking of dictionary-encoded arrays when schema expects non-dictionary type. This fixes failures when Parquet returns dictionary- encoded decimals but the conversion expects Decimal128Array. - Improve error messages for all downcast failures to include the actual array type, making debugging easier. - Fix dead_code warning by changing TypedArray::Null variant to unit type since the array reference was never used. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When all values in a column are null, Arrow/Parquet may return a NullArray instead of the expected typed array (e.g., Int8Array). This adds casting of NullArray to the expected schema type, fixing the "Failed to downcast to Int8Array, actual type: Null" error. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Handle FixedSizeBinary data type in native columnar-to-row conversion to fix CI failure when processing FixedSizeBinary(3) arrays. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When unpacking dictionary-encoded arrays, cast to the schema's expected type instead of the dictionary's internal value type. This fixes decimal value corruption (2x multiplication) when reading dictionary-encoded decimals from Parquet. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Builds on #3221
Closes #.
Rationale for this change
Enable native columnar to row by default and see if any tests fail.
What changes are included in this PR?
doBroadcastExchangeHow are these changes tested?