Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Jan 7, 2026

Which issue does this PR close?

Rationale for this change

DataFusion’s struct casting and some coercion paths were effectively positional: when two structs had the same field types but different field orders, casting could silently swap values. This is surprising to users and can lead to silent data corruption (e.g. {b: 3, a: 4}::STRUCT(a INT, b INT) yielding {a: 3, b: 4}).

The goal of this PR is to make struct casting behavior match user expectations by matching fields by name (case-sensitive) and recursively applying the same logic to nested structs, while keeping a compatible fallback for structs with no shared field names.

What changes are included in this PR?

  • Name-based struct casting implementation in datafusion_common::nested_struct:

    • Match struct fields by name, reorder to match target schema, recursively cast nested structs.
    • Fill missing target fields with null arrays.
    • Ignore extra source fields.
    • Positional mapping fallback when there is no name overlap and field counts match (avoids breaking struct(1, 'x')::STRUCT(a INT, b VARCHAR) style casts).
    • Improved handling for NULL / all-null struct inputs by producing a correctly typed null struct array.
    • Centralized validation via validate_field_compatibility and helper fields_have_name_overlap.
  • Ensure struct casting paths use the name-based logic:

    • ScalarValue::cast_to_with_options: route Struct casts through nested_struct::cast_column.
    • ColumnarValue::cast_to: for Struct targets, cast via nested_struct::cast_column; non-struct casts still use Arrow’s standard casting.
  • Type coercion improvements for structs in binary operators / CASE:

    • When two structs have at least one overlapping name, coerce by name.
    • Otherwise, preserve prior behavior by coercing positionally.
  • Planning-time cast validation for struct-to-struct:

    • physical-expr CAST planning now validates struct compatibility using the same rules as runtime (validate_struct_compatibility) to fail fast.
    • ExprSchemable allows struct-to-struct casts to pass type checking; detailed compatibility is enforced by the runtime / planning-time validator.
  • Optimizer safety:

    • Avoid const-folding struct casts when field counts differ.
    • Avoid const-folding casts of 0-row struct literals due to evaluation batch dimension mismatches.
  • Tests and SQL logic tests:

    • New unit tests covering:

      • name-based reordering
      • missing fields (nullable vs non-nullable)
      • null struct fields and nested nulls
      • positional fallback with no overlap
      • coercion behavior and simplifier behavior
    • Updated/added .slt cases to reflect the new semantics and to add coverage for struct casts and nested struct reordering.

  • Minor docs/maintenance:

    • Adjusted doc comment referencing ParquetWriterOptions so it doesn’t break when the parquet feature is disabled.

Are these changes tested?

Yes.

  • Added/updated Rust unit tests in:

    • datafusion/common/src/nested_struct.rs
    • datafusion/expr-common/src/columnar_value.rs
    • datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs
  • Added/updated SQL logic tests in:

    • datafusion/sqllogictest/test_files/case.slt
    • datafusion/sqllogictest/test_files/struct.slt

These tests cover:

  • correct value mapping when struct field order differs
  • nested struct reordering
  • insertion of nulls for missing nullable fields
  • erroring on missing non-nullable target fields
  • positional mapping fallback when there is no name overlap
  • planning-time validation vs runtime behavior alignment

Are there any user-facing changes?

Yes.

  • Struct casts are now name-based (case-sensitive): fields are matched by name, reordered to the target schema, missing fields are null-filled (if nullable), and extra fields are ignored.
  • Fallback behavior: if there is no name overlap and field counts match, casting proceeds positionally.
  • Potential behavior change in queries relying on the prior positional behavior when structs shared names but were out of order (previously could yield swapped values). This PR changes that to the safer, expected behavior.

No public API changes are introduced, but this is a semantic change in struct casting.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Route ColumnarValue::cast_to through a name-based struct casting
path for both array and scalar struct values. Introduce a helper
to reorder struct children by target field names, insert nulls
for missing fields, and recursively cast each child with
Arrow options. Add unit tests to verify struct field reordering
and null-filling for missing fields when casting between
struct schemas.
…::cast_to

Add comprehensive documentation explaining that struct casting uses field name
matching rather than positional matching. This clarifies the behavior change
for struct types while preserving existing documentation for other types.

Addresses PR review recommendation #4 about documenting public API changes.
Replace .clone() with Arc::clone() to address clippy warning about
clone_on_ref_ptr. This makes the ref-counting operation explicit and
follows Rust best practices.

Fixes clippy error from rust_lint.sh.
## Problem
The PR to fix struct casting (issue apache#14396) introduced regressions where struct
casting with field additions/removals was failing, and name-based field matching
wasn't working correctly in all scenarios.

## Root Causes Identified
1. **Field index mismatch**: cast_struct_array_by_name was using field indices from
   the DataType instead of the actual StructArray, causing wrong column access when
   field names didn't match physical layout.

2. **Missing fallback logic**: When source and target had no overlapping field names
   (e.g. {c0, c1} → {a, b}), name-based matching failed silently, returning NULLs.
   Added fallback to positional casting for non-overlapping fields.

3. **Optimizer const-folding issue**: ScalarValue::cast_to_with_options was calling
   Arrow's cast_with_options directly, which doesn't support struct field count
   changes. The optimizer's simplify_expressions rule would fail when trying to
   fold struct casts at compile time.

4. **Validation rejection**: The logical planner's can_cast_types check rejected
   struct-to-struct casts with mismatched field counts before execution. Added
   special handling to allow all struct-to-struct casts (validation at runtime).

## Solution
- Created datafusion/common/src/struct_cast.rs with shared name-based struct
  casting logic for both runtime (ColumnarValue) and optimization-time (ScalarValue)
- Updated ScalarValue::cast_to_with_options to use the name-based struct casting
- Updated ColumnarValue::cast_to to use the shared logic
- Updated Expr::cast_to validation to allow struct-to-struct casts
- Added fallback to positional casting when field names don't overlap
- Fixed struct array field access to use actual StructArray fields, not DataType fields
- Updated tests to reflect new behavior and correct syntax issues

## Behavior Changes
Struct casts now work correctly with:
- Field reordering: {b: 3, a: 4} → STRUCT(a INT, b INT) → {a: 4, b: 3}
- Field additions: {a: 1} → STRUCT(a INT, b INT) → {a: 1, b: NULL}
- Field removals: {a: 1, b: 2, c: 3} → STRUCT(a INT, b INT) → {a: 1, b: 2}
- Fallback to positional casting when no field names overlap

## Files Modified
- datafusion/common/src/lib.rs: Added struct_cast module
- datafusion/common/src/struct_cast.rs: New shared struct casting logic
- datafusion/common/src/scalar/mod.rs: Use name-based struct casting
- datafusion/expr-common/src/columnar_value.rs: Delegate to shared casting logic
- datafusion/expr/src/expr_schema.rs: Allow struct-to-struct casts through validation
- datafusion/sqllogictest/test_files/struct.slt: Fixed tests and added new ones
…ype comparison

This aligns the type coercion logic with the new name-based struct casting
semantics introduced for explicit CAST operations in issue apache#14396.

Changes:
- struct_coercion in binary.rs now attempts name-based field matching first
- Falls back to positional matching when no field names overlap
- Requires matching field counts for successful coercion
- Preserves left-side field names and order when using name-based matching

This fixes cases where CASE expressions with structs having the same fields
in different orders now correctly match fields by name rather than position.

Note: Several CASE expression tests with struct field reordering are disabled
due to const-folding optimizer hang when evaluating struct literals with
different field orders. This requires separate investigation and fix.
…-folding

This fixes the TODO tests in struct.slt that were causing optimizer hangs
when attempting to const-fold struct literal casts with field count mismatches.

Changes:
1. Modified expr_simplifier.rs can_evaluate() to skip const-folding for
   struct CAST/TryCAST expressions where source and target have different
   field counts. This prevents the optimizer from attempting to evaluate
   these at plan time, deferring to execution time instead.

2. Modified cast.rs cast_with_options() to allow all struct-to-struct casts
   at physical planning time, even when Arrow's can_cast_types rejects them.
   These casts are handled by name-based casting at execution time via
   ColumnarValue::cast_to.

3. Uncommented and fixed TODO tests in struct.slt:
   - CAST({a: 1} AS STRUCT(a INT, b INT)) - adds NULL for missing field b
   - CAST({a: 1, b: 2, extra: 3} AS STRUCT(a INT, b INT)) - ignores extra field

The fix ensures that:
- Optimizer doesn't hang trying to const-fold unsupported struct casts
- Physical planner accepts struct-to-struct casts with field count changes
- Execution uses name-based casting to handle field reordering and NULLs
This uncomments and fixes the TODO tests for struct coercion in CASE expressions
that were previously disabled due to concerns about optimizer hangs.

Changes:
1. Uncommented the TODO test section for struct coercion with different field orders.
   Tests now verify that name-based struct coercion works correctly in CASE expressions.

2. Updated test expectations to match actual behavior:
   - When THEN branch executes, result uses THEN branch's field order
   - When ELSE branch executes, result uses ELSE branch's field order
   - Struct coercion requires equal field counts - mismatch causes planning error

3. Added explicit test for field count mismatch case:
   - Verifies that coercing structs with different field counts (2 fields vs 1 field)
     correctly fails during type coercion with appropriate error message

The tests now pass because:
- The optimizer fix from the previous commit prevents const-folding hangs
- Name-based struct coercion in struct_coercion function handles field reordering
- Type coercion correctly rejects field count mismatches during planning
Remove duplicate struct_cast.rs module and use the existing
nested_struct::cast_struct_column implementation instead. This
eliminates code duplication and provides a single source of truth
for struct field-by-name casting logic.

Changes:
- Add public cast_struct_array_by_name wrapper in nested_struct.rs
- Update columnar_value.rs to use nested_struct::cast_struct_array_by_name
- Update scalar/mod.rs to use nested_struct::cast_struct_array_by_name
- Remove struct_cast module from lib.rs
- Delete datafusion/common/src/struct_cast.rs

Benefits:
- Single implementation to maintain and test
- Consistent behavior across all struct casting operations
- Reduced maintenance burden for future bug fixes
- Better code cohesion in nested_struct module
@github-actions github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Jan 7, 2026
kosiew added 14 commits January 13, 2026 19:48
Implement name-overlap detection with positional fallback and clearer
ambiguity errors for non-overlapping cases. Enhance unit tests and
SQLLogicTest coverage to include tests for positional struct casting
and scenarios lacking name overlap.
Implement null fast paths for struct casting and compatibility
checks to return null struct arrays for NULL-only inputs.
Allow NULL source fields during validation. Add a unit test
covering the scenario of casting a NULL struct field into a
nested struct target.
Return plan errors directly from optimizer rule failures and update
the related test expectations. Relax the SQLogicTest expectation for
struct cast mismatches to allow for either plan-error prefix
variant.
Consolidate duplicated logic in cast_struct_column by using a single
loop for both name-overlap and positional mapping cases. This change
selects the source child by either name or position and centralizes
the cast-and-error handling, improving code clarity and maintainability.
@kosiew
Copy link
Contributor Author

kosiew commented Jan 17, 2026

@comphead

Thanks for sharing the scala test results.

About overlap this is what Spark does when no overlap and field count is different
....
scala> spark.read.schema("str struct<col3 int, col4 int, col5 int>").parquet("/tmp/test_struct").show(false)
+----+
|str |
+----+
|NULL|
+----+

In other words, when there is no field name overlap AND field counts differ, Spark treats the entire struct as incompatible and returns NULL rather than attempting positional or partial casting.

In this PR, we return an error.


With Overlap (Partial Field Match)
...
Spark Behavior:

// Source in Parquet: {col1: int, col2: int}
// Target Schema: {col3: int, col4: int, col5: int, col1: int}
// Result: {NULL, NULL, NULL, 1}

spark.read.schema("str struct<col3 int, col4 int, col5 int, col1: int>").parquet("/tmp/test_struct").show(false)
// Output: {NULL, NULL, NULL, 1}

When at least one field name matches (in this case, col1), Spark:

  • Locates the matching source field by name
  • Fills non-matching target fields with NULL
  • Preserves the matched field value in the correct position

This PR has the same behaviour.

@kosiew kosiew marked this pull request as draft January 17, 2026 03:39
@kosiew kosiew force-pushed the struct-casting-17285 branch from 4452d56 to 5a4ecb1 Compare January 17, 2026 04:51
/// that apply to all columns and optional column-specific overrides
///
/// Closely tied to [`ParquetWriterOptions`](crate::file_options::parquet_writer::ParquetWriterOptions).
/// Closely tied to `ParquetWriterOptions` (see `crate::file_options::parquet_writer::ParquetWriterOptions` when the "parquet" feature is enabled).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR.

Fix to enable CI to pass.

@kosiew kosiew marked this pull request as ready for review January 17, 2026 08:03
@adriangb
Copy link
Contributor

  1. Require at least one matching field - Eliminate positional fallback entirely

I agree with this.

The OP still says:

When two structs have the same set of field names (possibly in different order), coerce by name.
Otherwise, preserve prior behavior by coercing positionally.

Which contradicts this. Could you update the PR description to reflect the current state?

@kosiew
Copy link
Contributor Author

kosiew commented Jan 20, 2026

@adriangb

The OP still says:

> When two structs have the same set of field names (possibly in different order), coerce by name.
> Otherwise, preserve prior behavior by coercing positionally.

Which contradicts this. Could you update the PR description to reflect the current state?

This PR still coerce positionally as a fallback.
I will remove positional casting in a follow up PR.

@adriangb
Copy link
Contributor

adriangb commented Jan 20, 2026

@adriangb

The OP still says:

> When two structs have the same set of field names (possibly in different order), coerce by name.
> Otherwise, preserve prior behavior by coercing positionally.

Which contradicts this. Could you update the PR description to reflect the current state?

This PR still coerce positionally as a fallback. I will remove positional casting in a follow up PR.

My point is that we should use names as long as there is 1 matching name, like DuckDB does. If that's different from what Spark does... then we need a cast option and to support both.

@kosiew
Copy link
Contributor Author

kosiew commented Jan 20, 2026

@adriangb
Thanks for the clarification.

Amended it from:
When two structs have the same set of field names (possibly in different order), coerce by name.
to
When two structs have at least one overlapping name, coerce by name.

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only issue I see (aside from minor comments) is the removed tests. I'm not seeing 1:1 replacements. Maybe that's fine and the behavior is actually covered by better structured tests, but I would consider leaving the old tests in place to be safe if they aren't hurting anything.

Ok(())
}

fn fields_have_name_overlap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to has_one_or_more_common_fields or something like that to make it clear the behavior in the name. Alternatively add a docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed - 2d5457b

///
/// ## Field Matching Strategy
/// - **By Name**: Source struct fields are matched to target fields by name (case-sensitive)
/// - **By Position**: When there is no name overlap and the field counts match, fields are cast by index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth clarifying we're going to remove this behavior if we indeed plan on doing so. Basically why not deprecate it as best we can in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove positional casting after this PR.
I actually did it on a local branch and reverted it because so that would be a stand-alone PR.

validate_struct_compatibility(source_struct.fields(), target_fields)?;
let source_fields = source_struct.fields();
let has_overlap = fields_have_name_overlap(source_fields, target_fields);
validate_struct_compatibility(source_fields, target_fields)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function need to be adjusted at all to take mapping of fields by name into account?

Copy link
Contributor Author

@kosiew kosiew Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change needed.

I added more tests, though - 9f496dd

@@ -492,8 +492,7 @@ Struct("r": Utf8, "c": Float64)
statement ok
drop table t;

query error DataFusion error: Optimizer rule 'simplify_expressions' failed[\s\S]*Arrow error: Cast error: Cannot cast string 'a' to value of Float64 type
create table t as values({r: 'a', c: 1}), ({c: 2.3, r: 'b'});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a new test that checks implicit coercion during create table ... as values. Am I missing something? I think it should be added back if not.


# Create a list of struct with different field types
query ?
select [{r: 'a', c: 1}, {c: 2, r: 'b'}];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't see a new test with array literals (searched for select [{)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added - e5d5d7a


# create array with different struct type should be cast
query T
select arrow_typeof([a, b]) from t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually see a test for the target type / casting of a dynamically created array (searched for [a and arrow_typeof([)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added tests - 8c0fe98

@@ -492,8 +492,7 @@ Struct("r": Utf8, "c": Float64)
statement ok
drop table t;

query error DataFusion error: Optimizer rule 'simplify_expressions' failed[\s\S]*Arrow error: Cast error: Cannot cast string 'a' to value of Float64 type
create table t as values({r: 'a', c: 1}), ({c: 2.3, r: 'b'});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And big picture the old tests check implicit coercion by creating arrays or having mismatched types in a values clause. The new tests always use an explicit cast call.

@kosiew kosiew marked this pull request as draft January 20, 2026 10:43
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jan 20, 2026
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew !

Edit: I see you marked as draft again, I guess there's some issue you discovered that needs resolving?

@kosiew kosiew marked this pull request as ready for review January 20, 2026 14:46
@kosiew kosiew requested a review from adriangb January 21, 2026 05:41
@kosiew
Copy link
Contributor Author

kosiew commented Jan 21, 2026

@adriangb
Adding the tests increased the PR size to more than 1000 lines.
If you like I can extract the changes after your approval to a separate PR

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kosiew for all of the extra tests. I think we can merge this as is, a good chunk of the line count is new tests (which is a good thing!) and I don't think warrants splitting of the PR.

Could you please open an issue to track further changes, e.g. deprecation of positional casting?

Comment on lines +38 to +42
## Coercion Paths Using Name-Based Matching

The following query operations use name-based field mapping for struct coercion:

### 1. Array Literal Construction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like more just general documentation on where casts happen, I don't think it's specific to struct coercion. That said I don't know that this exist anywhere else / there is nowhere better to put it. I suggest we keep it here for now since you've written it and it is helpful and not wrong, and we can always refactor the docs later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong cast between structs with different field order

3 participants