Skip to content

perf: extend field-major processing to nested struct fields #3225

@andygrove

Description

@andygrove

Summary

PR #3224 implements field-major processing for struct fields, which moves type dispatch from O(rows × fields) to O(fields). However, for complex nested types (Struct, List, Map inside a struct), it falls back to row-major processing via append_field.

This issue tracks extending the field-major optimization to nested Struct fields specifically.

Current Behavior

In append_struct_fields_field_major() (row.rs), complex types fall back to per-row processing:

// For complex types (struct, list, map), fall back to append_field
// since they have their own nested processing logic
dt @ (DataType::Struct(_) | DataType::List(_) | DataType::Map(_, _)) => {
    for (row_idx, i) in (row_start..row_end).enumerate() {
        let nested_row = if struct_is_null[row_idx] {
            SparkUnsafeRow::default()
        } else {
            // ... extract nested row
        };
        append_field(dt, struct_builder, &nested_row, field_idx)?;
    }
}

This means for deeply nested structs, we lose the benefit of field-major processing at each nesting level.

Proposed Optimization

For nested Struct fields:

  1. Get the nested StructBuilder once per field
  2. Build nested struct validity in one pass
  3. Recursively apply field-major processing to nested struct fields

This would require refactoring to separate validity handling from field value processing.

Expected Impact

  • 1.2-1.5x speedup for workloads with deeply nested struct types
  • Benefit multiplies with nesting depth

Notes

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions