Skip to content

Comet should gracefully handle OnHeapColumnVector instead of failing #3215

@andygrove

Description

@andygrove

Description

When a native Comet operator receives data from a Spark scan that produces OnHeapColumnVector instead of Arrow arrays, Comet fails with:

org.apache.spark.SparkException: Comet execution only takes Arrow Arrays, but got class org.apache.spark.sql.execution.vectorized.OnHeapColumnVector

This can happen when:

  1. The native scan (e.g., native_comet) doesn't support certain data types (like complex types)
  2. The scan falls back to Spark's Parquet reader
  3. A downstream native operator (like the native Parquet writer) receives the non-Arrow data

Reproduction

// With native Parquet write enabled but without COMET_SCAN_ALLOW_INCOMPATIBLE
withSQLConf(
  "spark.comet.parquet.write.enabled" -> "true",
  "spark.comet.exec.enabled" -> "true") {
  
  // Create data with complex types
  val df = Seq((1, Seq(1, 2, 3))).toDF("id", "values")
  
  // Write to parquet (without Comet)
  df.write.parquet("/tmp/input")
  
  // Read and write - this fails because native_comet scan doesn't support 
  // complex types, falls back to Spark reader, but downstream native writer
  // expects Arrow arrays
  spark.read.parquet("/tmp/input").write.parquet("/tmp/output")
}

Expected Behavior

Comet should either:

  1. Fall back the entire query to Spark when native operators would receive non-Arrow data
  2. Automatically insert conversion from OnHeapColumnVector to Arrow (using the existing spark.comet.convert.parquet.enabled mechanism)
  3. Provide a clearer error message explaining why this happened and how to fix it (e.g., "Enable spark.comet.scan.allowIncompatible to use native_iceberg_compat scan which supports complex types")

Current Workarounds

  1. Enable spark.comet.scan.allowIncompatible=true so that native_iceberg_compat scan is used (which supports complex types)
  2. Enable spark.comet.convert.parquet.enabled=true to convert Spark columnar data to Arrow

Context

This was discovered while adding complex type support to the native Parquet writer (#3214). The fix there uses COMET_SCAN_ALLOW_INCOMPATIBLE, but the underlying issue of ungraceful failure should be addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions