Skip to content

[Feature] Support Spark expression: percentile_cont #3190

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark percentile_cont function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

PercentileCont calculates a percentile value based on a continuous distribution of numeric or ANSI interval columns at a given percentage. It implements the SQL PERCENTILE_CONT function which uses linear interpolation between values when the exact percentile position falls between two data points. This expression is a runtime-replaceable aggregate that delegates to the internal Percentile implementation.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Describe the potential solution

Spark Specification

Syntax:

PERCENTILE_CONT(percentage) WITHIN GROUP (ORDER BY column [ASC|DESC])
// Used internally by Catalyst - not directly exposed in DataFrame API
// The DataFrame API uses percentile_approx for similar functionality

Arguments:

Argument Type Description
left Expression The column expression to calculate percentile over (ORDER BY clause)
right Expression The percentile value between 0.0 and 1.0
reverse Boolean Whether to reverse the ordering (DESC vs ASC), defaults to false

Return Type: Returns the same data type as the input column being aggregated (numeric types or ANSI interval types).

Supported Data Types:
Supports numeric data types and ANSI interval types as determined by the underlying Percentile implementation's input type validation.

Edge Cases:

  • Null handling behavior is delegated to the underlying Percentile implementation
  • Empty input datasets return null
  • Percentile values outside 0.0-1.0 range cause validation errors
  • Requires exactly one ordering column - throws QueryCompilationError for multiple orderings
  • DISTINCT clause is not supported and will be ignored
  • UnresolvedWithinGroup state indicates incomplete ordering specification

Examples:

-- Calculate median (50th percentile) of salary
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary
FROM employees;

-- Calculate 95th percentile in descending order
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time DESC) as p95_response
FROM requests;
// This is an internal Catalyst expression
// DataFrame API equivalent would use approxQuantile or percentile_approx
df.stat.approxQuantile("salary", Array(0.5), 0.0)

Implementation Approach

See the Comet guide on adding new expressions for detailed instructions.

  1. Scala Serde: Add expression handler in spark/src/main/scala/org/apache/comet/serde/
  2. Register: Add to appropriate map in QueryPlanSerde.scala
  3. Protobuf: Add message type in native/proto/src/proto/expr.proto if needed
  4. Rust: Implement in native/spark-expr/src/ (check if DataFusion has built-in support first)

Additional context

Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.PercentileCont

Related:

  • Percentile - The underlying implementation class
  • PercentileDisc - Discrete percentile calculation
  • RuntimeReplaceableAggregate - Base trait for replaceable aggregates
  • SupportsOrderingWithinGroup - Interface for WITHIN GROUP syntax

This issue was auto-generated from Spark reference documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions