-
Notifications
You must be signed in to change notification settings - Fork 272
Description
What is the problem the feature request solves?
Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark percentile_cont function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.
PercentileCont calculates a percentile value based on a continuous distribution of numeric or ANSI interval columns at a given percentage. It implements the SQL PERCENTILE_CONT function which uses linear interpolation between values when the exact percentile position falls between two data points. This expression is a runtime-replaceable aggregate that delegates to the internal Percentile implementation.
Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.
Describe the potential solution
Spark Specification
Syntax:
PERCENTILE_CONT(percentage) WITHIN GROUP (ORDER BY column [ASC|DESC])// Used internally by Catalyst - not directly exposed in DataFrame API
// The DataFrame API uses percentile_approx for similar functionalityArguments:
| Argument | Type | Description |
|---|---|---|
| left | Expression | The column expression to calculate percentile over (ORDER BY clause) |
| right | Expression | The percentile value between 0.0 and 1.0 |
| reverse | Boolean | Whether to reverse the ordering (DESC vs ASC), defaults to false |
Return Type: Returns the same data type as the input column being aggregated (numeric types or ANSI interval types).
Supported Data Types:
Supports numeric data types and ANSI interval types as determined by the underlying Percentile implementation's input type validation.
Edge Cases:
- Null handling behavior is delegated to the underlying Percentile implementation
- Empty input datasets return null
- Percentile values outside 0.0-1.0 range cause validation errors
- Requires exactly one ordering column - throws QueryCompilationError for multiple orderings
- DISTINCT clause is not supported and will be ignored
- UnresolvedWithinGroup state indicates incomplete ordering specification
Examples:
-- Calculate median (50th percentile) of salary
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) as median_salary
FROM employees;
-- Calculate 95th percentile in descending order
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY response_time DESC) as p95_response
FROM requests;// This is an internal Catalyst expression
// DataFrame API equivalent would use approxQuantile or percentile_approx
df.stat.approxQuantile("salary", Array(0.5), 0.0)Implementation Approach
See the Comet guide on adding new expressions for detailed instructions.
- Scala Serde: Add expression handler in
spark/src/main/scala/org/apache/comet/serde/ - Register: Add to appropriate map in
QueryPlanSerde.scala - Protobuf: Add message type in
native/proto/src/proto/expr.protoif needed - Rust: Implement in
native/spark-expr/src/(check if DataFusion has built-in support first)
Additional context
Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.PercentileCont
Related:
- Percentile - The underlying implementation class
- PercentileDisc - Discrete percentile calculation
- RuntimeReplaceableAggregate - Base trait for replaceable aggregates
- SupportsOrderingWithinGroup - Interface for WITHIN GROUP syntax
This issue was auto-generated from Spark reference documentation.