-
Notifications
You must be signed in to change notification settings - Fork 272
Description
What is the problem the feature request solves?
Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.
Comet does not currently support the Spark approximate_percentile function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.
ApproximatePercentile is a Spark Catalyst aggregate expression that computes approximate percentiles of numeric data using the t-digest algorithm. It provides a memory-efficient way to estimate percentiles for large datasets without requiring exact sorting, trading precision for performance and memory usage.
Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.
Describe the potential solution
Spark Specification
Syntax:
percentile_approx(col, percentage [, accuracy])
percentile_approx(col, array_of_percentages [, accuracy])// DataFrame API
import org.apache.spark.sql.functions._
df.agg(expr("percentile_approx(column, 0.5, 10000)"))Arguments:
| Argument | Type | Description |
|---|---|---|
| child | Expression | The column or expression to compute percentiles for |
| percentageExpression | Expression | Single percentile (0.0-1.0) or array of percentiles to compute |
| accuracyExpression | Expression | Optional accuracy parameter (default: 10000). Higher values = more accuracy |
Return Type: Returns the same data type as the input column. If an array of percentiles is provided, returns an array of the input data type with containsNull = false.
Supported Data Types:
- Numeric types: ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType
- Temporal types: DateType, TimestampType, TimestampNTZType
- Interval types: YearMonthIntervalType, DayTimeIntervalType
Edge Cases:
- Null handling: Ignores null input values during computation
- Empty input: Returns null when no non-null values are processed
- Invalid percentages: Validates that percentages are between 0.0 and 1.0 inclusive
- Invalid accuracy: Requires accuracy to be positive and ≤ Int.MaxValue
- Non-foldable expressions: Percentage and accuracy parameters must be compile-time constants
Examples:
-- Single percentile (median)
SELECT percentile_approx(salary, 0.5) as median_salary FROM employees;
-- Multiple percentiles with custom accuracy
SELECT percentile_approx(response_time, array(0.25, 0.5, 0.75, 0.95), 50000) as quartiles
FROM web_requests;
-- Using with GROUP BY
SELECT department, percentile_approx(salary, 0.9) as p90_salary
FROM employees
GROUP BY department;// DataFrame API examples
import org.apache.spark.sql.functions._
// Single percentile
df.agg(expr("percentile_approx(amount, 0.5)").as("median"))
// Multiple percentiles
df.agg(expr("percentile_approx(latency, array(0.5, 0.95, 0.99))").as("percentiles"))
// With custom accuracy
df.agg(expr("percentile_approx(value, 0.95, 100000)").as("p95"))Implementation Approach
See the Comet guide on adding new expressions for detailed instructions.
- Scala Serde: Add expression handler in
spark/src/main/scala/org/apache/comet/serde/ - Register: Add to appropriate map in
QueryPlanSerde.scala - Protobuf: Add message type in
native/proto/src/proto/expr.protoif needed - Rust: Implement in
native/spark-expr/src/(check if DataFusion has built-in support first)
Additional context
Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.ApproximatePercentile
Related:
- Percentile: Exact percentile computation (more expensive but precise)
- ApproxQuantile: Similar approximate quantile functionality in DataFrame API
- Other aggregate functions: Count, Sum, Avg, Min, Max
This issue was auto-generated from Spark reference documentation.