Skip to content

[Feature] Support Spark expression: approximate_percentile #3189

@andygrove

Description

@andygrove

What is the problem the feature request solves?

Note: This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark approximate_percentile function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

ApproximatePercentile is a Spark Catalyst aggregate expression that computes approximate percentiles of numeric data using the t-digest algorithm. It provides a memory-efficient way to estimate percentiles for large datasets without requiring exact sorting, trading precision for performance and memory usage.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

Describe the potential solution

Spark Specification

Syntax:

percentile_approx(col, percentage [, accuracy])
percentile_approx(col, array_of_percentages [, accuracy])
// DataFrame API
import org.apache.spark.sql.functions._
df.agg(expr("percentile_approx(column, 0.5, 10000)"))

Arguments:

Argument Type Description
child Expression The column or expression to compute percentiles for
percentageExpression Expression Single percentile (0.0-1.0) or array of percentiles to compute
accuracyExpression Expression Optional accuracy parameter (default: 10000). Higher values = more accuracy

Return Type: Returns the same data type as the input column. If an array of percentiles is provided, returns an array of the input data type with containsNull = false.

Supported Data Types:

  • Numeric types: ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType
  • Temporal types: DateType, TimestampType, TimestampNTZType
  • Interval types: YearMonthIntervalType, DayTimeIntervalType

Edge Cases:

  • Null handling: Ignores null input values during computation
  • Empty input: Returns null when no non-null values are processed
  • Invalid percentages: Validates that percentages are between 0.0 and 1.0 inclusive
  • Invalid accuracy: Requires accuracy to be positive and ≤ Int.MaxValue
  • Non-foldable expressions: Percentage and accuracy parameters must be compile-time constants

Examples:

-- Single percentile (median)
SELECT percentile_approx(salary, 0.5) as median_salary FROM employees;

-- Multiple percentiles with custom accuracy
SELECT percentile_approx(response_time, array(0.25, 0.5, 0.75, 0.95), 50000) as quartiles 
FROM web_requests;

-- Using with GROUP BY
SELECT department, percentile_approx(salary, 0.9) as p90_salary 
FROM employees 
GROUP BY department;
// DataFrame API examples
import org.apache.spark.sql.functions._

// Single percentile
df.agg(expr("percentile_approx(amount, 0.5)").as("median"))

// Multiple percentiles
df.agg(expr("percentile_approx(latency, array(0.5, 0.95, 0.99))").as("percentiles"))

// With custom accuracy
df.agg(expr("percentile_approx(value, 0.95, 100000)").as("p95"))

Implementation Approach

See the Comet guide on adding new expressions for detailed instructions.

  1. Scala Serde: Add expression handler in spark/src/main/scala/org/apache/comet/serde/
  2. Register: Add to appropriate map in QueryPlanSerde.scala
  3. Protobuf: Add message type in native/proto/src/proto/expr.proto if needed
  4. Rust: Implement in native/spark-expr/src/ (check if DataFusion has built-in support first)

Additional context

Difficulty: Medium
Spark Expression Class: org.apache.spark.sql.catalyst.expressions.ApproximatePercentile

Related:

  • Percentile: Exact percentile computation (more expensive but precise)
  • ApproxQuantile: Similar approximate quantile functionality in DataFrame API
  • Other aggregate functions: Count, Sum, Avg, Min, Max

This issue was auto-generated from Spark reference documentation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions