[Feature] Support Spark expression: approximate_percentile

## What is the problem the feature request solves?

> **Note:** This issue was generated with AI assistance. The specification details have been extracted from Spark documentation and may need verification.

Comet does not currently support the Spark `approximate_percentile` function, causing queries using this function to fall back to Spark's JVM execution instead of running natively on DataFusion.

ApproximatePercentile is a Spark Catalyst aggregate expression that computes approximate percentiles of numeric data using the t-digest algorithm. It provides a memory-efficient way to estimate percentiles for large datasets without requiring exact sorting, trading precision for performance and memory usage.

Supporting this expression would allow more Spark workloads to benefit from Comet's native acceleration.

## Describe the potential solution

### Spark Specification

**Syntax:**
```sql
percentile_approx(col, percentage [, accuracy])
percentile_approx(col, array_of_percentages [, accuracy])
```

```scala
// DataFrame API
import org.apache.spark.sql.functions._
df.agg(expr("percentile_approx(column, 0.5, 10000)"))
```

**Arguments:**
| Argument | Type | Description |
|----------|------|-------------|
| child | Expression | The column or expression to compute percentiles for |
| percentageExpression | Expression | Single percentile (0.0-1.0) or array of percentiles to compute |
| accuracyExpression | Expression | Optional accuracy parameter (default: 10000). Higher values = more accuracy |

**Return Type:** Returns the same data type as the input column. If an array of percentiles is provided, returns an array of the input data type with `containsNull = false`.

**Supported Data Types:**
- **Numeric types**: ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType
- **Temporal types**: DateType, TimestampType, TimestampNTZType
- **Interval types**: YearMonthIntervalType, DayTimeIntervalType

**Edge Cases:**
- **Null handling**: Ignores null input values during computation
- **Empty input**: Returns null when no non-null values are processed
- **Invalid percentages**: Validates that percentages are between 0.0 and 1.0 inclusive
- **Invalid accuracy**: Requires accuracy to be positive and ≤ Int.MaxValue
- **Non-foldable expressions**: Percentage and accuracy parameters must be compile-time constants

**Examples:**
```sql
-- Single percentile (median)
SELECT percentile_approx(salary, 0.5) as median_salary FROM employees;

-- Multiple percentiles with custom accuracy
SELECT percentile_approx(response_time, array(0.25, 0.5, 0.75, 0.95), 50000) as quartiles 
FROM web_requests;

-- Using with GROUP BY
SELECT department, percentile_approx(salary, 0.9) as p90_salary 
FROM employees 
GROUP BY department;
```

```scala
// DataFrame API examples
import org.apache.spark.sql.functions._

// Single percentile
df.agg(expr("percentile_approx(amount, 0.5)").as("median"))

// Multiple percentiles
df.agg(expr("percentile_approx(latency, array(0.5, 0.95, 0.99))").as("percentiles"))

// With custom accuracy
df.agg(expr("percentile_approx(value, 0.95, 100000)").as("p95"))
```

### Implementation Approach

See the [Comet guide on adding new expressions](https://datafusion.apache.org/comet/contributor-guide/adding_a_new_expression.html) for detailed instructions.

1. **Scala Serde**: Add expression handler in `spark/src/main/scala/org/apache/comet/serde/`
2. **Register**: Add to appropriate map in `QueryPlanSerde.scala`
3. **Protobuf**: Add message type in `native/proto/src/proto/expr.proto` if needed
4. **Rust**: Implement in `native/spark-expr/src/` (check if DataFusion has built-in support first)


## Additional context

**Difficulty:** Medium
**Spark Expression Class:** `org.apache.spark.sql.catalyst.expressions.ApproximatePercentile`

**Related:**
- **Percentile**: Exact percentile computation (more expensive but precise)
- **ApproxQuantile**: Similar approximate quantile functionality in DataFrame API
- Other aggregate functions: Count, Sum, Avg, Min, Max

---
*This issue was auto-generated from Spark reference documentation.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support Spark expression: approximate_percentile #3189

What is the problem the feature request solves?

Describe the potential solution

Spark Specification

Implementation Approach

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Argument	Type	Description
child	Expression	The column or expression to compute percentiles for
percentageExpression	Expression	Single percentile (0.0-1.0) or array of percentiles to compute
accuracyExpression	Expression	Optional accuracy parameter (default: 10000). Higher values = more accuracy

[Feature] Support Spark expression: approximate_percentile #3189

Description

What is the problem the feature request solves?

Describe the potential solution

Spark Specification

Implementation Approach

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions