SPEC: Add SQL UDF spec #14117

flyrain · 2025-09-19T07:01:12Z

Dev mailing thread: https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:versioned%20UDF
Design docs:

I have updated the metadata structure per last meeting. Here is the latest structure in a nutshell. Please use the PRs as the source of truth.

format/udf-spec.md

stevenzwu · 2025-10-13T22:06:01Z

format/udf-spec.md

+| *optional*  | `doc`  | `string` | Parameter documentation. |
+
+Notes:
+1. The `name` and `type` of a `parameter` are immutable. To change them, a new overload must be created. Only the optional documentation field (`doc`) can be updated in-place.


should name be immutable? typically function signature (like Java) doesn't include parameter name

The name itself doesn't have to be immutable for callers, as the order of parameter matters more. Names are mainly used by the versioned representations, they should be consistent across multiple versions. Otherwise, the rollback would be problematic. For example, we need to keep the name the same when add/rollback versions.

{ "name": "x", "type": "int", "doc": "Input integer" } ... "overload-version-id": 1, "deterministic": true, "representations": [ { "dialect": "trino", "body": "x + 2" } ], ... "overload-version-id": 2, "deterministic": true, "representations": [ { "dialect": "trino", "body": "x + 1" } ],

not sure if I understand how the parameter renaming cause problem for rollback.

To change them, a new overload must be created.

Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?

not sure if I understand how the parameter renaming cause problem for rollback.

Taking the example I put above. If we rename it to y at some point, then rollback to v1 or v2 will cause inconsistency between representation x+1 and parameter name y.

Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?

It shouldn't be allowed, as the signatures are the same.

got it. basically, parameter renaming is not allowed. hence, we require the name and type are immutable.

It is unclear to me what "immutable" means. Does it mean that you can't change these without updating the overload-id? That seems incorrect to me because the overload ID is more about tracking than identity. I think a better way to phrase this is:

Function definitions are identified by the tuple of types and there can be only one definition for a given tuple

All parameter names must match the definition in all versions and representations

After talking with Dan about the issue we discussed in the sync, I think that it makes sense to have a list of parameter names in the SQL representation. That way each representation is self-contained and consistent. And there's no need to have restrictions on whether names can change. The names in the definition and docs are shown as the definition, but the names used in SQL are specific to that SQL. It's the same idea as having a param name in a Java interface that can differ in the definition:

interface Definition { int do_something(String foo); } class Impl implements Definition { int do_something(String bar) { return bar.length(); } }

That sounds a good idea. To avoid duplication as most of representations may not need different names, we might still allow SQL representation to use the default parameters. So that only renaming triggers the copying of parameters to individual representations.

Added an optional parameter list in the representation, also clarified that the tuple of types identify a definitioin.

Per last discussion(https://lists.apache.org/thread/t30hfxydwd8qkfzk9mtscc2xpg3kf621), we keep parameters only at the definition layer.

format/udf-spec.md

flyrain · 2025-10-16T00:44:28Z

Thanks @stevenzwu for the review. Resolved all comments. Please take another look!

format/udf-spec.md

rdblue · 2026-01-21T19:00:02Z

format/udf-spec.md

+   and any field or element marked as required MUST NOT be null. Engines MUST reject results that violate these rules.
+
+#### Parameter-Type
+Primitive types are encoded using the [Iceberg Type JSON Representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization),


I think this is a bit unclear because the JSON encoding produces a JSON string, like "float". This, however, is the bare float string. In addition, I don't think that we want to allow some things like decimal(9, 2) and instead want decimal(9,2). I think it's fine to refer to the JSON representation, but we probably want to use it as examples. Something like "Primitive types are encoded as simple strings, using the same representation as in ..." and "type strings must contain no spaces or quote characters".

Rewrote the section per offline discussion.

format/udf-spec.md

RussellSpitzer · 2026-01-21T18:49:37Z

format/udf-spec.md

+**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security,
+properties, and engine-specific representations.
+
+* Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata.


This is unclear to me, are we defining the file name as being the same or are we saying that a Catalog must first have a reference to the UDF and that is what must be atomically swapped?

Maybe this should just be, UDF metadata files are immutable and modifications should cause a new file to be created. Catalogs can then use an atomic swap, similar to an Iceberg table, to change the UDF linked with a particular catalog identifier.

Or something?

Fixed by suggestion.

format/udf-spec.md

RussellSpitzer · 2026-01-21T18:55:21Z

format/udf-spec.md

+| *required*  | `format-version`  | `int`                  | Metadata format version (must be `1`).                                |
+| *required*  | `definitions`     | `list<definition>`     | List of function [definition](#definition) entities.                  |
+| *required*  | `definition-log`  | `list<definition-log>` | History of [definition snapshots](#definition-log).                   |
+| *optional*  | `location`        | `string`               | The function's base location; used to create metadata file locations. |


It what cases is this not where the UDF file is? IE, If I read this file and it's not where the location string says, is that a problem? Or is this just for future versions?

Similar to the location or write.metadata.path in table and view spec, this is mainly for writer to decide where to write the metadata.json file. The reader will always get the metadata.json path from the catalogs.

I'm just trying to think about a situation where you would write this file to a different place than where it already is.

Not sure I understand the concern completely. I think this can happen to table and view as well, when a new metadata.json file is written to a new location controlled by write.metadata.path. The existing metadata.json files won't be moved in that cases.

RussellSpitzer · 2026-01-21T18:56:52Z

format/udf-spec.md

+| *optional*  | `doc`             | `string`               | Documentation string.                                                 |
+
+Notes:
+1. Engines must prevent leakage of sensitive information when a function is marked as `secure` by setting it to `true`.


Probably needs more of a definition than this. Engines may not expose UDF implementation details to the end users?

Here are related discussion, #14117 (comment), #14117 (comment). Are you suggesting something like this?

Engines must prevent leakage of sensitive information to end users when a function is marked as `secure` by setting the property to `true`.

If i'm an engine reading this note or property, what should I do? What is "sensitive information? " what is "leakage"?

I have changed it to:

1. When `secure` is set to `true`, engines must prevent leakage of sensitive information to end users. This includes but is not limited to: UDF definitions, error messages, logs, query plans, and intermediate results.

cc @rdblue

format/udf-spec.md

RussellSpitzer · 2026-01-21T19:00:05Z

format/udf-spec.md

+| *required*  | `definition-id`      | `string`                                        | An identifier derived from canonical parameter-type tuple (lowercase, no spaces; e.g., `"(int,int,string)"`). |
+| *required*  | `parameters`         | `list<parameter>`                               | Ordered list of [function parameters](#parameter). Invocation order **must** match this list.                 |
+| *required*  | `return-type`        | `string`                                        | Declared return type (see [Parameter Type](#parameter-type)).                                                 |
+| *optional*  | `nullable-return`    | `boolean`                                       | A hint to indicate whether the return value is nullable or not. Default: `true`.                              |


Why do we specify it here and not in the return type? I guess it's just a hint so it doesn't really matter.

Good question. Yes, this is intentionally modeled as a hint rather than part of the return type itself. The return type captures the type, while nullability is separated to provide flexibilities for engines. Different engines already treat nullability with different strictness. Spark UDF could be defined as nullable return or nonNullable return. Snowflake allow nullable return definition in certain use cases.

RussellSpitzer · 2026-01-21T19:01:23Z

format/udf-spec.md

+| *optional*  | `nullable-return`    | `boolean`                                       | A hint to indicate whether the return value is nullable or not. Default: `true`.                              |
+| *required*  | `versions`           | `list<definition-version>`                      | [Versioned implementations](#definition-version) of this definition.                                          |
+| *required*  | `current-version-id` | `int`                                           | Identifier of the current version for this definition.                                                        |
+| *optional*  | `function-type`      | `string` (`"udf"` or `"udtf"`, default `"udf"`) | If `"udtf"`, `return-type` must be an Iceberg type `struct` describing the output schema.                     |


Why is this at the definition level? Are we ok with some signatures being UDF and others being UDTF?

Yes, engines(e.g., Postgres, Snowflake) usually support that.

:visibly upset:

I love a function that is sometimes a scalar and sometimes not :|

format/udf-spec.md

RussellSpitzer · 2026-01-21T19:13:34Z

I'm still strongly of the opinion we should replace "definition" with "signature"

rdblue · 2026-01-21T22:03:14Z

format/udf-spec.md

+Primitive types are encoded using the [Iceberg Type JSON Representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization),
+for example `"int"`, `"string"`.
+
+Three composite types are supported. 


Nested types?

Also, are we supporting variant? It is not considered primitive.

I don't think we need to exclude variant, it doesn't seem an extra burden for UDF. Variant will work in a UDF if related engines can support it. WDYT?

format/udf-spec.md

rdblue · 2026-01-21T22:37:48Z

format/udf-spec.md

+| *required*  | `definition-versions` | `list<{ definition-id: string, version-id: int }>` | Mapping of each definition to its selected version at this time. |
+
+## Function Call Convention and Resolution in Engines
+Resolution rule is decided by engines, but engines SHOULD:


I think it would be more clear to say this:

Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:

Prefer exact parameter matches over safe (widening) or unsafe casts

Safely widen types as needed to avoid failing to find a matching definition

Require explicit casts for unsafe or non-obvious conversions

Use definitions with the same number of arguments as the input

Pass positional arguments in the same position as the input

Use definitions with the same set of field names as named input arguments

As for the last point of specifically not mixing positional and named arguments, I think that points 5 and 6 cover it. Don't reorder positional arguments and match the whole set of names. Also, implementers may ignore the "don't mix positional and named matching" but clearly stating how to match positional and named at least gives us some insurance that behavior won't be wacky if people do it anyway.

Fixed per suggestion

format/udf-spec.md

wgtmac · 2026-01-22T02:42:32Z

format/udf-spec.md

+**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security,
+properties, and engine-specific representations.
+
+* Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata.


How is the UDF metadata file referenced by table or view metadata? Does it need to be updated together with the swap? If only function-uuid is referenced, then this is not an issue.

The udf name will be the identifier, just like table name, and view name. I think it's fine to go with that convention. For example, if users' sql refers a table by its identifier (ns1.t1), instead of its uuid. We may apply the similar logic there for udf.

wgtmac · 2026-01-22T03:24:45Z

format/udf-spec.md

+### Parameter
+| Requirement | Field  | Type     | Description                                                  |
+|-------------|--------|----------|--------------------------------------------------------------|
+| *required*  | `type` | `string` | Parameter data type (see [Parameter Type](#parameter-type)). |


Do we allow nullable parameter? I just saw the expected behavior if any input is null. Do we need finer-grained control?

We do allow nullable parameter. The on-null-input is a hint that engines can decide whether to optimize when one of parameters is null. Please check the section "Null Input Handling" in this doc for more details, https://docs.google.com/document/d/1GC896Z4gxYP0Vz-ENqZ3tZZBqXEUQf4qDJO11NRo8F4/edit?tab=t.0

Resolve comments

flyrain · 2026-01-24T20:24:58Z

Thank you all for the review. The PR is ready for another look.

stevenzwu

LGTM

flyrain · 2026-01-26T18:41:33Z

Fixed the spec related to secure and Types per today's community sync. Please take another look.

Add SQL UDF spec

d118b64

github-actions bot added the Specification Issues that may introduce spec changes. label Sep 19, 2025

RussellSpitzer reviewed Sep 19, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

stevenzwu reviewed Sep 19, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

stevenzwu reviewed Sep 19, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

stevenzwu reviewed Sep 19, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

stevenzwu reviewed Sep 19, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Outdated Show resolved Hide resolved

stevenzwu reviewed Sep 19, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Show resolved Hide resolved

talatuyarer reviewed Sep 22, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

sfc-gh-ygu added 2 commits September 23, 2025 11:41

Resolve commemnts

33ad408

Resolve comments

6990aea

stevenzwu reviewed Sep 23, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Outdated Show resolved Hide resolved

format/udf-spec.md Outdated Show resolved Hide resolved

Resolve comments

35cda2a

danielcweeks reviewed Oct 1, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

danielcweeks reviewed Oct 1, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

danielcweeks reviewed Oct 1, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

danielcweeks reviewed Oct 6, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

talatuyarer reviewed Oct 6, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

sfc-gh-ygu added 5 commits October 6, 2025 15:26

Resolve comments

8a75909

Resolve comments

7432d52

Add field types

e779099

Resolve comments

786f82b

Resolve comments

96a5880

stevenzwu reviewed Oct 13, 2025

View reviewed changes

sfc-gh-ygu added 2 commits October 15, 2025 17:32

Resolve comments

071e97f

Resolve comments

78223c1

flyrain requested a review from rdblue October 16, 2025 00:44

rdblue reviewed Oct 20, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Oct 20, 2025

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

RussellSpitzer reviewed Jan 21, 2026

View reviewed changes

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Outdated Show resolved Hide resolved

rdblue reviewed Jan 21, 2026

View reviewed changes

format/udf-spec.md Show resolved Hide resolved

wgtmac reviewed Jan 22, 2026

View reviewed changes

flyrain added 5 commits January 22, 2026 15:42

Resolve comments.

678621e

Resolve comments

Resolve comments.

e6e3bcb

SQL expression and call conventions

9b02a15

Use Iceberg Type Json representation

eb97175

Secure udf fix

af0e694

stevenzwu approved these changes Jan 25, 2026

View reviewed changes

Secure and types fixes

c1c0a96

SPEC: Add SQL UDF spec #14117

Are you sure you want to change the base?

SPEC: Add SQL UDF spec #14117

Conversation

flyrain commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

flyrain commented Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

flyrain commented Sep 19, 2025 •

edited

Loading