-
Notifications
You must be signed in to change notification settings - Fork 3k
SPEC: Add SQL UDF spec #14117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SPEC: Add SQL UDF spec #14117
Conversation
format/udf-spec.md
Outdated
| | *optional* | `doc` | `string` | Parameter documentation. | | ||
|
|
||
| Notes: | ||
| 1. The `name` and `type` of a `parameter` are immutable. To change them, a new overload must be created. Only the optional documentation field (`doc`) can be updated in-place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should name be immutable? typically function signature (like Java) doesn't include parameter name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name itself doesn't have to be immutable for callers, as the order of parameter matters more. Names are mainly used by the versioned representations, they should be consistent across multiple versions. Otherwise, the rollback would be problematic. For example, we need to keep the name the same when add/rollback versions.
{ "name": "x", "type": "int", "doc": "Input integer" }
...
"overload-version-id": 1,
"deterministic": true,
"representations": [
{ "dialect": "trino", "body": "x + 2" }
],
...
"overload-version-id": 2,
"deterministic": true,
"representations": [
{ "dialect": "trino", "body": "x + 1" }
],
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I understand how the parameter renaming cause problem for rollback.
To change them, a new overload must be created.
Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if I understand how the parameter renaming cause problem for rollback.
Taking the example I put above. If we rename it to y at some point, then rollback to v1 or v2 will cause inconsistency between representation x+1 and parameter name y.
Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?
It shouldn't be allowed, as the signatures are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. basically, parameter renaming is not allowed. hence, we require the name and type are immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me what "immutable" means. Does it mean that you can't change these without updating the overload-id? That seems incorrect to me because the overload ID is more about tracking than identity. I think a better way to phrase this is:
- Function definitions are identified by the tuple of types and there can be only one definition for a given tuple
- All parameter names must match the definition in all versions and representations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After talking with Dan about the issue we discussed in the sync, I think that it makes sense to have a list of parameter names in the SQL representation. That way each representation is self-contained and consistent. And there's no need to have restrictions on whether names can change. The names in the definition and docs are shown as the definition, but the names used in SQL are specific to that SQL. It's the same idea as having a param name in a Java interface that can differ in the definition:
interface Definition {
int do_something(String foo);
}
class Impl implements Definition {
int do_something(String bar) {
return bar.length();
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds a good idea. To avoid duplication as most of representations may not need different names, we might still allow SQL representation to use the default parameters. So that only renaming triggers the copying of parameters to individual representations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an optional parameter list in the representation, also clarified that the tuple of types identify a definitioin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per last discussion(https://lists.apache.org/thread/t30hfxydwd8qkfzk9mtscc2xpg3kf621), we keep parameters only at the definition layer.
|
Thanks @stevenzwu for the review. Resolved all comments. Please take another look! |
format/udf-spec.md
Outdated
| and any field or element marked as required MUST NOT be null. Engines MUST reject results that violate these rules. | ||
|
|
||
| #### Parameter-Type | ||
| Primitive types are encoded using the [Iceberg Type JSON Representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit unclear because the JSON encoding produces a JSON string, like "float". This, however, is the bare float string. In addition, I don't think that we want to allow some things like decimal(9, 2) and instead want decimal(9,2). I think it's fine to refer to the JSON representation, but we probably want to use it as examples. Something like "Primitive types are encoded as simple strings, using the same representation as in ..." and "type strings must contain no spaces or quote characters".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote the section per offline discussion.
format/udf-spec.md
Outdated
| **self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security, | ||
| properties, and engine-specific representations. | ||
|
|
||
| * Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unclear to me, are we defining the file name as being the same or are we saying that a Catalog must first have a reference to the UDF and that is what must be atomically swapped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should just be, UDF metadata files are immutable and modifications should cause a new file to be created. Catalogs can then use an atomic swap, similar to an Iceberg table, to change the UDF linked with a particular catalog identifier.
Or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by suggestion.
format/udf-spec.md
Outdated
| | *required* | `format-version` | `int` | Metadata format version (must be `1`). | | ||
| | *required* | `definitions` | `list<definition>` | List of function [definition](#definition) entities. | | ||
| | *required* | `definition-log` | `list<definition-log>` | History of [definition snapshots](#definition-log). | | ||
| | *optional* | `location` | `string` | The function's base location; used to create metadata file locations. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It what cases is this not where the UDF file is? IE, If I read this file and it's not where the location string says, is that a problem? Or is this just for future versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the location or write.metadata.path in table and view spec, this is mainly for writer to decide where to write the metadata.json file. The reader will always get the metadata.json path from the catalogs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just trying to think about a situation where you would write this file to a different place than where it already is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the concern completely. I think this can happen to table and view as well, when a new metadata.json file is written to a new location controlled by write.metadata.path. The existing metadata.json files won't be moved in that cases.
format/udf-spec.md
Outdated
| | *optional* | `doc` | `string` | Documentation string. | | ||
|
|
||
| Notes: | ||
| 1. Engines must prevent leakage of sensitive information when a function is marked as `secure` by setting it to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably needs more of a definition than this. Engines may not expose UDF implementation details to the end users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are related discussion, #14117 (comment), #14117 (comment). Are you suggesting something like this?
Engines must prevent leakage of sensitive information to end users when a function is marked as `secure` by setting the property to `true`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i'm an engine reading this note or property, what should I do? What is "sensitive information? " what is "leakage"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed it to:
1. When `secure` is set to `true`, engines must prevent leakage of sensitive information to end users. This includes
but is not limited to: UDF definitions, error messages, logs, query plans, and intermediate results.
cc @rdblue
format/udf-spec.md
Outdated
| | *required* | `definition-id` | `string` | An identifier derived from canonical parameter-type tuple (lowercase, no spaces; e.g., `"(int,int,string)"`). | | ||
| | *required* | `parameters` | `list<parameter>` | Ordered list of [function parameters](#parameter). Invocation order **must** match this list. | | ||
| | *required* | `return-type` | `string` | Declared return type (see [Parameter Type](#parameter-type)). | | ||
| | *optional* | `nullable-return` | `boolean` | A hint to indicate whether the return value is nullable or not. Default: `true`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we specify it here and not in the return type? I guess it's just a hint so it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Yes, this is intentionally modeled as a hint rather than part of the return type itself. The return type captures the type, while nullability is separated to provide flexibilities for engines. Different engines already treat nullability with different strictness. Spark UDF could be defined as nullable return or nonNullable return. Snowflake allow nullable return definition in certain use cases.
format/udf-spec.md
Outdated
| | *optional* | `nullable-return` | `boolean` | A hint to indicate whether the return value is nullable or not. Default: `true`. | | ||
| | *required* | `versions` | `list<definition-version>` | [Versioned implementations](#definition-version) of this definition. | | ||
| | *required* | `current-version-id` | `int` | Identifier of the current version for this definition. | | ||
| | *optional* | `function-type` | `string` (`"udf"` or `"udtf"`, default `"udf"`) | If `"udtf"`, `return-type` must be an Iceberg type `struct` describing the output schema. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this at the definition level? Are we ok with some signatures being UDF and others being UDTF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, engines(e.g., Postgres, Snowflake) usually support that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:visibly upset:
I love a function that is sometimes a scalar and sometimes not :|
|
I'm still strongly of the opinion we should replace "definition" with "signature" |
format/udf-spec.md
Outdated
| Primitive types are encoded using the [Iceberg Type JSON Representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization), | ||
| for example `"int"`, `"string"`. | ||
|
|
||
| Three composite types are supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nested types?
Also, are we supporting variant? It is not considered primitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to exclude variant, it doesn't seem an extra burden for UDF. Variant will work in a UDF if related engines can support it. WDYT?
format/udf-spec.md
Outdated
| | *required* | `definition-versions` | `list<{ definition-id: string, version-id: int }>` | Mapping of each definition to its selected version at this time. | | ||
|
|
||
| ## Function Call Convention and Resolution in Engines | ||
| Resolution rule is decided by engines, but engines SHOULD: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more clear to say this:
Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:
- Prefer exact parameter matches over safe (widening) or unsafe casts
- Safely widen types as needed to avoid failing to find a matching definition
- Require explicit casts for unsafe or non-obvious conversions
- Use definitions with the same number of arguments as the input
- Pass positional arguments in the same position as the input
- Use definitions with the same set of field names as named input arguments
As for the last point of specifically not mixing positional and named arguments, I think that points 5 and 6 cover it. Don't reorder positional arguments and match the whole set of names. Also, implementers may ignore the "don't mix positional and named matching" but clearly stating how to match positional and named at least gives us some insurance that behavior won't be wacky if people do it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed per suggestion
format/udf-spec.md
Outdated
| **self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security, | ||
| properties, and engine-specific representations. | ||
|
|
||
| * Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the UDF metadata file referenced by table or view metadata? Does it need to be updated together with the swap? If only function-uuid is referenced, then this is not an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The udf name will be the identifier, just like table name, and view name. I think it's fine to go with that convention. For example, if users' sql refers a table by its identifier (ns1.t1), instead of its uuid. We may apply the similar logic there for udf.
format/udf-spec.md
Outdated
| ### Parameter | ||
| | Requirement | Field | Type | Description | | ||
| |-------------|--------|----------|--------------------------------------------------------------| | ||
| | *required* | `type` | `string` | Parameter data type (see [Parameter Type](#parameter-type)). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we allow nullable parameter? I just saw the expected behavior if any input is null. Do we need finer-grained control?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do allow nullable parameter. The on-null-input is a hint that engines can decide whether to optimize when one of parameters is null. Please check the section "Null Input Handling" in this doc for more details, https://docs.google.com/document/d/1GC896Z4gxYP0Vz-ENqZ3tZZBqXEUQf4qDJO11NRo8F4/edit?tab=t.0
|
Thank you all for the review. The PR is ready for another look. |
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Fixed the spec related to |
Dev mailing thread: https://lists.apache.org/list?dev@iceberg.apache.org:lte=1M:versioned%20UDF
Design docs:
I have updated the metadata structure per last meeting. Here is the latest structure in a nutshell. Please use the PRs as the source of truth.
