Skip to content

Conversation

@flyrain
Copy link
Contributor

@flyrain flyrain commented Sep 19, 2025

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Sep 19, 2025
| *optional* | `doc` | `string` | Parameter documentation. |

Notes:
1. The `name` and `type` of a `parameter` are immutable. To change them, a new overload must be created. Only the optional documentation field (`doc`) can be updated in-place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should name be immutable? typically function signature (like Java) doesn't include parameter name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name itself doesn't have to be immutable for callers, as the order of parameter matters more. Names are mainly used by the versioned representations, they should be consistent across multiple versions. Otherwise, the rollback would be problematic. For example, we need to keep the name the same when add/rollback versions.

 { "name": "x", "type": "int", "doc": "Input integer" }
              ...
               "overload-version-id": 1,
               "deterministic": true,
               "representations": [
                  { "dialect": "trino", "body": "x + 2" }
               ],
...
               "overload-version-id": 2,
               "deterministic": true,
               "representations": [
                  { "dialect": "trino", "body": "x + 1" }
               ],

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand how the parameter renaming cause problem for rollback.

To change them, a new overload must be created.

Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if I understand how the parameter renaming cause problem for rollback.

Taking the example I put above. If we rename it to y at some point, then rollback to v1 or v2 will cause inconsistency between representation x+1 and parameter name y.

Is it ok to add an overload with only parameter name change, while params type and order are the same? How would client/engine resolve to the correct overload?

It shouldn't be allowed, as the signatures are the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. basically, parameter renaming is not allowed. hence, we require the name and type are immutable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me what "immutable" means. Does it mean that you can't change these without updating the overload-id? That seems incorrect to me because the overload ID is more about tracking than identity. I think a better way to phrase this is:

  1. Function definitions are identified by the tuple of types and there can be only one definition for a given tuple
  2. All parameter names must match the definition in all versions and representations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking with Dan about the issue we discussed in the sync, I think that it makes sense to have a list of parameter names in the SQL representation. That way each representation is self-contained and consistent. And there's no need to have restrictions on whether names can change. The names in the definition and docs are shown as the definition, but the names used in SQL are specific to that SQL. It's the same idea as having a param name in a Java interface that can differ in the definition:

interface Definition {
  int do_something(String foo);
}

class Impl implements Definition {
  int do_something(String bar) {
    return bar.length();
  }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds a good idea. To avoid duplication as most of representations may not need different names, we might still allow SQL representation to use the default parameters. So that only renaming triggers the copying of parameters to individual representations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an optional parameter list in the representation, also clarified that the tuple of types identify a definitioin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per last discussion(https://lists.apache.org/thread/t30hfxydwd8qkfzk9mtscc2xpg3kf621), we keep parameters only at the definition layer.

@flyrain flyrain requested a review from rdblue October 16, 2025 00:44
@flyrain
Copy link
Contributor Author

flyrain commented Oct 16, 2025

Thanks @stevenzwu for the review. Resolved all comments. Please take another look!

and any field or element marked as required MUST NOT be null. Engines MUST reject results that violate these rules.

#### Parameter-Type
Primitive types are encoded using the [Iceberg Type JSON Representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bit unclear because the JSON encoding produces a JSON string, like "float". This, however, is the bare float string. In addition, I don't think that we want to allow some things like decimal(9, 2) and instead want decimal(9,2). I think it's fine to refer to the JSON representation, but we probably want to use it as examples. Something like "Primitive types are encoded as simple strings, using the same representation as in ..." and "type strings must contain no spaces or quote characters".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote the section per offline discussion.

**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security,
properties, and engine-specific representations.

* Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unclear to me, are we defining the file name as being the same or are we saying that a Catalog must first have a reference to the UDF and that is what must be atomically swapped?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should just be, UDF metadata files are immutable and modifications should cause a new file to be created. Catalogs can then use an atomic swap, similar to an Iceberg table, to change the UDF linked with a particular catalog identifier.

Or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by suggestion.

| *required* | `format-version` | `int` | Metadata format version (must be `1`). |
| *required* | `definitions` | `list<definition>` | List of function [definition](#definition) entities. |
| *required* | `definition-log` | `list<definition-log>` | History of [definition snapshots](#definition-log). |
| *optional* | `location` | `string` | The function's base location; used to create metadata file locations. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It what cases is this not where the UDF file is? IE, If I read this file and it's not where the location string says, is that a problem? Or is this just for future versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the location or write.metadata.path in table and view spec, this is mainly for writer to decide where to write the metadata.json file. The reader will always get the metadata.json path from the catalogs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to think about a situation where you would write this file to a different place than where it already is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the concern completely. I think this can happen to table and view as well, when a new metadata.json file is written to a new location controlled by write.metadata.path. The existing metadata.json files won't be moved in that cases.

| *optional* | `doc` | `string` | Documentation string. |

Notes:
1. Engines must prevent leakage of sensitive information when a function is marked as `secure` by setting it to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably needs more of a definition than this. Engines may not expose UDF implementation details to the end users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are related discussion, #14117 (comment), #14117 (comment). Are you suggesting something like this?

Engines must prevent leakage of sensitive information to end users when a function is marked as `secure` by setting the property to `true`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i'm an engine reading this note or property, what should I do? What is "sensitive information? " what is "leakage"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to:

1. When `secure` is set to `true`, engines must prevent leakage of sensitive information to end users. This includes
   but is not limited to: UDF definitions, error messages, logs, query plans, and intermediate results.

cc @rdblue

| *required* | `definition-id` | `string` | An identifier derived from canonical parameter-type tuple (lowercase, no spaces; e.g., `"(int,int,string)"`). |
| *required* | `parameters` | `list<parameter>` | Ordered list of [function parameters](#parameter). Invocation order **must** match this list. |
| *required* | `return-type` | `string` | Declared return type (see [Parameter Type](#parameter-type)). |
| *optional* | `nullable-return` | `boolean` | A hint to indicate whether the return value is nullable or not. Default: `true`. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we specify it here and not in the return type? I guess it's just a hint so it doesn't really matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Yes, this is intentionally modeled as a hint rather than part of the return type itself. The return type captures the type, while nullability is separated to provide flexibilities for engines. Different engines already treat nullability with different strictness. Spark UDF could be defined as nullable return or nonNullable return. Snowflake allow nullable return definition in certain use cases.

| *optional* | `nullable-return` | `boolean` | A hint to indicate whether the return value is nullable or not. Default: `true`. |
| *required* | `versions` | `list<definition-version>` | [Versioned implementations](#definition-version) of this definition. |
| *required* | `current-version-id` | `int` | Identifier of the current version for this definition. |
| *optional* | `function-type` | `string` (`"udf"` or `"udtf"`, default `"udf"`) | If `"udtf"`, `return-type` must be an Iceberg type `struct` describing the output schema. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this at the definition level? Are we ok with some signatures being UDF and others being UDTF?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, engines(e.g., Postgres, Snowflake) usually support that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:visibly upset:

I love a function that is sometimes a scalar and sometimes not :|

@RussellSpitzer
Copy link
Member

I'm still strongly of the opinion we should replace "definition" with "signature"

Primitive types are encoded using the [Iceberg Type JSON Representation](https://iceberg.apache.org/spec/#appendix-c-json-serialization),
for example `"int"`, `"string"`.

Three composite types are supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nested types?

Also, are we supporting variant? It is not considered primitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to exclude variant, it doesn't seem an extra burden for UDF. Variant will work in a UDF if related engines can support it. WDYT?

| *required* | `definition-versions` | `list<{ definition-id: string, version-id: int }>` | Mapping of each definition to its selected version at this time. |

## Function Call Convention and Resolution in Engines
Resolution rule is decided by engines, but engines SHOULD:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear to say this:

Selecting the definition of a function to use is delegated to engines, which may apply their own casting rules. However, engines should:

  1. Prefer exact parameter matches over safe (widening) or unsafe casts
  2. Safely widen types as needed to avoid failing to find a matching definition
  3. Require explicit casts for unsafe or non-obvious conversions
  4. Use definitions with the same number of arguments as the input
  5. Pass positional arguments in the same position as the input
  6. Use definitions with the same set of field names as named input arguments

As for the last point of specifically not mixing positional and named arguments, I think that points 5 and 6 cover it. Don't reorder positional arguments and match the whole set of names. Also, implementers may ignore the "don't mix positional and named matching" but clearly stating how to match positional and named at least gives us some insurance that behavior won't be wacky if people do it anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed per suggestion

**self-contained metadata file**. Metadata captures definitions, parameters, return types, documentation, security,
properties, and engine-specific representations.

* Any modification (new definition, updated representation, changed properties, etc.) creates a new metadata file, and atomically swaps in the new file as the current metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the UDF metadata file referenced by table or view metadata? Does it need to be updated together with the swap? If only function-uuid is referenced, then this is not an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The udf name will be the identifier, just like table name, and view name. I think it's fine to go with that convention. For example, if users' sql refers a table by its identifier (ns1.t1), instead of its uuid. We may apply the similar logic there for udf.

### Parameter
| Requirement | Field | Type | Description |
|-------------|--------|----------|--------------------------------------------------------------|
| *required* | `type` | `string` | Parameter data type (see [Parameter Type](#parameter-type)). |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we allow nullable parameter? I just saw the expected behavior if any input is null. Do we need finer-grained control?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do allow nullable parameter. The on-null-input is a hint that engines can decide whether to optimize when one of parameters is null. Please check the section "Null Input Handling" in this doc for more details, https://docs.google.com/document/d/1GC896Z4gxYP0Vz-ENqZ3tZZBqXEUQf4qDJO11NRo8F4/edit?tab=t.0

@flyrain
Copy link
Contributor Author

flyrain commented Jan 24, 2026

Thank you all for the review. The PR is ready for another look.

Copy link
Contributor

@stevenzwu stevenzwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@flyrain
Copy link
Contributor Author

flyrain commented Jan 26, 2026

Fixed the spec related to secure and Types per today's community sync. Please take another look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Specification Issues that may introduce spec changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.