Add option to split output according to predicates #14

ptgolden · 2025-12-16T15:06:23Z

This commit adds a --split-with-predicates flag which changes the behavior of the --split flag. Typically, the output is split along a subject prefix and an object prefix, e.g. MONDO-to-NCIT.sssom.tsv. When the --split-with-predicates flag is passed, it also includes the CURIE of the relation in the name of the output split file, e.g. MONDO-skos_exactMatch-NCIT.sssom.tsv. (Note that the colon in the CURIE has been replaced by an underscore, since colons cannot appear in filenames in Windows).

This (almost) matches the behavior of sssom-py:

https://github.com/mapping-commons/sssom-py/blob/ac0f769386d35e2b1c961cac59221c816143bf07/src/sssom/parsers.py#L1033-L1035

The difference being that sssom-py only includes the local name of the predicate, while this includes the full CURIE.

This commit adds a `--split-with-predicates` flag which changes the behavior of the `--split` flag. Typically, the output is split along a subject prefix and an object prefix, e.g. `MONDO-to-NCIT.sssom.tsv`. When the `--split-with-predicates` flag is passed, it also includes the CURIE of the relation in the name of the output split file, e.g. `MONDO-skos_exactMatch-NCIT.sssom.tsv`. (Note that the colon in the CURIE has been replaced by an underscore, since colons cannot appear in filenames in Windows). This (almost) matches the behavior of sssom-py: <https://github.com/mapping-commons/sssom-py/blob/ac0f769386d35e2b1c961cac59221c816143bf07/src/sssom/parsers.py#L1033-L1035> The difference being that sssom-py only includes the local name of the predicate, while this pull request includes the full CURIE.

ptgolden · 2025-12-16T15:14:11Z

A couple notes:

if --split-with-predicates is passed without --split, it does nothing
if a predicate does not have a prefix registered, the default {objectPrefix}-to-{subjectPrefix} is used (rather than a full URI)

gouttegd

Looks good, thanks!

I wonder if we should have a more generic option to allow splitting a set along arbitrary slots instead of specifically the predicate, but I’m going to abstain making things more complicated until someone explicitly asks for it. :)

ptgolden · 2025-12-16T19:44:50Z

Interesting! What sort of arbitrary slot were you thinking of?

It would definitely be useful to arbitrarily define the split boundaries, but seems like it might require developing some kind of tiny DSL.

ptgolden · 2025-12-16T19:54:43Z

...adding to that (while it's on my mind).

The default behavior of the split could be --split-by="{subject:prefix}-to-{object:prefix}", and the behavior described in this PR could be --split-by="{subject:prefix}-{predicate:curie}-{object:prefix}".

Not that I have time to implement this-- just sounds like it could be cool for generating reports.

gouttegd · 2025-12-16T20:02:20Z

Unfortunately (?), “developing some kind of tiny DSL” is not something that can dissuade me – if anything it makes me want to do it even more! :D

And yeah, your --split-by idea is more or less what I was thinking about. But I’ll resist (at least for now!) the temptation to implement it until I have another reason to do it than “it would be a cool thing to do”.

ptgolden · 2025-12-16T20:58:40Z

Sounds like it rhymes a bit with Robot's template functionality. Although without the benefit of multiple distinct fields to parse.

Don't worry @matentzn I'm not following up. Just filing away!

gouttegd · 2025-12-16T22:05:49Z

Thing is, with the existing MappingTransformer (used by SSSOM/T), we already have a parser for the kind of “format strings” we’d need here.

We could therefore very easily support something like this:

--split-by="%{subject_id|prefix}-%{predicate_id|short|replace(':', '_')}-%{object_id|prefix}"

The only things we currently miss for that are the prefix and replace format modifier functions, which would be trivial to implement.

gouttegd · 2025-12-16T22:56:25Z

We could therefore very easily support something like this […]

Done with 6c569f2.

gouttegd · 2025-12-17T13:04:33Z

This (almost) matches the behavior of sssom-py

The behaviour of SSSOM-Py may now be matched exactly if needed with

--split-by="%{subject_id|prefix|lower}_%{predicate_id|suffix|lower}_%{object_id|prefix|lower}"

Almost any SSSOM slot can be used in the format string, so should you want to split along, say, mapping justifications and IDs of the first author of a mapping, you could do so with:

--split-by="%{mapping_justification|suffix|upper}_%{author_id|list_item(1)|suffix|default('unknown')}"

ptgolden · 2025-12-17T15:18:10Z

That's tremendous! Feel free to remove the --split-with-predicates option, as it's now superfluous.

gouttegd · 2025-12-17T15:24:35Z

Feel free to remove the --split-with-predicates option, as it's now superfluous.

No, that option can stay as a convenience option. It does not hurt anybody and is certainly nicer to use than --split-by="%{subject_id|prefix}-%{predicate_id|short|replace(':', '_')}-%{object_id|prefix}". :)

gouttegd approved these changes Dec 16, 2025

View reviewed changes

gouttegd merged commit 36c12bd into gouttegd:main Dec 16, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option to split output according to predicates #14

Add option to split output according to predicates #14

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

gouttegd left a comment

Uh oh!

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 16, 2025

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 17, 2025

Uh oh!

ptgolden commented Dec 17, 2025

Uh oh!

gouttegd commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add option to split output according to predicates #14

Add option to split output according to predicates #14

Uh oh!

Conversation

ptgolden commented Dec 16, 2025

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

gouttegd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 16, 2025

Uh oh!

ptgolden commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 16, 2025

Uh oh!

gouttegd commented Dec 17, 2025

Uh oh!

ptgolden commented Dec 17, 2025

Uh oh!

gouttegd commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants