Skip to content

Conversation

@ptgolden
Copy link
Contributor

This commit adds a --split-with-predicates flag which changes the behavior of the --split flag. Typically, the output is split along a subject prefix and an object prefix, e.g. MONDO-to-NCIT.sssom.tsv. When the --split-with-predicates flag is passed, it also includes the CURIE of the relation in the name of the output split file, e.g. MONDO-skos_exactMatch-NCIT.sssom.tsv. (Note that the colon in the CURIE has been replaced by an underscore, since colons cannot appear in filenames in Windows).

This (almost) matches the behavior of sssom-py:

https://github.com/mapping-commons/sssom-py/blob/ac0f769386d35e2b1c961cac59221c816143bf07/src/sssom/parsers.py#L1033-L1035

The difference being that sssom-py only includes the local name of the predicate, while this includes the full CURIE.

This commit adds a `--split-with-predicates` flag which changes the
behavior of the `--split` flag. Typically, the output is split along a
subject prefix and an object prefix, e.g. `MONDO-to-NCIT.sssom.tsv`.
When the `--split-with-predicates` flag is passed, it also includes the
CURIE of the relation in the name of the output split file, e.g.
`MONDO-skos_exactMatch-NCIT.sssom.tsv`. (Note that the colon in the
CURIE has been replaced by an underscore, since colons cannot appear in
filenames in Windows).

This (almost) matches the behavior of sssom-py:

<https://github.com/mapping-commons/sssom-py/blob/ac0f769386d35e2b1c961cac59221c816143bf07/src/sssom/parsers.py#L1033-L1035>

The difference being that sssom-py only includes the local name of the
predicate, while this pull request includes the full CURIE.
@ptgolden
Copy link
Contributor Author

A couple notes:

  • if --split-with-predicates is passed without --split, it does nothing
  • if a predicate does not have a prefix registered, the default {objectPrefix}-to-{subjectPrefix} is used (rather than a full URI)

Copy link
Owner

@gouttegd gouttegd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

I wonder if we should have a more generic option to allow splitting a set along arbitrary slots instead of specifically the predicate, but I’m going to abstain making things more complicated until someone explicitly asks for it. :)

@gouttegd gouttegd merged commit 36c12bd into gouttegd:main Dec 16, 2025
1 check passed
@ptgolden
Copy link
Contributor Author

Interesting! What sort of arbitrary slot were you thinking of?

It would definitely be useful to arbitrarily define the split boundaries, but seems like it might require developing some kind of tiny DSL.

@ptgolden
Copy link
Contributor Author

...adding to that (while it's on my mind).

The default behavior of the split could be --split-by="{subject:prefix}-to-{object:prefix}", and the behavior described in this PR could be --split-by="{subject:prefix}-{predicate:curie}-{object:prefix}".

Not that I have time to implement this-- just sounds like it could be cool for generating reports.

@gouttegd
Copy link
Owner

Unfortunately (?), “developing some kind of tiny DSL” is not something that can dissuade me – if anything it makes me want to do it even more! :D

And yeah, your --split-by idea is more or less what I was thinking about. But I’ll resist (at least for now!) the temptation to implement it until I have another reason to do it than “it would be a cool thing to do”.

@ptgolden
Copy link
Contributor Author

Sounds like it rhymes a bit with Robot's template functionality. Although without the benefit of multiple distinct fields to parse.

Don't worry @matentzn I'm not following up. Just filing away!

@gouttegd
Copy link
Owner

Thing is, with the existing MappingTransformer (used by SSSOM/T), we already have a parser for the kind of “format strings” we’d need here.

We could therefore very easily support something like this:

--split-by="%{subject_id|prefix}-%{predicate_id|short|replace(':', '_')}-%{object_id|prefix}"

The only things we currently miss for that are the prefix and replace format modifier functions, which would be trivial to implement.

@gouttegd
Copy link
Owner

We could therefore very easily support something like this […]

Done with 6c569f2.

@gouttegd
Copy link
Owner

This (almost) matches the behavior of sssom-py

The behaviour of SSSOM-Py may now be matched exactly if needed with

--split-by="%{subject_id|prefix|lower}_%{predicate_id|suffix|lower}_%{object_id|prefix|lower}"

Almost any SSSOM slot can be used in the format string, so should you want to split along, say, mapping justifications and IDs of the first author of a mapping, you could do so with:

--split-by="%{mapping_justification|suffix|upper}_%{author_id|list_item(1)|suffix|default('unknown')}"

@ptgolden
Copy link
Contributor Author

That's tremendous! Feel free to remove the --split-with-predicates option, as it's now superfluous.

@gouttegd
Copy link
Owner

Feel free to remove the --split-with-predicates option, as it's now superfluous.

No, that option can stay as a convenience option. It does not hurt anybody and is certainly nicer to use than --split-by="%{subject_id|prefix}-%{predicate_id|short|replace(':', '_')}-%{object_id|prefix}". :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants