-
Notifications
You must be signed in to change notification settings - Fork 3
Add option to split output according to predicates #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds a `--split-with-predicates` flag which changes the behavior of the `--split` flag. Typically, the output is split along a subject prefix and an object prefix, e.g. `MONDO-to-NCIT.sssom.tsv`. When the `--split-with-predicates` flag is passed, it also includes the CURIE of the relation in the name of the output split file, e.g. `MONDO-skos_exactMatch-NCIT.sssom.tsv`. (Note that the colon in the CURIE has been replaced by an underscore, since colons cannot appear in filenames in Windows). This (almost) matches the behavior of sssom-py: <https://github.com/mapping-commons/sssom-py/blob/ac0f769386d35e2b1c961cac59221c816143bf07/src/sssom/parsers.py#L1033-L1035> The difference being that sssom-py only includes the local name of the predicate, while this pull request includes the full CURIE.
|
A couple notes:
|
gouttegd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks!
I wonder if we should have a more generic option to allow splitting a set along arbitrary slots instead of specifically the predicate, but I’m going to abstain making things more complicated until someone explicitly asks for it. :)
|
Interesting! What sort of arbitrary slot were you thinking of? It would definitely be useful to arbitrarily define the split boundaries, but seems like it might require developing some kind of tiny DSL. |
|
...adding to that (while it's on my mind). The default behavior of the split could be Not that I have time to implement this-- just sounds like it could be cool for generating reports. |
|
Unfortunately (?), “developing some kind of tiny DSL” is not something that can dissuade me – if anything it makes me want to do it even more! :D And yeah, your |
|
Sounds like it rhymes a bit with Robot's template functionality. Although without the benefit of multiple distinct fields to parse. Don't worry @matentzn I'm not following up. Just filing away! |
|
Thing is, with the existing MappingTransformer (used by SSSOM/T), we already have a parser for the kind of “format strings” we’d need here. We could therefore very easily support something like this: The only things we currently miss for that are the |
Done with 6c569f2. |
The behaviour of SSSOM-Py may now be matched exactly if needed with Almost any SSSOM slot can be used in the format string, so should you want to split along, say, mapping justifications and IDs of the first author of a mapping, you could do so with: |
|
That's tremendous! Feel free to remove the |
No, that option can stay as a convenience option. It does not hurt anybody and is certainly nicer to use than |
This commit adds a
--split-with-predicatesflag which changes the behavior of the--splitflag. Typically, the output is split along a subject prefix and an object prefix, e.g.MONDO-to-NCIT.sssom.tsv. When the--split-with-predicatesflag is passed, it also includes the CURIE of the relation in the name of the output split file, e.g.MONDO-skos_exactMatch-NCIT.sssom.tsv. (Note that the colon in the CURIE has been replaced by an underscore, since colons cannot appear in filenames in Windows).This (almost) matches the behavior of sssom-py:
https://github.com/mapping-commons/sssom-py/blob/ac0f769386d35e2b1c961cac59221c816143bf07/src/sssom/parsers.py#L1033-L1035
The difference being that sssom-py only includes the local name of the predicate, while this includes the full CURIE.