Skip to content

Add a function to recursively add IDs to a VRS object and every contained identifiable object #596

@jsstevenson

Description

@jsstevenson

In AnyVar, we want to receive VRS objects, but they need IDs to be stored. We'd like some kind of function that will ensure IDs get added to received objects if they aren't there already. This needs to recurse down through any contained objects.

I had assumed this existed somewhere in VRS-Python, but I haven't found a working solution yet. Granted, this is a very simple problem, and I could just write something that manually checks if an object is an allele or a different type of object and handles each case, but I felt like there should be a better option, especially since we already have some functions that get close:

ga4gh_identify() returns the correct ID, but doesn't completely update the contained object

It'll add the outermost .id property if you set in_place="always", but not a contained object's ID

In [1]: from ga4gh.vrs import models, normalize; from ga4gh.core import ga4gh_identify; from ga4gh.vrs.enderef import vrs_deref, vrs_enref

In [2]: input_data = {"location": {"end": 87894077, "start": 87894076, "sequenceReference": { "refgetAccession": "SQ.ss8r_wB0-b9r44TQTMmVTI92884QvBiB", "type": "SequenceReference"},},"state": {"sequence": "T"}}

In [3]: allele1 = models.Allele(**input_data)

In [4]: ga4gh_identify(allele1, in_place="always")
Out[8]: 'ga4gh:VA.K7akyz9PHB0wg8wBNVlWAAdvMbJUJJfU'

In [5]: allele1.id
Out[5]: 'ga4gh:VA.K7akyz9PHB0wg8wBNVlWAAdvMbJUJJfU'

In [6]: allele1.location.id is None
Out[6]: True

vrs_enref()/vrs_deref() will update IDs in place, but they'll be wrong

This one seems bad. I mean, maybe I don't understand how these methods are supposed to work, but this is troubling. Regardless, this is not a solution for my problem.

In [7]: storage = {}

In [8]: enreffed = vrs_enref(models.Allele(**input_data), storage)

In [9]: dereffed = vrs_deref(enreffed, storage)

In [10]: dereffed.id
Out[11]: 'ga4gh:VA.UBp6cO0u3i286SZhHhfUo1uFft259YyC'

In [12]: dereffed.location.id
Out[12]: 'ga4gh:SL.01EH5o6V6VEyNUq68gpeTwKE7xOo-WAy'

Note -- that location ID is correct, as far as I can tell. I don't understand why the allele ID is wrong.

vrs_enref() -> vrs_deref() -> clear allele ID and digest -> ga4gh_identify() works, but this is way more complicated than it should be

This is what I put into AnyVar as a temporary measure

def recursive_identify(vrs_object: Type_VrsObject) -> Type_VrsObject:
    """Add GA4GH IDs to an object and all GA4GH-identifiable objects contained within.

    :param vrs_object: AnyVar-supported variation object
    :return: same object, with any missing ID fields filled in
    """
    storage = {}
    enreffed = vrs_enref(vrs_object, storage)
    dereffed = vrs_deref(enreffed, storage)
    dereffed.id = None  # type: ignore[reportAttributeAccessIssue]
    dereffed.digest = None  # type: ignore[reportAttributeAccessIssue]
    ga4gh_identify(dereffed, in_place="always")
    return dereffed  # type: ignore[reportReturnType]

I cannot imagine this is the best possible solution to this problem, or that I'm the only person who's ever needed something like this before. I think it'd be nice to either update behavior of the existing functions or add something new that does this efficiently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions