Skip to content

pod needing GPU or SRIOV failing scheduling after node restart #132

@cicyle

Description

@cicyle

After node restart some pods in needs of GPU or VF might fails because the device ressources are not yet ready.
It requires manual delete of failing pods.
as example, one error can be reported by the failing pod as:
Message: Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices openshift.io/media_a_rx_pool, which is unexpected,

As one solution, we could implement a daemonset that deletes those pods once all gpu/vf nodes have at least 1 device allocatable. I've done this

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions