pod needing GPU or SRIOV failing scheduling after node restart

After node restart some pods in needs of GPU or VF might fails because the device ressources are not yet ready.
It requires manual delete of failing pods.
as example, one error can be reported by the failing pod as:
_Message:          Pod was rejected: Allocate failed due to no healthy devices present; cannot allocate unhealthy devices openshift.io/media_a_rx_pool, which is unexpected,_

As one solution, we could implement a daemonset that deletes those pods once all gpu/vf nodes have at least 1 device allocatable. I've done this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pod needing GPU or SRIOV failing scheduling after node restart #132

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pod needing GPU or SRIOV failing scheduling after node restart #132

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions