Skip to content

Preflight+Post-mortem machine health checks #386

@JamesKunstle

Description

@JamesKunstle

It'd be nice to be able to verify that a machine is in a heuristically OK state before kicking off a long-running training job. There are some common things to check:

  1. Host -> Card throughput
  2. Card -> Card ring throughput
  3. Card power throttling
  4. Card memory row remapping
  5. Card memory page tainting

to name a few broadly.

Occasionally, our training runs experience in-situ failures, possibly due to some collectives timing out (NCCL, RCCL, HCCL implied). Watchdog timers currently kill training for restarting in these situations but it'd be great to do better than that.

Nvidia DCGM levels 1,2 can provide card-level diagnostics for pre-flight checks, and can do some post-mortem info logging with level 3,4. However, these have to be manually run manually.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions