Misleading overall_throughput calculation

The "overall_throughput" that is calculated in https://github.com/instructlab/training/blob/main/src/instructlab/training/main_ds.py#L422 taking args.samples_per_gpu for the batch size instead of the actual "micro_batch_size".
In each step the batch_size is different, but overall_throughput calculated based on a constant value :
Part of a log for example with batch_size values of 125,112,121:

Epoch 0:  97%|█████████▋| 76/78 [03:54<00:05,  2.94s/it][92m{
    "epoch": 0,
    "step": 76,
    "rank": 0,
    "overall_throughput": 44.94857943825548,
    "lr": 2.0000000000000003e-06,
    "cuda_mem_allocated": 1.2444758415222168,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 25623,
    "batch_size": 125,
    "total_loss": 3.9130468719509817,
    "samples_seen": 9661,
    "timestamp": "2024-12-20T13:51:34.253834"
}[0m
Epoch: 0, Step: 77, Rank: 3, loss = 0.95703125Epoch: 0, Step: 77, Rank: 1, loss = 0.71484375Epoch: 0, Step: 77, Rank: 2, loss = 0.64453125Epoch: 0, Step: 77, Rank: 5, loss = 2.953125Epoch: 0, Step: 77, Rank: 7, loss = 12.5Epoch: 0, Step: 77, Rank: 6, loss = 10.5

Epoch: 0, Step: 77, Rank: 4, loss = 1.765625



Epoch: 0, Step: 77, Rank: 0, loss = 0.921875


Epoch 0:  99%|█████████▊| 77/78 [03:57<00:02,  2.89s/it][92m{
    "epoch": 0,
    "step": 77,
    "rank": 0,
    "overall_throughput": 47.957271498777644,
    "lr": 2.0000000000000003e-06,
    "cuda_mem_allocated": 1.2483596801757812,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 23046,
    "batch_size": 112,
    "total_loss": 3.8774624663716044,
    "samples_seen": 9773,
    "timestamp": "2024-12-20T13:51:37.052739"
}[0m
Epoch: 0, Step: 78, Rank: 0, loss = 0.8671875Epoch: 0, Step: 78, Rank: 5, loss = 2.15625Epoch: 0, Step: 78, Rank: 7, loss = 12.75Epoch: 0, Step: 78, Rank: 3, loss = 0.72265625Epoch: 0, Step: 78, Rank: 4, loss = 1.1640625Epoch: 0, Step: 78, Rank: 6, loss = 14.8125





Epoch: 0, Step: 78, Rank: 2, loss = 0.57421875Epoch: 0, Step: 78, Rank: 1, loss = 0.2314453125


Epoch 0: 100%|██████████| 78/78 [04:00<00:00,  2.91s/it][92m{
    "epoch": 0,
    "step": 78,
    "rank": 0,
    "overall_throughput": 45.40726680806918,
    "lr": 2.0000000000000003e-06,
    "cuda_mem_allocated": 1.2466816902160645,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 27044,
    "batch_size": 121,
    "total_loss": 4.160331311936104,
    "samples_seen": 9894,
    "timestamp": "2024-12-20T13:51:39.872213"
}[0m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misleading overall_throughput calculation #392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misleading overall_throughput calculation #392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions