Checkpoint inconsistencies, training vs validation scores do not match

## Describe the bug
We have been trying to verify that this Yolo v9-c model using the provided weights does indeed perform at the level cited in the original paper and have been failing to do so.  We're using the AP@0.5:0.95 score as our metric of choice.  So I'm encountering at least 3 or 4 issues here:

1) When I train the model with the pretrained weights for 1 epoch with learning rate 0 and weight_decay of 0, the model produces an AP@0.5:0.95 score of 0.4803 not 0.53 as the paper says.  (We also tested using the code base that this code is based on, https://github.com/WongKinYiu/yolov9 (abbreviated "WKY version" from hereafter), using this same scheme of training for 1 epoch with zeroed-out learning rates and got a score of 0.528 which is within tolerance)

2) When I use a 'validation' run, the score that is produced is about 0.5158 which does not match either the paper nor the value reported when doing a training run for 1 epoch with 0 learning rate.

3) We started looking at the differences in the source data and checkpoint files.  The source data we're using checks out.  This data produced the value of 0.528 using the original model.  We then compared checkpoints.  Using the original checkpoint that we have downloaded ./weights/v9-c.pt with filesize 102895262, MD5 checksum 38332a6a95eb4c3239e726276cf3a1ed, we did the training for 1 epoch and saved a new file of type *.ckpt.  The files have a slightly different structure (the *.ckpt has other things in it besides the state_dict values; I'm ignoring everything besides the state_dict values), so we mapped every key of the form x in the original *.pt file to a key of the form model.model.x and ema.model.x in the state_dict of the *.ckpt file that is saved after 1 round of training (with learning rate 0) so that the keys could be compared across files.  We're finding that there are differences in the values of the keys in these files, which should not happen when the learning rate is 0.  Aside from all of the parameters changing associated with batch norm computations (which could be expected to change and can therefore be ignored), I'm finding other keys that are different:

The v1 value here is from the original weight file with extension *.pt, the v2 value is the value from the *.ckpt file that was saved after 1 round of trivial training:

   ('23.conv.bias', 'model'): max abs diff = 1.2422e-05  v1:-0.0143585205078125  v2:-0.014370942488312721
   ('23.conv.bias', 'ema'): max abs diff = 5.18374e-06  v1:-0.019012451171875  v2:-0.01900726743042469
   ('24.conv.bias', 'model'): max abs diff = 1.44839e-05  v1:-0.02899169921875  v2:-0.02900618314743042
   ('24.conv.bias', 'ema'): max abs diff = 8.63965e-06  v1:0.0036449432373046875  v2:0.0036535828839987516
   ('25.conv.bias', 'model'): max abs diff = 1.74399e-05  v1:-0.0311737060546875  v2:-0.03115626610815525
   ('25.conv.bias', 'ema'): max abs diff = 9.68762e-06  v1:-0.0311737060546875  v2:-0.03116401843726635
   ('38.heads.0.anchor_conv.2.bias', 'model'): max abs diff = 0.0331876  v1:1.849609375  v2:1.8164217472076416
   ('38.heads.0.anchor_conv.2.bias', 'ema'): max abs diff = 0.0166272  v1:1.546875  v2:1.5635021924972534
   ('38.heads.0.class_conv.2.bias', 'model'): max abs diff = 0.0738459  v1:-8.7265625  v2:-8.652716636657715
   ('38.heads.0.class_conv.2.bias', 'ema'): max abs diff = 0.0522156  v1:-8.7265625  v2:-8.674346923828125
   ('38.heads.1.anchor_conv.2.bias', 'model'): max abs diff = 0.0641969  v1:1.6845703125  v2:1.6203733682632446
   ('38.heads.1.anchor_conv.2.bias', 'ema'): max abs diff = 0.0266552  v1:1.6845703125  v2:1.6579151153564453
   ('38.heads.1.class_conv.2.bias', 'model'): max abs diff = 0.0598283  v1:-7.87109375  v2:-7.811265468597412
   ('38.heads.1.class_conv.2.bias', 'ema'): max abs diff = 0.0288258  v1:-7.87109375  v2:-7.842267990112305
   ('38.heads.2.anchor_conv.2.bias', 'model'): max abs diff = 0.0517824  v1:1.7216796875  v2:1.6698973178863525
   ('38.heads.2.anchor_conv.2.bias', 'ema'): max abs diff = 0.0360591  v1:1.7216796875  v2:1.6856205463409424
   ('38.heads.2.class_conv.2.bias', 'model'): max abs diff = 0.0743265  v1:-8.46875  v2:-8.394423484802246
   ('38.heads.2.class_conv.2.bias', 'ema'): max abs diff = 0.0221138  v1:-8.46875  v2:-8.446636199951172
   ('22.heads.0.anchor_conv.2.bias', 'model'): max abs diff = 0.115278  v1:3.025390625  v2:2.9101126194000244
   ('22.heads.0.anchor_conv.2.bias', 'ema'): max abs diff = 0.0406651  v1:3.025390625  v2:2.9847254753112793
   ('22.heads.0.class_conv.2.bias', 'model'): max abs diff = 0.083271  v1:-11.0546875  v2:-10.971416473388672
   ('22.heads.0.class_conv.2.bias', 'ema'): max abs diff = 0.0379944  v1:-10.8359375  v2:-10.797943115234375
   ('22.heads.1.anchor_conv.2.bias', 'model'): max abs diff = 0.124773  v1:1.8779296875  v2:1.753156304359436
   ('22.heads.1.anchor_conv.2.bias', 'ema'): max abs diff = 0.0779871  v1:1.8779296875  v2:1.7999426126480103
   ('22.heads.1.class_conv.2.bias', 'model'): max abs diff = 0.120522  v1:-9.203125  v2:-9.082603454589844
   ('22.heads.1.class_conv.2.bias', 'ema'): max abs diff = 0.0431719  v1:-9.203125  v2:-9.159953117370605
   ('22.heads.2.anchor_conv.2.bias', 'model'): max abs diff = 0.100173  v1:1.3056640625  v2:1.405837059020996
   ('22.heads.2.anchor_conv.2.bias', 'ema'): max abs diff = 0.0557531  v1:2.013671875  v2:1.9579187631607056
   ('22.heads.2.class_conv.2.bias', 'model'): max abs diff = 0.180182  v1:-7.21875  v2:-7.038567543029785
   ('22.heads.2.class_conv.2.bias', 'ema'): max abs diff = 0.0730696  v1:-7.21875  v2:-7.1456804275512695

4) I have temporarily modified the lazy.py file to execute both a 'validation' run and a 'training' run in the same program session when the task is set to 'train'.  What I'm finding is that if I first validate and then train (for 1 epoch, with 0 learning rates and no weight decay) I get a score of ~0.50 for the validation and 0.48 for the training.  It should be 0.53, once again, for the Yolo 9-c model.  If, however, I train first and validate second in the same program run, then I'm getting a score of 0.48 for both validation and training performance reporting.  And that's very odd.

## To Reproduce
Steps to reproduce the behavior:

1) To run the training run, we are using this command line:

clear; torchrun --nproc_per_node=8 yolo/lazy.py task=train device=[0,1,2,3,4,5,6,7]  task.data.batch_size=16 task.epoch=1 weight=weights/v9-c.pt 2>&1 | tee ./runs/yolov9_training_log.txt

after setting the learning_rate and weight_decay values to 0 in the train.yaml file.

2) To run the validation run, we are using:

clear; CUDA_VISIBLE_DEVICES=0 python yolo/lazy.py task=validation  2>&1 | tee ./runs/yolov9_training_log.txt

## Expected behavior
1) I'm expecting the validation output to match the value cited in the source paper at 0.53 for the AP@0.5:0.95 statistic.
2) I'm expecting the value for this AP score when training for 1 epoch with 0 learning rate to match the value of the score when the model is validated.
3) I am not expecting the value of these scores to depend on whether or not the validation or the training runs first after I modify lazy.py to do both operations in series.
4) I am not expecting a different in the values in the checkpoint file *.ckpt from the original values in the *.pt file when the learning rate is 0.

## Screenshots

Validation output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│      PyCOCO/AP @ .5       │    0.6810915470123291     │
│    PyCOCO/AP @ .5:.95     │    0.5158287286758423     │
│            map            │    0.5158287286758423     │
│          map_50           │    0.6810915470123291     │
│          map_75           │    0.5603926777839661     │
│         map_large         │    0.6733922362327576     │
│        map_medium         │    0.5049033164978027     │
│       map_per_class       │           -1.0            │
│         map_small         │    0.26766273379325867    │
│           mar_1           │     0.391811728477478     │
│          mar_10           │     0.658163845539093     │
│          mar_100          │    0.7201112508773804     │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.8452091813087463     │
│        mar_medium         │    0.7402596473693848     │
│         mar_small         │    0.5012239813804626     │
└───────────────────────────┴───────────────────────────┘
Valid | mAP : 60.38 | mAP50 : 67.30 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157/157 0:02:19 • 0:00:00 0.98it/s
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃     % ┃ Avg. Recall    ┃     % ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│    0  │ AP @ .5:.95    │ 51.58 │ AR maxDets   1 │ 39.18 │
│    0  │ AP @     .5    │ 68.11 │ AR maxDets  10 │ 65.82 │
│    0  │ AP @    .75    │ 56.04 │ AR maxDets 100 │ 72.01 │
│    0  │ AP  (small)    │ 26.77 │ AR     (small) │ 50.12 │
│    0  │ AP (medium)    │ 50.49 │ AR    (medium) │ 74.03 │
│    0  │ AP  (large)    │ 67.34 │ AR     (large) │ 84.52 │
└───────┴────────────────┴───────┴────────────────┴───────┘


Training Output (for 1 epoch, learning rates 0, starting from the provided weight file):

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃      Validate metric      ┃       DataLoader 0        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│      PyCOCO/AP @ .5       │    0.6434592008590698     │
│    PyCOCO/AP @ .5:.95     │    0.48034167289733887    │
│            map            │    0.48034167289733887    │
│          map_50           │    0.6434592008590698     │
│          map_75           │    0.5224654674530029     │
│         map_large         │    0.6373392343521118     │
│        map_medium         │    0.46543779969215393    │
│       map_per_class       │           -1.0            │
│         map_small         │    0.23354189097881317    │
│           mar_1           │    0.3763301968574524     │
│          mar_10           │    0.6356218457221985     │
│          mar_100          │    0.7002217769622803     │
│     mar_100_per_class     │           -1.0            │
│         mar_large         │    0.8251056671142578     │
│        mar_medium         │    0.7188176512718201     │
│         mar_small         │    0.47954803705215454    │
└───────────────────────────┴───────────────────────────┘
Valid | mAP : 60.21 | mAP50 : 73.19 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157/157 0:02:46 • 0:00:00 0.82it/s
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃     % ┃ Avg. Recall    ┃     % ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│    1  │ AP @ .5:.95    │ 48.03 │ AR maxDets   1 │ 37.63 │
│    1  │ AP @     .5    │ 64.35 │ AR maxDets  10 │ 63.56 │
│    1  │ AP @    .75    │ 52.25 │ AR maxDets 100 │ 70.02 │
│    1  │ AP  (small)    │ 23.35 │ AR     (small) │ 47.95 │
│    1  │ AP (medium)    │ 46.54 │ AR    (medium) │ 71.88 │
│    1  │ AP  (large)    │ 63.73 │ AR     (large) │ 82.51 │
└───────┴────────────────┴───────┴────────────────┴───────┘


## System Info (please complete the following ## information):
 - OS: Linux
 - Python Version: 3.12.11 
 - PyTorch Version: 2.8.0+cu128
 - CUDA/cuDNN/MPS Version: 
           Built on Wed_Nov_22_10:17:15_PST_2023
           Cuda compilation tools, release 12.3, V12.3.107
           Build cuda_12.3.r12.3/compiler.33567101_0
 - YOLO Model Version: YOLOv9-c

[compare_checkpoints.py](https://github.com/user-attachments/files/22395701/compare_checkpoints.py)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpoint inconsistencies, training vs validation scores do not match #222

Describe the bug

To Reproduce

Expected behavior

Screenshots

System Info (please complete the following ## information):

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint inconsistencies, training vs validation scores do not match #222

Description

Describe the bug

To Reproduce

Expected behavior

Screenshots

System Info (please complete the following ## information):

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions