Skip to content

Checkpoint inconsistencies, training vs validation scores do not match #222

@darinhitchings

Description

@darinhitchings

Describe the bug

We have been trying to verify that this Yolo v9-c model using the provided weights does indeed perform at the level cited in the original paper and have been failing to do so. We're using the AP@0.5:0.95 score as our metric of choice. So I'm encountering at least 3 or 4 issues here:

  1. When I train the model with the pretrained weights for 1 epoch with learning rate 0 and weight_decay of 0, the model produces an AP@0.5:0.95 score of 0.4803 not 0.53 as the paper says. (We also tested using the code base that this code is based on, https://github.com/WongKinYiu/yolov9 (abbreviated "WKY version" from hereafter), using this same scheme of training for 1 epoch with zeroed-out learning rates and got a score of 0.528 which is within tolerance)

  2. When I use a 'validation' run, the score that is produced is about 0.5158 which does not match either the paper nor the value reported when doing a training run for 1 epoch with 0 learning rate.

  3. We started looking at the differences in the source data and checkpoint files. The source data we're using checks out. This data produced the value of 0.528 using the original model. We then compared checkpoints. Using the original checkpoint that we have downloaded ./weights/v9-c.pt with filesize 102895262, MD5 checksum 38332a6a95eb4c3239e726276cf3a1ed, we did the training for 1 epoch and saved a new file of type *.ckpt. The files have a slightly different structure (the *.ckpt has other things in it besides the state_dict values; I'm ignoring everything besides the state_dict values), so we mapped every key of the form x in the original *.pt file to a key of the form model.model.x and ema.model.x in the state_dict of the *.ckpt file that is saved after 1 round of training (with learning rate 0) so that the keys could be compared across files. We're finding that there are differences in the values of the keys in these files, which should not happen when the learning rate is 0. Aside from all of the parameters changing associated with batch norm computations (which could be expected to change and can therefore be ignored), I'm finding other keys that are different:

The v1 value here is from the original weight file with extension *.pt, the v2 value is the value from the *.ckpt file that was saved after 1 round of trivial training:

('23.conv.bias', 'model'): max abs diff = 1.2422e-05 v1:-0.0143585205078125 v2:-0.014370942488312721
('23.conv.bias', 'ema'): max abs diff = 5.18374e-06 v1:-0.019012451171875 v2:-0.01900726743042469
('24.conv.bias', 'model'): max abs diff = 1.44839e-05 v1:-0.02899169921875 v2:-0.02900618314743042
('24.conv.bias', 'ema'): max abs diff = 8.63965e-06 v1:0.0036449432373046875 v2:0.0036535828839987516
('25.conv.bias', 'model'): max abs diff = 1.74399e-05 v1:-0.0311737060546875 v2:-0.03115626610815525
('25.conv.bias', 'ema'): max abs diff = 9.68762e-06 v1:-0.0311737060546875 v2:-0.03116401843726635
('38.heads.0.anchor_conv.2.bias', 'model'): max abs diff = 0.0331876 v1:1.849609375 v2:1.8164217472076416
('38.heads.0.anchor_conv.2.bias', 'ema'): max abs diff = 0.0166272 v1:1.546875 v2:1.5635021924972534
('38.heads.0.class_conv.2.bias', 'model'): max abs diff = 0.0738459 v1:-8.7265625 v2:-8.652716636657715
('38.heads.0.class_conv.2.bias', 'ema'): max abs diff = 0.0522156 v1:-8.7265625 v2:-8.674346923828125
('38.heads.1.anchor_conv.2.bias', 'model'): max abs diff = 0.0641969 v1:1.6845703125 v2:1.6203733682632446
('38.heads.1.anchor_conv.2.bias', 'ema'): max abs diff = 0.0266552 v1:1.6845703125 v2:1.6579151153564453
('38.heads.1.class_conv.2.bias', 'model'): max abs diff = 0.0598283 v1:-7.87109375 v2:-7.811265468597412
('38.heads.1.class_conv.2.bias', 'ema'): max abs diff = 0.0288258 v1:-7.87109375 v2:-7.842267990112305
('38.heads.2.anchor_conv.2.bias', 'model'): max abs diff = 0.0517824 v1:1.7216796875 v2:1.6698973178863525
('38.heads.2.anchor_conv.2.bias', 'ema'): max abs diff = 0.0360591 v1:1.7216796875 v2:1.6856205463409424
('38.heads.2.class_conv.2.bias', 'model'): max abs diff = 0.0743265 v1:-8.46875 v2:-8.394423484802246
('38.heads.2.class_conv.2.bias', 'ema'): max abs diff = 0.0221138 v1:-8.46875 v2:-8.446636199951172
('22.heads.0.anchor_conv.2.bias', 'model'): max abs diff = 0.115278 v1:3.025390625 v2:2.9101126194000244
('22.heads.0.anchor_conv.2.bias', 'ema'): max abs diff = 0.0406651 v1:3.025390625 v2:2.9847254753112793
('22.heads.0.class_conv.2.bias', 'model'): max abs diff = 0.083271 v1:-11.0546875 v2:-10.971416473388672
('22.heads.0.class_conv.2.bias', 'ema'): max abs diff = 0.0379944 v1:-10.8359375 v2:-10.797943115234375
('22.heads.1.anchor_conv.2.bias', 'model'): max abs diff = 0.124773 v1:1.8779296875 v2:1.753156304359436
('22.heads.1.anchor_conv.2.bias', 'ema'): max abs diff = 0.0779871 v1:1.8779296875 v2:1.7999426126480103
('22.heads.1.class_conv.2.bias', 'model'): max abs diff = 0.120522 v1:-9.203125 v2:-9.082603454589844
('22.heads.1.class_conv.2.bias', 'ema'): max abs diff = 0.0431719 v1:-9.203125 v2:-9.159953117370605
('22.heads.2.anchor_conv.2.bias', 'model'): max abs diff = 0.100173 v1:1.3056640625 v2:1.405837059020996
('22.heads.2.anchor_conv.2.bias', 'ema'): max abs diff = 0.0557531 v1:2.013671875 v2:1.9579187631607056
('22.heads.2.class_conv.2.bias', 'model'): max abs diff = 0.180182 v1:-7.21875 v2:-7.038567543029785
('22.heads.2.class_conv.2.bias', 'ema'): max abs diff = 0.0730696 v1:-7.21875 v2:-7.1456804275512695

  1. I have temporarily modified the lazy.py file to execute both a 'validation' run and a 'training' run in the same program session when the task is set to 'train'. What I'm finding is that if I first validate and then train (for 1 epoch, with 0 learning rates and no weight decay) I get a score of ~0.50 for the validation and 0.48 for the training. It should be 0.53, once again, for the Yolo 9-c model. If, however, I train first and validate second in the same program run, then I'm getting a score of 0.48 for both validation and training performance reporting. And that's very odd.

To Reproduce

Steps to reproduce the behavior:

  1. To run the training run, we are using this command line:

clear; torchrun --nproc_per_node=8 yolo/lazy.py task=train device=[0,1,2,3,4,5,6,7] task.data.batch_size=16 task.epoch=1 weight=weights/v9-c.pt 2>&1 | tee ./runs/yolov9_training_log.txt

after setting the learning_rate and weight_decay values to 0 in the train.yaml file.

  1. To run the validation run, we are using:

clear; CUDA_VISIBLE_DEVICES=0 python yolo/lazy.py task=validation 2>&1 | tee ./runs/yolov9_training_log.txt

Expected behavior

  1. I'm expecting the validation output to match the value cited in the source paper at 0.53 for the AP@0.5:0.95 statistic.
  2. I'm expecting the value for this AP score when training for 1 epoch with 0 learning rate to match the value of the score when the model is validated.
  3. I am not expecting the value of these scores to depend on whether or not the validation or the training runs first after I modify lazy.py to do both operations in series.
  4. I am not expecting a different in the values in the checkpoint file *.ckpt from the original values in the *.pt file when the learning rate is 0.

Screenshots

Validation output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Validate metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ PyCOCO/AP @ .5 │ 0.6810915470123291 │
│ PyCOCO/AP @ .5:.95 │ 0.5158287286758423 │
│ map │ 0.5158287286758423 │
│ map_50 │ 0.6810915470123291 │
│ map_75 │ 0.5603926777839661 │
│ map_large │ 0.6733922362327576 │
│ map_medium │ 0.5049033164978027 │
│ map_per_class │ -1.0 │
│ map_small │ 0.26766273379325867 │
│ mar_1 │ 0.391811728477478 │
│ mar_10 │ 0.658163845539093 │
│ mar_100 │ 0.7201112508773804 │
│ mar_100_per_class │ -1.0 │
│ mar_large │ 0.8452091813087463 │
│ mar_medium │ 0.7402596473693848 │
│ mar_small │ 0.5012239813804626 │
└───────────────────────────┴───────────────────────────┘
Valid | mAP : 60.38 | mAP50 : 67.30 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157/157 0:02:19 • 0:00:00 0.98it/s
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃ % ┃ Avg. Recall ┃ % ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ 0 │ AP @ .5:.95 │ 51.58 │ AR maxDets 1 │ 39.18 │
│ 0 │ AP @ .5 │ 68.11 │ AR maxDets 10 │ 65.82 │
│ 0 │ AP @ .75 │ 56.04 │ AR maxDets 100 │ 72.01 │
│ 0 │ AP (small) │ 26.77 │ AR (small) │ 50.12 │
│ 0 │ AP (medium) │ 50.49 │ AR (medium) │ 74.03 │
│ 0 │ AP (large) │ 67.34 │ AR (large) │ 84.52 │
└───────┴────────────────┴───────┴────────────────┴───────┘

Training Output (for 1 epoch, learning rates 0, starting from the provided weight file):

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Validate metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ PyCOCO/AP @ .5 │ 0.6434592008590698 │
│ PyCOCO/AP @ .5:.95 │ 0.48034167289733887 │
│ map │ 0.48034167289733887 │
│ map_50 │ 0.6434592008590698 │
│ map_75 │ 0.5224654674530029 │
│ map_large │ 0.6373392343521118 │
│ map_medium │ 0.46543779969215393 │
│ map_per_class │ -1.0 │
│ map_small │ 0.23354189097881317 │
│ mar_1 │ 0.3763301968574524 │
│ mar_10 │ 0.6356218457221985 │
│ mar_100 │ 0.7002217769622803 │
│ mar_100_per_class │ -1.0 │
│ mar_large │ 0.8251056671142578 │
│ mar_medium │ 0.7188176512718201 │
│ mar_small │ 0.47954803705215454 │
└───────────────────────────┴───────────────────────────┘
Valid | mAP : 60.21 | mAP50 : 73.19 | ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 157/157 0:02:46 • 0:00:00 0.82it/s
┏━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Epoch ┃ Avg. Precision ┃ % ┃ Avg. Recall ┃ % ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ 1 │ AP @ .5:.95 │ 48.03 │ AR maxDets 1 │ 37.63 │
│ 1 │ AP @ .5 │ 64.35 │ AR maxDets 10 │ 63.56 │
│ 1 │ AP @ .75 │ 52.25 │ AR maxDets 100 │ 70.02 │
│ 1 │ AP (small) │ 23.35 │ AR (small) │ 47.95 │
│ 1 │ AP (medium) │ 46.54 │ AR (medium) │ 71.88 │
│ 1 │ AP (large) │ 63.73 │ AR (large) │ 82.51 │
└───────┴────────────────┴───────┴────────────────┴───────┘

System Info (please complete the following ## information):

  • OS: Linux
  • Python Version: 3.12.11
  • PyTorch Version: 2.8.0+cu128
  • CUDA/cuDNN/MPS Version:
    Built on Wed_Nov_22_10:17:15_PST_2023
    Cuda compilation tools, release 12.3, V12.3.107
    Build cuda_12.3.r12.3/compiler.33567101_0
  • YOLO Model Version: YOLOv9-c

compare_checkpoints.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions