Skip to content

Conversation

@IvanMM27
Copy link
Contributor

Hello,

When using fast_dev_run in trainer.fit, an error is printed since no checkpoint is created, and GraphNeT directly loads the best checkpoint after trainer.fit is completed

Running in `fast_dev_run` mode: will run the requested loop using 1 batch(es). Logging and checkpointing is suppressed.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------


  | Name                 | Type       | Params | Mode  | FLOPs
--------------------------------------------------------------------
0 | _tasks               | ModuleList | 129    | train | 0    
1 | _data_representation | KNNGraph   | 0      | train | 0    
2 | backbone             | DynEdge    | 1.4 M  | train | 0    
--------------------------------------------------------------------
1.4 M     Trainable params
0         Non-trainable params
1.4 M     Total params
5.515     Total estimated model params size (MB)
36        Modules in train mode
0         Modules in eval mode
0         Total Flops
Epoch  0: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  0.14 batch(es)/s, lr=1e-5, val_loss=0.00255, train_loss=0.028]`Trainer.fit` stopped: `max_steps=1` reached.                                                                                                                                           
Epoch  0: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  0.14 batch(es)/s, lr=1e-5, val_loss=0.00255, train_loss=0.028]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data_hgx/KM3NeT/mozun/temp/graphnet/examples/04_training/01_train_dynedge.py", line 249, in <module>
[rank0]:     main(
[rank0]:   File "/data_hgx/KM3NeT/mozun/temp/graphnet/examples/04_training/01_train_dynedge.py", line 164, in main
[rank0]:     model.fit(
[rank0]:   File "/data_hgx/KM3NeT/mozun/temp/graphnet/src/graphnet/models/easy_model.py", line 182, in fit
[rank0]:     torch.load(
[rank0]:   File "/data_hgx/KM3NeT/mozun/temp/graphnet_dev/lib/python3.10/site-packages/torch/serialization.py", line 1425, in load
[rank0]:     with _open_file_like(f, "rb") as opened_file:
[rank0]:   File "/data_hgx/KM3NeT/mozun/temp/graphnet_dev/lib/python3.10/site-packages/torch/serialization.py", line 751, in _open_file_like
[rank0]:     return _open_file(name_or_buffer, mode)
[rank0]:   File "/data_hgx/KM3NeT/mozun/temp/graphnet_dev/lib/python3.10/site-packages/torch/serialization.py", line 732, in __init__
[rank0]:     super().__init__(open(name, mode))
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: ''

Therefore, I have implemented in easy_syntax the class argument fast_dev_run that is parsed to trainer.fit and omits the loading of the best checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant