training with more epochs #44

oussaifi-majdi · 2023-09-21T12:23:28Z

i'm facing time limitations in Google Colab and need to train my data for 150 epochs, but in 50 epochs colab is termine how to resume from the last saved checkpoint when you restart the Colab session.

naseemap47 · 2023-09-22T05:09:03Z

Hi @oussaifi-majdi ,
To solve your issue, I added new option for resume the model training.
I think this will solve your issue.
If you have any issues, Please let me know.
Thank You.

oussaifi-majdi · 2023-09-22T13:06:13Z

@naseemap47 Thank you so much for your help with this issue! Your guidance and support were invaluable in resolving the problem.
Now I use summary metrics to train the data :

python3 train.py --data /dir/dataset/data.yaml --batch 16 --epoch 120 --model yolo_nas_m --size 640 --resume

but how can I determine the figure for accuracy, precision..etc with tensorboard throughout the training, from the first hours of training to the end when i finish training all epochs.

CHECKPOINT_DIR =?
EXPERIMENT_NAME =?
%load_ext tensorboard
%tensorboard --logdir {CHECKPOINT_DIR}/{EXPERIMENT_NAME} --port 6005
%reload_ext tensorboard

naseemap47 · 2023-09-23T04:40:49Z

Hi @oussaifi-majdi ,
I am giving on example. i think this will help you.
Example:

python3 train.py --data /dir/dataset/data.yaml --batch 6 --epoch 100 --model yolo_nas_m --size 640 --weight runs/train2/ckpt_latest.pth --resume

oussaifi-majdi · 2023-09-23T15:57:37Z

thanks sor , but If I resume training later using the --resume option, it may be difficult to get the full figure of precision and accuracy from the first epoch to the end.
Is there a solution to get the complete figure?

naseemap47 · 2023-09-24T06:11:10Z

Hi @oussaifi-majdi ,
I fixed the issue, you can check now.
Thank you for finding this issue.
Please let me know. This is fixed your issue.
Thank you

oussaifi-majdi · 2023-09-24T12:51:52Z

@naseemap47 thanks the #46 resume works well but the problem for example if we stop in epochs from 0 to 70 then summarize and continue from 70 to 100. when using tensorboard at the end to display the curves of recal, precision, F1.. . it only displays the last part of training 70 to 100 not from 1 to 100
I found some solution https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/experiment_monitoring.md but it does not work with this project, it is necessary to integrate a method among these methods to make the project the best and differentiate it from the others, it solves a very interesting problem

naseemap47 · 2023-09-25T07:30:42Z

@oussaifi-majdi Thank you.
I will look into it.
Thank you for your support.

naseemap47 self-assigned this Sep 22, 2023

naseemap47 added the enhancement New feature or request label Sep 22, 2023

naseemap47 linked a pull request Sep 22, 2023 that will close this issue

Resume #45

Merged

naseemap47 mentioned this issue Sep 22, 2023

Resume #45

Merged

naseemap47 closed this as completed in #45 Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training with more epochs #44

training with more epochs #44

oussaifi-majdi commented Sep 21, 2023

naseemap47 commented Sep 22, 2023

oussaifi-majdi commented Sep 22, 2023

naseemap47 commented Sep 23, 2023

oussaifi-majdi commented Sep 23, 2023

naseemap47 commented Sep 24, 2023

oussaifi-majdi commented Sep 24, 2023

naseemap47 commented Sep 25, 2023

training with more epochs #44

training with more epochs #44

Comments

oussaifi-majdi commented Sep 21, 2023

naseemap47 commented Sep 22, 2023

oussaifi-majdi commented Sep 22, 2023

naseemap47 commented Sep 23, 2023

oussaifi-majdi commented Sep 23, 2023

naseemap47 commented Sep 24, 2023

oussaifi-majdi commented Sep 24, 2023

naseemap47 commented Sep 25, 2023