Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resume training from checkpoint? #63

Closed
ThrowawayAccount01 opened this issue Mar 22, 2023 · 6 comments · Fixed by #64
Closed

How to resume training from checkpoint? #63

ThrowawayAccount01 opened this issue Mar 22, 2023 · 6 comments · Fixed by #64
Labels
bug Something isn't working

Comments

@ThrowawayAccount01
Copy link

Right now if the training gets interrupted, I have to start over from scratch. Is there a way to continue training from the latest checkpoint?

@34j
Copy link
Collaborator

34j commented Mar 22, 2023

This is not the expected behavior. Are you placing different models such as 4.0 v2? Please post the output as there is not enough information.

@ThrowawayAccount01
Copy link
Author

For example, lets say I trained this:

01

Then I stopped training at G5600.pth

How do I continue training from G5600.pth? Do I need to edit the config file, or add extra arguments after svc train?

@34j
Copy link
Collaborator

34j commented Mar 22, 2023

Can you please post the output of svc train? This is a bug.

@ThrowawayAccount01
Copy link
Author

I think there is a misunderstanding. I'm not making a bug report, I'm asking if continuing training is possible?

Here is the stack trace:

(venv) C:\Users\LXC PC\Desktop\sovits\venv\Scripts>svc train
  0%|                                                                                         | 0/9934 [00:00<?, ?it/s]C:\Users\LXC PC\Desktop\sovits\venv\lib\site-packages\torch\functional.py:641: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:867.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
C:\Users\LXC PC\Desktop\sovits\venv\lib\site-packages\torch\autograd\__init__.py:200: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [32, 1, 4], strides() = [4, 1, 1]
bucket_view.sizes() = [32, 1, 4], strides() = [4, 4, 1] (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\torch\csrc\distributed\c10d\reducer.cpp:337.)
  Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  1%|▌                                                                            | 69/9934 [08:20<21:52:24,  7.98s/it]

When training again, it overwrites the old checkpoints:

02

Is it possible to continue training from the latest checkpoint G_5600.pth, instead of starting over from scratch?

@34j
Copy link
Collaborator

34j commented Mar 22, 2023

Resuming the training from the last model is the expected behavior. If updating does not resolve the issue, please make sure to reopen. Thank you.

@34j 34j added the bug Something isn't working label Mar 22, 2023
@2blackbar
Copy link

2blackbar commented Mar 29, 2023

IT doesnt resolve the issue, so ive lost 6 hours of training , you cant resume and it starts from 0 when you stop training and restart it, you should really have extra info on how to resume or extra argument for resuming , also whats the -t next to train argument ? where are the docs about it ?
That thing removing previous weights should be booted from the code, who would want that ? you dont have disk space you make disk space, never ever remove the models when training restarts unless you really want it and actually specify it by argument, not by defaults

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants