Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: small tweaks to the preprocessing #7

Closed
wants to merge 3 commits into from

Conversation

GarrettConway
Copy link
Collaborator

@GarrettConway GarrettConway commented Mar 18, 2023

Change the ValueError text to match the logic for the preprocess_flist_config.py, add support for reading any audio filetype, and optimize the hubert preprocessing so that it can run on more than a small dataset. Future investigation will be needed to fix the hubert parallelization to allow it to handle larger datasets without exhausting memory.

@codecov-commenter
Copy link

codecov-commenter commented Mar 18, 2023

Codecov Report

Merging #7 (5aa7039) into main (b988101) will decrease coverage by 0.12%.
The diff coverage is 21.21%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##             main       #7      +/-   ##
==========================================
- Coverage   17.47%   17.35%   -0.12%     
==========================================
  Files          28       28              
  Lines        3108     3146      +38     
  Branches      328      341      +13     
==========================================
+ Hits          543      546       +3     
- Misses       2555     2589      +34     
- Partials       10       11       +1     
Impacted Files Coverage Δ
src/so_vits_svc_fork/preprocess_flist_config.py 90.41% <ø> (ø)
src/so_vits_svc_fork/preprocess_hubert_f0.py 37.20% <13.63%> (+5.39%) ⬆️
src/so_vits_svc_fork/utils.py 18.91% <25.00%> (-0.07%) ⬇️
src/so_vits_svc_fork/preprocess_resample.py 64.51% <42.85%> (-5.86%) ⬇️

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@GarrettConway GarrettConway changed the title Small tweaks enhancement: small tweaks to the preprocessing Mar 18, 2023
np.save(f0_path, f0)


def _process_batch(filepaths: Iterable[Path], sampling_rate: int, hop_length: int, pos: int):
Copy link
Collaborator Author

@GarrettConway GarrettConway Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why pulling these functions out to the top level increases performance, but it runs much faster.
EDIT: It is probably something to do with how it serializes it to pass it to the joblib Parallel

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like not being able to use LokyBackend?

Copy link
Collaborator Author

@GarrettConway GarrettConway Mar 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some quick testing on the memory consumption of the backends with my small/medium dataset:

  • multiprocessing backend is the worst. 2 threads will very quickly max out a 3090 and crash it.
  • loky will struggle along on but complete with 2 threads. More threads will crash it.
  • threading backend is slowest, but best on memory consumption by far.

In all three cases, memory was not getting released after it ran until the python instance shut down. I tried some things to get it to release memory, but didn't have any luck. I'm going to swap in the threading backend, but we should probably make an Issue to track it and fix it so it does not break larger datasets.

I'm not too familiar with python memory management myself, but these docs may help: https://joblib.readthedocs.io/en/latest/parallel.html#serialization-and-processes

@GarrettConway GarrettConway changed the title enhancement: small tweaks to the preprocessing chore: small tweaks to the preprocessing Mar 18, 2023
Copy link
Collaborator

@34j 34j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run pre-commit.

src/so_vits_svc_fork/preprocess_flist_config.py Outdated Show resolved Hide resolved
src/so_vits_svc_fork/preprocess_resample.py Show resolved Hide resolved
Copy link
Collaborator

@34j 34j left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contrib. Almost there!

@GarrettConway GarrettConway requested a review from 34j March 18, 2023 18:14
@34j
Copy link
Collaborator

34j commented Mar 19, 2023

I will verify this some more. (While I was taking screenshots to verify this issue, my video card crashed as well and I lost data. 😇)

@34j
Copy link
Collaborator

34j commented Mar 19, 2023

image
I tested with jvs corpus (15000 files) and loky is almost 2x faster due to CPU being the bottleneck. (40min -> 20min)
What dataset did you use?

@34j
Copy link
Collaborator

34j commented Mar 19, 2023

The results of the verification are as follows:

  • Backend
    loky is the fastest.
  • Number of processes
    Under jvs corpus, GTX 1060 6GB environment, the maximum was 2 before, but after adding torch.cuda.empty_cache() after process_one, the maximum is 4. VRAM capacity is the main issue and max allowable number of processes depends greatly on the duration per file of the dataset.

@34j
Copy link
Collaborator

34j commented Mar 19, 2023

Migrate to #12

@34j 34j closed this Mar 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants