-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: small tweaks to the preprocessing #7
Conversation
Codecov Report
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more @@ Coverage Diff @@
## main #7 +/- ##
==========================================
- Coverage 17.47% 17.35% -0.12%
==========================================
Files 28 28
Lines 3108 3146 +38
Branches 328 341 +13
==========================================
+ Hits 543 546 +3
- Misses 2555 2589 +34
- Partials 10 11 +1
... and 3 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
np.save(f0_path, f0) | ||
|
||
|
||
def _process_batch(filepaths: Iterable[Path], sampling_rate: int, hop_length: int, pos: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why pulling these functions out to the top level increases performance, but it runs much faster.
EDIT: It is probably something to do with how it serializes it to pass it to the joblib Parallel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like not being able to use LokyBackend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some quick testing on the memory consumption of the backends with my small/medium dataset:
- multiprocessing backend is the worst. 2 threads will very quickly max out a 3090 and crash it.
- loky will struggle along on but complete with 2 threads. More threads will crash it.
- threading backend is slowest, but best on memory consumption by far.
In all three cases, memory was not getting released after it ran until the python instance shut down. I tried some things to get it to release memory, but didn't have any luck. I'm going to swap in the threading backend, but we should probably make an Issue to track it and fix it so it does not break larger datasets.
I'm not too familiar with python memory management myself, but these docs may help: https://joblib.readthedocs.io/en/latest/parallel.html#serialization-and-processes
a931c56
to
0e1d3ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run pre-commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contrib. Almost there!
I will verify this some more. (While I was taking screenshots to verify this issue, my video card crashed as well and I lost data. 😇) |
The results of the verification are as follows:
|
Migrate to #12 |
Change the ValueError text to match the logic for the preprocess_flist_config.py, add support for reading any audio filetype, and optimize the hubert preprocessing so that it can run on more than a small dataset. Future investigation will be needed to fix the hubert parallelization to allow it to handle larger datasets without exhausting memory.