fix(ml): race condition when loading models #3207
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This change reverts to loading models synchronously, preventing multiple calls from loading the same model several times.
An earlier change made model loading happen in a background thread so other requests could continue to be handled. However, this had the effect of allowing multiple calls to load the same model repeatedly with concurrent requests. While the default behavior is to load all models at startup, this race condition allowed memory usage to spike dramatically after the models were unloaded.
Additionally defaults to models never being unloaded. The current implementation for unloading isn't robust enough and can result in greater memory usage than simply keeping models in memory.
Fixes #3142
How Has This Been Tested?
Set the job concurrency to a high number (such as 8) for either image tagging or facial recognition (the CLIP model has no logs) and start the job. The logs should only show one instance of the corresponding log below. I tested this with all ML jobs running simultaneously and saw it showed correct behavior (on
main
, this swelled memory usage to over 10gb).Image classification log:
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.
Facial recognition log: