Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ml): race condition when loading models #3207

Merged
merged 3 commits into from
Jul 11, 2023
Merged

Conversation

mertalev
Copy link
Contributor

@mertalev mertalev commented Jul 11, 2023

Description

This change reverts to loading models synchronously, preventing multiple calls from loading the same model several times.

An earlier change made model loading happen in a background thread so other requests could continue to be handled. However, this had the effect of allowing multiple calls to load the same model repeatedly with concurrent requests. While the default behavior is to load all models at startup, this race condition allowed memory usage to spike dramatically after the models were unloaded.

Additionally defaults to models never being unloaded. The current implementation for unloading isn't robust enough and can result in greater memory usage than simply keeping models in memory.

Fixes #3142

How Has This Been Tested?

Set the job concurrency to a high number (such as 8) for either image tagging or facial recognition (the CLIP model has no logs) and start the job. The logs should only show one instance of the corresponding log below. I tested this with all ML jobs running simultaneously and saw it showed correct behavior (on main, this swelled memory usage to over 10gb).

Image classification log:
Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.

Facial recognition log:

Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
model ignore: /cache/facial-recognition/buffalo_l/models/buffalo_l/1k3d68.onnx landmark_3d_68
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
model ignore: /cache/facial-recognition/buffalo_l/models/buffalo_l/2d106det.onnx landmark_2d_106
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /cache/facial-recognition/buffalo_l/models/buffalo_l/det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
model ignore: /cache/facial-recognition/buffalo_l/models/buffalo_l/genderage.onnx genderage
Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}}
find model: /cache/facial-recognition/buffalo_l/models/buffalo_l/w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5
set det-size: (640, 640)

@vercel
Copy link

vercel bot commented Jul 11, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
immich ⬜️ Ignored (Inspect) Jul 11, 2023 7:41am

@mertalev mertalev requested a review from alextran1502 July 11, 2023 07:17
@alextran1502 alextran1502 merged commit 848ba68 into main Jul 11, 2023
@alextran1502 alextran1502 deleted the ml/fix-race-condition branch July 11, 2023 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Machine learning memory leak
3 participants