Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: memory and lora cache improvements #379

Open
wants to merge 81 commits into
base: main
Choose a base branch
from
Open

Conversation

tazlin
Copy link
Member

@tazlin tazlin commented Dec 24, 2024

Time has shown me that the readme was confusing laid out. I am hoping that this condensed and hopefully easier to read version will prevent onboarding friction.
This avoids an active bug on the API and also unsure consistency in the off chance the API and the worker model references disagree.
With the recent changes it becomes possible for this function to take longer than 15 seconds, at least in the case of a lora.json which predates those changes. To reduce protentional friction and crashes for users who want to "just" update, I'm going to increase this value.

Additionally, I'm reordering the call to purge loras to try and ensure that any migrations/adhoc updates are in play before we start deleting things.
Any errors which make it to this point are probably terrible and so we should just end the process
- By moving lora/ti downloading to be before preloading the main model off disk, we should have more time to download at better times and improve the chances of a job skip when models are actively downloading
- Piggy backing off of this change, the job skip logic needed to be tweak to account for the rework of the job steps
Because of the switch of the download_aux_models and preload_model step, there was a possibility for models to endless load and unload on a single process. This prevents that from happening and removes some of the legacy (and now moot) logic that was originally intended to make sure loras preloading happened earlier, which now happens as property of 5cc9974.
High vram cards do not seem to currently be clearing vram as I intend, this is an attempt at triggering that more often
Processes which have models that are still queued should not be killed
This should prevent corner cases where models that loaded within a few ticks of a SIGINT aren't registered properly, causing the process to be killed but the accounting of the model states to fall out of sync, causing the worker to never end (with a job stuck in queue and never attempted).
In the main loop, `get_next_job_and_process`, is called for information purposes only; if the function returns `None`, that's fine, and it probably just means gears haven't turned elsewhere. Due to the sub-function `handle_process_missing` trying to recover when there *should* be a return value, such as when called from `start_inference_`, it would then be appropriate to try and fix the model and process maps.

Additionally, the prevents the "cleared RAM" message from triggering twice
Prior to this change, individual processes could be commanded to unload from ram repeatedly, many times a second. This prevents the command from being issued multiple times in a row
This can just lead to trading one blocking download for another
The lru cache used in this way is moot now due to checks done elsewhere that have the same practical effect
This is still an indication that something is wrong, which is why I am leaving it as a warning, but with all of the 1-tick or timing related issues that can lead to this, I don't see the reason to cast fear in to the operators hearts when it will often be able to actually recover.

Additionally, the second log message changed was for debugging purposes only and does not necessarily indicate a critical/error condition.
Prior to this, the overloaded nature of `get_processes_with_model_for_queued_job` led to some confusion on shutdown, where it could hang indefinitely with a never-killed process, and models wouldn't be proactively preloaded as intended
Once upon a time, this seemed reasonable, but the current status quo doesn't fit with relying on this as such. The outgoing states ('HordeControlFlag') are useful for knowing whether or not to resend a HordeControlFlag, but they are not accurate (at least, as of recent changes) when it comes to measuring the reported process state
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant