[util] Add generic torch device class #6174

lstein · 2024-04-07T20:35:11Z

Summary

This PR cleans up the various calls to choose_torch_device(), torch_dtype(), choose_precision() and so on, removes redundant function calls, and creates a single class named TorchDeviceSelect that supersedes their functionality. In addition to creating a simplified API, this class generalizes the call to clear the VRAM cache so that the same method empties the VRAM cache for both CUDA and MPS devices. It also provides an API for invocation context-dependent retrieval of the GPU device, intended for use in GPU load balancing in the future.

Example usage:

from invokeai.backend.util.devices import TorchDevice

class Foo(BaseInvocation):
    def invoke(self, context: InvocationContext):
         torch_device = TorchDevice.choose_torch_device()
         torch_dtype = TorchDevice.choose_torch_dtype()

         # empty CUDA or MPS cache
         TorchDevice.empty_cache()

The methods that return strings instead of objects, e.g. "float32" rather than torch.float32, have been removed. The legacy calls to choose_torch_device() and choose_precision() functions now issue a deprecation warning.

Related Issues / Discussions

QA Instructions

Merge Plan

Squash merge when approved.

Checklist

The PR has a short but descriptive title, suitable for a changelog
Tests added / updated (if applicable)
Documentation added / updated (if applicable)

invokeai/backend/util/devices.py

tests/backend/util/test_devices.py

invokeai/app/services/model_install/model_install_default.py

invokeai/app/invocations/latent.py

psychedelicious · 2024-04-10T23:03:47Z

What's the difference between "auto" and "autocast"?

invokeai/backend/model_manager/load/model_cache/model_cache_base.py

lstein · 2024-04-11T03:34:25Z

What's the difference between "auto" and "autocast"?

"auto" selects one of the floating point precision types. "autocast" used to activate the torch.autocast() context for certain generation operations, but I this code has been removed. So I've gone ahead and removed references to this configuration option.

By the way, I notice a TODO from ryan in model_patcher.py saying that torch.autocast might provide a speed benefit in model patching. It could be hard-coded there if needed.

psychedelicious · 2024-04-11T12:08:11Z

"auto" selects one of the floating point precision types. "autocast" used to activate the torch.autocast() context for certain generation operations, but I this code has been removed. So I've gone ahead and removed references to this configuration option.

Gotcha. We need to be careful with changing valid config settings. If somebody has "autocast" in their invokeai.yaml file, they will get a pydantic error on startup. It may be OK for this particular setting, but generally we'll want to write a config script migration, like config_default.py:migrate_v3_config_dict

lstein · 2024-04-11T18:54:41Z

Right. I'll put in a migrate script

RyanJDick

Good work on this PR. I love to see our test coverage going up! 😄

I've approved, but would still like some clarity on how you're thinking about the following:
1. As @psychedelicious mentioned this will break configs that contain autocast. Are we thinking that there are probably very few people with that setting? So we'll just provide guidance to anyone who does hit this?
Edit: I just saw your comment about addressing this.
2. How confident are we that context.models.get_execution_device() is the right API? I can imagine how it would enable multi-gpu - but without the full context of how it's going to be used I'm slightly nervous that we're going to end up making breaking changes to it.

psychedelicious · 2024-04-11T20:02:03Z

Indeed we should be careful when adding to the public API, because we are promising to support and maintain it.

Some more thoughts:

I'm not sure get_execution_device should be in the context.models namespace. Feels more like a utility to me.
There's no indication that using TorchDeviceSelect.get_execution_device() directly in a node is going to cause problems - because it doesn't except in a separate branch that may be merged in the future and not work for some nodes. In fact, some nodes do use TorchDeviceSelect directly. It's a confusing and has footgun potential.
"TorchDeviceSelect" suggests to me that I can use this class to select a torch device. Maybe something more generic like "TorchHelper" is clearer.
The invocation_api module exports some objects that are removed in this PR:
```
from invokeai.backend.util.devices import CPU_DEVICE, CUDA_DEVICE, MPS_DEVICE, choose_precision, choose_torch_device
```
For better or worse, we have committed these to the public API. Need to handle that.

lstein · 2024-04-12T05:36:49Z

I'm tired of working on this and will get back to it at some point soon. Converting to draft for now.

1. Remove TorchDeviceSelect.get_execution_device(), as well as calls to context.models.get_execution_device(). 2. Rename TorchDeviceSelect to TorchDevice 3. Added back the legacy public API defined in `invocation_api`, including choose_precision(). 4. Added a config file migration script to accommodate removal of precision=autocast.

lstein · 2024-04-14T21:27:29Z

Indeed we should be careful when adding to the public API, because we are promising to support and maintain it.

Some more thoughts:
I'm not sure get_execution_device should be in the context.models namespace. Feels more like a utility to me.

There's no indication that using TorchDeviceSelect.get_execution_device() directly in a node is going to cause problems - because it doesn't except in a separate branch that may be merged in the future and not work for some nodes. In fact, some nodes do use TorchDeviceSelect directly. It's a confusing and has footgun potential.

"TorchDeviceSelect" suggests to me that I can use this class to select a torch device. Maybe something more generic like "TorchHelper" is clearer.
The invocation_api module exports some objects that are removed in this PR:
from invokeai.backend.util.devices import CPU_DEVICE, CUDA_DEVICE, MPS_DEVICE, choose_precision, choose_torch_device
For better or worse, we have committed these to the public API. Need to handle that.

These recommendations have now been implemented.

…/InvokeAI into lstein/feat/device-abstraction

psychedelicious

Thanks for addressing the feedback, sorry if it was a hassle. Couple minor changes and comments

invokeai/backend/model_manager/merge.py

invokeai/backend/util/devices.py

…h_dtype()

introduce new abstraction layer for GPU devices

b916d29

lstein requested review from blessedcoolant, GreggHelt2, brandonrising, RyanJDick, hipsterusername and psychedelicious as code owners April 7, 2024 20:35

github-actions bot added python PRs that change python files invocations PRs that change invocations backend PRs that change backend files services PRs that change app services python-tests PRs that change python tests labels Apr 7, 2024

lstein marked this pull request as draft April 7, 2024 21:01

add unit test for device abstraction

577bf62

lstein force-pushed the lstein/feat/device-abstraction branch from 7898699 to 577bf62 Compare April 7, 2024 21:06

fix ruff

c7f0555

lstein marked this pull request as ready for review April 7, 2024 21:15

Merge branch 'main' into lstein/feat/device-abstraction

c2aa3e9

RyanJDick requested changes Apr 10, 2024

View reviewed changes

Lincoln Stein added 4 commits April 10, 2024 15:45

convert TorchDeviceSelect into a stateless class

54b8e12

merge with main

7e26969

move logic to select context-specific execution device into context API

8795000

add mock hardware environments to pytest

0115021

lstein requested a review from RyanJDick April 10, 2024 22:53

remove dangling mocker fixture

ea94fc3

psychedelicious reviewed Apr 10, 2024

View reviewed changes

invokeai/backend/model_manager/load/model_cache/model_cache_base.py Outdated Show resolved Hide resolved

Lincoln Stein added 3 commits April 10, 2024 23:17

fix unit test for running on non-CUDA systems

153cd64

remove unimplemented get_execution_device() call

6ad0823

remove autocast precision

879cac5

RyanJDick approved these changes Apr 11, 2024

View reviewed changes

lstein marked this pull request as draft April 12, 2024 05:36

lstein marked this pull request as ready for review April 14, 2024 21:27

lstein and others added 3 commits April 14, 2024 17:27

Merge branch 'main' into lstein/feat/device-abstraction

8fa3ea5

add deprecation warnings to choose_torch_device() and choose_precision()

03beaca

Merge branch 'lstein/feat/device-abstraction' of github.com:invoke-ai…

b2ce44f

…/InvokeAI into lstein/feat/device-abstraction

lstein requested a review from psychedelicious April 14, 2024 22:39

lstein changed the title ~~[util] Add object-oriented torch device class~~ [util] Add generic torch device class Apr 14, 2024

fix test crash

7e177c1

lstein force-pushed the lstein/feat/device-abstraction branch from 48ce7b4 to 7e177c1 Compare April 14, 2024 22:54

psychedelicious requested changes Apr 14, 2024

View reviewed changes

invokeai/backend/model_manager/merge.py Outdated Show resolved Hide resolved

invokeai/backend/util/devices.py Outdated Show resolved Hide resolved

invokeai/backend/util/devices.py Outdated Show resolved Hide resolved

remove app_config argument from choose_torch_device() and choose_torc…

cb2a12b

…h_dtype()

psychedelicious self-requested a review April 15, 2024 02:21

psychedelicious approved these changes Apr 15, 2024

View reviewed changes

lstein enabled auto-merge (squash) April 15, 2024 13:03

Merge branch 'main' into lstein/feat/device-abstraction

1cb5054

lstein merged commit e93f4d6 into main Apr 15, 2024
14 checks passed

lstein deleted the lstein/feat/device-abstraction branch April 15, 2024 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[util] Add generic torch device class #6174

[util] Add generic torch device class #6174

lstein commented Apr 7, 2024 •

edited

Loading

psychedelicious commented Apr 10, 2024

lstein commented Apr 11, 2024

psychedelicious commented Apr 11, 2024

lstein commented Apr 11, 2024

RyanJDick left a comment •

edited

Loading

psychedelicious commented Apr 11, 2024

lstein commented Apr 12, 2024 •

edited

Loading

lstein commented Apr 14, 2024

psychedelicious left a comment

[util] Add generic torch device class #6174

[util] Add generic torch device class #6174

Conversation

lstein commented Apr 7, 2024 • edited Loading

Summary

Related Issues / Discussions

QA Instructions

Merge Plan

Checklist

psychedelicious commented Apr 10, 2024

lstein commented Apr 11, 2024

psychedelicious commented Apr 11, 2024

lstein commented Apr 11, 2024

RyanJDick left a comment • edited Loading

Choose a reason for hiding this comment

psychedelicious commented Apr 11, 2024

lstein commented Apr 12, 2024 • edited Loading

lstein commented Apr 14, 2024

psychedelicious left a comment

Choose a reason for hiding this comment

lstein commented Apr 7, 2024 •

edited

Loading

RyanJDick left a comment •

edited

Loading

lstein commented Apr 12, 2024 •

edited

Loading