Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WebGPU] Handle device OOM in createBuffer #17005

Merged
merged 1 commit into from
May 17, 2024

Conversation

CharlieFRuan
Copy link
Contributor

Prior to this PR, WebGPU errors such as OOM are only logged as a warning without affecting the program. This PR handles WebGPU error using pushErrorScope() and popErrorScope() following https://github.com/gpuweb/gpuweb/blob/main/design/ErrorHandling.md.

We replace createBuffer() with tryCreateBuffer(), in which we catch all three types of errors. For now, we treat any error occurred in createBuffer() fatal and hence do device.destroy(). When a device is initiated, we use device.lost.then() to listen to the event of device.destroy(), upon which we log the error and call Instance.dispose(), prompting the user to re-initialize.

See mlc-ai/web-llm#356 for motivation.

Tested end-to-end with WebLLM.

@tqchen tqchen merged commit afb6416 into apache:main May 17, 2024
15 checks passed
CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request May 21, 2024
Prior to this PR, when users `createEngine()` or call `reload()` with a
model that is too large for the device, likely the device would keep
generating, ignoring OOM issue and correctness. See
#356 and
#209.

This PR catches such error with `device.lost.then()`, depending on tvmjs
to call `device.destroy()` upon detecting error in `createBuffer()` via
apache/tvm#17005.

We have only observed `createBuffer()` errors and hence will only
process such kind of errors for now. Besides, since most OOM errors
occur in `reload()`, we make the error handling synchronous despite
using `.then()` by throwing the error at the end of `reload()` if there
is one.
CharlieFRuan added a commit to mlc-ai/web-llm that referenced this pull request May 21, 2024
### Changes
Main changes include:
- New model `Hermes-2-Pro-Mistral-7B` in `prebuiltAppConfig` via:
  - #390
- Various `index.js` and `index.js.map` post-processings to resolve
frontend compatibility issues with `require()` and `perf_hoooks`
  - #397
  - #406
- Catch WebGPU OOM error upon `reload()` and `CreateEngine()`:
  - #402
- Service Worker support (in addition to Extension Service Worker):
  - #395
  - #400
  - #401

### WASM Version
v0_2_34 as no change is required.

### TVMjs
TVMjs compiled at
apache/tvm@a5862a5,
with only one change in `tvm/web`:
apache/tvm#17005
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants