[WebGPU] Handle device OOM in createBuffer #17005

CharlieFRuan · 2024-05-17T09:36:00Z

Prior to this PR, WebGPU errors such as OOM are only logged as a warning without affecting the program. This PR handles WebGPU error using pushErrorScope() and popErrorScope() following https://github.com/gpuweb/gpuweb/blob/main/design/ErrorHandling.md.

We replace createBuffer() with tryCreateBuffer(), in which we catch all three types of errors. For now, we treat any error occurred in createBuffer() fatal and hence do device.destroy(). When a device is initiated, we use device.lost.then() to listen to the event of device.destroy(), upon which we log the error and call Instance.dispose(), prompting the user to re-initialize.

See mlc-ai/web-llm#356 for motivation.

Tested end-to-end with WebLLM.

Prior to this PR, when users `createEngine()` or call `reload()` with a model that is too large for the device, likely the device would keep generating, ignoring OOM issue and correctness. See #356 and #209. This PR catches such error with `device.lost.then()`, depending on tvmjs to call `device.destroy()` upon detecting error in `createBuffer()` via apache/tvm#17005. We have only observed `createBuffer()` errors and hence will only process such kind of errors for now. Besides, since most OOM errors occur in `reload()`, we make the error handling synchronous despite using `.then()` by throwing the error at the end of `reload()` if there is one.

### Changes Main changes include: - New model `Hermes-2-Pro-Mistral-7B` in `prebuiltAppConfig` via: - #390 - Various `index.js` and `index.js.map` post-processings to resolve frontend compatibility issues with `require()` and `perf_hoooks` - #397 - #406 - Catch WebGPU OOM error upon `reload()` and `CreateEngine()`: - #402 - Service Worker support (in addition to Extension Service Worker): - #395 - #400 - #401 ### WASM Version v0_2_34 as no change is required. ### TVMjs TVMjs compiled at apache/tvm@a5862a5, with only one change in `tvm/web`: apache/tvm#17005

[WebGPU] Handle device OOM in createBuffer

7b98b31

CharlieFRuan mentioned this pull request May 17, 2024

[Device] Catch WebGPU OOM error mlc-ai/web-llm#402

Merged

tqchen approved these changes May 17, 2024

View reviewed changes

tqchen merged commit afb6416 into apache:main May 17, 2024
15 checks passed

CharlieFRuan mentioned this pull request May 21, 2024

[Version] Bump version to 0.2.36 mlc-ai/web-llm#407

Merged

ysh329 mentioned this pull request Jul 20, 2024

[Release] v0.17.0 Release Candidate Notes #17178

Closed

CharlieFRuan mentioned this pull request Aug 6, 2024

[WebGPU] Fix unexpected device lost error when intentional dispose #17250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU] Handle device OOM in createBuffer #17005

[WebGPU] Handle device OOM in createBuffer #17005

CharlieFRuan commented May 17, 2024

[WebGPU] Handle device OOM in createBuffer #17005

[WebGPU] Handle device OOM in createBuffer #17005

Conversation

CharlieFRuan commented May 17, 2024