-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Add cpu alloc/free callback to support customlize memory alloctor APIs. #1898
Comments
cc @mgouicem |
Hi @xuhancn and thanks for the proposal. Some time ago, we decided to rely on pointers pre-allocated by users instead of malloc/free callbacks. There was 2 main reasons for this:
In general, the memory allocation in oneDNN happens in four places:
Could you clarify if you are already using the mechanisms above and still see allocation overheads? |
Hi @mgouicem My proposal indeed to optimize your mentioned item: The POC PR is here: pytorch/pytorch#126049 which contains:
The performance comparsion as following: After mimalloc registered, the mkldnn_convolution performance improved about 0.3s. Could you please help on designed a memory allocation callback mechanism? It will help on pytorch Windows get better performance, much appreciated. |
Summary
During our pytorch development, we found Windows system memory alloctor is worse performance, and slow down the whole pytorch performance. After add third party memory alloctor, pytorch improved its tensor alloction performance. Detailed please take reference: pytorch/pytorch#102534
As pytorch submodule, I found oneDNN still using system memory alloctor to malloc some buffer for reorder/resharp options.
Related code as here:
oneDNN/src/common/utils.cpp
Lines 146 to 170 in 11f5558
I add some debug log to confirmed also.
On Windows, I tested resnet18 it has more than 360k times malloc/free via system malloc/free.
Shows as below:
Problem statement
For slow memory alloction on Windows OS, I also write a malloc benchmark: https://github.com/xuhancn/bench_malloc
The other third party memory malloc libraries can improve the performance.
It is also works well on pytorch: pytorch/pytorch#102534 (comment)
So, we need an idea to let oneDNN use some third party memory alloctor for performance improvement.
Option 1: Add some memory alloction library as a submodule.
Acturally, It is not a good option:
Option 2: Add cpu alloc/free callback to support customlize memory alloctor APIs.
It is a light method to change the memory alloction implemention.
Preferred solution
For above option 2:
First, we can define the callback funtions:
The registeration API as below:
Reference implemention:
Additional question:
oneDNN has two piece of malloc/free implemention:
oneDNN/src/common/utils.cpp
Lines 146 to 170 in 11f5558
oneDNN/src/graph/utils/alloc.cpp
Lines 62 to 80 in 11f5558
Whether we need to add callback for both them?
CC: @jgong5, @chunyuan-w, @Guobing-Chen
The text was updated successfully, but these errors were encountered: