-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reuse OneDNN handler for SGD and SUM for SelectedRows input tensors. #35510
Reuse OneDNN handler for SGD and SUM for SelectedRows input tensors. #35510
Conversation
This makes it possible to call it multiple times without recreating the oneDNN primitives every time.
* For dense param and sparse grad.
Thanks for your contribution! |
* Stop use PP cache mechanims to store mem and primitive obj. * Handler obj store and reuse needed desc & prim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@wzzju Please tak a look at changes in this PR |
@wzzju Hi, please take a look at this PR, use oneDNN sum and SGD |
@tsocha Please review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for PADDLE_ENFORCE
@chenwhql could you please approve PR-CI-APPROVAL ? |
…addlePaddle#35510) * Create stateful OneDNNAXPYHandler object. This makes it possible to call it multiple times without recreating the oneDNN primitives every time. * Prepare SGDOpKernel to reuse its implementation from OneDNN kernel. * OneDNN SGD kernel. * Update call to use new OneDNNAXPYHandler object api. * Setup seed in proper place. * Enable OneDNN kernel only for single case. * For dense param and sparse grad. * Small refactor. * Enable oneDNN by op attr or by cmd line flag. * Use int64_t type for number of elements. * Support dense param and grad from OneDNN kernel. * Enable SGD OneDNN kernel when use MP BF16 optimizer. * Force non-copyable/movable OneDNNAXPYHandler. * Reuse OneDNNAXPYHandler for spare tensors in SUM op. * Fix SFINAE rules. * Remove recording event inside AXPY. * Get rid of internal primitive caching. * Stop use PP cache mechanims to store mem and primitive obj. * Handler obj store and reuse needed desc & prim * Do not derive from MKLDNNHandlerT
…addlePaddle#35510) * Create stateful OneDNNAXPYHandler object. This makes it possible to call it multiple times without recreating the oneDNN primitives every time. * Prepare SGDOpKernel to reuse its implementation from OneDNN kernel. * OneDNN SGD kernel. * Update call to use new OneDNNAXPYHandler object api. * Setup seed in proper place. * Enable OneDNN kernel only for single case. * For dense param and sparse grad. * Small refactor. * Enable oneDNN by op attr or by cmd line flag. * Use int64_t type for number of elements. * Support dense param and grad from OneDNN kernel. * Enable SGD OneDNN kernel when use MP BF16 optimizer. * Force non-copyable/movable OneDNNAXPYHandler. * Reuse OneDNNAXPYHandler for spare tensors in SUM op. * Fix SFINAE rules. * Remove recording event inside AXPY. * Get rid of internal primitive caching. * Stop use PP cache mechanims to store mem and primitive obj. * Handler obj store and reuse needed desc & prim * Do not derive from MKLDNNHandlerT
PR types
Performance optimization
PR changes
OPs
Describe
This PR optimize current implementations for SGD and SUM operators for BF16 data type (mainly) when SelectedRows (sparse) tensors were used by reusing OneDNN handler.
The performance results on word2vec model are as follows on CPX 6348 machine with single thread:
This gives ~15% speedup.
From profiling: