-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA][HIP] Support zero range kernel for cuda and hip backends. #7044
Conversation
mmoadeli
commented
Oct 13, 2022
•
edited
Loading
edited
- Fixes issue 6963 to allow range zero kernel for cuda and hip backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for submitting a fix for this, @mmoadeli! This definitely seems like a bug, given the SYCL 2020 specification states that:
When the global size is zero, the kernel function is not executed, the local size is ignored, and any dependencies are satisfied.
I do however wonder if it would be better to handle this at the runtime level rather than in the individual plugins. Based on the new test I assume this is also currently not handled by the L0 plugin. @smaslov-intel - What do you think?
@@ -2608,6 +2608,10 @@ pi_result cuda_piEnqueueKernelLaunch( | |||
assert(work_dim > 0); | |||
assert(work_dim < 4); | |||
|
|||
if (*global_work_size == 0) { | |||
return PI_SUCCESS; | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing an early exit here means that we do not create an event. I fear that could cause unexpected problems. Same goes for HIP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@steffenlarsen I agree with you on this regarding the event.
If I move the condition (if (*global_work_size =! 0 )) to guard the call to PI_CHECK_ERROR(cuLaunchKernel( it will preserve the functionality related to event handling. Do you have any reservation doing that?
I have not tried the L0, but the opencl:cpu and esimd (and host in an earlier version of DP++) backend are tried and work without the need for any modifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either that or you could copy the event creation to here, something like:
if (event) {
std::unique_ptr<_pi_event> retImplEv{nullptr};
retImplEv = std::unique_ptr<_pi_event>(_pi_event::make_native(
PI_COMMAND_TYPE_NDRANGE_KERNEL, command_queue));
retImplEv->start();
retImplEv->record();
*event = retImplEv.release();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it will be a correct implementation. A call with NDRange == 0
while should do nothing in terms of running the kernel, the event produced should be as if we run a kernel:
- Completion of such an event should guarantee completion of events passed in
event_wait_list
- The event can be used in a
event_wait_list
of a subsequent enqueue.
So, I think that the more correct implementation would be just calling cuda_piEnqueueEventsWaitWithBarrier
if NDRange == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. Now that there are multiple queues, recording an event is not enough to act like a barrier.
queue q; | ||
q.submit( | ||
[&](handler &cgh) { cgh.parallel_for(range<1>(0), [=](id<1> i) {}); }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Device-dependent testing should be in https://github.com/intel/llvm-test-suite rather than in the in-tree LIT tests. Could you please move this test to there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you agree, the device dependent part of the test to be removed. The same behaviour should be seen on all backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if that is the case, there is no guarantee that a device is available, so it could fail to create a queue.
Probably it's not worth optimizing, but the SYCL RT can emulate the required behavior by submitting an RT barrier command only, while plugins can do something better. |
…zero. - Removes range_zero_size.cpp, this to be added in to sycl-test-suit repo.
@romanovvlad @steffenlarsen I'd be happy to address any potential issues on this PR to have it merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! Changes look good. Where did you move the tests to?
The test can be found in first commit 719c6b8 |
It would be preferable if you can open a PR on the test-suite with the test. We can trigger testing with it on this PR, albeit only on L0 and OCL. |
/verify with intel/llvm-test-suite#1363 |
Co-authored-by: Romanov Vlad <vlad.romanov@intel.com>
Co-authored-by: Romanov Vlad <vlad.romanov@intel.com>
i have got same problem: terminate called after throwing an instance of 'sycl::_V1::runtime_error' code: It does not falling when i removing parallel_for from submit. How can i fix it? Queue is under cpu_selector_v |