-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sycl][cuda][hip] Expose cuda/hip const addrspace via device_global<const T> #16001
base: sycl
Are you sure you want to change the base?
Conversation
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
sycl/doc/extensions/experimental/sycl_ext_oneapi_device_global.asciidoc
Outdated
Show resolved
Hide resolved
….asciidoc Reword spec note to generalize beyond dpc++ compiler Co-authored-by: John Pennycook <john.pennycook@intel.com>
using namespace sycl; | ||
using namespace sycl::ext::oneapi::experimental; | ||
|
||
device_global<const int> DeviceGlobalVar; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is the right thing. Why not declare the device_global
itself as const
in this case:
device_global<const int> DeviceGlobalVar; | |
const device_global<int> DeviceGlobalVar; |
It seems like we are moving in a direction where the data type of these "wrapper" classes is expected to be cv-unqualified. For example, the Khronos WG recently decided that vec
should only allow a cv-unqualified DataT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could do this, but you would have to make further changes to the specification.
Currently the memcpy
method of queue/handler does not allow copying a const device_global
from host to device. This functionalty is absolutely key to making a const
device_global
useful.
I did not see any problem with the spec as it is, and the fact that syclomatic is currently already translating codes from __constant__ to device_global<const T>
, persuaded me that the current solution is the best option. I also think that all things considered, the current solution is the most semantically correct with respect to what the compiler actually does.
If you wish to go down the route of changing the device_global
spec as described above, that is fine. The only other downside from my perspective is this will slow down blender work a little, and syclomatic will have to be changed. But in the grand scheme of things that doesn't matter I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the
memcpy
method of queue/handler does not allow copying aconst device_global
from host to device. This functionalty is absolutely key to making aconst
device_global
useful.
This seems OK to me. In the current specification const device_global
seems useless. You can't write the value from either the host or device. For such a case, why use device_global
at all? If you just want a constant, SYCL already allows you to define a constant variable as const int X = <val>
or constexpr int X = <val>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I should have given more details. In SIMT sometimes values shared across multiple threads in a subgroup will be constant across the threads. In such a case if the GPU retrieves the common value a single time for all threads this will be more efficient than if all threads retrieve the data individially. This is my basic understanding of the motivation for constant caches in GPGPU.
This seems OK to me. In the current specification
const device_global
seems useless. You can't write the value from either the host or device. For such a case, why usedevice_global
at all? If you just want a constant, SYCL already allows you to define a constant variable asconst int X = <val>
orconstexpr int X = <val>
.
I think that in this case this would mean that the program can not update the const
X
variable. In such a case I think it would always be more efficient for a programmer to do something like
#DEFINE REG_VAR 10;
and then set the value of e.g. a register directly within kernel code to REG_VAR
without going via any other device memory at all (even .const).
The use case of __constant__
(or in current PR setup here: device_global<const T>
) is to allow the user to continuously update the constant values at runtime via copying from host (via queue::memcpy
). This is generally how programs like blender and NWchemex use __constant__
variables currently with the equivalent cuda API.
Now a third option that I didn't mention previously is to use a attribute like __constant__
(it is a macro but maps to a attribute in attr.td) and not the const keyword. I considered this and got it working, but I also decided against it because I didn't want to complicate the extension by adding a new macro. I then considered getting it to work with the existing "const" attribute that is already exposed to all sycl backends in attr.td. Once I realized that device_global<const T>
works fully within the use cases we need (and that __constant__
implies const
) (and also that const
has the added advantage of already warning users about trying to change the value in kernel code that attributes don't by default), combined with the fact that this "const" attribute in attr.td could be used in different ways in an arbitrary backend that I don't understand, I decided to stick with device_global<const T>
. Now we could do one of the different options I described and if you tell me to do that I can do it, but be aware that this requires more work, and I do think that the device_global<const T>
solution that SYCLomatic translates to is a good one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that the cleanest API would be like this:
- Recommend that applications use
const device_global<T>
to represent a global variable that is read from device code but never written from device code. - Change SYCLomatic to to migrate to this code pattern.
- Change the "copy" member functions that copy to a
device_global
to allow copying even if thedevice_global
is declaredconst
. Clarify the wording in the specification to say that it is allowed for the host to write to suchdevice_global
variables.
How hard would it be to implement this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gmlueck IMO that's potentially limiting valid use cases - it seems completely legitimate that the value inside device_global
can be be const-qualified, I'd argue it better signals the intent of device constant memory as being modifiable on host, but constant on device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can't reach consensus about this, it might be better to avoid
const
completely, in favor of something likedevice_global<T, AccessMode>
, so we can clearly document that theAccessMode
only applies to device-access, and it can be any of the valid access modes.
OK, this sounds like a better solution. The device_global
class already supports a property named host_access
which tells whether the variable can be read / written from the host. Perhaps we could add a property named device_access
which tells whether the variable can be read / written from the device.
It seems like the only useful access modes are read_write
(which would be the default) and read
. I'm not sure it makes sense to support the other access modes. In particular, I don't think we structure the API in a way that enforces write
access mode because there's no way to return a write-only pointer or reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you all happy if I map device_global<T, AccessMode>
to use .const address space for cuda/hip?
I'm sure I can make it work, although I don't know exactly how to do it immediately and it will probably require me adding an attribute to attr.td.
Also can you confirm that you do not want me to map device_global<const T>
to .const address space for cuda/hip? In theory I could make both the above two cases work.
Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the AccessMode
communicate the same information, just in a less intuitive way? That was part of the accessor simplification for SYCL 2020, to allow const T
for the data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you say AccessMode
, I assume you really mean that you will add a new compile-time property, right? I think we should not add a new template parameter AccessMode
.
I do not think that const T
is more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FE changes require FE tests (not just E2E). Please add a FE test
Hi, I didn't add an e2e test. I only have a device code test. The current test checks that the .const (aka "4") address space is used for cuda/hip, and global ("1") otherwise Do you mean that I should move the existing device_code test to the clang/test/CodeGenSYCL test folder so it can be tested with Thanks |
// RUN: %clangxx -fsycl -fsycl-device-only %if cuda %{ -fsycl-targets=nvptx64-nvidia-cuda %} %if hip-amd %{ -fsycl-targets=amdgcn-amd-amdhsa -Xsycl-target-backend --offload-arch=gfx90a %} -S -emit-llvm %s -o - | FileCheck %s %if cuda || hip-amd %{ --check-prefixes=CHECK-CONST %} | ||
|
||
// Tests that device_global<const T> uses const address space for cuda/hip and | ||
// global address space otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// global address space otherwise | |
// global address space otherwise. |
queue Q; | ||
Q.single_task([]() { | ||
// CHECK-CONST: addrspace(4) @DeviceGlobalVar | ||
// CHECK: addrspace(1) @DeviceGlobalVar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We fall back to this one if no cuda or hip, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
Thanks for pointing that out. You can just move it to CodeGenSYCL and change it to use clang_cc1 instead of the driver |
Expose cuda/hip const addrspace (
__constant__
) viadevice_global<const T>
Nvidia GPUs have a dedicated constant memory cache which can be a lot faster in some cases for constant global device variables ("constant cuda symbols"). CUDA programmers access this cache via global variables marked
__constant__
AMD gpus do not have a dedicated constant memory cache (as far as I am aware). However the HIP programming model does support
__constant__
. As well as supporting the constant cache in the Nvidia case, when AMD GPUs are the target the macro can be used as a compiler hint for other optimizations such as using SGPRs (scalar registers) instead of VGPRs (vector registers).This patch switches on these optimizations for cuda/hip backends of dpc++.
syclomatic already translates
__constant__ T
todevice_global<const T>
. This is a natural translation that allows the complete support ofdevice_global
features under the constraint that programmers cannot update thedevice_global<const T>
in kernel code (matching__constant__
semantics in cuda/hip), whilst still allowing them to update this constant global variable viaqueue::memcpy(const device_global<T>)
, which maps naturally to how CUDA APIs allows programmers to update__constant__
device symbols via the host.Key applications that have been identified will benefit from this:
Kokkos (general)
Blender
NWCHEMEX aka Exachem
Fixes #5827
Fixes #4278