NVPTX backend metabug #38789

japaric · 2017-01-02T23:56:23Z

The NVPTX backend has been available since: nightly-2017-01-XX

This is a collections of bugs and TODOs related to it.

Documentation

How to: Run Rust code on your NVIDIA GPU

Bugs

LLVM assertion when compiling core to PTX. LLVM assertion when using i128 with the NVPTX #38824
LLVM error when emitting PTX code with debuginfo. LLVM error when emitting PTX code with debuginfo #38785
NVPTX: No "undefined reference" error is raised when it should be. NVPTX: No "undefined reference" error is raised when it should be #38786
NVPTX: non-inlined functions can't be used cross crate. NVPTX: non-inlined functions can't be used cross crate #38787

Missing features

The equivalent to CUDA's __shared__ modifier. Probably needs an RFC to land
in the compiler as we don't have anything similar to it (AFAIK).

Stabilization

Stabilize the nvptx targets. IOW, add them to the compiler. Candidates for merging.

All the non-trivial kernels make use of intrinsics like blockIdx.x. These will have to be stabilized. Right now these intrinsics are implemented as "plaform-intrinsics" but that feature is unstable.
Stabilize the "ptx-kernel" ABI. Tracking issue for the "ptx-kernel" ABI #38788

cc @rkruppe

The text was updated successfully, but these errors were encountered:

gnzlbg · 2018-02-21T10:43:01Z

It would be better if instead of having to create a new module / crate, adding the #![feature(abi_ptx)] feature, and then declaring the kernels as extern "ptx-kernel" fn foo() {}, we could just handle this using #[target_feature], so that one can add kernels within a non-kernel Rust module. It would be great if this attribute could be applied to closures as well.

That way we could just write our kernels inline with normal Rust code.

#[target_feature(enabled = "ptx_device_fn")] unsafe fn a_device_fn(...) { ... }
#[target_feature(enabled = "ptx_kernel")] unsafe fn a_kernel(...) { ... a_device_fn(...) ... }

unsafe fn bar() {
   cuda::driver::launch(x, y, z, w, a_kernel);
   cuda::driver::launch(x, y, z, w, #[target_feature(enabled = "ptx_kernel")] |...| a_kernel(...) );
}

This way users can use the typical target_feature facilities to write portable code.

gnzlbg · 2018-02-21T10:56:13Z

@japaric @rkruppe @alexcrichton I'd like to work on this.

hanna-kruppe · 2018-02-21T11:06:54Z

You seem to assume a "single source" model? We don't have that currently. You have to compile your kernels as one crate for a nvptx-* target, then embed the resulting PTX asm as string/resource into a crate that's compiled for the host target. The latter crate passes it to the driver for compilation. There's no way to mix host and device code in one compilation unit like nvcc allows.

Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files.

Additionally, even if/once we have "single source", target_feature seems inappropriate, since target_feature is for modifying the available instruction set within one target, while this here requires compiling the code for a different target altogether (and in many cases, for two targets as mentioned before). A custom attribute might be more appropriate but since this is really about "entry point for device code or not" rather than "can run on device or not" an ABI seems fine too.

gnzlbg · 2018-02-21T11:59:28Z

Additionally, even if/once we have "single source", target_feature seems inappropriate, since target_feature is for modifying the available instruction set within one target,

Indeed, makes sense.

Supporting that would require novel frontend integration (novel for Rust; clang has something like this already). For example rustc would have to decide for each translation item whether it should be compiled for the host, for the device, or both -- and then combining the resulting PTX and host object files.

A custom attribute might be more appropriate but since this is really about "entry point for device code or not" rather than "can run on device or not" an ABI seems fine too.

I'd like to work on enabling this via a mixture of the ABI solution to choose, e.g., the ptx or ptx-kernel ABIs, and target_feature to choose, e.g., sm30 vs sm70.

If we could have these multiple ABIs into a single source file, we could have #[target_device] and #[target_device_kernel] procedural macros that just generates copies of a function for different ABIs:

#[inline]
fn baz() { }

#[target_device(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn bar(...) { ... }

#[target_device_kernel(host, nvptx(sm = "40"), spirv(version = "1.0"))]
fn foo(...) { 
  #[device] bar(...); // device attribute indicates that this fn is a device fn
  baz(); // this function will be used as is
}

that expand to:

fn baz_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx" fn baz_nvptx(...) { bar_nvptx(...); baz();  }
extern "spriv" fn baz_spirv(...) { bar_spirv(...); baz();  }

fn foo_host(...) { bar_host(...); baz(); }
#[target_feature(enabled = "sm40")]
extern "ptx-kernel" fn foo_nvptx(...) { bar_nvptx(...); baz();  }
extern "spirv-kernell" fn foo_spirv(...) { bar_device(...); baz();  }

And then just launch the kernels using another procedural macro, e.g., kernel_launch!(foo, args...).

One cool feature of clang and nvcc is to allow whoever builds the library to easily choose the devices to target. Procedural macros could allow these via feature flags: --features target_device_nvptx_sm_35, nvptx_sm_70, ... (it's not nice, but should be at least doable).

This approach leaves the door open to doing something nicer in the language in the future, while allowing libraries to experiment with better APIs. I wonder whether these two building blocks (extern ABIs in a single source file, and #[target_feature]) are a good way to do this, and whether this is something that a nicer approach is going to need in one form or another, or whether this approach is completely wrong and there is a better alternative.

hanna-kruppe · 2018-02-21T12:50:36Z

Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support. Right now, one crate is compiled for one target, period. Subsets of the crate can tweak some parts of the target (e.g., use non-standard ABIs or enable/disable instruction set extensions) but that's a far cry from slicing out a subset of the crate, compiling it for a completely different target, and then stitching the results back together -- and that's precisely what is necessary for single-source offloading (not just CUDA, but also everything else along these lines that I've seen).

In fact the assumption that one crate == one target goes as far as rustc being literally unable to store more than one target per compilation session (Session). Not to mention how you need to generate a whole separate LLVM module for the PTX code.

Even if one attempts to minimize the amount of compiler changes needed during prototyping for faster iteration (generally a good idea) by e.g. splitting the crate into two crates with an external tool and invoking rustc twice, there is ample room for compiler hacking. Even the bare minimum of single source support requires name resolution information, and being able to use generic library code will require type system integration as well.

So in my opinion, this is a rather big feature with at least as much need for experimentation and compiler hacking and design work as SIMD intrinsics. I say this not to discourage you but because your posts so far ignore the technical challenges that are, in my opinion, the biggest obstacle to single source support.

I'm also rather puzzled by the priorities here. Before experimenting with the best way to allow users to compile their single-source applications not just for multiple CUDA devices but also for entirely different targets like SPIR-V, basic features like an equivalent to __shared__ seem like a simpler and more important first step. Again, not trying to discourage, but even the greatest most ergonomic portable offloading solution seems pointless if the kernels can't even use group-shared memory.

gnzlbg · 2018-02-21T12:53:29Z

Frankly, I don't see how any of the tools we have in the language now (target_feature, proc macros, ABIs) can help at all with single source support.

Oh no, I think I expressed myself wrong. I meant that once we get single source support using extern ABIs, the combination of tools that we have already available in the language can allow for pretty nice APIs.

basic features like an equivalent to shared seem like a simpler and more important first step.

I think that __shared__ can be implemented as a core::intrinsic and is thus not a big deal: fn shared() -> *mut u8 would do.

hanna-kruppe · 2018-02-21T13:05:06Z

I meant that once we get single source support using extern ABIs,

What does this mean? I am not aware of any plans for any kind of single source support. And what does "using extern ABIs" mean? It seems to presuppose some strategy for single source support but it's not clear to me which one (and it doesn't sound like any of the strategies that I am aware of). Finally, assuming I'm correct that single source support is not on the horizon, I'm puzzled why we're hashing out details of how it could be exposed better to the user if the basic technical prerequsites aren't even on the horizon.

I think that shared can be implemented as a core::intrinsic and is thus not a big deal: fn shared() -> *mut u8 would do.

IIUC such an intrinsic would be basically like an alloca intrinsic, which Rust has rejected in favor of better support for unsized (DST) values. So while this is a possible strategy (although I can think of some technical challenges as well) that a prototype implementation might choose, it is far from clear to me that it's the approach we'd want to adopt.

gnzlbg · 2018-02-21T13:13:39Z

IIUC such an intrinsic would be basically like an alloca intrinsic

How so? The kernel does not allocate anything: __shared__ just initializes a pointer with a value.

Finally, assuming I'm correct that single source support is not on the horizon

I'd like to work on enabling single source support and I'd like to enable it in such a way that it is useful.

And what does "using extern ABIs" mean?

In a single source model:

extern "ptx-kernel" unsafe fn foo(...) { ... is compiled to a ptx kernel ... }
fn bar(...) { ... is compiled for the host ... }

hanna-kruppe · 2018-02-21T13:22:11Z

How so? The kernel does not allocate anything: shared just initializes a pointer with a value.

It's a storage specifies in C parlance. You declare variables to live in shared memory as opposed to thread-private memory or global memory or constant memory. In pointer types it's just an optional hint that the pointee lives in shared memory, that aspect isn't even needed. What is absolutely necessary is to be able to do declare locals like __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; (from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory) which is like let As: [[f32; BLOCK_SIZE]; BLOCK_SIZE]; except there's just one array per thread group, accessed by all threads in the group.

gnzlbg · 2018-02-21T13:31:09Z

It's a storage specifies in C parlance.

Sure, but what's the point of making it a storage specifier in Rust? You can't have two variables on shared memory, that is, the following is not valid CUDA C:

__global__ void foo(float* foo) {
  __shared__ a float[];
  __shared__ b float[];  // ERROR: you can only have one pointer to shared memory per kernel
  foo[0] = a[0] + b[t0]; 
}

What is absolutely necessary is to be able to do declare locals like shared float As[BLOCK_SIZE][BLOCK_SIZE]; (from http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#shared-memory)

In particular, here you are not allocating a float [BLOCK_SIZE][BLOCK_SIZE]; region of shared memory, but assigning the pointer to shared memory to a float [BLOCK_SIZE][BLOCK_SIZE] array (C arrays are just pointers). This might lead to UB if no memory actually has been allocated, or if some other kernel reinterpreted it to have another type, or if less memory than the one required here was allocated, etc.

So IMO, independently of what the spec says, we should focus on the actual semantics of __shared__. The only thing __shared__ does is initializing some pointer to point to the shared memory region with the restriction that one cannot initialize two pointers to it. Nothing more, nothing less. It does not allocate anything, it does not guarantee that any memory exists, or was allocated, or that the data in the memory has some "type", ... no nothin'.

This

extern __shared__ a float[];

is just:

float* a = __get_ptr_to_shared_memory();

hanna-kruppe · 2018-02-21T13:49:45Z

I want to point out that we're pretty badly derailing this metabug. You should probably open a thread on internals.rlo if you want to take this discussion much further.

But first I would invite you to double check your facts. Many things you've said go against everything I've ever heard and seen about CUDA (and other offloading solutions, for that matter).

Sure, but what's the point of making it a storage specifier in Rust?

I'm not saying it should be a storage specifier in Rust. We don't even have such a concept at the moment. I'm just saying, it's fundamentally a variation on variable declaration (like DST locals), not something about pointers.

You can't have two variables on shared memory, that is, the following is not valid CUDA C:

???? Check the code I linked earlier, it has at least two such variables. Or pretty a random non-trivial CUDA program using shared memory. I don't know why the code you give is rejected, but I've never even seen this syntax so I'm not even sure what it means.

For all you know, As[0] will fail because no shared memory actually has been allocated.

Are you saying Nvidia's examples, as well as all the other programs using shared memory, are effectively borked? That can't be right. In fact, I'm pretty sure the size of the is recorded (provided it has a size -- again I don't know wth __shared__ a float[]; is) in the binary and the scheduler makes sure to only place as many thread groups on one core as the available shared memory permits. I know this because it means the amount of shared memory you use impacts occupancy and thus performance.

That reminds me, another way in which such intrinsic for __shared__ would be weird is that it should probably require the size to be a compile time constant (unlike alloca).

gnzlbg · 2018-02-21T14:10:50Z

but I've never even seen this syntax so I'm not even sure what it means.

That syntax is the dynamic shared memory allocation syntax.

That reminds me, another way in which such intrinsic for shared would be weird is that it should probably require the size to be a compile time constant (unlike alloca).

The size of the shared memory region can be specified at run-time, at least in CUDA.

Reading through the docs of the example you mention, if the size of the variables allocated in shared memory are compile-time constants, and no dynamic shared memory allocation occurs, it looks like one does not need to specify the memory to allocate during kernel launch because the compiler does it for you (but I've always used dynamic shared memory so I am not sure).

hanna-kruppe · 2018-02-21T15:03:06Z

Good to know that there's a dynamic allocation strategy as well. But it seems that you still specify the size, just at kernel invocation time? That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly). For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well).

gnzlbg · 2018-02-21T15:38:11Z

But it seems that you still specify the size, just at kernel invocation time?

Yes.

That seems like it would still allow the driver to make sure enough memory is available (i.e., as much memory as the kernel invocation specified; of course the kernel still needs to obtain and use that number correctly).

Exactly. The typical way in which this is used is by passing something that correlates with the allocated size as a run-time argument to the kernel.

For dynamic an intrinsic might be good, but since static shared memory allocation seems extremely common, we'd probably want to support it as well and an intrinsic can't really do that (well).

Yes definitely. Since dynamic shared memory is more powerful (it allows doing everything that can be done with static shared memory and some more), has no drawbacks over static shared memory beyond ergonomics (shared memory is always allocated at run-time, whether the size is known are compile-time or not is pretty much irrelevant), and can probably just be an nvptx intrinsic that returns a pointer to the shared memory region, I think that implementing support for it would be the tiniest incremental step that delivers the most value.

Adding support for __shared__ static memory would be a nice ergonomic addition. If you have device-only functions that need to be called from kernels, using static shared memory in them allows you to bump the size of the allocated memory region transparently (using dynamic shared memory these functions would need to take as argument a pointer into a suitable part of the shared memory region).

AFAIK only fixed-size arrays are allowed in static shared memory and the memory must be uninitialized. So while something like let floats: #[shared] [f32; N] = mem::unitialized(); might work, I think here it would be better to just provide an nvptx::shared_array<[T; N]>type, implemented using compiler magic to put it always on static shared memory, that provides a minimal API that makes sense for shared memory array since things like bounds checking by default (as provided by arrays) make little sense in device kernels where you might not have a way to panic, abort, print anything, etc.

lilith · 2018-07-28T18:45:54Z

Note that I'm interested in funding work on this: https://internals.rust-lang.org/t/nvptx-funding/7441

I'd like to get this to work out of the box on nightly.

Jasper-Bekkers · 2018-08-02T10:29:05Z

Hi,

I've started using the NVPTX backend for some simple experiments, I'm listing my experiences here so far since I don't know what the proper protocol is. We can turn these into specific issues on the right repo's later on.

Name mangling is sometimes incorrect because the LLVM backend outputs names with . in some cases and PTX only accepts [a-zA-Z0-9_$]+

pub struct MyStruct {
    data: u32,
}

impl PartialEq for MyStruct {
    fn eq(&self, other: &Self) -> bool {
        return self.data == other.data;
    }
}

Leads to invalid PTX since symbols are being generated with dots in them:


.visible .func  (.param .b32 func_retval0) _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE(
	.param .b64 _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_0,
	.param .b64 _ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_1
)
{
	.reg .pred 	%p<2>;
	.reg .b32 	%r<4>;
	.reg .b64 	%rd<3>;

	ld.param.u64 	%rd1, [_ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_0];
	ld.u32 	%r1, [%rd1];
	ld.param.u64 	%rd2, [_ZN59_$LT$nvkernel..MyStruct$u20$as$u20$core..cmp..PartialEq$GT$2eq17hebe67210677cfb7eE_param_1];
	ld.u32 	%r2, [%rd2];
	setp.eq.s32 	%p1, %r1, %r2;
	selp.u32 	%r3, 1, 0, %p1;
	st.param.b32 	[func_retval0+0], %r3;
	ret;
}

I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424

So far I've only been able to get this to run and compile with the nightly-2018-02-12 compiler (due to this PR Corrected auto trait syntax as previous became obsolete japaric-archived/core64#4 ) anything newer seems to spew quite a few compile errors after trying to manually create my own core64 lib.
It would be nice to see if we can add a "kernel" attribute to expose some (all) of these https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#performance-tuning-directives to tweak occupancy & register counts similar to CUDA
Find a way to expose vector loads & stores (ld.v4.f32 etc) since they're significantly faster. One way would be through the platform intrinsics
Expose __shared__ memory (as per discussion previous) though preferably as static kernel side memory - though ideally both options are available since this too factors into the kernel occupancy
Expose __constant__ memory similar to __shared__.
Expose cross-lane swizzle ops (shfl.bfly, shfl.up etc) as platform intrinsics
Right now the target json determines the sm target machine and fixes it at sm_20 which is extremely dated (Fermi gpus). It would be nice if this was more flexible - though for now patching the json works too.
I have some confusion whether repr(C) would be good enough of a layout to share data between GPU and CPU or if we'd eventually need another one.
It's not entirely clear to me if __syncthreads is also a compiler barrier or not but it ought to be since the compiler shouldn't schedule memory requests across it.

On the bright side: this has been a really pleasant GPU programming experience so far (other then actually getting it set up) because it's extremely valuable to share the same codebase between CPU and GPU.

bheisler · 2018-08-04T01:49:54Z

I have not been able to compile to PTX with cargo or xargo, I've only been able to do so using accel. Therefore, some of the following may be issues with Accel. That seems unlikely, so I'll report them here.

I see an internal compiler error if the PTX code references the core::intrinsics::sinf32 or cosf32 functions
INVALID_PTX error if I index into a slice (I saw the same error if I used get_unchecked) in the PTX code
INVALID_PTX error if I use the for x in 0..y loop in the PTX code
INVALID_PTX error if any struct in the PTX code implements core::ops traits like Add or Mul
INVALID_PTX error if any struct in the PTX code derives Debug or Display

Most of these are probably due to references to missing functions in the final PTX, but they'll need to be dealt with somehow.

I am interested in contributing to improve the state of GPGPU in Rust. Not sure where to start.

termoshtt · 2018-08-04T05:39:03Z

Name mangling is sometimes incorrect
Leads to invalid PTX since symbols are being generated with dots in them:

I've met this issue while developing accel , and it prevent me to use libcore for nvptx target. accel cannot link libcore or other std libraries currently.

I recently start to write a patch to rustc to enable nvptx target.

I haven't looked much into it but I have a feeling that it's due to https://github.com/rust-lang/rust/blob/master/src/librustc_codegen_utils/symbol_names.rs#L424

this seems to be a good information for me :)

bheisler · 2018-08-18T23:56:07Z

#53099 is relevant here. I haven't been able to compile any kernel recently, because of a segfault while compiling libcore. This happens on all machines where I've tried it (Windows and Linux). However, other people say that they've been able to compile NVPTX kernels without running into this issue. Has anyone else seen this?

denzp · 2018-08-19T21:39:51Z

@bheisler I believe it somehow related to definition json. It doesn't happend to me with json from ptx-linker, but I saw the problem with another one.

denzp · 2018-08-19T22:29:41Z

I'm finally proud to announce my progress on CUDA integration. I've made several tools to ease development and currently working on a tutorial and high-level crate (it will probably be a custom rustc driver because compiler plugins are de-facto deprecated).

First one is a ptx-linker that solves several important problems:

recently mentioned issue about dots in functions or global variables names,
NVPTX: No "undefined reference" error is raised when it should be #38786 "undefined references" validation,
NVPTX: non-inlined functions can't be used cross crate #38787 linking of multiple LLVM modules.

I started work on the linker about a year ago, and today achieved important milestone: it doesn't depend on any external tools and libs anymore. So end users don't need to care about matching Rust's and system's LLVM versions (which became a problem when Rust switched to LLVM 7.0).

The second crate is a ptx-builder that improves development convenience dramatically. It's a build.rs helper that ensures all needed tools (xargo and ptx-linker atm) are present. It also manages a build environment and xargo runs.

Also worth checking, an incomplete tutorial about CUDA development flow with more or less real example.
The tutorial evolved as mentioned before tools did. And more chapters are yet to come :)
Sometimes I run tests from there, to ensure PTX assembly is correct and the whole thing still works :)

termoshtt · 2018-08-20T02:32:10Z

@denzp Can ptx-linker link with libcore? I am creating a toolchain to link libcore using llvm-link in rust-accel/nvptx. Linking of libcore will cause the symbol name issue as reported by @Jasper-Bekkers due to the difference between GAS and PTX, and I avoid it by rewriting librustc_codegen_utils/symbol_names.rs.

denzp · 2018-08-20T07:34:50Z

@termoshtt The linker suppose to fix this, it has a special "pass" that does renaming. The problem can happen not only with libcore though: consts or structs can also produce invalid PTX sometimes:

src_image.pixel(i, j)

call.uni (retval0), 
_ZN32_$LT$example..Image$LT$T$GT$$GT$5pixel17h81db5ad692bcf640E, 
(param0, param1, param2);

I found the linker robust enough about solving the issue. But still, I'd probably prefer this to be fixed in rustc.

bheisler · 2018-08-21T00:10:44Z

@denzp - I don't think it's caused by the target JSON, unfortunately. When I add the obj-is-bitcode flag to my target file it compiles without segfaulting, but that means it no longer writes PTX files but instead writes LLVM bitcode.

denzp · 2018-08-21T15:48:55Z

@bheisler I can confirm that obj-is-bitcode indeed helps to avoid segfault (that's the reason I've never seen the problem before, I always use the flag). But it doesn't affect assembly file creation for me. Let's move further discussion into #53099

juchiast · 2018-09-19T07:34:50Z

#38824 is closed about a month ago. I think we can remove it from the list?

NVPTX target specification This change adds a built-in `nvptx64-nvidia-cuda` GPGPU no-std target specification and a basic PTX assembly smoke tests. The approach is taken here and the target spec is based on `ptx-linker`, a project started about 1.5 years ago. Key feature: bitcode object files being linked with LTO into the final module on the linker's side. Prior to this change, the linker used a `ld` linker-flavor, but I think, having the special CLI convention is a more reliable way. Questions about further progress on reliable CUDA workflow with Rust: 1. Is it possible to create a test suite `codegen-asm` to verify end-to-end integration with LLVM backend? 1. How would it be better to organise no-std `compile-fail` tests: add `#![no_std]` where possible and mark others as `ignore-nvptx` directive, or alternatively, introduce `compile-fail-no-std` test suite? 1. Can we have the `ptx-linker` eventually be integrated as `rls` or `clippy`? Hopefully, this should allow to statically link against LLVM used in Rust and get rid of the [current hacky solution](https://github.com/denzp/rustc-llvm-proxy). 1. Am I missing some methods from `rustc_codegen_ssa::back::linker::Linker` that can be useful for bitcode-only linking? Currently, there are no major public CUDA projects written in Rust I'm aware of, but I'm expecting to have a built-in target will create a solid foundation for further experiments and awesome crates. Related to #38789 Fixes #38787 Fixes #38786

steveklabnik · 2019-09-13T03:44:52Z

Triage: this is a metabug. Not aware of anything particular going on with this target lately.

pnkfelix · 2022-03-18T14:13:57Z

Visited for T-compiler backlog bonanza. It seems like there are some unresolved questions about scope and design, with respect to the concerns that were raised on this thread between @hanna-kruppe and @gnzlbg . (Basically, its not clear to me whether the work remaining here is "just" more implementation and fixing bugs, or if there's some design stuff that needs to be revisited.)

@rustbot label: +S-tracking-needs-summary

OwenTrokeBillard · 2024-01-12T13:34:00Z

@kjetilkjeka Thank you for all your excellent recent work on Rust PTX support.

What, in your opinion, is required to bring Rust CUDA support to parity with C++? Is it even possible?

I'm curious to hear from others too.

kjetilkjeka · 2024-01-25T11:42:53Z

I think the most prominent issue with using CUDA from Rust is referenced by this issue. It's the lacking support for shared memory.

In the bright future, I think that Rust with CUDA could in some ways even surpass using CUDA from C++. To avoid adding a lot of discussion around what CUDA using Rust should and should not be in this issue, I think we should continue discussing it in the newly created zulip thread for gpgpu.

https://rust-lang.zulipchat.com/#narrow/stream/422870-t-compiler.2Fgpgpu-backend

If you start a thread there I'm happy to discuss it further 😄

japaric added metabug Issues about issues themselves ("bugs about bugs") O-NVPTX Target: the NVPTX LLVM backend for running rust on GPUs, https://llvm.org/docs/NVPTXUsage.html labels Jan 2, 2017

akonneker mentioned this issue Feb 1, 2017

Utilize GPU computational capabilities jonysy/miro#5

Open

Mark-Simulacrum added the C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC label Jul 26, 2017

LifeIsStrange mentioned this issue Jun 15, 2018

Tracking issue for targeting AMDGPU devices #51575

Closed

5 tasks

DiamondLovesYou mentioned this issue Jun 18, 2018

[Mini-RFC] Tracking issue for single source cross-compilation #51623

Closed

10 tasks

termoshtt mentioned this issue Aug 11, 2018

Use '_' instead of '.' in symbol name sanitization rust-accel/rust#6

Merged

bheisler mentioned this issue Aug 28, 2018

Segfault in rustc while cross-compiling core #53099

Open

denzp mentioned this issue Jan 27, 2019

NVPTX target specification #57937

Merged

termoshtt mentioned this issue Dec 30, 2019

Restart accel termoshtt/accel#64

Open

7 tasks

cuviper mentioned this issue Aug 3, 2020

Add GPU execution support? rayon-rs/rayon#778

Open

Jasper-Bekkers mentioned this issue Aug 21, 2020

List previous Rust on GPU initatives and related projects EmbarkStudios/rust-gpu#20

Closed

Walther mentioned this issue Oct 11, 2020

Automatic GPU offloading rayon-rs/rayon#798

Open

rustbot added the S-tracking-needs-summary Status: It's hard to tell what's been done and what hasn't! Someone should do some investigation. label Mar 18, 2022

kjetilkjeka mentioned this issue Mar 25, 2022

Tracking issue for the "ptx-kernel" ABI #38788

Open

14 tasks

fmease added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NVPTX backend metabug #38789

NVPTX backend metabug #38789

japaric commented Jan 2, 2017 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018

hanna-kruppe commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 •

edited

Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 •

edited

Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 •

edited

Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 •

edited

Loading

lilith commented Jul 28, 2018

Jasper-Bekkers commented Aug 2, 2018

bheisler commented Aug 4, 2018

termoshtt commented Aug 4, 2018

bheisler commented Aug 18, 2018

denzp commented Aug 19, 2018

denzp commented Aug 19, 2018

termoshtt commented Aug 20, 2018

denzp commented Aug 20, 2018

bheisler commented Aug 21, 2018

denzp commented Aug 21, 2018

juchiast commented Sep 19, 2018

steveklabnik commented Sep 13, 2019

pnkfelix commented Mar 18, 2022

OwenTrokeBillard commented Jan 12, 2024

kjetilkjeka commented Jan 25, 2024

NVPTX backend metabug #38789

NVPTX backend metabug #38789

Comments

japaric commented Jan 2, 2017 • edited Loading

Documentation

Bugs

Missing features

Stabilization

gnzlbg commented Feb 21, 2018 • edited Loading

gnzlbg commented Feb 21, 2018

hanna-kruppe commented Feb 21, 2018 • edited Loading

gnzlbg commented Feb 21, 2018 • edited Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 • edited Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 • edited Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 • edited Loading

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018

hanna-kruppe commented Feb 21, 2018

gnzlbg commented Feb 21, 2018 • edited Loading

lilith commented Jul 28, 2018

Jasper-Bekkers commented Aug 2, 2018

bheisler commented Aug 4, 2018

termoshtt commented Aug 4, 2018

bheisler commented Aug 18, 2018

denzp commented Aug 19, 2018

denzp commented Aug 19, 2018

termoshtt commented Aug 20, 2018

denzp commented Aug 20, 2018

bheisler commented Aug 21, 2018

denzp commented Aug 21, 2018

juchiast commented Sep 19, 2018

steveklabnik commented Sep 13, 2019

pnkfelix commented Mar 18, 2022

OwenTrokeBillard commented Jan 12, 2024

kjetilkjeka commented Jan 25, 2024

japaric commented Jan 2, 2017 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

hanna-kruppe commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading

gnzlbg commented Feb 21, 2018 •

edited

Loading