Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kirchhoff cuda integrated and tested successfully #617

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

Amnah2020
Copy link

No description provided.

@mrava87
Copy link
Collaborator

mrava87 commented Nov 2, 2024

@Amnah2020 thanks for your contribution 😄

I have started looking at your code focusing on the case when dynamic=False and I have some general comments/questions plus some specific one that I make directly on the line of code I am unsure about:

  • the case travsrcrec=False is still available for historical reasons (backward compatibility) but it is not really a good choice given that it creates a big burden on memory (storing a matrix of size nsnrnx*nz instead of 2 matrices of overall size (ns+nr)nxnz. I therefore reccomend that we do not include an option to allow travsrcrec=False in the cuda implementation.
  • more forward thinking comment: I think you have made a design choice that goes a bit against what we have so far in PyLops when it comes to numba-backed operators. You assume that the input arrays entering your _matvec_call/_rmatvec_call are always numpy arrays and they are converted internally into numba arrays (using the cuda.to_device). In all other methods, we instead assume that the input arrays have been already put on the gpu. So, why is your approach not idea? Take this Kirchhoff operator as example, this a chained operator composed of a main method (e.g., _trav_kirch_matvec) and a convolution operator; now, if you force the input and output to be numpy arrays, the convolve operator will not be able to leverage the extra speed one could get when using the GPU (i.e., applying it to a cupy array). I understand that in large applications one may not be able to put the entire x, trav_src, trav_rec, etc. onto the GPU upfront, and after some thinking I believe I have found an elegant solution. I suggest that we create a new operator called ToCupy which simply lifts a numpy array onto a cupy array in forward mode and does the opposite in adjoint mode. This way we could simply do:
Kop = Kirchhoff(
                z, x, t, sources, recs, v0, 
                wav.astype(dtype), 
                wavc, mode='analytic', 
                engine='cuda', dynamic=dynamic, 
                dtype=dtype)
Top = ToCupy(Kop.dims)
Top1 = ToCupy(Kop.dimsd)
Kop = Top1.H @ Kop @ Top

and get an operator that takes a numpy array as input, moves it to cupy, runs the entire Kirchhoff operator on GPU (including the convolution) and then moves the output back to numpy. This way, we could easily split sources into groups and do:

nsrop = 3 # number of source for each single operator
Kops = []
for isrc in range(0, ns, 3):
    Kop = Kirchhoff(z, x, t, sources[:, isrc:min(isrc + nsrop, ns)], recs, v0, 
                    wav.astype(dtype), 
                    wavc, mode='analytic', 
                    engine='cuda', dynamic=dynamic, 
                    dtype=dtype)
    Top = ToCupy(Kop.dims, dtype=dtype)
    Top1 = ToCupy(Kop.dimsd, dtype=dtype)
    Kop = Top1.H @ Kop @ Top
    Kops.append(Kop)
Kops = VStack(Kops)

Of course, this can be further wrapped into a pylops_mpi.MPIVStack to further distribute groups of sources to multiple MPI ranks...

I have documented most of these points here https://github.com/PyLops/pylops_notebooks/blob/master/developement-cupy/Kirchhoff.ipynb

Please reply to my questions before pushing any code, as I have made a significant amount of changes locally (the notebook will unlikely not work for you...), so based on your replies I can finalize them and push them before we continue finalizing the PR.

PS: once we agree on those, I'll look in more details at the case when dynamic=True and then we can merge the PR.

wrap = current_device.WARP_SIZE
multipr_count = current_device.MULTIPROCESSOR_COUNT
self.num_threads_per_blocks = (wrap, wrap)
self.num_blocks = (multipr_count, multipr_count)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure about this. For my GPU (RTX 3090), the output of this:

from numba import cuda

current_device = cuda.get_current_device()
current_device.WARP_SIZE, current_device.MULTIPROCESSOR_COUNT

gives (32,82). This means that indipendently on the number of sources and receivers, you always have a 82x82 grid with 32x32 threads, which in many cases will call many blocks that will never pass the if ns > isrc and nr > irec: condition. Am I getting this wrong?

Copy link
Author

@Amnah2020 Amnah2020 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, each block always has a fixed number of threads, specifically (32, 32). The number of blocks, however, depends on the total number of threads required with a maximum of 82*82 grid. This implementation ensures that no block is allocated unless it will run at least one thread.

However, I need to fix the last else as it does not follow the same approach ..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your reply. My point is that you code will ALWAYS allocate the same number of blocks no matter the size of the problem.

Take this code:

 wrap = current_device.WARP_SIZE
multipr_count = current_device.MULTIPROCESSOR_COUNT
self.num_threads_per_blocks = (wrap, wrap)
self.num_blocks = (multipr_count, multipr_count)

on my GPU (RTX 3090), I get self.num_threads_per_blocks = (32, 32) and self.num_blocks = (82, 82). Now later, you do:

self._trav_kirch_matvec_cuda[self.num_blocks, self.num_threads_per_blocks](x_d, y_d, nsnr_d, nt_d, ni_d, itrav_d, travd_d)

this will ALWAYS call 82x82 blocks no matter how many sources or receivers you have in your problem. So say for 64 sources and 64 receivers, you will have the first 2x2 blocks doing something and all the other running but just never passing this condition if ns > isrc and nr > irec:.

In the new code, i changed this to:

self.num_blocks = ((self.ns + self.num_threads_per_blocks[0] - 1
                                ) // self.num_threads_per_blocks[0], 
                                (self.nr + self.num_threads_per_blocks[1] - 1
                                ) // self.num_threads_per_blocks[1])

which is actually similar to what you had done for case 4. Here we don't launch any block that will not contribute to the final solution

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for correcting the kernel launch, as I did not follow the same approach as in the other cases.

I missed this part and ended up copying a lot of code because I didn't fork the repo before starting to make changes.

@staticmethod
@cuda.jit
def _travsrcrec_kirch_rmatvec_cuda(x, y, ns, nr, nt, ni, dt, trav_srcs, trav_recs):
irec, isrc = cuda.grid(2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this is swapped compared to line 136?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing! The tests have passed for a uniform number of sources and receivers.

However, I anticipate that if we vary these numbers, the kernel launch and setup may not adequately cover both. So, this line should remain consistent across all 2D grid launch calls and setups. I’ll make sure to enhance it accordingly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, great! A good reccomandation is to never test codes with forced symmetries as many bugs could pass unnoticed (i sometimes make this mistake myself 😆 )

itrav_d, travd_d)

if not self.dynamic:
cuda.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed since for not self.dynamic you use only the default stream?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the dynamic case, for each stream, I need to synchronize it with the main stream in a looped sequence. This synchronization happens within self._process_streams().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you have if not self.dynamic: so this will never be run in the dynamic case anyways?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

@mrava87 mrava87 requested a review from cako November 2, 2024 19:46
@mrava87
Copy link
Collaborator

mrava87 commented Nov 2, 2024

@cako (aka our gpu guru): if you have time, it would be good if you can also have a quick look into this 😄

@mrava87
Copy link
Collaborator

mrava87 commented Nov 7, 2024

@Amnah2020 checking you have seen my review. After you have been able to answer my questions, I will push some minor code modifications and we will be able to progress to the second part of the PR.

@Amnah2020
Copy link
Author

Amnah2020 commented Nov 8, 2024

  • I will remove the corresponding section for traverscrec=False.

  • First, I want to clarify the purpose of cuda.to_device. This function allocates memory on the GPU and transfers a NumPy array to it. While I assume the input will always be a NumPy array, cuda.to_device handles device/host communication, primarily by maintaining a pointer to the data location on the GPU.

    • If we decide to transfer all arrays in advance, it would likely follow a similar implementation strategy to the _data_prep_dynamic function.
    • However, this approach may not apply to the results of intermediate forward and adjoint operator functions.
  • I’m not entirely sure what ToCupy does, and I'm also unfamiliar with standard CUDA CuPy functions. Previously, I used custom C++ CuPy functions, which follow a similar strategy to my implementation with Numba CUDA. From what you’ve shared, and based on some research, CuPy enables automatic conversion between NumPy and CuPy arrays, introducing some overhead. In contrast, Numba CUDA requires explicit memory management.

  • Alternatively, depending on what ToCupy does, cuda.to_device() can accept CuPy arrays as well as NumPy arrays. This suggests that no data transfer would occur; only a pointer transfer (I assume). If we adopt this design, I’ll need a deeper understanding of how memory management works between the two libraries, though it could introduce additional dependencies to maintain.

Maybe you could share the ToCupy function, and I can explore different strategies to leverage it. The goal would be to achieve a similar high-level call structure for full processing of Kirchhoff with multiple sources at once in the device.

@mrava87
Copy link
Collaborator

mrava87 commented Nov 8, 2024

Thanks @Amnah2020, let me push the modifications to the code I already made (later today) so:

  • I will remove the corresponding section for traverscrec=False.

This will be already gone

  • First, I want to clarify the purpose of cuda.to_device. This function allocates memory on the GPU and transfers a NumPy array to it. While I assume the input will always be a NumPy array, cuda.to_device handles device/host communication, primarily by maintaining a pointer to the data location on the GPU.

I see your point and I am familiar with the cuda.to_device method of numba, but because of this https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numba, this is not needed if one has already arrays on GPU in cupy type. Numba/cupy handle everything at their interface with a zero-copy operation, so the best of both worlds 😄 It is true that sometimes you may not have already the array on the GPU, but this is where the ToCupy operator comes in, I personally prefer the user to choose and not doing things hidden internally. One issue with your approach is that for chained operators, like Kirchooff, your final operation will move the array back to numpy and the Convolve1d pass will be much slower....

  • If we decide to transfer all arrays in advance, it would likely follow a similar implementation strategy to the _data_prep_dynamic function.

Let's discuss this after i push the updated cod.

  • However, this approach may not apply to the results of intermediate forward and adjoint operator functions.

Not sure what you mean?

  • I’m not entirely sure what ToCupy does, and I'm also unfamiliar with standard CUDA CuPy functions. Previously, I used custom C++ CuPy functions, which follow a similar strategy to my implementation with Numba CUDA. From what you’ve shared, and based on some research, CuPy enables automatic conversion between NumPy and CuPy arrays, introducing some overhead. In contrast, Numba CUDA requires explicit memory management.

No. You can do the same in Cupy or Numba. In either case if the array is on the CPU. In our experience, interoperating numba and cupy does not come with any overhead (and this is how PyLops work). We prefer using cupy for everything GPU related other than custom kernels (which is true, are easier to write in numba-cuda than cupy custom kernels), and I would like to make sure this new code follows the same philosophy to avoid creating too many different coding patterns without a real reason.

  • Alternatively, depending on what ToCupy does, cuda.to_device() can accept CuPy arrays as well as NumPy arrays. This suggests that no data transfer would occur; only a pointer transfer (I assume). If we adopt this design, I’ll need a deeper understanding of how memory management works between the two libraries, though it could introduce additional dependencies to maintain.

Ok, interesting. My understanding, from the link above, was that cuda.to_device() can be completely omitted if the array is already a cupy array... let's see what @cako thinks. And no new dependencies, cupy is already heavily used in pylops.

Maybe you could share the ToCupy function, and I can explore different strategies to leverage it. The goal would be to achieve a similar high-level call structure for full processing of Kirchhoff with multiple sources at once in the device.

Doing soon :)

@mrava87
Copy link
Collaborator

mrava87 commented Nov 8, 2024

@Amnah2020 codes and notebook updated.

I am also going to open a new Issue explaining in more details the philosophy behind ToCupy, as this is just a use case, but I think many other problems (if implemented on GPU) would require something like this to scale properly. Whilst there I talk about this in general terms, I attach here a figure showing how the current implementation falls short when you chain operators and how the new one can solve this issue.

Screen Shot 2024-11-08 at 2 31 22 PM

@mrava87 mrava87 mentioned this pull request Nov 8, 2024
@cako
Copy link
Collaborator

cako commented Nov 13, 2024

Sorry, pretty busy at work, haven't been able to look at this yet :(

@mrava87
Copy link
Collaborator

mrava87 commented Nov 13, 2024

Sorry, pretty busy at work, haven't been able to look at this yet :(

No rush, we can wait for you ;)

Comment on lines +49 to +62
streams = [cuda.stream() for _ in range(num_streams)]
sources_per_stream = num_sources // num_streams
remainder = num_sources % num_streams
source_counter = 0
self.sources_per_streams = {}
for i, stream in enumerate(streams):
num_sources_for_stream = sources_per_stream + (
1 if i < remainder else 0
)
sources_for_stream = list(
range(source_counter, source_counter + num_sources_for_stream)
)
self.sources_per_streams[stream] = sources_for_stream
source_counter += num_sources_for_stream
Copy link
Collaborator

@cako cako Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can substitute all this by:

# Split sources into different processing streams
self.sources_per_streams = {
    cuda.stream(): sources
    for sources in np.array_split(range(self.ns), self.nstreams)
}
# Check streams are unique
assert (
    len(set(s.handle.value for s in self.sources_per_streams.keys()))
    == self.nstreams
)

Comment on lines +116 to +142
aperturetap_d = cuda.to_device(aperturetap)
vel_d = cuda.to_device(vel)
six_d = cuda.to_device(six)
rix_d = cuda.to_device(rix)
self.const_inputs = (
ns_d,
nr_d,
nt_d,
ni_d,
nz_d,
dt_d,
aperturemin_d,
aperturemax_d,
angleaperturemin_d,
angleaperturemax_d,
vel_d,
aperturetap_d,
six_d,
rix_d,
)

self.trav_recs_d_global = cuda.to_device(trav_recs)
self.angles_recs_d_global = cuda.to_device(angle_recs)
self.trav_srcs_d_global = cuda.to_device(trav_srcs)
self.angles_srcs_d_global = cuda.to_device(angle_srcs)
self.amp_srcs_d_global = cuda.to_device(amp_srcs)
self.amp_recs_d_global = cuda.to_device(amp_recs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much overhead these to_device are adding, but if you'd like to make this more efficient, you can queue the memory transfers per stream and use pinned arrays to ensure async. Here's a snippet (untested):

streams = list(self.sources_per_streams.keys())
n = len(streams)
with cuda.pinned(
    aperturetap,
    vel,
    six,
    rix,
    trav_recs,
    angle_recs,
    trav_srcs,
    angle_srcs,
    amp_srcs,
    amp_recs,
):
    aperturetap_d = cuda.to_device(aperturetap, stream=streams[0 % n])
    vel_d = cuda.to_device(vel, stream=streams[1 % n])
    six_d = cuda.to_device(six, stream=streams[2 % n])
    rix_d = cuda.to_device(rix, stream=streams[3 % n])
    self.trav_recs_d_global = cuda.to_device(trav_recs, stream=streams[4 % n])
    self.angles_recs_d_global = cuda.to_device(
        angle_recs, stream=streams[5 % n]
    )
    self.trav_srcs_d_global = cuda.to_device(trav_srcs, stream=streams[6 % n])
    self.angles_srcs_d_global = cuda.to_device(
        angle_srcs, stream=streams[7 % n]
    )
    self.amp_srcs_d_global = cuda.to_device(amp_srcs, stream=streams[8 % n])
    self.amp_recs_d_global = cuda.to_device(amp_recs, stream=streams[9 % n])
cuda.synchronize()

Comment on lines +218 to +220
y = np.zeros((self.ns * self.nr, self.nt))
elif opt == "_rmatvec":
y = np.zeros(self.ni)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these should be zeros(dtype=np.float32)?

Comment on lines +225 to +227
x_d = cuda.to_device(x, stream=stream)
y_d_dict[stream] = cuda.to_device(y, stream=stream)
isrc_list_d_dict[stream] = cuda.to_device(isrc_list, stream=stream)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be pinned to allow async:

with cuda.pinned(x, y, isrc_list):
    x_d = cuda.to_device(x, stream=stream)
    y_d_dict[stream] = cuda.to_device(y, stream=stream)
    isrc_list_d_dict[stream] = cuda.to_device(isrc_list, stream=stream)

Comment on lines +225 to +226
x_d = cuda.to_device(x, stream=stream)
y_d_dict[stream] = cuda.to_device(y, stream=stream)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are x and y being copied multiple times per stream? Not sure they should...

Comment on lines +266 to +267
for stream in self.sources_per_streams.keys():
stream.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this can be substituted with cuda.synchronize, but that would also sync the default stream.

Comment on lines +218 to +220
y = np.zeros((self.ns * self.nr, self.nt))
elif opt == "_rmatvec":
y = np.zeros(self.ni)
Copy link
Collaborator

@cako cako Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to create pinned memory y arrays to facilitate copy to host. Example:

y = cuda.pinned_array((self.ns * self.nr, self.nt), dtype=np.float32)

stream.synchronize()
y_streams = []
for stream, y_dev in y_d_dict.items():
y_streams.append(y_dev.copy_to_host(stream=stream))
Copy link
Collaborator

@cako cako Nov 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When copying to host, it's better to provide a target array which is pinned. Example:

for stream, y_dev in y_d_dict.items():
    ary = cuda.pinned_array_like(y_dev)
    y_dev.copy_to_host(ary, stream=stream)
    y_streams.append(ary)

Comment on lines +375 to +377
aptap = (
aptap
* aperturetap[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aptap *= ...

Several lines have the same pattern

@cako
Copy link
Collaborator

cako commented Nov 17, 2024

Hey guys, great work! I left some comments, mostly around memory transfer assuming that to_device will be used.

I guess the question is, should CuPy or raw Numba arrays be used? To make sure we're all on the same page, for Numba kernels, it makes no difference. Numba and CuPy arrays implement the CUDA Array Interface so you can just call a cuda-jitted kernel directly on them. Same thing goes for PyTorch (not Jax, I think because they only support the v2 of the interface, not the latest v3).

The difference comes from how memory is managed. With CuPy, memory is managed automatically via Memory Pool. The user requests memory and the memory pool decides when and how to allocate it. Versus Numba which leaves memory allocation entirely on the hands of the user. For higher level libraries, managed memory allocators have historically been favored. CuPy has its own, PyTorch, and RAPIDS. They make it simpler to deal with memory, generally offer better performance and allow the user to set global memory limits (useful when sharing resources).

Personally, I feel like PyLops operators should do as little magic as possible internally, and it should be up to the user to manage memory. Of course, this cannot be done to the extreme, as there may be intermediate steps where memory allocation is required.

In the case of the Kirchhoff operator, the behavior that would make more sense to me is the following: when selecting cuda backend, all input array must be consumable by Numba kernels. They can be CuPy, PyTorch, Numba, etc, and possibly a combination of them. The user should be responsible for allocating them. The return arrays IMO should be CuPy, ensuring that they use the currently active memory pool which is already available to most workflows (since we use CuPy extensively).

@mrava87
Copy link
Collaborator

mrava87 commented Dec 5, 2024

Thanks @cako!

I was waiting for @Amnah2020 to reply, but I'll go ahead and provide my feedback on your comments:

  • General comment:

Hey guys, great work! I left some comments, mostly around memory transfer assuming that to_device will be used.

I guess the question is, should CuPy or raw Numba arrays be used? To make sure we're all on the same page, for Numba kernels, it makes no difference. Numba and CuPy arrays implement the CUDA Array Interface so you can just call a cuda-jitted kernel directly on them. Same thing goes for PyTorch (not Jax, I think because they only support the v2 of the interface, not the latest v3).

This is my understanding as well from other projects. If you have a cupy array and pass it to a numba-cuda custom kernel call there will be no-copy 'data transfer' between the two interfaces thanks to the CUDA Array Interface, so the user should not do any explicit call to move a cupy array into a numba one. So when we write a kernel that assumes the input to be already provided as cupy array we should not do any data movement - this is already the case in _call_dynamic for example 😄

The difference comes from how memory is managed. With CuPy, memory is managed automatically via Memory Pool. The user requests memory and the memory pool decides when and how to allocate it. Versus Numba which leaves memory allocation entirely on the hands of the user. For higher level libraries, managed memory allocators have historically been favored. CuPy has its own, PyTorch, and RAPIDS. They make it simpler to deal with memory, generally offer better performance and allow the user to set global memory limits (useful when sharing resources).

Makes sense to me, and indeed I would favour managed memory allocation if/when that does not bring any downside as it simplifies the user experience.

Personally, I feel like PyLops operators should do as little magic as possible internally, and it should be up to the user to manage memory. Of course, this cannot be done to the extreme, as there may be intermediate steps where memory allocation is required.

Agree.

In the case of the Kirchhoff operator, the behavior that would make more sense to me is the following: when selecting cuda backend, all input array must be consumable by Numba kernels. They can be CuPy, PyTorch, Numba, etc, and possibly a combination of them. The user should be responsible for allocating them. The return arrays IMO should be CuPy, ensuring that they use the currently active memory pool which is already available to most workflows (since we use CuPy extensively).

I also agree with this in general, but with a caveat. Let's take the non-dynamic codes first. The way I have adapted them in the PR fully embrace this philosophy. When the problem is small / GPU VRAM is large enough, the user can lift x/y to the GPU and then pass them to the matvec/rmatvec calls. And this is pretty much in line with any other PyLops operator GPU implementation. In general, we also give user the responsability to pass any additional parameter already as CuPy array (eg wavelet). However, in this specific case, the __init__ method may compute the traveltime tables (if not explicitely provided) and there I find it a bit akward to ask the user to do something like this:

Kop = Kirchhoff(...)
Kop.trav_srcs = cp.asarray(trav_srcs)
Kop.trav_recs = cp.asarray(trav_recs)

So line 559 trav_srcs_d = to_cupy(inputs[7]), which I acknowledge does not falls into the category little magic as possible internally... what do you think?

Now to the dynamic case. The way the code is currently set up departs a bit from the way the non-dynamic code is set up as it uses streams... I think the main motivation here is that since you have more tables (traveltimes, amplitudes, angles), it is likely that even in small examples one cannot just lift all of them in one go to the GPU VRAM, and splitting them with stream could/should allow some overlap between data movements and computations... now my question to you: do you see this as a viable solution to the point we accept to do some magic internally or not?

My dream (not sure how to achieve this yet) would however be to lift this one level up, perhaps inside the ToCupy operator, so that one could run multiple GPU operators in a VStack/BlockDiag on different streams instead of embedding that logic in each single operator... but I think we can be pragmatic and finalize this PR as is and then if we achieve the dream solution, we can adapt the dynamic code to use it.

  • Specific comments: they all make sense to me and some seem to suggest that we can simplify some of the codebase a lot.. I'll leave it to Amnah to look at them and incorporate them before we can merge this PR.

PS: I removed all codes that I had added for the ToCupy operator as this has been already merged in a different PR... and didn't really need to be here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants