Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

have cuda_library output RDC #125

Merged
merged 1 commit into from
Jul 13, 2023

Conversation

garymm
Copy link
Contributor

@garymm garymm commented Jul 6, 2023

This allows a cuda_library that was built with rdc=True to be
depended upon by another such library. This is convenient as such a
library can be consumed by either another cuda_library OR a
cc_library.

Note: if we have this, I'm not sure why we need a separate cuda_objects
rule anymore. Perhaps it can be removed in a later PR.

Change-Id: I1014d28a0ab3a9c76b788821211b13c4a9956d2a

@cloudhan
Copy link
Collaborator

This somehow allows cc_library to depend on an incomplete cuda_library, is there a way to avoid it?

@garymm
Copy link
Contributor Author

garymm commented Jul 11, 2023

This somehow allows cc_library to depend on an incomplete cuda_library, is there a way to avoid it?

Can you please elaborate? Or illustrate with an example? If rdc=True, the cuda_library will always perform a device link step, so it should be complete.

@@ -87,12 +91,19 @@ def _cuda_library_impl(ctx):
pic_lib = pic_libs,
objects = objects,
pic_objects = pic_objects,
rdc_objects = rdc_objects,
Copy link
Collaborator

@cloudhan cloudhan Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you missing rdc_pic_objects here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -104,7 +115,7 @@ cuda_library = rule(
"hdrs": attr.label_list(allow_files = ALLOW_CUDA_HDRS),
"deps": attr.label_list(providers = [[CcInfo], [CudaInfo]]),
"alwayslink": attr.bool(default = False),
"rdc": attr.bool(default = False, doc = "whether to perform relocateable device code linking, otherwise, normal device link."),
"rdc": attr.bool(default = False, doc = "Whether to produce and consume relocateable device code. Transitive deps that contain device code must all either be cuda_objects or cuda_library that has rdc = True. If False, all device code must be in the same translation unit. May have performance implications. See https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#using-separate-compilation-in-cuda."),
Copy link
Collaborator

@cloudhan cloudhan Jul 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please break this line a little bit, way too long for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@cloudhan
Copy link
Collaborator

If rdc=True, the cuda_library will always perform a device link step, so it should be complete.

Indeed.

The only limitation left is this cannot support circular object dependency due to the mandatory device link, which by itself is really stupid so I don't think user will need it generally.

@cloudhan
Copy link
Collaborator

Backlog for memo here:

Originally, it is modeled as

host code device code
cc_library for reusable cpu objects cuda_objects for reusable device objects
cc_binary for linked cpu executable cuda_library for linked device executable

For non-rdc and non-dlto case, the device executable is automatically linked already. rdc will enforce device_link then the compiler will ensure the device executable is linked.

This is something like cc_binary expose all objects for reusing with other cc_binarys. But this is not possible because the main entrypoint symbol conflict. For device case, the entrypoint is determined by a kernel launch, so this is not a problem.

Thus, allow cuda_library to expose the underlying objects for reusing between cuda_library but not cc_library will be harmless then.

This allows a `cuda_library` that was built with `rdc=True` to be
depended upon by another such library. This is convenient as such a
library can be consumed by either another `cuda_library` OR a
`cc_library`.

Note: if we have this, I'm not sure why we need a separate cuda_objects
rule anymore. Perhaps it can be removed in a later PR.

Change-Id: I1014d28a0ab3a9c76b788821211b13c4a9956d2a
Copy link
Collaborator

@cloudhan cloudhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement.

@cloudhan cloudhan merged commit 555f88f into bazel-contrib:main Jul 13, 2023
12 checks passed
@garymm
Copy link
Contributor Author

garymm commented Jul 13, 2023

@cloudhan would you be OK with me deleting cuda_objects? Seems useless now.

@garymm garymm deleted the garymm/transitive-cuda branch July 13, 2023 15:56
@cloudhan
Copy link
Collaborator

No

@garymm
Copy link
Contributor Author

garymm commented Jul 14, 2023

Why?

cloudhan added a commit that referenced this pull request Aug 9, 2023
Compile the newly added example in #125 with `_wrapper_device_link` with rdc causes artifact name conflict.

```
 cuda_library( 
     name = "b", 
     srcs = ["b.cu"], 
     hdrs = ["b.cuh"], 
     rdc = True, 
 ) 
```

The reason is that our compile always compile to <target_output_dir>/_objs:

 cuda_library(name="b") --> artifact base name is `b` during `compile`

with `compile` for following srcs
"b.cu" ------------------> artifact `<target_output_dir>/_objs/b.rdc.o`
`fatbin.cu` ------rdc----> artifact `<target_output_dir>/_objs/b.rdc.o`

After this PR:
"b.cu" ------------------> artifact `<target_output_dir>/_objs/b.rdc.o`
`fatbin.cu` ------rdc----> artifact `<target_output_dir>/_objs/_dlink/b.rdc.o`
@cloudhan
Copy link
Collaborator

cloudhan commented Sep 13, 2023

OK, so when building a shared library, we inevitably -Wl,--whole-archive the static library. Let cuda_library do transitive device link is not harmful for the device code, but is not the case for interfacing with host code.

  1. the cuda_library can contain incomplete device object but can be consumed by the cc_* rules
  2. multiple cuda_library targets that form a chain will contain same symbols when doing --whole-archive linking.

So this will cause misusing easily and is unfixable in some condition. I will revert this PR.

@garymm
Copy link
Contributor Author

garymm commented Sep 13, 2023

Can you add an example that illustrates the problem?
I don't understand exactly how 1 or 2 can happen.

@cloudhan
Copy link
Collaborator

1 is not very important, because those imcomplete objects are not exposed by CcInfo. For 2

if use_rdc:
transitive_objects = depset(transitive = [dep[CudaInfo].rdc_objects for dep in attr.deps if CudaInfo in dep])
transitive_pic_objects = depset(transitive = [dep[CudaInfo].rdc_pic_objects for dep in attr.deps if CudaInfo in dep])
objects = depset(transitive = [objects, transitive_objects])
rdc_objects = objects
pic_objects = depset(transitive = [pic_objects, transitive_pic_objects])
rdc_pic_objects = pic_objects
dlink_object = depset([device_link(ctx, cuda_toolchain, cc_toolchain, objects, common, pic = False, rdc = use_rdc)])
dlink_pic_object = depset([device_link(ctx, cuda_toolchain, cc_toolchain, pic_objects, common, pic = True, rdc = use_rdc)])
objects = depset(transitive = [objects, dlink_object])
pic_objects = depset(transitive = [pic_objects, dlink_pic_object])

Lets consider the issue 162, build the shared library will cause

/usr/bin/ld.gold: error: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: multiple definition of 'mul(double*, double const*, double const*, int)'
/usr/bin/ld.gold: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: previous definition here
/usr/bin/ld.gold: error: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: multiple definition of 'mul_impl(double*, double const*, double const*, int)'
/usr/bin/ld.gold: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: previous definition here
/usr/bin/ld.gold: error: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: multiple definition of '__fatbinwrap_4280babd_6_mul_cc_a09afee1'
/usr/bin/ld.gold: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: previous definition here
/usr/bin/ld.gold: error: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: multiple definition of '__device_stub__Z8mul_implPdPKdS1_i(double*, double const*, double const*, int)'
/usr/bin/ld.gold: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul.rdc.pic.o: previous definition here
/usr/bin/ld.gold: error: bazel-out/k8-fastbuild/bin/tmp/_objs/mul/mul_dlink.rdc.pic.o: multiple definition of '__cudaRegisterLinkedBinary_4280babd_6_mul_cc_a09afee1'
/usr/bin/ld.gold: bazel-out/k8-fastbuild/bin/tmp/_objs/muladd/muladd_dlink.rdc.pic.o: previous definition here
| target | transitive_objects | objects (exposed via CcInfo) | rdc_objects (exposed via CudaInfo)
|--------|--------------------|------------------------------|------------------------------------
| mul    |                    | mul.o,mul_dlink.o            | mul.o
| muladd | mul.o              | muladd.o,mul.o,muladd_dlink.o| muladd.o,mul.o

And cc_shared_library(name = "t", deps = [":mul", ":muladd"]) is valid because from user side, they want symbols in target mul to be included in target t, so they whole archive link mul, and this applies to muladd respectively.

@cloudhan
Copy link
Collaborator

@garymm #164 (comment) This should address all the issue we are facing. Let me fix this.

@cloudhan cloudhan mentioned this pull request Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants