Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] JUCX: client id support #7136

Closed
wants to merge 8 commits into from

Conversation

petro-rudenko
Copy link
Member

What

Follow up on #6859 for JUCX Spark tests

@abellina you can build ucx from this PR and use JUCX from SNAPSHOT of version 1.12.0-client-id-SNAPSHOT

…uest.

UCP: Apply review comments.

UCP: support backward compatibility.

UCP/WIREUP_CM: Fix void pointer error.

UCP/WIREUP: use sa_data + 1.

UCP/ADDRESS: unpack address and get it's size.

UCP/WIREUP_EP: initialize client_id.

AZP: Test binary compatibility. Use dpkg

UCP/EP: build ucx-1.10 in AZP.

AZP: enable examples.

AZP: Build ucp_client_server with master bits.

UCP/EP: Test binary compatibility between v1.11.x.

UCP/EP: Set address mode AM_LANE_ONLY when we cut address.

UCP/EP: Minor formatting.

UCP/EP: Update src/ucp/wireup/wireup_cm.c

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update test/gtest/ucp/test_ucp_sockaddr.cc

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update src/ucp/wireup/wireup_cm.c

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update src/ucp/wireup/wireup_cm.c

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update test/gtest/ucp/test_ucp_sockaddr.cc

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update src/ucp/wireup/wireup_cm.c

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update src/ucp/wireup/wireup_cm.c

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Update test/gtest/ucp/test_ucp_sockaddr.cc

Co-authored-by: dmitrygx <dmitrygla@nvidia.com>

UCP/EP: Apply review comments.
@@ -780,6 +774,11 @@ ucp_ep_create_api_conn_request(ucp_worker_h worker,
ucp_ep_h ep;
ucs_status_t status;

if (params->field_mask & UCP_EP_PARAM_FIELD_CLIENT_ID) {
ucs_error("client id is supported only for sockaddr connection establishment");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that client id is not supported for RDMACM?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this ment to set client_id only for ep creation by sockaddr, neither by worker_addr nor by conn_request

@abellina
Copy link

abellina commented Aug 3, 2021

@petro-rudenko api makes sense so far, and I can see the clientId populated in the UcpConnectionRequest.

In order to reject a request, do we need to call something? Or just do nothing in the callback (poc patch: abellina/spark-rapids@f156f7b for reference)

I am not able to fully test this right now since I am getting an exception when creating the endpoint:

2021-08-03 07:10:42,159 INFO  UCX:57 - Got UcpListener request from /192.168.50.80:48544 for executorId 1
2021-08-03 07:10:42,167 ERROR UCX:94 - Exception caught in UCX progress thread. Continuing.
org.openucx.jucx.UcxException: Invalid parameter
  at org.openucx.jucx.ucp.UcpEndpoint.createEndpointNative(Native Method)
  at org.openucx.jucx.ucp.UcpEndpoint.<init>(UcpEndpoint.java:42)
  at org.openucx.jucx.ucp.UcpWorker.newEndpoint(UcpWorker.java:52)
  at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.onConnectionRequest(UCX.scala:833)
  at org.openucx.jucx.ucp.UcpWorker.progressWorkerNative(Native Method)
  at org.openucx.jucx.ucp.UcpWorker.progress(UcpWorker.java:135)
  at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:187)
  at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:178)
  at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
  at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
  at com.nvidia.spark.rapids.shuffle.ucx.UCX.withResource(UCX.scala:69)
  at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:178)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:340)
  at java.lang.Thread.run(Thread.java:748)

And the UCX log says:

cp_ep.c:778  UCX  ERROR client id is supported only for sockaddr connection establishment

Which seems related to the comment above. I did not specify aUCX_SOCKADDR_TLS_PRIORITY setting, I simply have: UCX_TLS=cuda_copy,cuda_ipc,tcp. Is that expected?

Note I also tried with: UCX_SOCKADDR_TLS_PRIORITY=sockcm,tcp, and still get the same error.

@abellina
Copy link

Just updating here. We discussed offline and the issue I had was that I specified the client id both on new endpoints to a remote listener, and to endpoints created as a response of a connection request. If we don't set the client id on the connection request, this error goes away.

I know the API is a bit in flux, there was another issue with what looked like the client id sometimes not being set (i.e. a random high number instead of what I was setting it to). @petro-rudenko any idea on when the new API may be up?

@petro-rudenko
Copy link
Member Author

Closing in favor of #7523

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP-DNM Work in progress / Do not review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants