Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/CORE/GTEST: Fix possible purge of requests after discarding is in-progress after error handling #7672

Merged

Conversation

dmitrygx
Copy link
Member

What

Fix possible purge of requests after discarding is in-progress after error handling.

Why ?

It fixes a new type of "iov data corruption" issue found by MTT/IO-Demo testing:

  • client and server exchange data
  • some network issues happen on some links, but some other links are still health
  • client detects an issue while some RNDV operations are in-progress and RTS for them was sent
  • ucp_ep_set_failed discards all UCT lanes (they operations are in-progress)
  • ucp_ep_set_failed invokes a user's error callback which closes the failed endpoint with mode=FORCE
  • ucp_ep_reqs_purge is called from ucp_ep_discard_lanes, since no lanes to discard
  • send operations are canceled and their invalidate data
  • server doesn't detect an issue and continues reading data using health links, but the data is invalid

How ?

  1. Save a pointer to ucp_ep_discard_lanes_t in the control extension of UCP endpoint.
  2. When starting discarding of UCT lanes, check whether discard_lanes was already allocated: if so - use it, otherwise - allocate a new one. So all discarding operations will use the same discard_lanes argument and UCP EP protected from being purged before all UCT lanes are closed.
  3. Write a test to test detecting UCP EP failure and FORCE_CLOSE during in-progress RNDV operation.

@dmitrygx dmitrygx force-pushed the topic/ucp/purge_reqs_after_fail_and_close branch from 2dc5069 to 7b19d38 Compare November 18, 2021 19:31
@yosefe
Copy link
Contributor

yosefe commented Nov 18, 2021

Maybe just don't call purge during ep_close if the ep is already failed?

@dmitrygx
Copy link
Member Author

Maybe just don't call purge during ep_close if the ep is already failed?

do you mean to not do invoke ucp_ep_discard_lanes() if UCP_EP_FLAG_FAILED set?

@dmitrygx
Copy link
Member Author

Maybe just don't call purge during ep_close if the ep is already failed?

do you mean to not do invoke ucp_ep_discard_lanes() if UCP_EP_FLAG_FAILED set?

if you mean to not invoke ucp_ep_reqs_purge() during ucp_ep_close_nbx()? but we don't do it directly. we invoke ucp_ep_reqs_purge() only as a part of ucp_ep_discard_lanes()'s completion flow.

@yosefe
Copy link
Contributor

yosefe commented Nov 18, 2021

if you mean to not invoke ucp_ep_reqs_purge() during ucp_ep_close_nbx()? but we don't do it directly. we invoke ucp_ep_reqs_purge() only as a part of ucp_ep_discard_lanes()'s completion flow.

Maybe discard with a different callback?

@dmitrygx
Copy link
Member Author

if you mean to not invoke ucp_ep_reqs_purge() during ucp_ep_close_nbx()? but we don't do it directly. we invoke ucp_ep_reqs_purge() only as a part of ucp_ep_discard_lanes()'s completion flow.

Maybe discard with a different callback?

maybe just not do discard in case of UCP_EP_FLAG_FAILED flag set? since, if this flag is set, it means that all lanes are already marked as failed ones - so, nothing to discard.
I've implemented this way - just to not rely on UCP_EP_FLAG_FAILED flag

@dmitrygx
Copy link
Member Author

@yosefe wdyt about the current solution pushed as the second commit?

@yosefe
Copy link
Contributor

yosefe commented Nov 19, 2021

3 line fix is definitely better :) i'll review the test

src/ucp/core/ucp_ep.c Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Show resolved Hide resolved
@dmitrygx
Copy link
Member Author

@yosefe is it ok now?

EXPECT_FALSE(ucs_hlist_is_empty(&ucp_ep_ext_gen(ep)->proto_reqs));
ucp_tag_message_h message;
ucp_tag_recv_info_t info;
message = message_wait(receiver(), 0, 0, &info);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check if message is NULL

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch
done

@dmitrygx dmitrygx force-pushed the topic/ucp/purge_reqs_after_fail_and_close branch from 62a93bc to b2a935b Compare November 19, 2021 16:09
@dmitrygx dmitrygx merged commit ed03edc into openucx:master Nov 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants