Skip to content

Commit

Permalink
prov/efa: Fix bugs in domain queue lists iteration
Browse files Browse the repository at this point in the history
When hitting EAGAIN, we should continue instead of
break, because now the list can contain opes for
other eps, which may not have resource exhaustion.

When hitting error, we should still move forward
after writing the txe/rxe error.

Signed-off-by: Shi Jin <sjina@amazon.com>
  • Loading branch information
shijin-aws committed Jun 24, 2024
1 parent 3cfc0bb commit 433256a
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions prov/efa/src/efa_domain.c
Original file line number Diff line number Diff line change
Expand Up @@ -468,14 +468,14 @@ void efa_domain_progress_rdm_peers_and_queues(struct efa_domain *domain)

ret = efa_rdm_ep_post_handshake(peer->ep, peer);
if (ret == -FI_EAGAIN)
break;
continue;

if (OFI_UNLIKELY(ret)) {
EFA_WARN(FI_LOG_EP_CTRL,
"Failed to post HANDSHAKE to peer %ld: %s\n",
peer->efa_fiaddr, fi_strerror(-ret));
efa_base_ep_write_eq_error(&peer->ep->base_ep, -ret, FI_EFA_ERR_PEER_HANDSHAKE);
return;
continue;
}

dlist_remove(&peer->handshake_queued_entry);
Expand Down Expand Up @@ -600,7 +600,7 @@ void efa_domain_progress_rdm_peers_and_queues(struct efa_domain *domain)
break;

efa_rdm_txe_handle_error(ope, -ret, FI_EFA_ERR_PKT_POST);
return;
continue;
}
}
}
Expand Down

0 comments on commit 433256a

Please sign in to comment.