Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/util: Revert ofi_genlock changes completely #9442

Closed

Conversation

a-szegel
Copy link
Contributor

Completely reverting the ofi_genlock changes from main saves us 20ns in fi_rdm_tagged_pingpong for the SHM provider.

instance: aws c5n.18xlarge
os: al2
client: fi_rdm_tagged_pingpong --pin-core 1 -S 1 -s 127.0.0.1 -p shm 127.0.0.1
server: fi_rdm_tagged_pingpong --pin-core 0 -S 1 -s 127.0.0.1 -p shm

main @ 8db8594 (has genlocks that are always mutex): .48us
git revert 64b31be is .45us (my changes to make us have genlocks with no mutex)
git revert 96f35fa 3670360 1c4e8ea 0d39856 is .46us (ofi_mutex directly)

@a-szegel a-szegel requested review from shefty, aingerson and a team October 14, 2023 00:04
@a-szegel a-szegel changed the title prov/util: Revert ofigenlock changes completely prov/util: Revert ofi_genlock changes completely Oct 14, 2023
Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced the genlock implementation is the problem here. The biggest thing I see in this set of changes is that you change the ep lock to a mutex instead of optimizing for the no-op. I could definitely see that causing a regression. I think the ep_list_locks are the only ones that need to be a mutex (until we revisit the connection management issue Sean brought up). Perhaps that change got added by mistake in the ep_list_lock patch.

Not related to the code itself but when you revert a commit, remember to:

  1. Sign all your revert commits so the DCO check passes
  2. Add a commit message to each one explaining why you are reverting

Comment on lines +250 to +252
ret = ofi_genlock_init(&ep->lock,
ep->domain->threading != FI_THREAD_SAFE ?
OFI_LOCK_NOOP : OFI_LOCK_MUTEX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe misreading this (could have been an issue with the original commit) but the commit I think was supposed to target the ep_list_locks, this is the ep->lock itself which I think is a separate issue. I agree with changing the ep lock to a noop/mutex check but I don't think it belongs in this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is clearer if you look at the files changed tab instead of commit by commit.

Comment on lines 555 to 557
ofi_genlock_init(&av->ep_list_lock,
domain->threading == FI_THREAD_DOMAIN ?
OFI_LOCK_NOOP : OFI_LOCK_MUTEX);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original commit says it was fixing the API per Sean's concern about the ep_list_lock needing to be thread safe regardless of the threading level. I don't think reverting that fix in the av and cq is the proper answer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is clearer if you look at the files changed tab instead of commit by commit.

@@ -52,15 +52,15 @@ static const char *rxd_cq_strerror(struct fid_cq *cq_fid, int prov_errno,

cq = container_of(cq_fid, struct rxd_cq, util_cq.cq_fid);

ofi_genlock_lock(&cq->util_cq.ep_list_lock);
ofi_mutex_lock(&cq->util_cq.ep_list_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The genlock vs mutex implementation is exactly the same. Switching back to the mutex lock shouldn't impact performance. I'm not convinced this is the source of the regression you're seeing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I question that the test results are reproducible enough to determine a 20ns difference. The mutex calls go direct to pthread calls, the genlock calls access those indirectly via a function pointer. The point of genlock is that it allows disabling the lock at runtime, which isn't possible calling mutex directly.

Copy link
Contributor Author

@a-szegel a-szegel Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reproduce the testing on your end. In the slack channel, I linked my run command for the reproducer (and it was also done on an intel Skylake instance).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my test results:

Bisect 2 84a6bde5e -> main(for SHM)

84a6bde5e: .46  99f87709c:.46  <Seths OFI genlock changes in here>    318b3f49f: .45  55f7fd1ad: .45    6b62f34ea:.45   90baec9d3: .45   803af8801: .45   64b31bea3: .48   967f91458:.48    52a4364d9: .48   422241636:.48          main: .48

And running with this branch on main brings us from .48 to .46.

@@ -60,7 +60,7 @@ static int rxd_cntr_wait(struct fid_cntr *cntr_fid, uint64_t threshold, int time
return -FI_ETIMEDOUT;

ep_retry = -1;
ofi_genlock_lock(&cntr->ep_list_lock);
ofi_mutex_lock(&cntr->ep_list_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here - they are the same thing, I don't think this should affect performance. We should take a closer look at it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not the same thing, and they do impact performance.

@a-szegel
Copy link
Contributor Author

a-szegel commented Oct 17, 2023

Before this patch set, main was always using ofi_mutex locks (perf .46us). The first patch changed a bunch of ofi_mutex to be ofi_genlock, and decided on whether or not to use mutex locking based on domain lock level (perf .45us ... yay less locking makes us faster). The second patch set switched the ofi_genlock to always default to an ofi_mutex lock b/c @shefty realized we were violating a threading case for msg endpoints in the API (perf .48us... bad).

This PR will blow both patch set 1, and patch set 2 away. This will make us default to ofi_mutex, and will bring us from current 0.48us to 0.46us b/c ofi_genlock hard coded to ofi_mutex adds a non negligible overhead to ofi_mutex directly.

@a-szegel
Copy link
Contributor Author

a-szegel commented Oct 17, 2023

I do not like this patch set, although I think it is necessary. I think ofi_genlocks are better and should be used. I noticed that 1.19.x SHM is faster than main SHM, and this patch set is the fastest/easiest way to fix part of that problem. I would like to be able to use ofi_genlock->ofi_mutex without a performance hit, and I think finding the root cause of the issue would be great.

@aingerson
Copy link
Contributor

@a-szegel What I'm trying to say is I think that there was an issue in the original commit that is misleading you to what the actual issue is. In the commit where you change the ep_list_locks to always be a mutex, you also forced the ep lock to always be a mutex as well. The ep_list_lock is the only one where it "has" to be a mutex because of the control/bind issue that could be problematic. The util ep can still have the no-op optimization.
I do see the regression you're talking about but for me, it is solved simply by adding the util_ep no-op optimization back in. If I just revert the ep_list_lock/genlock vs mutex code, it does not fix it. So I don't think it is the genlock abstraction that is the problem but rather the incorrect removal of the util_ep lock optimization.
At least for me, the regression I'm seeing goes away with this fix:

diff --git a/prov/util/src/util_ep.c b/prov/util/src/util_ep.c
index 61906a9f2..bd6cf4990 100644
--- a/prov/util/src/util_ep.c
+++ b/prov/util/src/util_ep.c
@@ -247,10 +247,9 @@ int ofi_endpoint_init(struct fid_domain *domain, const struct util_prov *util_pr
        if (util_domain->eq)
                ofi_ep_bind_eq(ep, util_domain->eq);
 
-       /* TODO Figure out how to optimize this lock for rdm and msg endpoints */
-       ret = ofi_genlock_init(&ep->lock, OFI_LOCK_MUTEX);
-       if (ret)
-               return ret;
+       ret = ofi_genlock_init(&ep->lock,
+               ep->domain->threading != FI_THREAD_SAFE ?
+               OFI_LOCK_NOOP : OFI_LOCK_MUTEX);
 
        if (ep->caps & FI_COLLECTIVE) {
                ep->coll_cid_mask = calloc(1, sizeof(*ep->coll_cid_mask));

@a-szegel
Copy link
Contributor Author

@shefty told me that I needed to revert everything in #9377. Before any of my changes, it was always a mutex lock, and after my changes, it is an ofi_genlock that is always a mutex lock with a regression. Yes making the mutex lock "not a mutex" fixes the regression, but it also fixes the regression by having it always be an ofi_mutex lock. This points to an issue in genlock.

@aingerson
Copy link
Contributor

@a-szegel I think I'm lost with your logic. For me the regression is not fixed by changing the genlock mutex to using a mutex directly. And before your changes, the util_ep lock was not always a mutex, it was optimized for the threading model and your changes changed it to always a mutex. Your revert of the initial set of changes was incorrect and introduced a new change in the util_ep.
your original optimization
your "revert" of those changes
The util_ep lock was never part of your original changes. It was changed to a genlock in June here
Your revert changed that lock to always a mutex when it shouldn't have.

@a-szegel a-szegel closed this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants