prov/util: Revert ofi_genlock changes completely #9442

a-szegel · 2023-10-14T00:04:34Z

Completely reverting the ofi_genlock changes from main saves us 20ns in fi_rdm_tagged_pingpong for the SHM provider.

instance: aws c5n.18xlarge
os: al2
client: fi_rdm_tagged_pingpong --pin-core 1 -S 1 -s 127.0.0.1 -p shm 127.0.0.1
server: fi_rdm_tagged_pingpong --pin-core 0 -S 1 -s 127.0.0.1 -p shm

main @ 8db8594 (has genlocks that are always mutex): .48us
git revert 64b31be is .45us (my changes to make us have genlocks with no mutex)
git revert 96f35fa 3670360 1c4e8ea 0d39856 is .46us (ofi_mutex directly)

This reverts commit 64b31be.

This reverts commit 96f35fa.

…enlock" This reverts commit 3670360.

…nlock" This reverts commit 1c4e8ea.

This reverts commit 0d39856.

aingerson

I'm not convinced the genlock implementation is the problem here. The biggest thing I see in this set of changes is that you change the ep lock to a mutex instead of optimizing for the no-op. I could definitely see that causing a regression. I think the ep_list_locks are the only ones that need to be a mutex (until we revisit the connection management issue Sean brought up). Perhaps that change got added by mistake in the ep_list_lock patch.

Not related to the code itself but when you revert a commit, remember to:

Sign all your revert commits so the DCO check passes
Add a commit message to each one explaining why you are reverting

aingerson · 2023-10-16T21:34:15Z

prov/util/src/util_ep.c

+	ret = ofi_genlock_init(&ep->lock,
+			       ep->domain->threading != FI_THREAD_SAFE ?
+			       OFI_LOCK_NOOP : OFI_LOCK_MUTEX);


Maybe misreading this (could have been an issue with the original commit) but the commit I think was supposed to target the ep_list_locks, this is the ep->lock itself which I think is a separate issue. I agree with changing the ep lock to a noop/mutex check but I don't think it belongs in this patch.

I think it is clearer if you look at the files changed tab instead of commit by commit.

aingerson · 2023-10-16T21:35:16Z

prov/util/src/util_av.c

+	ofi_genlock_init(&av->ep_list_lock,
+			 domain->threading == FI_THREAD_DOMAIN ?
+			       OFI_LOCK_NOOP : OFI_LOCK_MUTEX);


The original commit says it was fixing the API per Sean's concern about the ep_list_lock needing to be thread safe regardless of the threading level. I don't think reverting that fix in the av and cq is the proper answer here.

I think it is clearer if you look at the files changed tab instead of commit by commit.

aingerson · 2023-10-16T21:37:04Z

prov/rxd/src/rxd_cq.c

@@ -52,15 +52,15 @@ static const char *rxd_cq_strerror(struct fid_cq *cq_fid, int prov_errno,

 	cq = container_of(cq_fid, struct rxd_cq, util_cq.cq_fid);

-	ofi_genlock_lock(&cq->util_cq.ep_list_lock);
+	ofi_mutex_lock(&cq->util_cq.ep_list_lock);


The genlock vs mutex implementation is exactly the same. Switching back to the mutex lock shouldn't impact performance. I'm not convinced this is the source of the regression you're seeing.

I question that the test results are reproducible enough to determine a 20ns difference. The mutex calls go direct to pthread calls, the genlock calls access those indirectly via a function pointer. The point of genlock is that it allows disabling the lock at runtime, which isn't possible calling mutex directly.

Please reproduce the testing on your end. In the slack channel, I linked my run command for the reproducer (and it was also done on an intel Skylake instance).

From my test results:

Bisect 2 84a6bde5e -> main(for SHM) 84a6bde5e: .46 99f87709c:.46 <Seths OFI genlock changes in here> 318b3f49f: .45 55f7fd1ad: .45 6b62f34ea:.45 90baec9d3: .45 803af8801: .45 64b31bea3: .48 967f91458:.48 52a4364d9: .48 422241636:.48 main: .48

And running with this branch on main brings us from .48 to .46.

aingerson · 2023-10-16T21:37:57Z

prov/rxd/src/rxd_cntr.c

@@ -60,7 +60,7 @@ static int rxd_cntr_wait(struct fid_cntr *cntr_fid, uint64_t threshold, int time
 			return -FI_ETIMEDOUT;

 		ep_retry = -1;
-		ofi_genlock_lock(&cntr->ep_list_lock);
+		ofi_mutex_lock(&cntr->ep_list_lock);


Same thing here - they are the same thing, I don't think this should affect performance. We should take a closer look at it.

they are not the same thing, and they do impact performance.

a-szegel · 2023-10-17T05:10:29Z

Before this patch set, main was always using ofi_mutex locks (perf .46us). The first patch changed a bunch of ofi_mutex to be ofi_genlock, and decided on whether or not to use mutex locking based on domain lock level (perf .45us ... yay less locking makes us faster). The second patch set switched the ofi_genlock to always default to an ofi_mutex lock b/c @shefty realized we were violating a threading case for msg endpoints in the API (perf .48us... bad).

This PR will blow both patch set 1, and patch set 2 away. This will make us default to ofi_mutex, and will bring us from current 0.48us to 0.46us b/c ofi_genlock hard coded to ofi_mutex adds a non negligible overhead to ofi_mutex directly.

a-szegel · 2023-10-17T05:22:18Z

I do not like this patch set, although I think it is necessary. I think ofi_genlocks are better and should be used. I noticed that 1.19.x SHM is faster than main SHM, and this patch set is the fastest/easiest way to fix part of that problem. I would like to be able to use ofi_genlock->ofi_mutex without a performance hit, and I think finding the root cause of the issue would be great.

aingerson · 2023-10-17T21:57:52Z

@a-szegel What I'm trying to say is I think that there was an issue in the original commit that is misleading you to what the actual issue is. In the commit where you change the ep_list_locks to always be a mutex, you also forced the ep lock to always be a mutex as well. The ep_list_lock is the only one where it "has" to be a mutex because of the control/bind issue that could be problematic. The util ep can still have the no-op optimization.
I do see the regression you're talking about but for me, it is solved simply by adding the util_ep no-op optimization back in. If I just revert the ep_list_lock/genlock vs mutex code, it does not fix it. So I don't think it is the genlock abstraction that is the problem but rather the incorrect removal of the util_ep lock optimization.
At least for me, the regression I'm seeing goes away with this fix:

diff --git a/prov/util/src/util_ep.c b/prov/util/src/util_ep.c
index 61906a9f2..bd6cf4990 100644
--- a/prov/util/src/util_ep.c
+++ b/prov/util/src/util_ep.c
@@ -247,10 +247,9 @@ int ofi_endpoint_init(struct fid_domain *domain, const struct util_prov *util_pr
        if (util_domain->eq)
                ofi_ep_bind_eq(ep, util_domain->eq);
 
-       /* TODO Figure out how to optimize this lock for rdm and msg endpoints */
-       ret = ofi_genlock_init(&ep->lock, OFI_LOCK_MUTEX);
-       if (ret)
-               return ret;
+       ret = ofi_genlock_init(&ep->lock,
+               ep->domain->threading != FI_THREAD_SAFE ?
+               OFI_LOCK_NOOP : OFI_LOCK_MUTEX);
 
        if (ep->caps & FI_COLLECTIVE) {
                ep->coll_cid_mask = calloc(1, sizeof(*ep->coll_cid_mask));

a-szegel · 2023-10-20T00:09:22Z

@shefty told me that I needed to revert everything in #9377. Before any of my changes, it was always a mutex lock, and after my changes, it is an ofi_genlock that is always a mutex lock with a regression. Yes making the mutex lock "not a mutex" fixes the regression, but it also fixes the regression by having it always be an ofi_mutex lock. This points to an issue in genlock.

aingerson · 2023-10-20T18:41:25Z

@a-szegel I think I'm lost with your logic. For me the regression is not fixed by changing the genlock mutex to using a mutex directly. And before your changes, the util_ep lock was not always a mutex, it was optimized for the threading model and your changes changed it to always a mutex. Your revert of the initial set of changes was incorrect and introduced a new change in the util_ep.
your original optimization
your "revert" of those changes
The util_ep lock was never part of your original changes. It was changed to a genlock in June here
Your revert changed that lock to always a mutex when it shouldn't have.

a-szegel added 5 commits October 13, 2023 21:54

Revert "prov/util: Revert util_av, util_cq and util_cntr to mutex"

d9b6e9a

This reverts commit 64b31be.

Revert "prov/{rxd/shm/sm2/udp/util}: util_cq ep_list_lock to genlock"

5fcbe59

This reverts commit 96f35fa.

Revert "prov/util & prov/rdx: Switch util_cntr->ep_list_lock to ofi_g…

806564d

…enlock" This reverts commit 3670360.

Revert "prov/util: Create fid_list_{insert/remove}2 which uses ofi_ge…

7e41056

…nlock" This reverts commit 1c4e8ea.

Revert "prov/util: Switch util_av->ep_list_lock to genlock"

bed1074

This reverts commit 0d39856.

a-szegel requested review from shefty, aingerson and a team October 14, 2023 00:04

a-szegel changed the title ~~prov/util: Revert ofigenlock changes completely~~ prov/util: Revert ofi_genlock changes completely Oct 14, 2023

aingerson reviewed Oct 16, 2023

View reviewed changes

a-szegel closed this Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/util: Revert ofi_genlock changes completely #9442

prov/util: Revert ofi_genlock changes completely #9442

a-szegel commented Oct 14, 2023

aingerson left a comment

aingerson Oct 16, 2023

a-szegel Oct 17, 2023

aingerson Oct 16, 2023

a-szegel Oct 17, 2023

aingerson Oct 16, 2023

shefty Oct 16, 2023

a-szegel Oct 17, 2023 •

edited

Loading

a-szegel Oct 17, 2023

aingerson Oct 16, 2023

a-szegel Oct 17, 2023

a-szegel commented Oct 17, 2023 •

edited

Loading

a-szegel commented Oct 17, 2023 •

edited

Loading

aingerson commented Oct 17, 2023

a-szegel commented Oct 20, 2023

aingerson commented Oct 20, 2023

prov/util: Revert ofi_genlock changes completely #9442

prov/util: Revert ofi_genlock changes completely #9442

Conversation

a-szegel commented Oct 14, 2023

aingerson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-szegel Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

a-szegel commented Oct 17, 2023 • edited Loading

a-szegel commented Oct 17, 2023 • edited Loading

aingerson commented Oct 17, 2023

a-szegel commented Oct 20, 2023

aingerson commented Oct 20, 2023

a-szegel Oct 17, 2023 •

edited

Loading

a-szegel commented Oct 17, 2023 •

edited

Loading

a-szegel commented Oct 17, 2023 •

edited

Loading