Eliminate semaphore contention #435

dgtony · 2020-09-12T10:51:19Z

Using go-threads on a high-degree nodes we detected there is a scalability problem with a current approach of protecting thread operations with semaphores. Currently semaphore remains acquired by an active process during entire flow, including cyclic fetches of record blocks from the network, app-level validation etc. Sometimes it may result in a long time holding the semaphore. Unfortunately several processes trying to acquire the same semaphore become naturally queued even further amplifying waiting times.

To give an idea on how bad it can be, here are the quantiles of per-thread semaphore acquiring time measured in a sliding window during the day and a week respectively:

In the end it prevents normal system operation and leads to a lot of requests failed due to exceeded deadlines.

Here we propose a solution to the problem by decoupling long-running operations and synchronized sections. Network operations and record validation are performed without acquiring the thread semaphore, but concurrent log pullings are still avoided with per-thread-per-log semaphores. Per-thread semaphores now protect only local thread state mutations, record processing and internal bus broadcasting to preserve order of the records.

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

sanderpick · 2020-09-15T00:31:10Z

@dgtony, this is great! Thanks for diving so deep. I am back at the desk this week. I gave it a skim but need to look deeper tomorrow.

sanderpick

Thanks, @dgtony, this is a big improvement. I left a couple questions, but LGTM!

sanderpick · 2020-09-18T20:21:56Z

net/client.go

 			defer wg.Done()
-			p, err := addr.ValueForProtocol(ma.P_P2P)
+			pid, ok, err := s.callablePeer(addr)


Nice cleanups 🤙

sanderpick · 2020-09-18T20:41:54Z

net/client.go

+				log.Debugf("received %d records in log %s from %s", len(l.Records), logID, pid)
+
+				if l.Log != nil && len(l.Log.Addrs) > 0 {
+					if err = s.net.store.AddAddrs(tid, logID, addrsFromProto(l.Log.Addrs), pstore.PermanentAddrTTL); err != nil {


Is the idea in using AddAddrs and AddPubKey (below) instead of AddLog to be more explicit? Looks like a good change, just curious the motivation.

Mostly to avoid moving the head of the log. Method getRecords now just fetches the records and except adding new addresses and keys is stateless, real processing and state update happens in putRecord.

sanderpick · 2020-09-18T20:42:38Z

net/client.go

@@ -315,49 +337,43 @@ func (s *server) pushRecord(ctx context.Context, id thread.ID, lid peer.ID, rec
 		Body: body,
 	}

+	logErr := func(addr ma.Multiaddr, f func(addr ma.Multiaddr) error) {


Could be a package level func?

sanderpick · 2020-09-18T21:04:23Z

net/net.go

@@ -826,11 +824,25 @@ func (n *net) Validate(id thread.ID, token thread.Token, readOnly bool) (thread.
 	return token.Validate(n.getPrivKey())
 }

-// getConnector returns the connector tied to the thread if it exists
+func (n *net) addConnector(id thread.ID, conn *app.Connector) {


Ooph, good catch on the unprotected map access 🤦

sanderpick · 2020-09-18T21:17:53Z

net/net.go

+				//    from the record possibly won't reach reducers/listeners or even get dispatched.
+				// 2. Rollback log head to the previous record. In this case record handling will be retried until
+				//    success, but reducers must guarantee its idempotence and there is a chance of getting stuck
+				//    with bad event and not making any progress at all.


Good notes! The original thinking here is inline with your warning in (2)... if it failed the first time, chances are it will fail again.

sanderpick · 2020-09-18T21:18:06Z

net/net.go

+
+		// Generally broadcasting should not block for too long, i.e. we have to run it
+		// under the semaphore to ensure consistent order seen by the listeners. Record
+		// bursts could be overcome by adjusting listener buffers (EventBusCapacity).


sanderpick · 2020-09-18T21:19:25Z

net/net.go

 			lock.Lock()
 			fetchedRcs = append(fetchedRcs, recs)
 			lock.Unlock()
 		}(lg)
 	}
 	wg.Wait()
+
+	// maybe we should preliminary deduplicate records?


As in, check for duplicates in fetchedRcs?

yes, in order to remove redundant processing of already seen records

sanderpick · 2020-09-18T21:46:42Z

Oop, I left you with a conflict... I think just from this tiny change: f4a9ab7#diff-2fef49f8571a7dfa93a40dec78adc852

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

dgtony · 2020-09-21T13:45:56Z

Here is a little update - take a look, please.

We've been testing this branch during the last week and discovered that in a presence of large amount of threads, semaphore protection performs better on a thread-level rather than on a log-level. So there are two types of semaphores introduced: first protects against concurrent pulls of the same thread, and another one guards updates of local thread state.

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

…ment/eliminate-semaphore-contention Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

sanderpick · 2020-09-21T16:49:00Z

Cool, looks reasonable. @jsign, since you worked on the initial concurrency model, could you give this a skim?

jsign · 2020-09-21T16:53:32Z

Well, a lot of things happened since that so I'm not very gymnastic with all the changed code since then.
But sure, I'll give a quick look to today if I see any catches.

sanderpick · 2020-09-21T16:59:57Z

Cool, makes sense. Even just a 2 minute skim as a sanity check

jsign

Looks good.

Every time I see the TryAcquire defaulting to do nothing if the lock is already taken, making that behavior indistinguishable from a real update seems a bit weird. (btw, was always that way).

Maybe something for the future is to return some extra bool to indicate if real work was done or the call was ignored, since the caller can't really know after having a successful return if something really happened or maybe got unlucky and should try again soon.

Anyway, looks like if this hasn't happened yet is because callers doesn't want to have that guarantee. Or maybe that's something that callers doesn't have that clear, since in general when you're calling a method you wouldn't expect that "it might do nothing" and you should retry later (without any way to identify that). Looks like this magical "ignore" style might be error-prone.

Just sharing some thoughts here. Not fully related with this PR change really.

sanderpick · 2020-09-21T18:08:17Z

Great, good stuff to think about 👍

@dgtony, are you still testing or do you consider this mergeable?

dgtony · 2020-09-22T10:16:41Z

We've done with testing, latest version looks stable.

Regarding silent TryAcquire. In a first draft implementaton methods returned specific error in case of failed semaphore acquisition, to let caller know about skipped processing. But later I removed it, because in fact TryAcquire is currently used only to ensure there is no more than a single process pulling given thread from the other nodes. So if you come and see semaphore is already busy, that's fine because no matter what process will pull the thread, and error checks just cluttered the code.

But as you said, problem may arise in a future if some caller will assume synchronism, e.g. that thread was definitely updated after return from the pullThread. There seems to be no such assumptions currently in the code, but I can bring the error back, to make it explicit, if you think it's worth it.

jsign · 2020-09-22T11:25:29Z

But as you said, problem may arise in a future if some caller will assume synchronism, e.g. that thread was definitely updated after return from the pullThread. There seems to be no such assumptions currently in the code, but I can bring the error back, to make it explicit, if you think it's worth it.

Yes, until there's some good argument to include a signal that the call was skipped I think it's ok as it is. 👍🏼

dgtony added 9 commits September 8, 2020 14:41

Implement semaphore and pool

40a642f

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Use semaphors in net

da9cc2e

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Control event bus capacity with var

fa00c7c

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Make app-connector operations thread-safe

e1b9008

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

WIP: implement putting record into the log with decoupled data loading

515400d

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Fix: limit record chain length

f113c83

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Generic semaphor IDs in pool

c9415df

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Protect getRecords call with per-thread-per-log semaphore

7eb7e7d

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Remove no longer required semaphores from log/thread operations

3f57fe0

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

d1eselboy mentioned this pull request Sep 14, 2020

Eliminate contention on thread semaphors anyproto/go-threads#2

Closed

9 tasks

sanderpick approved these changes Sep 18, 2020

View reviewed changes

Lock pulls on a thread-level

54acffc

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

dgtony added 2 commits September 21, 2020 17:02

Move error wrapper to package level

f4420f8

Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

Merge branch 'master' of github.com:textileio/go-threads into improve…

b6fdbae

…ment/eliminate-semaphore-contention Signed-off-by: Anton Dort-Golts <dortgolts@gmail.com>

jsign approved these changes Sep 21, 2020

View reviewed changes

sanderpick merged commit 28845b5 into textileio:master Sep 22, 2020

dgtony deleted the improvement/eliminate-semaphore-contention branch September 23, 2020 10:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate semaphore contention #435

Eliminate semaphore contention #435

dgtony commented Sep 12, 2020

sanderpick commented Sep 15, 2020

sanderpick left a comment

sanderpick Sep 18, 2020

sanderpick Sep 18, 2020

dgtony Sep 21, 2020

sanderpick Sep 18, 2020

dgtony Sep 21, 2020

sanderpick Sep 18, 2020

sanderpick Sep 18, 2020

sanderpick Sep 18, 2020

sanderpick Sep 18, 2020

dgtony Sep 21, 2020

sanderpick commented Sep 18, 2020 •

edited

Loading

dgtony commented Sep 21, 2020

sanderpick commented Sep 21, 2020

jsign commented Sep 21, 2020

sanderpick commented Sep 21, 2020

jsign left a comment •

edited

Loading

sanderpick commented Sep 21, 2020

dgtony commented Sep 22, 2020

jsign commented Sep 22, 2020

Eliminate semaphore contention #435

Eliminate semaphore contention #435

Conversation

dgtony commented Sep 12, 2020

sanderpick commented Sep 15, 2020

sanderpick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanderpick commented Sep 18, 2020 • edited Loading

dgtony commented Sep 21, 2020

sanderpick commented Sep 21, 2020

jsign commented Sep 21, 2020

sanderpick commented Sep 21, 2020

jsign left a comment • edited Loading

Choose a reason for hiding this comment

sanderpick commented Sep 21, 2020

dgtony commented Sep 22, 2020

jsign commented Sep 22, 2020

sanderpick commented Sep 18, 2020 •

edited

Loading

jsign left a comment •

edited

Loading