[IMPROVED] Check stream state performance #5963

derekcollison · 2024-10-06T01:34:17Z

When checking interest state for interest or workqueue streams, we would check all msgs from the streams first sequence through ack floor and up to delivered.

We do this to make sure our ack state is correct. In cases where there were alot of messages still in the stream due to offline or slow consumers, this could be a heavy load on a server.

This improvement uses LoadNextMsg() to efficiently skip ahead and we now remember our checked floor and do not repeat checks for messages below our check floor on subsequent runs.

This change also highlighted a datarace in filestore that is fixed here as well.

Signed-off-by: Derek Collison derek@nats.io

…uld check all msgs from the streams first sequence through ack floor and up to delivered. We do this to make sure our ack state is correct. In cases where there were alot of messages still in the stream due to offline or clow consumers, this could be a heavy load on a server. This improvement uses LoadNextMsg() to efficiently skip ahead and we now remember our checked floor and do not repeat checks for messages below our check floor on subsequent runs. Signed-off-by: Derek Collison <derek@nats.io>

Signed-off-by: Derek Collison <derek@nats.io>

We used to check at 5s then every 30s. However we already checked once the log replay was complete. Now just check every ~2.5 minutes. Signed-off-by: Derek Collison <derek@nats.io>

Signed-off-by: Derek Collison <derek@nats.io>

neilalexander

LGTM overall but just a question.

neilalexander · 2024-10-06T15:46:24Z

server/filestore.go

-					psi.fblk = i
+		// We only require read lock here as that is desirable,
+		// so we need to do this in a go routine to acquire write lock.
+		go func() {


Do we need to guard against accidentally creating multiple of these goroutines for the same store?

I don't think so, but was thinking same thing.

We essentially have three options.

Promote to write lock at call sites, but I want LoadNextMsg() to be able to operate in parallel.

Do not do any fixups in this function to stale fblks.

The proposal above (we could modify it to funnel through a single Go routine, but should not be common)

I'm wondering if a simple CompareAndSwap and a deferred clear would be enough just to ensure only one fixup runs at a time for a given store. I'd be worried that we could end up with multiple of these doing the same work at the same time, which could compound the issue.

Could be for a different set of PSIM entries though so a simple boolean state would not suffice IMO.

Good point, we could go with this for now and keep an eye on the goroutines. It may not be an issue.

I think worse case is we could duplicate work but not invalidate state. Hence the checks to only move it forward.

neilalexander

LGTM

When checking interest state for interest or workqueue streams, we would check all msgs from the streams first sequence through ack floor and up to delivered. We do this to make sure our ack state is correct. In cases where there were alot of messages still in the stream due to offline or slow consumers, this could be a heavy load on a server. This improvement uses LoadNextMsg() to efficiently skip ahead and we now remember our checked floor and do not repeat checks for messages below our check floor on subsequent runs. This change also highlighted a datarace in filestore that is fixed here as well. Signed-off-by: Derek Collison <derek@nats.io> --------- Signed-off-by: Derek Collison <derek@nats.io>

Includes the following: - #5944 - #5945 - #5939 - #5935 - #5960 - #5970 - #5971 - #5963 - #5973 - #5978 Signed-off-by: Neil Twigg <neil@nats.io>

When checking interest state for interest or workqueue streams, we would check all msgs from the streams first sequence through ack floor and up to delivered. We do this to make sure our ack state is correct. In cases where there were alot of messages still in the stream due to offline or slow consumers, this could be a heavy load on a server. This improvement uses LoadNextMsg() to efficiently skip ahead and we now remember our checked floor and do not repeat checks for messages below our check floor on subsequent runs. This change also highlighted a datarace in filestore that is fixed here as well. Signed-off-by: Derek Collison <derek@nats.io> --------- Signed-off-by: Derek Collison <derek@nats.io>

derekcollison added 5 commits October 5, 2024 15:10

Fix for write data race when fblk not found

f94d0ab

Signed-off-by: Derek Collison <derek@nats.io>

Fix tests since fblk remap async now

edb5dbe

Signed-off-by: Derek Collison <derek@nats.io>

Adjust checkStreamState interval to 2m plus up to 30s jitter.

871f831

We used to check at 5s then every 30s. However we already checked once the log replay was complete. Now just check every ~2.5 minutes. Signed-off-by: Derek Collison <derek@nats.io>

After log replay, delay between 5-10s before checking interest state

cb20000

Signed-off-by: Derek Collison <derek@nats.io>

derekcollison requested a review from a team as a code owner October 6, 2024 01:34

neilalexander reviewed Oct 6, 2024

View reviewed changes

neilalexander approved these changes Oct 6, 2024

View reviewed changes

derekcollison merged commit c9d0a12 into main Oct 8, 2024
5 checks passed

derekcollison deleted the check-interest-state branch October 8, 2024 02:18

neilalexander mentioned this pull request Oct 8, 2024

Cherry-picks for 2.10.22-RC.1 #5979

Merged

neilalexander added a commit that referenced this pull request Oct 9, 2024

Cherry-picks for 2.10.22-RC.1 (#5979)

7f381e0

Includes the following: - #5944 - #5945 - #5939 - #5935 - #5960 - #5970 - #5971 - #5963 - #5973 - #5978 Signed-off-by: Neil Twigg <neil@nats.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[IMPROVED] Check stream state performance #5963

[IMPROVED] Check stream state performance #5963

derekcollison commented Oct 6, 2024

neilalexander left a comment

neilalexander Oct 6, 2024

derekcollison Oct 6, 2024

neilalexander Oct 6, 2024

derekcollison Oct 6, 2024

neilalexander Oct 6, 2024

derekcollison Oct 6, 2024

neilalexander left a comment

[IMPROVED] Check stream state performance #5963

[IMPROVED] Check stream state performance #5963

Conversation

derekcollison commented Oct 6, 2024

neilalexander left a comment

Choose a reason for hiding this comment

neilalexander Oct 6, 2024

Choose a reason for hiding this comment

derekcollison Oct 6, 2024

Choose a reason for hiding this comment

neilalexander Oct 6, 2024

Choose a reason for hiding this comment

derekcollison Oct 6, 2024

Choose a reason for hiding this comment

neilalexander Oct 6, 2024

Choose a reason for hiding this comment

derekcollison Oct 6, 2024

Choose a reason for hiding this comment

neilalexander left a comment

Choose a reason for hiding this comment