Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IMPROVED] Check stream state performance #5963

Merged
merged 5 commits into from
Oct 8, 2024
Merged

Conversation

derekcollison
Copy link
Member

When checking interest state for interest or workqueue streams, we would check all msgs from the streams first sequence through ack floor and up to delivered.

We do this to make sure our ack state is correct. In cases where there were alot of messages still in the stream due to offline or slow consumers, this could be a heavy load on a server.

This improvement uses LoadNextMsg() to efficiently skip ahead and we now remember our checked floor and do not repeat checks for messages below our check floor on subsequent runs.

This change also highlighted a datarace in filestore that is fixed here as well.

Signed-off-by: Derek Collison derek@nats.io

…uld check all msgs from the streams first sequence through ack floor and up to delivered.

We do this to make sure our ack state is correct. In cases where there were alot of messages still in the stream due to offline or clow consumers, this could be a heavy load on a server.

This improvement uses LoadNextMsg() to efficiently skip ahead and we now remember our checked floor and do not repeat checks for messages below our check floor on subsequent runs.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
We used to check at 5s then every 30s. However we already checked once the log replay was complete. Now just check every ~2.5 minutes.

Signed-off-by: Derek Collison <derek@nats.io>
Signed-off-by: Derek Collison <derek@nats.io>
@derekcollison derekcollison requested a review from a team as a code owner October 6, 2024 01:34
Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall but just a question.

psi.fblk = i
// We only require read lock here as that is desirable,
// so we need to do this in a go routine to acquire write lock.
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to guard against accidentally creating multiple of these goroutines for the same store?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but was thinking same thing.

We essentially have three options.

  1. Promote to write lock at call sites, but I want LoadNextMsg() to be able to operate in parallel.
  2. Do not do any fixups in this function to stale fblks.
  3. The proposal above (we could modify it to funnel through a single Go routine, but should not be common)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if a simple CompareAndSwap and a deferred clear would be enough just to ensure only one fixup runs at a time for a given store. I'd be worried that we could end up with multiple of these doing the same work at the same time, which could compound the issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be for a different set of PSIM entries though so a simple boolean state would not suffice IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we could go with this for now and keep an eye on the goroutines. It may not be an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think worse case is we could duplicate work but not invalidate state. Hence the checks to only move it forward.

Copy link
Member

@neilalexander neilalexander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit c9d0a12 into main Oct 8, 2024
5 checks passed
@derekcollison derekcollison deleted the check-interest-state branch October 8, 2024 02:18
neilalexander pushed a commit that referenced this pull request Oct 8, 2024
When checking interest state for interest or workqueue streams, we would
check all msgs from the streams first sequence through ack floor and up
to delivered.

We do this to make sure our ack state is correct. In cases where there
were alot of messages still in the stream due to offline or slow
consumers, this could be a heavy load on a server.

This improvement uses LoadNextMsg() to efficiently skip ahead and we now
remember our checked floor and do not repeat checks for messages below
our check floor on subsequent runs.

This change also highlighted a datarace in filestore that is fixed here
as well.

Signed-off-by: Derek Collison <derek@nats.io>

---------

Signed-off-by: Derek Collison <derek@nats.io>
neilalexander added a commit that referenced this pull request Oct 9, 2024
Includes the following:

- #5944
- #5945
- #5939
- #5935
- #5960
- #5970
- #5971
- #5963
- #5973
- #5978

Signed-off-by: Neil Twigg <neil@nats.io>
wallyqs pushed a commit that referenced this pull request Oct 13, 2024
When checking interest state for interest or workqueue streams, we would
check all msgs from the streams first sequence through ack floor and up
to delivered.

We do this to make sure our ack state is correct. In cases where there
were alot of messages still in the stream due to offline or slow
consumers, this could be a heavy load on a server.

This improvement uses LoadNextMsg() to efficiently skip ahead and we now
remember our checked floor and do not repeat checks for messages below
our check floor on subsequent runs.

This change also highlighted a datarace in filestore that is fixed here
as well.

Signed-off-by: Derek Collison <derek@nats.io>

---------

Signed-off-by: Derek Collison <derek@nats.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants