Issues regarding to SSE events #336

iandyh · 2018-02-01T08:13:08Z

There are some issues with SSE events that I like to discuss because I don't know how it can be moved on.

Currently go-marathon is using https://github.com/donovanhide/eventsource as underlying client. However, the client itself has some issues. For example, Check if a stream is closed just before writing to the stream channels donovanhide/eventsource#33. It seems like it might take some time to get it fixed. Rewriting the client will require some changes to go-marathon as well. This will bring to the second topic
When go-marathon is disconnected from the server, it should immediately return to the application side and let the application side refresh the states in Marathon. Simply retrying, which is the current strategy, will result the application missing the events and eventually out-sync with states in Marathon. In an extreme case, due to network issue, the application can be disconnected from Marathon for long time and when retry finally succeeds, the local state is already largely behind.
So the client also needs to expose an errors channel to the application side.
Another issue is potentially, the order of the events received in each listener can be different from the events received from Marathon. I mentioned the issued in this closed pr: expose orignal events queue to applications #270 But at the time, I did not spend too much time on it because our app does not require the order to be correct but it's no longer the case.

But before fix issue 3 I really wish issue 1 and 2 can be solved because currently I am using a non-elegant fix in my private fork. What are your guys' opinions on the issues?

The text was updated successfully, but these errors were encountered:

timoreimann · 2018-02-07T00:34:35Z

Hey @iandyh, thanks for your questions. Here are my thougths:

I can see two options here: a) Invest into the eventsource library to fix the issues that we need to get fixed; or b) try to find another library. I can't tell which of the two options might be better; from my experience though, getting feature requests and bug fixes into eventsource has never been a problem from a collaborative perspective. So when I'm doubt, I'm inclined to stick with eventsource and drive any changes we need upstream.
Agreed that we need a way to expose the "event stream was broken" signal to the caller, and returning the error would be the most straight-forward way. I was shortly wondering if we could accept a callback function that'd be called every time the stream fails, so clients get a hook to catch up on missing events on their own while go-marathon continues to do retries under the hood. This will likely be more complex to implement, however, and I'm not sure if the benefit would truly out-weight the costs. Either way, the existing approach of doing retries transparently should continue to be possible as certain clients are fine with missing events.
Definitely another issue we should tackle.

Feel free to work on any of the challenges you mentioned and submit PRs once you think we've reached a reasonable design.

iandyh · 2018-02-08T02:25:49Z

Hi @timoreimann, Thanks for the response.

Speaking 1 and 2, I'll still argue having timeout is the easiest way to solve the problem. The implementation will also be much easier from go-marathon point of view. A likely solution will be like

if time.After(10 * time.Second) {
    stream.Close()
}
// Return error to application side

Regarding to issue 3, somehow I cannot reproduce the issue anymore. But I found a relevant discussion on stackoverflow: https://stackoverflow.com/questions/25860633/order-of-goroutine-unblocking-on-single-channel

The most straightforward way to solve would be replacing goroutings with blocking calls. So if one listener is blocked, the rest of listeners are also blocked:

if event.ID&context.filter != 0 {
        context.completion.Add(1)
        defer context.completion.Done()
	select {
		case ch <- event:
		case <-context.done:
			// Terminates goroutine.
	}
}

I've been thinking about this for a while, this seems to be the only solution I can come up with.

timoreimann · 2018-02-16T22:07:38Z

hey @iandyh

Speaking 1 and 2, I'll still argue having timeout is the easiest way to solve the problem.

Hm, not sure if I can follow you: 1. seems to be about the eventsource library in general, and 2. about exposing stream errors from the library. Are you suggesting to generate an error artifically after some static time interval?

Re: 3: How about using a channel buffer with a configurable size (so that the impact of slow consumers is limited)?
A further extension could be to use a ring buffer with either a LIFO or FIFO strategy so that senders never block. Existing libraries are available.

iandyh · 2018-02-28T03:32:49Z

Are you suggesting to generate an error artifically after some static time interval?

Yes, and during the interval, the event source library should do the retry. Moreover, I also think go-marathon should set a configurable timeout here and simply close the connection after timeout and return errors to the application. The application needs to refresh Marathon state and re-initiate the SSE connection.

Regarding to issue 3. Not sure about it. Probably it's overkilled? Because it would be probably easier to have its own SSE connection for each consumer.

timoreimann · 2018-03-01T22:57:00Z

Are you suggesting to generate an error artifically after some static time interval?

Yes, and during the interval, the event source library should do the retry. Moreover, I also think go-marathon should set a configurable timeout here and simply close the connection after timeout and return errors to the application. The application needs to refresh Marathon state and re-initiate the SSE connection.

In general, I'd prefer to strive for a deterministic solution where errors are generated and exposed exactly when an issue has knowingly been detected. Relying on timeouts feels like a workaround that may exacerbate attaining properties like correctness and performance on the user's end. But maybe I'm still missing a point here?

As far as eventsource is concerned, the maintainer has expressed willingness to continue accepting PRs. As mentioned, I'm also open to consider other libraries if they happen to provide the features we are looking for.

There's still a chance I might be misunderstanding. Let me know in this case, either by further discussion or a small PoC PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues regarding to SSE events #336

Issues regarding to SSE events #336

iandyh commented Feb 1, 2018 •

edited

Loading

timoreimann commented Feb 7, 2018

iandyh commented Feb 8, 2018

timoreimann commented Feb 16, 2018

iandyh commented Feb 28, 2018

timoreimann commented Mar 1, 2018

Issues regarding to SSE events #336

Issues regarding to SSE events #336

Comments

iandyh commented Feb 1, 2018 • edited Loading

timoreimann commented Feb 7, 2018

iandyh commented Feb 8, 2018

timoreimann commented Feb 16, 2018

iandyh commented Feb 28, 2018

timoreimann commented Mar 1, 2018

iandyh commented Feb 1, 2018 •

edited

Loading