-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add retries to apple push notifications #128
Conversation
server/apple_notification_server.go
Outdated
// Set the retry context timeout to a value that allow us to retry the notification | ||
// the MAX RETRIES with exponential backoff. With default values, this timeout | ||
// will be 5 seconds. | ||
retryContextTimeout := me.sendTimeout / (MAX_RETRIES * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The retry value would be 60 secs? So the timeout for the context here is 10 secs ? Just wondering is that what you intended
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
me.sendTimeout
is by default 30s. MAX_RETRIES
is 3.
30 / (3* 2) = 5 seconds
So... worst case scenario with the defaults...
- 5 seconds trying the first time and fail because the context exceeded the deadline
- Wait 1 second
- 5 seconds trying the second time and fail because the context exceeded the deadline
- Wait 2 seconds
- 5 seconds trying the third time and fail because the context exceeded the deadline
- Return error
A total of 18s, more than enough to fit in the 30s timeout.
I can do something more hardcoded, or tweak the numbers if we prefer. I don't have a strong opinion on these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we confident that 5 seconds is enough to set the timeout for the apple client call? It seems we are going from 30 secs per call to only 5 secs per call. In the dashboard we saw there were times it would take up to 10 secs.
Just trying to understand why we are trying to fit all the retry logic within the initial configuration of time out per send?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. The sendTimeout is the timeout for a single send. It should not be made to limit the time including retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My worry is that 30 seconds 3 times is +90 seconds the server waiting for the push proxy response. The server problably has also a context deadline, so not sure if we want to wait that long.
But if you have a strong feeling we should just use the configured timeout for each retry, I am happy to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's about semantics. The sendTimeout should be the timeout for a single request. If you want to have a separate timeout including all the retries, then that should be a separate config setting.
The server timeout is hardcoded to 30s currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new config, and use both to properly handle this. I set it to 8 seconds, since it is the highest integer we can use that will fit comfortably in the 30 seconds (8*3 + 1 + 2 = 27s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think retries are better implemented with a roundTripper as done here: https://github.com/hashicorp/go-retryablehttp/blob/main/roundtripper.go. Or perhaps, we can simply use that library to do the retry.
server/apple_notification_server.go
Outdated
// Set the retry context timeout to a value that allow us to retry the notification | ||
// the MAX RETRIES with exponential backoff. With default values, this timeout | ||
// will be 5 seconds. | ||
retryContextTimeout := me.sendTimeout / (MAX_RETRIES * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. The sendTimeout is the timeout for a single send. It should not be made to limit the time including retries.
Also, note that the server might return a 429: https://developer.apple.com/documentation/usernotifications/handling-notification-responses-from-apns#Interpret-header-responses, in which case we should slow down, and retrying further might worsen the condition. This will require usage of a rate limiter. Probably something to do as a future improvement. |
@agnivade A 429 is handled by the library, and doesn't become an error, but a "not sent" response with a |
My problem with this approach is that the library already does plenty of stuff that I don't want to mess with. So I don't need the retry at the http client level, but higher. My idea is mainly retry when there is a network hiccup. Most of the errors we have seen like this are context deadline errors, which I don't think are automatically retried in these circumstances either. |
That's okay. Not a big deal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @larkox!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small thing to verify in config.
if cfg.RetryTimeoutSec == 0 { | ||
cfg.RetryTimeoutSec = 8 | ||
} | ||
|
||
if cfg.EnableFileLog { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a check here to verify that the retryTimeout isn't greater than the sendTimeout.
start := time.Now() | ||
|
||
retryContext, cancelRetryContext := context.WithTimeout(generalContext, me.retryTimeout) | ||
defer cancelRetryContext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor note: having defer
s in a for loop has a slight problem that it will keep getting accumulated until the whole function exits. But since this loop has a hardcoded upper bound, this is okay. Otherwise, it would be recommended to cancel as soon as one iteration finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability sake, I feel it is better to keep it as is. As you said, the loop is bound, so it should not be a big issue.
Summary
Add retries to apple notifications. This may solve some
RequestError
errors we see in the metrics.Ticket Link
https://mattermost.atlassian.net/browse/MM-60572