-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle termination and unexpected messages #17
Conversation
Hey Matt! Thank you for your PR.
You're absolutely right. I wound up adding this exact change to a PR I've been working on, but I think I'd prefer to merge your stuff and rebase my work on your changes.
Nice! Admittedly I don't use the Statsd parts of Peep anymore since changing jobs a bit over a year ago, but I'm a bit surprised I didn't run into this issue running Peep in containers in k8s in the past. I'm not sure I agree with the approach of logging the unexpected message. My main concern is that a user of Peep can't really do anything with these log messages, particularly in the case of this bug you're seeing in Erlang/OTP, other than forking Peep and removing the Logger call. I think I would prefer if a Peep process either silently discards all unknown messages, or it handles the specific |
Fair point. I can update the PR to remove the logging, and maybe drop a comment in there instead? It's possible we're hitting this because the |
Sounds good! A short comment explaining why Peep sinkholes unknown messages might be useful to curious readers. Yeah, I read the issue you linked to. That does seem rather low. Nevertheless, this raises misbehaviour(s) in Peep that shouldn't happen, so I still think it's important to merge your changes, even if you manage to fix the underlying issue. |
1358fa8
to
2252345
Compare
Alright I replaced the logs with a comment that hopefully isn't too verbose :) I linked back to this PR and the OTP issue in case someone wants more context in the future. And yeah, I agree that it's best for Peep to be resilient to this kind of thing even once the bug that surfaced it is resolved. Thanks for being so responsive! |
Thank you! <3 |
Quick question: do you think you'll be cutting a new Hex release with this soon (within the day)? No rush or anything, just wondering if I should update our app to point at main or wait for the release. |
Nevermind, you just did! Thank you very much! |
We've recently tried to adopt Peep as a replacement for TelemetryMetricsStatsd. We just switched to using a UDS to send metrics to Datadog, and in the process found this bug in OTP: erlang/otp#8989. We optimistically hoped Peep's batching would avoid this issue.
It turned out to still hit that issue, which is fine: that's not Peep's bug. But the way it affected Peep was very destructive: when Peep received an unexpected message, it brought down the entire application! 😱
I believe the reason for this is that the error raised from Peep not having a matching
handle_info
clause for the message caused the Peep server to terminate for a reason other than:shutdown
, which is the only reason itsterminate
callback is able to handle. So another function clause error is raised, this time while handling the trapped exit, and I think this error is what brings the whole application crashing down. I'm not an OTP expert (yet) but my intuition is that an error raised while handling a trapped exit breaks the normal restarting that would be handled by the server's supervisor, and it cascades up from there.Even if you're not hitting the OTP bug mentioned above, this behavior is very easy to reproduce: lookup the PID for the
name
used for Peep withGenServer.whereis
and thensend
it an arbitrary message it's not expecting.So I've made two changes here that seem to address this issue for us. First, accept any reason for
terminate
rather than just:shutdown
. It seems to be very important forterminate
to always succeed, and I don't see a need to handle any specific exit reason differently.That change is sufficient to make the Peep server restart correctly when it receives unexpected processes, and presumably if it were to exit for any other unexpected reason. But given that we are still being hit by that OTP bug, we would still rather the server not restart on us. So I also introduced a catch-all
handle_info
case that logs a warning for unexpected messages, which allows the server to keep running. This avoids it dropping its ETS table when we hit this bug. I can understand if you wouldn't want to accept that commit though on the basis that it's better to crash early for unexpected things happening.Thanks for your work and thank you for your time!