-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Slack Socket level ping/pong #226
Conversation
Looks like a solid start. Thanks!
|
Thanks for the feedback @dblock! I've been thinking of ways to refactor this as well as chatting with a few other engineers at Slack and I wanted to get your perspective on a new approach. Taking your feedback into consideration what do you think about implementing an def ping(delay, task, ping_id=0)
unless @alive
disconnect! if connected?
return close
end
ping_data = { type: 'ping', id: ping_id }
send_data(ping_data.to_json)
@alive = false
task.sleep delay
ping(delay, task, ping_id + 1)
end This would require us to update the def connect!
super do |driver|
driver.on :message do |event|
event_data = JSON.parse(event.data)
@alive = true if event_data['type'] == 'pong'
end
yield driver
end
run_loop
end Lastly we’d ensure that this was run on a different block from the client’s run_loop following the example here. def start_async(client)
Thread.new do
::Async::Reactor.run do
::Async::Reactor.run do
client.run_loop
end
::Async::Reactor.run do |task|
@alive = true
ping(task)
end
end
end
end What do you think? |
Looks reasonable and I would merge support for just one of the async implementations for this. I don't see it being difficult to generalize this to all 3 EM, celluloid, async, so I encourage you to implement this all the way for one, then expand from there. |
lib/slack/real_time/socket.rb
Outdated
sleep delay | ||
end | ||
|
||
ping(delay, ping_id + 1, &block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to blow up the stack eventually, won't it?
@RodneyU215 This is a lot better.
|
client.run_loop | ||
end | ||
::Async::Reactor.run do |task| | ||
ping do |delay| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe model this as run_loop
into a run_ping
that's something that does ping every once in a while and recovers from errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you expand on the types of errors you'd like to recover from as well as what the expected recover steps would be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can really be anything, but from looking at the code if send_data
fails in any way the ping worker dies.
I think you're getting close @RodneyU215, nice work.
|
Thanks for all of the feedback you've given!
Upon closer inspection around the configuration it appears that there's already a option for
I've added a check to see if
The |
I'd like to talk a bit more around error handling. If the driver is unable to communicate with the server I can only really see reestablishing the connection or exiting the application as the best next steps. In the current implementation on subsequent unanswered pings we disconnect and close the Socket connection. So that when To keep the application running and to reestablish the connection what do you think about catching that error, running ::Async::Reactor.run do |task|
begin
client.run_ping do |delay|
task.sleep delay
end
rescue Slack::RealTime::Socket::SocketNotConnectedError
connect
retry
end
end |
@RodneyU215 Sounds like a good approach. Since this will be the same for every driver, see how you can sink this implementation into a shared class. |
The websocket ping description is good enough. I guess before we were documenting it but it was actually never run :) I would change "should send frames" to describing what actually happens now with this code somewhere. |
lib/slack/real_time/socket.rb
Outdated
@alive = true | ||
|
||
driver.on :message do |event| | ||
event_data = JSON.parse(event.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we do, but do we need to handle garbage from the server here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea, though I'd recommend handling that in a separate PR. One where Slack::Messages::Message
is used and updating it so that it would take care of any parsing errors that may occur.
Okay, so I've rebuilt things a bit to implement the ability for the client to automatically reconnect on ping failures. It felt a bit like scope creep but overall I think it's a better implementation. I've tested things manually and it worked, but I'll add unit tests tomorrow. Let me know what you think. |
I think that's what we want - taking away from the user any kind of worry that the connection goes away. Try making it work for EventMachine and Celluloid too, the retry pattern probably belongs inside a generic class. |
lib/slack/real_time/socket.rb
Outdated
ping_data = { type: 'ping', time: Time.now, id: "p#{ping_int += 1}" } | ||
ping(ping_data) | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you call Thread#join
? If two threads try to write to the socket at the same time, what happens when the data becomes interleaved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current PR version is no longer using a thread @ioquatix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry my bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. Please do make suggestions about this implementation! There's a lot of looping around in that calling code :)
Using a thread to implement this is very risky. |
@@ -15,7 +15,22 @@ class Socket < Slack::RealTime::Socket | |||
def start_async(client) | |||
Thread.new do | |||
::Async::Reactor.run do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify this code by something like
::Async::Reactor.run do |task|
task.async do
# nested task
end
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! I'll fix this.
Okay, I looked at what you are doing. I think there is a fundamental issue with how this code is structured, but that being said, here is a rough outline of what I'd do: class Client
attr :last_message_at
def time_since_last_message
Async::Clock.now - @last_message_at
end
def run_loop
on(:message) do
@last_message_at = Async::Clock.now
end
end
end
class Socket
def start_async
::Async::Reactor.run do |task|
task.async do
while @running # or true, some way to control this loop for graceful shutdown
# Probably make this one method?
client.build_socket
begin
client.run_loop
rescue EOFError, EPIPE #, etc
# Log and loop around again
end
end
end
task.async do |subtask|
while true
subtask.sleep delay
client.ping # Will eventually trigger some kind of "pong".
if client.time_since_last_message > (delay * 2)
# Connection failed
client.close
end
end
end
end
end
end |
As a side note, you don't need to send the ping if some other message was received recently... you'd just dynamically adjust the delay, and then when you wake up, don't send ping if something else was sent in the mean time. That avoids sending unnecessary junk. |
@ioquatix Care to explain the fundamental issue? For my own education? I like avoiding unnecessary pings and just keeping the timestamp of the last message. |
Okay, so the fundamental issue is the callback structure of the code and basically it makes it very hard to reason about. The main issue I have when reading the code is understanding the call flow and sequence of events that lead up to a particular line being executed. The solution that The next step is to consider invariants and how they are maintained. For example, are your connection stateful or stateless and where do you maintain that state. How is it affected by reconnections - is it modelled explicitly or implicitly (e.g. some kind of You propose reconnecting to the same URL, but is this valid? Is it possible after 24 hours, that URL changes? Where in your API do you need to implement the "timeout/reconnect" loop? That loop probably needs to be far up the call tree. I see you try to restart the run loop in the exception handler, but then after that call retry. That code already seems wrong, do you want to Ideally, you have some way to signal a disconnect from any part of the code and take appropriate action. If you are not careful, tracking state (which naturally includes which event handlers are registered) becomes difficult and buggy. Think about the following flow:
After the first reconnect, the initial events would be lost because the entire web socket is thrown away, no? If things are stateful, it's even more tricky to handle correctly. |
FYI, I pushed that last commit before I had to chance to see all of this new feedback. @ioquatix Thanks for the added insight! Dynamically adjusting the delay (before sending out a ping), based on the last message, is a great suggestion. I also agree that using In the code you suggested I noticed that you've chosen not to check for a Other than that It sounds like the only thing left is the logic to rebuild the web socket connection.
In the latest version I've push this logic into a method called
WebSocket Message Server URL's are only valid for 30 seconds but because we call
If I'm understanding this correctly the events registered by users are already stored in |
lib/slack/real_time/socket.rb
Outdated
end | ||
|
||
def current_time | ||
Time.now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time.now
is not monotonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://blog.dnsimple.com/2018/03/elapsed-time-with-ruby-the-right-way/
Running to work to fix all the instances of Time.now
that I have all over the place :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah, it's a common problem :p - common enough that I decided to add it to async. I might end up using hitime
to do it which is supposedly more accurate, but for now the Process.clock
method is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid feedback! Thanks for the link dblock! I'll change that.
It's really not necessary to confirm it's received. If no message is received within a certain timeframe, then the connection is considered dead. You don't care what kind of message it is, it doesn't change the outcome or behaviour.
It looks like that code path has been removed. I was referring to 88ea7a5#diff-7de932a86b84fcf2a3e218dd5515f726L31 |
Ultimately, I think you just want to ensure all state is cleaned up correctly and that an entirely new instance is re-created. Because you are supporting different concurrency models, this is tricky. You are using a |
@dblock I've addressed the outstanding feedback and added a few unit tests. The only thing I see to expand on is updating the While I'd be happy to help with that; I believe it may be best for a future PR. This one has been open for a bit and I believe should help out a lot of people who may be currently experiencing disconnects with the clients default Would this work for you? |
You might like to add this to your specs: https://github.com/socketry/async-rspec/blob/master/lib/async/rspec/leaks.rb It detects leaks. It's not async specific, it works with any spec that opens/closes IOs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks great to me. I am ready to merge this with either:
- Document in README/CHANGELOG that ping was added for Async only, not for EM/Celluloid, open an issue to implement those
- Implement EM/Celluloid
I would vote for 2, cause I have faith in you @RodneyU215, but I'll take 1) too ;)
CHANGELOG.md
Outdated
|
||
* Your contribution here. | ||
|
||
### 0.13.2 (2018/10/16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hasn't shipped, so please put this back, we do this during release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops! I'll fix it now.
Lets def add this in another PR! #230 |
@ioquatix Appreciate another code review of the state of things here when you get a chance. Thank you! |
@dblock I'm not going anywhere! 😊I'd like to help get EM/Celluloid implemented too. We'll have more time to jump on it after this tour around Europe. |
I've merged this, thanks for your time @RodneyU215. Lets try to get EM/Celluloid in soon, I'll spend some time upgrading one of my bots to this and see if this works in production, too. |
@RodneyU215 Take a look at the broken build in https://travis-ci.org/slack-ruby/slack-ruby-client/jobs/442629093? We run integration tests with a SLACK_API_TOKEN on master only, so this wasn't surfaced in this PR. |
I'm looking into this now! |
It looks like at least one aspect is that you are missing |
I've released:
Let's see how this puppy works and be attentive to fixing new bugs. Appreciate everyone's hard work! |
Summary
This PR attempts to add websocket-level ping monitoring (#223):
heartbeat
following Slack's RTM guidelines. See "Ping and Pong".heartbeat 30
should likely be configurable.It's not quite "shippable" yet, but I'd love some feedback on this.
Requirements