-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP Connection Error Always Breaks (Closed or Refused) #263
Comments
@bhairavmehta95 can you provide with the exact error message that you are getting. |
Unfortunately, I don't believe that it is just a port allocation problem, as I am only running the servers on free ports in the 9000+ range (not the 2000 one) |
@bhairavmehta95 Could you post the server log when a disconnect occurs? |
This is the last few lines; nothing out of the ordinary, it just doesn't move forward from here. |
There is no disconnection in that fragment, if it stops printing output there it usually means that it's running normally. Are you sure there is no other client running on that same port? Maybe you can find somewhere in the log where a message similar to this appears?
|
I could not, unfortunately that is the end of the log. The other thing is that on the client side, I get errors that look like this:
which make me believe that the connection was actually closed. |
For now, I have just be restarting the server in an asynchronous way so that my training does not crash the moment one TCP connection is closed / broken / refused, but I would also like to find if there is a way to keep the connection open more reliably. |
Can you see the server window or are you running off-screen? Can it be that the server hangs? It would make sense that it works running a single instance of Carla but it crashes/hangs with multiple instances (running out of GPU memory or some other issue). |
I'm running off screen. I thought of that, but I'm running on a server of 256GB of GPU Memory, and I've made sure to try to equally distribute the load across all four GPUs. I don't believe that that is the case, as each card has 64GB and each render process seems to take < 1.5 GB at most. Did your team ever have these issues when running the A3C experiments in the paper? |
We saw crashes when several instances of CARLA run on the same GPU (more than 4 or so), even if there is enough memory available, but we haven't observed issues like this. |
I have similar problems. I noticed, that before timing-out exception is thrown all simulators freeze as if they would be waiting for next action in synchronous mode. @nsubiron |
I've seen this issue when there is some error at GPU level, suddenly all the GPU dependent apps freeze. If the simulators freeze for more than the time-out (10s by default), the connection is closed. It's going to be very difficult to track down this bug, and only seems to happen in some GPUs, usually when running more than one simulator.
The client connects through TCP, a packet should not be lost without raising an error [1]. Though if there were a big latency the error would be "connection timed-out". |
Hi guys, I had the same problem. I think the problem is because your client is trying to read data from your server while your server is not ready yet. I solve this by setting a sleep time before my client reading data from a new episode. 2-3 seconds works perfectly for me. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This should not be an issue in 0.9.0, the protocol should be able now to recover from a broken connection without restarting the whole simulation. |
I am running CARLA across 4 GPUs on a server using the documentation provided in the setup docs, and using them to generate experience for a reinforcement learning agent.
My main issue is that during training, my server seems to close the connection (not necessarily in the beginning of training, but rather approximately 12K timesteps), despite having both the client and server timeouts set to extremely high numbers. The interesting thing is that if I don't run this across multiple GPUs, it doesn't seem to ever close.
My code used to look like this:
But I would always get a TCP Error on the start_episode line. Using some of the work done by NervanaSystems and their CARLA wrapper, I changed my code to look similar to theirs (i.e connect if you get a TCP error on start_episode), but since the connection is either closed / refused, this also times out and then my environment crashes, which stops training on all of my agents.
I am using 8 workers of PPO, a synchronous RL algorithm. I know A3C as described in the paper would be able to get around this problem by restarting the server and then reconnecting the client without interrupting the training of other agents due to the asynchrony. Is there anything that can be done about this? I am not super sure what else I could be doing to help with this problem, so I wanted to post this and see if anyone could find some incorrect logic in what I am doing here. (This code lies in the reset function of my agent environment)
The text was updated successfully, but these errors were encountered: