TCP Connection Error Always Breaks (Closed or Refused) #263

bhairavmehta95 · 2018-03-05T02:03:58Z

I am running CARLA across 4 GPUs on a server using the documentation provided in the setup docs, and using them to generate experience for a reinforcement learning agent.

My main issue is that during training, my server seems to close the connection (not necessarily in the beginning of training, but rather approximately 12K timesteps), despite having both the client and server timeouts set to extremely high numbers. The interesting thing is that if I don't run this across multiple GPUs, it doesn't seem to ever close.

My code used to look like this:

     self.client.start_episode(self.start_idx)
     measurements, sensor_data = self.client.read_data()

But I would always get a TCP Error on the start_episode line. Using some of the work done by NervanaSystems and their CARLA wrapper, I changed my code to look similar to theirs (i.e connect if you get a TCP error on start_episode), but since the connection is either closed / refused, this also times out and then my environment crashes, which stops training on all of my agents.

    # Blocking function until episode is ready
    while True:
        try:
           self.client.start_episode(self.start_idx)
           break
        except TCPConnectionError:
           self.client.connect()
           print('TCP Connection error: {}'.format(self.port))
           time.sleep(1)

I am using 8 workers of PPO, a synchronous RL algorithm. I know A3C as described in the paper would be able to get around this problem by restarting the server and then reconnecting the client without interrupting the training of other agents due to the asynchrony. Is there anything that can be done about this? I am not super sure what else I could be doing to help with this problem, so I wanted to post this and see if anyone could find some incorrect logic in what I am doing here. (This code lies in the reset function of my agent environment)

YashBansod · 2018-03-05T03:57:48Z

@bhairavmehta95 can you provide with the exact error message that you are getting.
Also, are you sure that it is an algorithmic problem and not just an issue of TCP/IP port allocation?
Did you see if just changing the port number solves the problem? It may be possible that the default port (2000) set in carla is populated by another program on the server.

bhairavmehta95 · 2018-03-05T04:17:11Z

Unfortunately, I don't believe that it is just a port allocation problem, as I am only running the servers on free ports in the 9000+ range (not the 2000 one)

nsubiron · 2018-03-05T15:26:08Z

@bhairavmehta95 Could you post the server log when a disconnect occurs?

bhairavmehta95 · 2018-03-06T16:00:15Z

[2018.03.06-15.55.43:219][  0]LogCarla: Player Vehicle        = Default
[2018.03.06-15.55.43:219][  0]LogCarla: Number Of Vehicles    = 0
[2018.03.06-15.55.43:219][  0]LogCarla: Number Of Pedestrians = 0
[2018.03.06-15.55.43:219][  0]LogCarla: Weather Id = 1
[2018.03.06-15.55.43:219][  0]LogCarla: Seed Vehicle Spawner = 30200
[2018.03.06-15.55.43:219][  0]LogCarla: Seed Pedestrian Spawner = 13085
[2018.03.06-15.55.43:219][  0]LogCarla: Found 15 available weather settings.
[2018.03.06-15.55.43:219][  0]LogCarla:   * 0 - Default
[2018.03.06-15.55.43:219][  0]LogCarla:   * 1 - ClearNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 2 - CloudyNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 3 - WetNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 4 - WetCloudyNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 5 - MidRainyNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 6 - HardRainNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 7 - SoftRainNoon
[2018.03.06-15.55.43:219][  0]LogCarla:   * 8 - ClearSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 9 - CloudySunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 10 - WetSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 11 - WetCloudySunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 12 - MidRainSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 13 - HardRainSunset
[2018.03.06-15.55.43:219][  0]LogCarla:   * 14 - SoftRainSunset
[2018.03.06-15.55.43:219][  0]LogCarla: [CARLA/SceneCapture]
[2018.03.06-15.55.43:219][  0]LogCarla: Added 1 cameras.
[2018.03.06-15.55.43:219][  0]LogCarla: Semantic Segmentation = Disabled
[2018.03.06-15.55.43:219][  0]LogCarla: [CARLA/SceneCapture/CameraRGB]
[2018.03.06-15.55.43:219][  0]LogCarla: Image Size = 800x600
[2018.03.06-15.55.43:219][  0]LogCarla: FOV = 90.000000
[2018.03.06-15.55.43:219][  0]LogCarla: Camera Position = (X=30.000 Y=0.000 Z=130.000)
[2018.03.06-15.55.43:219][  0]LogCarla: Camera Rotation = (P=0.000000 Y=0.000000 R=0.000000)
[2018.03.06-15.55.43:219][  0]LogCarla: Post-Processing = SceneFinal
[2018.03.06-15.55.43:219][  0]LogCarla: ================================================================================
[2018.03.06-15.55.43:242][  0]LogWorld: Bringing up level for play took: 1.894903
[2018.03.06-15.55.43:243][  0]LogCarlaServer: Sending 83 available start positions
[2018.03.06-15.55.43:244][  0]LogCarlaServer: Episode start received: { StartIndex = 0 }
[2018.03.06-15.55.43:247][  0]LogCarla: Created capture camera 0 with postprocess "SceneFinal"
[2018.03.06-15.55.43:247][  0]LogCarla: Changing weather settings to "ClearNoon"
[2018.03.06-15.55.43:248][  0]LogCarlaServer: Ready to play, notifying client
[2018.03.06-15.55.43:253][  0]LogCarla: Found 83 positions for spawning vehicles
[2018.03.06-15.55.43:253][  0]LogCarla: Found 110 positions for spawning walkers at begin play.
[2018.03.06-15.55.43:253][  0]LogCarla: Found 42 positions for spawning walkers during game play.
[2018.03.06-15.55.43:254][  0]LogLoad: Took 7.264515 seconds to LoadMap(/Game/Maps/Town02)
[2018.03.06-15.55.44:241][  0]LogLoad: (Engine Initialization) Total time: 12.54 seconds
[2018.03.06-15.55.44:242][  0]LogExternalProfiler: No external profilers were discovered.  External profiling features will not be available.
[2018.03.06-15.55.44:262][  0]LogInit: Adding external input plugin.
[2018.03.06-15.55.44:439][  1]LogRenderer: Reallocating scene render targets to support 800x600 Format 10 NumSamples 1 (Frame:1).
[2018.03.06-15.55.46:189][  1]LogRenderer: Reallocating scene render targets to support 400x300 Format 10 NumSamples 1 (Frame:2).
[2018.03.06-15.55.47:258][  1]LogLinux: Setting swap interval to 'Immediate'

This is the last few lines; nothing out of the ordinary, it just doesn't move forward from here.

nsubiron · 2018-03-06T16:39:42Z

There is no disconnection in that fragment, if it stops printing output there it usually means that it's running normally. Are you sure there is no other client running on that same port?

Maybe you can find somewhere in the log where a message similar to this appears?

ERROR: tcpserver 2002 : error reading message: End of file 
ERROR: tcpserver 2000 : error reading message: End of file 
[2018.03.06-16.38.00:413][501]LogCarlaServer: Warning: Client disconnected, server needs restart
[2018.03.06-16.38.00:413][501]LogCarlaServer: Restarting the level...

bhairavmehta95 · 2018-03-06T17:13:57Z

I could not, unfortunately that is the end of the log.

The other thing is that on the client side, I get errors that look like this:

Starting new episode...
Error during step, terminating episode early Traceback (most recent call last):
  File "/home/bhairavm/ray/examples/carla/env.py", line 309, in step
    obs = self._step(action)
  File "/home/bhairavm/ray/examples/carla/env.py", line 344, in _step
    image, py_measurements = self._read_observation()
  File "/home/bhairavm/ray/examples/carla/env.py", line 429, in _read_observation
    measurements, sensor_data = self.client.read_data()
  File "/home/bhairavm/carla/PythonClient/carla/client.py", line 117, in read_data
    data = self._stream_client.read()
  File "/home/bhairavm/carla/PythonClient/carla/tcp.py", line 73, in read
    header = self._read_n(4)
  File "/home/bhairavm/carla/PythonClient/carla/tcp.py", line 91, in _read_n
    raise TCPConnectionError(self._logprefix + 'connection closed')
carla.tcp.TCPConnectionError: (localhost:58212) connection closed

which make me believe that the connection was actually closed.

bhairavmehta95 · 2018-03-06T17:17:10Z

For now, I have just be restarting the server in an asynchronous way so that my training does not crash the moment one TCP connection is closed / broken / refused, but I would also like to find if there is a way to keep the connection open more reliably.

nsubiron · 2018-03-06T17:21:36Z

Can you see the server window or are you running off-screen?

Can it be that the server hangs? It would make sense that it works running a single instance of Carla but it crashes/hangs with multiple instances (running out of GPU memory or some other issue).

bhairavmehta95 · 2018-03-06T17:27:27Z

I'm running off screen. I thought of that, but I'm running on a server of 256GB of GPU Memory, and I've made sure to try to equally distribute the load across all four GPUs. I don't believe that that is the case, as each card has 64GB and each render process seems to take < 1.5 GB at most.

Did your team ever have these issues when running the A3C experiments in the paper?

nsubiron · 2018-03-06T17:51:26Z

We saw crashes when several instances of CARLA run on the same GPU (more than 4 or so), even if there is enough memory available, but we haven't observed issues like this.

Ajk4 · 2018-04-18T08:02:39Z

I have similar problems.

I noticed, that before timing-out exception is thrown all simulators freeze as if they would be waiting for next action in synchronous mode.

@nsubiron
Could it be the case, that send action request gets lost over the network and python client will never attempt to send new request? And the simulator will froze forever waiting for next action because of synchronous mode?

nsubiron · 2018-04-18T08:47:41Z

I've seen this issue when there is some error at GPU level, suddenly all the GPU dependent apps freeze. If the simulators freeze for more than the time-out (10s by default), the connection is closed. It's going to be very difficult to track down this bug, and only seems to happen in some GPUs, usually when running more than one simulator.

Could it be the case, that send action request gets lost over the network and python client will never attempt to send new request?

The client connects through TCP, a packet should not be lost without raising an error [1]. Though if there were a big latency the error would be "connection timed-out".

TairuiWang · 2018-04-20T14:25:33Z

Hi guys,

I had the same problem. I think the problem is because your client is trying to read data from your server while your server is not ready yet. I solve this by setting a sleep time before my client reading data from a new episode. 2-3 seconds works perfectly for me.

stale · 2018-08-18T08:55:46Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

nsubiron · 2018-08-20T09:34:41Z

This should not be an issue in 0.9.0, the protocol should be able now to recover from a broken connection without restarting the whole simulation.

marcgpuig assigned nsubiron Mar 5, 2018

stale bot added the stale Issue has not had recent activity label Aug 18, 2018

bhairavmehta95 closed this as completed Aug 18, 2018

Zehnaseeb mentioned this issue Jul 10, 2023

TCP connection error #6624

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP Connection Error Always Breaks (Closed or Refused) #263

TCP Connection Error Always Breaks (Closed or Refused) #263

bhairavmehta95 commented Mar 5, 2018

YashBansod commented Mar 5, 2018 •

edited

Loading

bhairavmehta95 commented Mar 5, 2018

nsubiron commented Mar 5, 2018

bhairavmehta95 commented Mar 6, 2018

nsubiron commented Mar 6, 2018

bhairavmehta95 commented Mar 6, 2018

bhairavmehta95 commented Mar 6, 2018

nsubiron commented Mar 6, 2018

bhairavmehta95 commented Mar 6, 2018

nsubiron commented Mar 6, 2018

Ajk4 commented Apr 18, 2018

nsubiron commented Apr 18, 2018

TairuiWang commented Apr 20, 2018

stale bot commented Aug 18, 2018

nsubiron commented Aug 20, 2018

TCP Connection Error Always Breaks (Closed or Refused) #263

TCP Connection Error Always Breaks (Closed or Refused) #263

Comments

bhairavmehta95 commented Mar 5, 2018

YashBansod commented Mar 5, 2018 • edited Loading

bhairavmehta95 commented Mar 5, 2018

nsubiron commented Mar 5, 2018

bhairavmehta95 commented Mar 6, 2018

nsubiron commented Mar 6, 2018

bhairavmehta95 commented Mar 6, 2018

bhairavmehta95 commented Mar 6, 2018

nsubiron commented Mar 6, 2018

bhairavmehta95 commented Mar 6, 2018

nsubiron commented Mar 6, 2018

Ajk4 commented Apr 18, 2018

nsubiron commented Apr 18, 2018

TairuiWang commented Apr 20, 2018

stale bot commented Aug 18, 2018

nsubiron commented Aug 20, 2018

YashBansod commented Mar 5, 2018 •

edited

Loading