Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in connection disconnect #732

Closed
jpulec opened this issue May 6, 2016 · 32 comments
Closed

Race condition in connection disconnect #732

jpulec opened this issue May 6, 2016 · 32 comments

Comments

@jpulec
Copy link

jpulec commented May 6, 2016

It seems like there is a race condition in the code for Connection.disconnect(). I run redis as a celery broker, so I seem to be running into a situation where I get a TypeError: NoneType has no attribute _close(). It would appear that in one process disconnect() is being called and setting the self._sock to None while another process has already executed the if statement that checks if self._sock is None.

    def disconnect(self):
        "Disconnects from the Redis server"
        self._parser.on_disconnect()
        if self._sock is None:
            return
        try:
            self._sock.shutdown(socket.SHUT_RDWR)
            self._sock.close()
        except socket.error:
            pass
        self._sock = None
@rolette
Copy link

rolette commented May 27, 2016

Here's part of a backtrace showing basically the same problem, but with a SentinelManagedConnection:

File "/usr/local/lib/python3.5/site-packages/redis/client.py", line 880, in get
    return self.execute_command('GET', name)
  File "/usr/local/lib/python3.5/site-packages/redis/client.py", line 572, in execute_command
    connection.send_command(*args)
  File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 563, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 538, in send_packed_command
    self.connect()
  File "/usr/local/lib/python3.5/site-packages/redis/sentinel.py", line 44, in connect
    self.connect_to(self.connection_pool.get_master_address())
  File "/usr/local/lib/python3.5/site-packages/redis/sentinel.py", line 106, in get_master_address
    self.disconnect()
  File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 921, in disconnect
    connection.disconnect()
  File "/usr/local/lib/python3.5/site-packages/redis/connection.py", line 530, in disconnect
    self._sock.close()
AttributeError: 'NoneType' object has no attribute 'close'

@rolette
Copy link

rolette commented Jun 1, 2016

I've pretty much convinced myself ConnectionPool doesn't have the locking it needs. The disconnect() method is ripping out connections that are in use and I suspect the call in Connection.disconnect() to self._sock.shutdown() ends up being a system call, so you lose the GIL and any protection you were counting on from that.

I did a quick experiment to catch and ignore the AttributeError on self._sock.close() and ended up with a RuntimeError in ConnectionPool.disconnect(). See the stack trace below:

  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/client.py", line 925, in get
    return self.execute_command('GET', name)
  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/client.py", line 623, in execute_command
    connection.send_command(*args)
  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/connection.py", line 571, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/connection.py", line 546, in send_packed_command
    self.connect()
  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/sentinel.py", line 45, in connect
    self.connect_to(self.connection_pool.get_master_address())
  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/sentinel.py", line 107, in get_master_address
    self.disconnect()
  File "/mnt/hgfs/w/firefly/inara/python/vendor/redis-py/redis/connection.py", line 961, in disconnect
    for connection in all_conns:
RuntimeError: Set changed size during iteration

@pfreixes
Copy link
Contributor

pfreixes commented Jun 2, 2016

The Connection class is not thread safe, but the ConnectionPool yes and due the design of Redis client and the execute_command function never two different threads would get access to the same connection.

This error shouldn't happen, but lets see why it is blowing in our face. I would need the version of redis-py used and a piece of code that raises this issue.

Cheers,

@rolette
Copy link

rolette commented Jun 3, 2016

This was running redis-py current with master up to and including this commit: 20fc04e

I did a minor patch to temporarily work-around the problem in Connection.disconnect() where it was blowing up on the self._sock.close() call. Here's the change I was experimenting with:

    def disconnect(self):
        "Disconnects from the Redis server"
        self._parser.on_disconnect()
        if self._sock is None:
            return
        try:
            self._sock.shutdown(socket.SHUT_RDWR)
            self._sock.close()
        # iio: Probably not the full fix, but this works around an
        #      "AttributeError: 'NoneType' object has no attribute 'close'" error
        except AttributeError:
            pass
        except socket.error:
            pass
        self._sock = None

Where you get multiple threads accessing the same connection is in SentinelConnectionPool.get_master_address():

    def get_master_address(self):
        master_address = self.sentinel_manager.discover_master(
            self.service_name)
        if self.is_master:
            if self.master_address is None:
                self.master_address = master_address
            elif master_address != self.master_address:
                # Master address changed, disconnect all clients in this pool
                self.disconnect()
        return master_address

In the case where the master address changes, it ends up disconnecting every connection in the pool, with no locking or care taken to avoid issues with the current owner of the connection. The disconnect() method is inherited from ConnectionPool:

    def disconnect(self):
        "Disconnects all connections in the pool"
        all_conns = chain(self._available_connections,
                          self._in_use_connections)
        for connection in all_conns:
            connection.disconnect()

Hopefully that makes it pretty clear how multiple threads are touching the same connection. If not and you really need something to reproduce it, it'll take me a bit to do that. It's happening in a web app under apache2 + mod_wsgi + flask on Python 3.5.1 when Sentinel sees a change in the master's IP address, so complicated to try to pull out for a test case.

@pfreixes
Copy link
Contributor

pfreixes commented Jun 3, 2016

Ill be out two days. I will check it then. In any case thanks for your detailed comment.

@pfreixes
Copy link
Contributor

pfreixes commented Jun 5, 2016

Hi @rolette you got the point, there is an issue with threads with the ConnectionPool, even the BlockingConnectionPool. The disconnect method [1] disconnects active connections and then there is chances of a race condition, I guess that the Sentinel case is just one that it takes more chances when master role change from one server to another one. The bug has to be tackled ASAP.

I guess that the call to the shutdown syscall is getting out the current thread from the CPU and the SO schedules the other thread that is waiting some data or signals related to the socket. Since here its just a time matter to see how the exception is raised.

Im afraid that this only the spike of the iceberg, I mean that the race condition can appear in the all access to the _sock attribute from the Connection, Parser or SocketBuffer. Then a disconnect method thread safe will give a false secure behaviour.

Then what ? Either wrap all classes to become thread safe or change the disconnect method. The first will get time and its not clear nowadays how to do that without break backward compatibility, the second option is tricky but easy to implement. How ? /cc @andymccurdy

The idea is use the shutdown syscall and call it from the pool connection and out of the current disconnect mehod that belongs to the Connection class and do nothing else. Once the owner of the connection is scheduled again it will execute the full disconnect method without race conditions.

[1] https://github.com/andymccurdy/redis-py/blob/master/redis/connection.py#L951

pfreixes added a commit to pfreixes/redis-py that referenced this issue Jun 19, 2016
This PR tries to solve the issues raised by redis#732 regarding
the danger of disconnect clients from the `ConnectionPool.disconnect`
method executed by a Thread different that those ones that are in charge
of the connections.

Instead of call the `Connection.disconnect` method it uses the syscall
`shutdown` to leave the socket unusable. Once the connection tries to use
the socket, even when it is already blocked such us the `PubSub` pattern, it
gets a `socket.error` exception that will be cactched by the
`Connection` class to then raise an `ConnectionError` and disconnect the
socket in a clean and safe way.

The `Client.execute_command` function catches the `ConnectionError` exception
and tries to connect again and run the command that raised the error.
Worth mentioning that in the case of the `Sentinel` environment, if some
changes regarding the Redis pool of servers - perhaps the mater went
down and a slave was promoted - the next command will be executed using
an other server.
@pfreixes
Copy link
Contributor

@rolette do you wanna try this fix ? It should remove the problem that you experimented.

@rolette
Copy link

rolette commented Jun 20, 2016

@pfreixes definitely an improvement, but still issues. Taking your patch as-is, here's what I see now:

  File "/mnt/hgfs/w/redis-py/redis/client.py", line 925, in get
    return self.execute_command('GET', name)
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 617, in execute_command
    connection.send_command(*args)
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 579, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 554, in send_packed_command
    self.connect()
  File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 45, in connect
    self.connect_to(self.connection_pool.get_master_address())
  File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 107, in get_master_address
    self.disconnect()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 970, in disconnect
    connection.shutdown_socket()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 545, in shutdown_socket
    self._sock.shutdown(socket.SHUT_RDWR)
OSError: [Errno 107] Transport endpoint is not connected

Whether you get the OSError or not varies based on what state the socket is in at the time shutdown() is called. If I tweak your patch to ignore the OSError:

    def shutdown_socket(self):
        """
        Shutdown the socket hold by the current connection, called from
        the connection pool class u other manager to singal it that has to be
        disconnected in a thread safe way. Later the connection instance
        will get an error and will call `disconnect` by it self.
        """
        try:
            self._sock.shutdown(socket.SHUT_RDWR)
        except AttributeError:
            # either _sock attribute does not exist or
            # connection thread removed it.
            pass
        except OSError as e:
            if e.errno == 107:
                # Transport endpoint is not connected
                pass
            else:
                raise

then I'm back to the same RuntimeError: Set changed size during iteration I reported earlier when I tried catching the AttributeError in disconnect() to get around the problem. Here's the traceback:

Traceback (most recent call last):
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 565, in send_packed_command
    self._sock.sendall(item)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 617, in execute_command
    connection.send_command(*args)
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 585, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 578, in send_packed_command
    (errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  # removed app-specific parts of stack
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 925, in get
    return self.execute_command('GET', name)
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 623, in execute_command
    connection.send_command(*args)
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 585, in send_command
    self.send_packed_command(self.pack_command(*args))
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 560, in send_packed_command
    self.connect()
  File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 45, in connect
    self.connect_to(self.connection_pool.get_master_address())
  File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 107, in get_master_address
    self.disconnect()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 975, in disconnect
    for connection in all_conns:
RuntimeError: Set changed size during iteration

I'm not seeing a way around adding proper synchronization instead of relying on the GIL, but maybe my python-foo is lacking and there's a different way to get there.

@pfreixes
Copy link
Contributor

@rolette thanks for the fast feedback, updated the PR with the two issues commented. I hope that this will be the definitely to fix your issue.

@rolette
Copy link

rolette commented Jun 20, 2016

@pfreixes progress... it gets past the RuntimeError and hits a problem similar to the original AttributeError, but this time in the parser (HiredisParser to be specific):

Traceback (most recent call last):
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 618, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 630, in parse_response
    response = connection.read_response()
  File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 56, in read_response
    return super(SentinelManagedConnection, self).read_response()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 587, in read_response
    response = self._parser.read_response()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 333, in read_response
    raise ConnectionError(SERVER_CLOSED_CONNECTION_ERROR)
redis.exceptions.ConnectionError: Connection closed by server.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
# removed app-specific parts of stack
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 1902, in hget
    return self.execute_command('HGET', name, key)
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 624, in execute_command
    return self.parse_response(connection, command_name, **options)
  File "/mnt/hgfs/w/redis-py/redis/client.py", line 630, in parse_response
    response = connection.read_response()
  File "/mnt/hgfs/w/redis-py/redis/sentinel.py", line 56, in read_response
    return super(SentinelManagedConnection, self).read_response()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 587, in read_response
    response = self._parser.read_response()
  File "/mnt/hgfs/w/redis-py/redis/connection.py", line 361, in read_response
    self._reader.feed(self._buffer, 0, bufflen)
AttributeError: 'NoneType' object has no attribute 'feed'

@pfreixes
Copy link
Contributor

mmm weird I been running the next snippet for a while and it didnt raise any issue

from threading import Thread
from redis.sentinel import Sentinel
from time import sleep

def ping_forever(client):
    while True:
        client.ping()
        sleep(0.1)

s = Sentinel([('localhost', 26379)], socket_timeout=0.1)
m = s.master_for('mymaster', socket_timeout=0.1)
t = Thread(target=ping_forever, args=(m,))
t.start()
sleep(1)
while True:
    print "Killing connection "
    m.connection_pool.disconnect()
    print "Connection finished"
    sleep(0.1)
t.join()

I must need all of this necessary to reproduce the issue, a snippet with the dependencies should be enough.

@rolette
Copy link

rolette commented Jun 20, 2016

It's happening in a web app under apache2 + mod_wsgi + flask on Python 3.5.1, so nothing easy for me to just pull a snippet to repro unfortunately.

More context on how I'm making this happen:

  • single Redis instance (master) with a single Sentinel instance on the same box (test instance, not production)
  • Redis is configured to bind to the IP address of the node, not 127.0.0.1
  • create the connection pool as follows:
sentinel = Sentinel([('127.0.0.1', 26379)], socket_timeout=0.5)
return SentinelConnectionPool('mymaster', sentinel, socket_timeout=3.0, retry_on_timeout=True)
  • login to my web app, so various connections to Redis are created and used
  • periodic async updates to the UI based on data queried from Redis
  • the bit that makes testing this "fun": run code that changes the IP address of the node that everything is running on (Redis master + sentinel), then it removes and re-adds 'mymaster' from Sentinel so Sentinel is pointed at the right address to monitor
  • once the IP address of the node has been updated, I re-login to the web app and the various in-use connections in the pool end up timing out and triggering the disconnect code that we've been patching up

Looking at your test code, the main question I'd have is whether you have hiredis installed or not. The path that blew up in my most recent traceback was down in the HiredisParser. If hiredis isn't installed, then your test won't hit that code. In case it matters, I'm running hiredis-0.11.0 with:

$ pip3 list | grep hiredis
hiredis (0.1.5)

@pfreixes
Copy link
Contributor

Yes I used hiredis. It is gonna take me a while, lets see if Im able to find what is going on here with your feedback.

@pfreixes
Copy link
Contributor

@rolette are you sharing the SentinelConnectionPool between threads via a Flask Proxy?

@rolette
Copy link

rolette commented Jun 21, 2016

@pfreixes Not sure specifically what you mean by "Flask Proxy", but the SentinelConnectionPool is being shared across threads, yes. mod_wsgi is running in embedded mode and Apache is using the "event" MPM:

$ sudo apachectl -V
Server version: Apache/2.4.7 (Ubuntu)
Server built:   Mar 10 2015 13:05:59
Server's Module Magic Number: 20120211:27
Server loaded:  APR 1.5.1-dev, APR-UTIL 1.5.3
Compiled using: APR 1.5.1-dev, APR-UTIL 1.5.3
Architecture:   64-bit
Server MPM:     event
  threaded:     yes (fixed thread count)
    forked:     yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

@pfreixes
Copy link
Contributor

@rolette Sadly SentinelConnectionPool [1] is not thread safe, Im pretty sure that the source of the issue that you are raising comes because of it.

In aims to avoid introduce so many changes in the core of redis-py and willing leave the next rationale untouched:

1 - Connection class is not thread safe. Each thread should use its own connection
2 - ConnectionPool the main methods get_connection and release connection are thread safe. This is the default pool used
3- BlookingConnectionPool explicitly thread safe using locking and it helps us to to throttle the use of connections.
4 - SentinelConnectionPool derivates from ConnectionPool and it allows to us handle automatically a group of Redis severs running them as master or slave.

My idea to address this issue is move all the necessary code to the specific classes published by Sentinel SentinelManagedConnection and SentinelConnectionPool to make them thread safe, and as I said for the moment touching nothing in the backdoors.

[1] https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L69

@rolette
Copy link

rolette commented Jun 21, 2016

@pfreixes Do you mean that the implementation of SentinelConnectionPool is not currently thread-safe, or that it wasn't intended to be thread-safe in the first place?

The docs in the code for BlockingConnectionPool state that both it and ConnectionPool are thread-safe:

class BlockingConnectionPool(ConnectionPool):
    """
    Thread-safe blocking connection pool::

        >>> from redis.client import Redis
        >>> client = Redis(connection_pool=BlockingConnectionPool())

    It performs the same function as the default
    ``:py:class: ~redis.connection.ConnectionPool`` implementation, in that,
    it maintains a pool of reusable connections that can be shared by
    multiple redis clients (safely across threads if required).

    The difference is that, in the event that a client tries to get a
    connection from the pool when all of connections are in use, rather than
    raising a ``:py:class: ~redis.exceptions.ConnectionError`` (as the default
    ``:py:class: ~redis.connection.ConnectionPool`` implementation does), it
    makes the client wait ("blocks") for a specified number of seconds until
    a connection becomes available.

I can't find any assertion one way or the other for SentinelConnectionPool, but having a connection pool that isn't thread-safe would have pretty limited use-cases. @andymccurdy, could you provide some clarity here please?

@pfreixes
Copy link
Contributor

@rolette the issue with SentinelConnectionPool comes with the functions get_master_address [1] and rotate_slaves both of them can be accessed by different threads with side effects. For example, in the case of call to the disconnect called by the get_master_address when the Redis master has changed can raise issues like you have raised with this Issue.

As I said the rationale is right but IMHO due the both functions SentinelConnectionPool it can raise issues when they are called from multiple threads.

[1] https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L99

pfreixes added a commit to pfreixes/redis-py that referenced this issue Jun 21, 2016
@pfreixes
Copy link
Contributor

@rolette Ive tried many attempts to get a race condition and I didn't make it ... but Ive prepared a POC patch [1] that might solve your problem.

Right now Im not feel very confidence with that issue and a bit frustrated ...

[1] pfreixes@6a95994

@andymccurdy
Copy link
Contributor

Thanks for digging into this. I just had a chance to come up to speed on the conversation. Looking at the code, I agree that there's a race condition in SentinelConnectionPool.get_master_address. I don't like that we're calling SentinelConnectionPool.disconnect at all.

In addition to the race condition, it seems like there's another bug as well. The only time the Sentinels are queried about the state of the cluster is when the SentinelConnection.connect method is called. This happens when a new connection is created or when an existing connection has been disconnected (perhaps from a timeout, a downed server, etc.) and attempts to reconnect. A system where N connections are created and regularly used (to avoid timeouts) and no new connections are ever created will never consult the sentinels for changes. In this scenario, if a new master is promoted while the prior master remains online, clients won't know (at least until they encounter a ReadOnlyError). Similarly, a new slave could come online and the clients would never know.

I think a better solution is to regularly query the state of the sentinel cluster and store the master address and slave addresses on the connection pool instance. SentinelConnectionPool.get_connection could be overridden to quickly check whether the connection instance about to be returned is pointing at the proper server and if not, call the SentinelManagedConnection.disconnect method so that when it's used next it connects to the proper server.

This doesn't account for connection instances that get cached on pipeline or pubsub objects, so we'll need to do a little more to help those out too.

I'm currently traveling but should have some free time around the end of this week. I'm happy to take a stab at this at that time.

If you guys see any glaring holes in this idea, please let me know :)

@pfreixes
Copy link
Contributor

pfreixes commented Jun 27, 2016

SO can we resume it as

  • Improve the current code of Sentinel to detect changes in the cluster roles as an active actor. Worth mentioning that can be used the proper messages sent by the sentinten as Redis does with the PubSub mechanism [1]
>>> from redis.sentinel import Sentinel
>>> sentinel = Sentinel([('localhost', 16380)], socket_timeout=0.1, retry_on_timeout=True)
>>> p = sentinel.sentinels[0].pubsub()
>>> p.psubscribe("*")
>>> p.get_message()
{'pattern': None, 'type': 'psubscribe', 'channel': '*', 'data': 2L}
>>> p.get_message()
{'pattern': '*', 'type': 'pmessage', 'channel': '+sdown', 'data': 'master redis-cluster 127.0.0.1 6381'}
  • Modify the Sentinel code to make it thread safe. (Im still concern with the override of this method [2]
  • Change the current pattern to signal that one connection has to be disconnected for any reason - such as the sentinel needs, the current code is not thread safe. Take a look to the use of the signal shutdown to achieve the same without be concern with race conditions [3]

/cc @rolette any news about the patch ?

[1] http://redis.io/topics/sentinel
[2] https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L128
[3] pfreixes@6a95994

@rolette
Copy link

rolette commented Jun 28, 2016

@pfreixes Unfortunately adding a threading lock just for get_master_address() doesn't fix the problem. Taking a look at the most recent stack I posted, we still have the problem of the disconnect ripping things out from under the thread that is currently using the connection.

While we could play whack-a-mole and squelch the AttributeError coming up from the HiredisParser in this case, similar to what we've been trying so far, I think it's a losing battle. I agree with @andymccurdy that it's not a good idea to be calling disconnect on all the connections in get_master_address. You just don't have any control over what state the connection is currently in from other threads.

One way to approach it:

  1. synchronize access to the free list (self._available_connections)
  2. if the master changes, then you can safely disconnect any connections in the free list because of (1). This isn't strictly required, but it is nice to avoid having to take timeouts for callers pulling connections off the free list later.

from there, you can either just let threads with in-use connections deal with the errors from their action (timeouts, read-only errors, etc.), which they have to do anyway, or you can implement a mechanism similar to what you would do in a signal handler... set a flag on each connection and then let the thread that owns the connection deal with the flag when it is safe to do so.

Having the in-use connections deal with the timeouts, etc. isn't near as gross as it seems for clients if they are using the retry_on_timeout option. The signal handler style approach let's you be a little more proactive and avoid timeouts more often, so that seems preferable - especially for clients that use long timeouts.

@pfreixes
Copy link
Contributor

The signal here try to minimize the refactoring having in mind commands that keep the connection opened until there is something to do, perhaps pubsub.

The use of shutdown seems the reasonable - and it should work with the current code - solution between do nothing and refactor a lot of code, having in mind how painful can be give support for connections that keep opened for a long time.

rolette added a commit to rolette/redis-py that referenced this issue Sep 10, 2016
Connection.disconnect() were ripping connections out from under
the threads or processes that owned them

redis#732
night pushed a commit to discord/redis-py that referenced this issue Feb 15, 2017
Connection.disconnect() were ripping connections out from under
the threads or processes that owned them

redis#732
capkovic pushed a commit to exponea/redis-py that referenced this issue Apr 4, 2017
Connection.disconnect() were ripping connections out from under
the threads or processes that owned them

redis#732
@gjcarneiro
Copy link

Can we please merge @rolette's patches? I looked over them and they seem very reasonable, also tested them for a while and found no problems, so +1 from me.

@lalinsky
Copy link

We had these changes in production for a while and after a Redis master failover, there is a flood of new connections opening, eventually making the process unusable, since it runs out of client TCP ports. That might be specific to our environment, but I think the code is not safe as is.

@rolette
Copy link

rolette commented Aug 28, 2017

@lalinsky: That sounds more like an application resource problem than a problem with the pool. There's no rate limit on the connection pool for opening connections. When the master changes, the old connections have to be closed and replaced with new ones (currently on an as-needed basis, not all connections concurrently).

When you close a connection, the network stack doesn't make the port available immediately. Even if the connection goes through a FIN, FIN-ACK, ACK sequence, it gets put into the TIME_WAIT state for some amount of time. If you are pushing the limit on the number of concurrent connections, it's easy to start getting errors when you try to open a new connection because no client ports are available. In Python, that shows up an OSError 99 - "Cannot assign requested address."

You might be able to fix that scenario by using the BlockingConnectionPool and setting appropriate limits. Just make sure you get the parallel fixes to that referenced earlier in the thread.

If you want to have some real fun and "interesting" backtraces, test your app when the master fails over without this changeset.

@lalinsky
Copy link

lalinsky commented Aug 28, 2017

@rolette I don't think it's the rate of the new connections that it's the core problem, it's that the connections never make it properly to the pool and are never reused. I have not done any serious testing to confirm that, but that's my intuition. If it was just the rate of new connections, we would have the same problem on app startup, but that's not happening there.

@lalinsky
Copy link

lalinsky commented Sep 30, 2017

@rolette FYI, I have figured out the problem with your branch. When the master address changes, it's not remembered and therefore every new connection attempt causes the connection pool to close all existing connections. That's easy to fix at https://github.com/andymccurdy/redis-py/blob/master/redis/sentinel.py#L106, but I'm going to open a slightly different PR with the sentinel pool redesigned to not close the connections immediately -- #909

@rolette
Copy link

rolette commented Sep 30, 2017

@lalinsky Thanks for following up. That's an issue in the master (nothing specific to my branch). We recently merged in PR #847 into our code base to fix the OSError 99 problems.

I keep hoping @andymccurdy will merge the various SentinelConnectionPool fixes in to master. If not, I guess I should at least merge them into my PR so they are available in one place for folks to pick up.

@hobzcalvin
Copy link

I'm seeing a few more errors, all coming from the same place, wondering if they have the same cause. In addition to AttributeError: 'NoneType' object has no attribute 'feed' and AttributeError: 'NoneType' object has no attribute 'close' there are these:

AttributeError: 'list' object has no attribute 'encode'
-- snip --
  File "celery/result.py", line 194, in get
    on_message=on_message,
  File "celery/backends/async.py", line 189, in wait_for_pending
    for _ in self._wait_for_pending(result, **kwargs):
  File "celery/backends/async.py", line 256, in _wait_for_pending
    on_interval=on_interval):
  File "celery/backends/async.py", line 57, in drain_events_until
    yield self.wait_for(p, wait, timeout=1)
  File "celery/backends/async.py", line 66, in wait_for
    wait(timeout=timeout)
  File "celery/backends/redis.py", line 69, in drain_events
    m = self._pubsub.get_message(timeout=timeout)
  File "redis/client.py", line 2513, in get_message
    response = self.parse_response(block=False, timeout=timeout)
  File "redis/client.py", line 2430, in parse_response
    return self._execute(connection, connection.read_response)
  File "redis/client.py", line 2415, in _execute
    connection.connect()
  File "redis/connection.py", line 493, in connect
    self.on_connect()
  File "redis/connection.py", line 567, in on_connect
    if nativestr(self.read_response()) != 'OK':
  File "redis/_compat.py", line 115, in nativestr
    return x if isinstance(x, str) else x.encode('utf-8', 'replace')

and

InvalidResponse: Protocol error, got "\xb8" as reply type byte
-- snip --
  File "celery/result.py", line 194, in get
    on_message=on_message,
  File "celery/backends/async.py", line 189, in wait_for_pending
    for _ in self._wait_for_pending(result, **kwargs):
  File "celery/backends/async.py", line 256, in _wait_for_pending
    on_interval=on_interval):
  File "celery/backends/async.py", line 57, in drain_events_until
    yield self.wait_for(p, wait, timeout=1)
  File "celery/backends/async.py", line 66, in wait_for
    wait(timeout=timeout)
  File "celery/backends/redis.py", line 69, in drain_events
    m = self._pubsub.get_message(timeout=timeout)
  File "redis/client.py", line 2513, in get_message
    response = self.parse_response(block=False, timeout=timeout)
  File "redis/client.py", line 2430, in parse_response
    return self._execute(connection, connection.read_response)
  File "redis/client.py", line 2408, in _execute
    return command(*args)
  File "redis/connection.py", line 624, in read_response
    response = self._parser.read_response()
  File "redis/connection.py", line 408, in read_response
    response = self._reader.gets()

I'm guessing they're just other things that can go wrong when the connection disappears without proper locking.

@jamespeacock
Copy link

I have also been running into @hobzcalvin's errors periodically and inconsistently. We utilize redis as a Celery backend and when running multiple tasks asynchronously, a stream of these AttributeErrors and ProtocolErrors pop up after task timeouts or after a task failure.

I understand that this problem might happen due to leftover bytes in connection after the socket closes that throw the ProtocolException when the connection is re-opened. I am hoping there is a graceful way to handle this and flush the bytes.

@andymccurdy
Copy link
Contributor

3.2.0 has just been released and fixes this issues.

jdost pushed a commit to discord/redis-py that referenced this issue Jun 24, 2019
Connection.disconnect() were ripping connections out from under
the threads or processes that owned them

redis#732
willhug pushed a commit to discord/redis-py that referenced this issue Aug 12, 2021
Connection.disconnect() were ripping connections out from under
the threads or processes that owned them

redis#732
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants