Improve max_lifetime handling #3051

mirek26 · 2024-02-11T15:47:30Z

Is your feature request related to a problem? Please describe.

We have a latency-sensitive system and we're seeing regular spikes in query latency. After some digging, I believe there are two interconnected issues:

Establishing a new database connection is slow (which I'm trying to improve here and here)
Connection pool does not reap old connections well, leading to a wave of new connections every max_lifetime

A somewhat related issue is #2854 - similar symptoms. All connections are replaced after max_lifetime AND this happens during acquire, so we observe spikes in query latency due to (1).

The issue with (2) is that the maintenance task only looks at (size - min_size) connections. This means that if size = min_size, it does nothing. We run with relatively high min_size to ensure we always have a good number of available connections, but after learning this, I suspect it may actually be counterproductive.

Describe the solution you'd like

The pool should be replacing connections in the background. It should be extremely rare that a connection needs to be opened at query time. Even if opening the connection takes 20ms - we're writing Rust so my expectations are high ;-)

@abonander had a related comment in #2848:

The logic of acquire() should be changed to race between acquiring a connection from the idle queue and opening a new connection.

💯 . This would be amazing. But I don't think it'd fully solve my issue - the problem is that if all connections in the pool were opened around the same time, the next connection we acquire from the idle queue is likely to be above max lifetime as well.

My preferred combination of changes would be:

Don't check max_lifetime on acquire. Instead, check it on release in return_to_pool. If users want to check this on acquire, they are free to do so manually in before_acquire
Improve the replacement logic: go though the full pool one-by-one, if old or idle, close it and run min_connections_maintenance immediately
The suggestion above: acquire() should be changed to race between acquiring a connection from the idle queue and opening a new connection

I'm open to other solutions; happy to contribute changes if we agree on the path forward

The text was updated successfully, but these errors were encountered:

mirek26 · 2024-02-12T17:07:36Z

After writing this, I figured we can simulate proposal 1 by

     .max_lifetime(None)
     .after_release(|_conn, meta| Box::pin(async move { Ok(meta.age.as_secs() < 30 * 60) }))

We don't get the benefit of background max lifetime checking, but we run enough queries for this to be checked often enough. This chart shows the p99 latency of different queries after rolling out this change:

Basically: synchronous reconnections at query time became very rare.

I still think this change should be done in sqlx - this seems to be a much better default

abonander · 2024-03-19T05:43:44Z

CLosed by #3065

mirek26 added the enhancement New feature or request label Feb 11, 2024

fiadliel mentioned this issue Feb 13, 2024

feat: Implement before_connect callback to modify connect options. #3057

Closed

mirek26 mentioned this issue Feb 18, 2024

Improve max_lifetime handling #3065

Merged

abonander closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve max_lifetime handling #3051

Improve max_lifetime handling #3051

mirek26 commented Feb 11, 2024

mirek26 commented Feb 12, 2024

abonander commented Mar 19, 2024 •

edited

Loading

Improve max_lifetime handling #3051

Improve max_lifetime handling #3051

Comments

mirek26 commented Feb 11, 2024

mirek26 commented Feb 12, 2024

abonander commented Mar 19, 2024 • edited Loading

abonander commented Mar 19, 2024 •

edited

Loading