Pools/connections not being closed in tests #1824

mlodato517 · 2022-04-21T16:30:18Z

Crazy High Level

After upgrading from 0.5.11 to 0.5.12 (and trying 0.5.13), our test
suite fails because of this error:

remaining connection slots are reserved for non-replication superuser connections

I cloned sqlx and git bisected it and got to #1799.

Details And Debugging

Every test in this particular test suite runs with #[tokio::test] so that seems
awfully related to the discussion of runtimes in #1396 (comment). Every test in this
test suite calls a function which creates its own PgPool to run some queries.
This is, of course, not what pools are for but, if I add a pool.close().await to
the end of that function, the tests all pass.

Originally I thought I'd just do that, problem solved. However we also have
a bunch of tests that, I think, can't share a pool. Every unit test
clones a template database and then connects a pool to its specific database.
This ensures no tests collide which is nice but, I think, prevents us from
sharing one pool among them.

Minimal Reproduction (sort of)

This reproducer shows the "bug" in both 0.5.11 and 0.5.13 so maybe
this is something we were bound to run into eventually but our test suite
happens to consistently pass on 0.5.11 and consistently fail otherwise.
This also, obviously, isn't how our tests work - we don't create the pools
in a loop, each test creates its own. But this is fewer lines of code :-).

#[cfg(test)]
mod tests {
    struct DropDetector(usize, sqlx::PgPool);
    impl Drop for DropDetector {
        fn drop(&mut self) {
            println!("Dropping {}", self.0);
        }
    }

    async fn handwavy_test_setup(i: usize) {
        let db_url = "your_connection_string_here";
        let drop_detector = DropDetector(i, sqlx::PgPool::connect(db_url).await.unwrap());
        let DropDetector(_, pool) = &drop_detector;

        // This "fixes" the "problem".
        // pool.close().await;
    }

    #[tokio::test]
    async fn test_stuff() {
        for i in 0..100 {
            println!("Starting test {i}");
            handwavy_test_setup(i).await;

            // Give time for pool to close? Simulate doing ... test stuff.
            tokio::time::sleep(std::time::Duration::from_millis(100)).await;
        }
    }
}

Cargo.toml:

[package]
name = "sqlx-pools"
version = "0.1.0"
edition = "2021"

[dependencies]
sqlx = { version = "=0.5.11", default-features = false, features = ["runtime-actix-rustls", "postgres"] }
tokio = { version = "=1.17.0", default-features = false, features = ["macros", "rt"] }

Versions And Things

Thing	Version
test suite `sqlx`	works on `0.5.11`, did not on `0.5.12`, does not on `0.5.13`
reproducer `sqlx`	fails on `0.5.11` and `0.5.13` (at least)
driver	postgres
`postgresql`	14.1 and 12.6

The text was updated successfully, but these errors were encountered:

abonander · 2022-04-21T20:16:08Z

#1799 was designed specifically to fix this situation so I don't understand how it could have made it worse unless it introduced an Arc-cycle somewhere. The fact that calling .close().await fixes the problem suggests it might be some sort of timing issue, though.

Since you're creating a number of different Pools potentially in parallel, you probably want to use something like a global tokio::sync::Semaphore to ensure you don't exceed Postgres' connection limit, or use --test-threads to limit the number of concurrently running tests.

Since the pools are short-lived, you can also turn off the idle reaper by setting max_lifetime and idle_timeout to None, or even try pool_options.after_release(|_| false) to force connections to be eagerly closed when returned to the pool. You might also want to use pool_options.max_connections(1) to make sure each pool is only opening a single connection.

Though, it might make sense to have a PoolOptions::new_test() config with different defaults specifically for testing. We've also been talking about adding a #[sqlx::test] attribute that does this configuration automatically.

mlodato517 · 2022-04-22T15:00:08Z

Since you're creating a number of different Pools potentially in parallel, you probably want to use something like a global tokio::sync::Semaphore to ensure you don't exceed Postgres' connection limit,

That does sound pretty nifty but I was hoping to not have to make a pool for my connection pools 😆 This may be the way though!

or use --test-threads to limit the number of concurrently running tests.

We actually do have --test-threads throttling in CI and locally that doesn't help either :-(

Example Local Output

➜ cargo test --quiet some_test_names -- --test-threads=2
running 82 tests
..................................................................................
test result: ok. 82 passed; 0 failed; 0 ignored; 0 measured; 47 filtered out; finished in 33.19s

➜ cargo update -p sqlx
    Updating crates.io index
      Adding event-listener v2.5.2
      Adding hkdf v0.12.3
      Adding hmac v0.12.1
    Updating md-5 v0.9.1 -> v0.10.1
      Adding sha2 v0.10.2
    Updating sqlx v0.5.11 -> v0.5.13
    Updating sqlx-core v0.5.11 -> v0.5.13
    Updating sqlx-macros v0.5.11 -> v0.5.13
    Updating sqlx-rt v0.5.11 -> v0.5.13

➜ cargo test --quiet some_test_names -- --test-threads=2

running 82 tests
...............................................test one_test has been running for over 60 seconds
test another_test has been running for over 60 seconds
..F^C # gave up

The tests start taking over 60s because the code that checks out the pool actually does it in a loop when it fails to connect to the DB so they just loop for a long time but if I turn on all the logging things I see the above mentioned error.

Since the pools are short-lived, you can also turn off the idle reaper by setting max_lifetime and idle_timeout to None, or even try pool_options.after_release(|_| false) to force connections to be eagerly closed when returned to the pool. You might also want to use pool_options.max_connections(1) to make sure each pool is only opening a single connection.

Ah that's great - I didn't know if None would make it infinite or effectively zero. I'll try that!

Since the pools are short-lived, you can also turn off the idle reaper by setting max_lifetime and idle_timeout to None, or even try pool_options.after_release(|_| false) to force connections to be eagerly closed when returned to the pool. You might also want to use pool_options.max_connections(1) to make sure each pool is only opening a single connection.

That also sounds really cool and would help our integration tests that connect to the various DB clones. Our integration tests use the same pool connection function that our actual application uses so it maybe can't benefit from this (or the other suggestions above - I'll see!).

mlodato517 · 2022-04-22T15:00:12Z

I'll spend some more time today trying to get an actual minimal reproduction but I'm having trouble understanding the original reproducer - is it semi-obvious to you that, with 100ms sleeps (also tried with 1s sleeps), the pools wouldn't close? It doesn't look like max_lifetime or idle_timeout help the above reproducer but the after_release(|_| false) gets them to pass! That's cool :-) (but, again, maybe not possible for our integration tests). I'll also try to get some PRs up for docs for idle_lifetime = None and after_release since those are nifty tidbits that weren't obvious to me.

mlodato517 · 2022-04-22T17:52:20Z

Alrighty, I have a consistent reproducer on my local at least. I made 3 changes:

Dependencies

I needed to lock in sqlx-core. Otherwise, when I updated sqlx to 0.5.13 and then back to 0.5.11, sqlx-core stayed at 0.5.13:

[dev-dependencies]
sqlx = { version = "=0.5.11", default-features = false, features = ["runtime-actix-rustls", "postgres"] }
sqlx-core = "=0.5.11"
tokio = { version = "=1.17.0", default-features = false, features = ["macros", "rt"] }

Pool Acquisition

I added some code that's more similar to how our actual function sets up a pool. I removed some looping and error handling but one important thing is that we test the connection with a query before returning it (not sure if that's necessary - the issue is still open but if I stop my DB server it seems to correctly return an error without that query - maybe because we have a timeout?).

async fn get_pool() -> sqlx::PgPool {
    let db_url = "your_connection_string_here";
    let pool = sqlx::PgPool::connect(db_url).await.unwrap();

    // NOTE: Commenting out this query allows the tests to pass ... freaky

    // sqlx will not fail if it cannot connect to the database at all.
    // Until [this](https://github.com/launchbadge/sqlx/issues/744) is fixed,
    // we need to make a dummy query.
    sqlx::query("SELECT 1").execute(&pool).await.unwrap();

    pool
}

async fn handwavy_test_setup() {
    let pool = get_pool().await;
    let tx = pool.begin().await.unwrap();

    // Do actual test setup stuff ...

    tx.commit().await.unwrap();

    // This "fixes" the "problem"
    // pool.close().await;
}

Test Arrangement

I went away from the weird loop to the "lots of actual tests"

// Generate a test so we can fit this all on one screen.
macro_rules! t {
    ($test_name:tt) => {
        #[tokio::test]
        async fn $test_name() {
            handwavy_test_setup().await;

            // Do test stuff
            tokio::time::sleep(std::time::Duration::from_millis(10)).await;
        }
    };
}

#[rustfmt::skip]
mod actual_tests {
    use super::*;

    t!(a0); t!(a1); t!(a2); t!(a3); t!(a4); t!(a5); t!(a6); t!(a7); t!(a8); t!(a9);
    t!(b0); t!(b1); t!(b2); t!(b3); t!(b4); t!(b5); t!(b6); t!(b7); t!(b8); t!(b9);
    t!(c0); t!(c1); t!(c2); t!(c3); t!(c4); t!(c5); t!(c6); t!(c7); t!(c8); t!(c9);
    t!(d0); t!(d1); t!(d2); t!(d3); t!(d4); t!(d5); t!(d6); t!(d7); t!(d8); t!(d9);
    t!(e0); t!(e1); t!(e2); t!(e3); t!(e4); t!(e5); t!(e6); t!(e7); t!(e8); t!(e9);
    t!(f0); t!(f1); t!(f2); t!(f3); t!(f4); t!(f5); t!(f6); t!(f7); t!(f8); t!(f9);
    t!(g0); t!(g1); t!(g2); t!(g3); t!(g4); t!(g5); t!(g6); t!(g7); t!(g8); t!(g9);
    t!(h0); t!(h1); t!(h2); t!(h3); t!(h4); t!(h5); t!(h6); t!(h7); t!(h8); t!(h9);
    t!(i0); t!(i1); t!(i2); t!(i3); t!(i4); t!(i5); t!(i6); t!(i7); t!(i8); t!(i9);
    t!(j0); t!(j1); t!(j2); t!(j3); t!(j4); t!(j5); t!(j6); t!(j7); t!(j8); t!(j9);
    t!(k0); t!(k1); t!(k2); t!(k3); t!(k4); t!(k5); t!(k6); t!(k7); t!(k8); t!(k9);
    t!(l0); t!(l1); t!(l2); t!(l3); t!(l4); t!(l5); t!(l6); t!(l7); t!(l8); t!(l9);
}

Result

With this set up I get consistent failures with (both sqlx and sqlx-core at) 0.5.13 and consistent success with (both sqlx and sqlx-core at) 0.5.11.

Full Code

#[cfg(test)]
mod tests {
    async fn get_pool() -> sqlx::PgPool {
        let db_url = "your_connection_string_here";
        let pool = sqlx::PgPool::connect(db_url).await.unwrap();

        // NOTE: Commenting out this query allows the tests to pass ... freaky

        // sqlx will not fail if it cannot connect to the database at all.
        // Until [this](https://github.com/launchbadge/sqlx/issues/744) is fixed,
        // we need to make a dummy query.
        sqlx::query("SELECT 1").execute(&pool).await.unwrap();

        pool
    }

    async fn handwavy_test_setup() {
        let pool = get_pool().await;
        let tx = pool.begin().await.unwrap();

        // Do actual test setup stuff ...

        tx.commit().await.unwrap();

        // This "fixes" the "problem"
        // pool.close().await;
    }

    // Generate a test so we can fit this all on one screen.
    macro_rules! t {
        ($test_name:tt) => {
            #[tokio::test]
            async fn $test_name() {
                handwavy_test_setup().await;

                // Do test stuff
                tokio::time::sleep(std::time::Duration::from_millis(10)).await;
            }
        };
    }

    #[rustfmt::skip]
    mod actual_tests {
        use super::*;

        t!(a0); t!(a1); t!(a2); t!(a3); t!(a4); t!(a5); t!(a6); t!(a7); t!(a8); t!(a9);
        t!(b0); t!(b1); t!(b2); t!(b3); t!(b4); t!(b5); t!(b6); t!(b7); t!(b8); t!(b9);
        t!(c0); t!(c1); t!(c2); t!(c3); t!(c4); t!(c5); t!(c6); t!(c7); t!(c8); t!(c9);
        t!(d0); t!(d1); t!(d2); t!(d3); t!(d4); t!(d5); t!(d6); t!(d7); t!(d8); t!(d9);
        t!(e0); t!(e1); t!(e2); t!(e3); t!(e4); t!(e5); t!(e6); t!(e7); t!(e8); t!(e9);
        t!(f0); t!(f1); t!(f2); t!(f3); t!(f4); t!(f5); t!(f6); t!(f7); t!(f8); t!(f9);
        t!(g0); t!(g1); t!(g2); t!(g3); t!(g4); t!(g5); t!(g6); t!(g7); t!(g8); t!(g9);
        t!(h0); t!(h1); t!(h2); t!(h3); t!(h4); t!(h5); t!(h6); t!(h7); t!(h8); t!(h9);
        t!(i0); t!(i1); t!(i2); t!(i3); t!(i4); t!(i5); t!(i6); t!(i7); t!(i8); t!(i9);
        t!(j0); t!(j1); t!(j2); t!(j3); t!(j4); t!(j5); t!(j6); t!(j7); t!(j8); t!(j9);
        t!(k0); t!(k1); t!(k2); t!(k3); t!(k4); t!(k5); t!(k6); t!(k7); t!(k8); t!(k9);
        t!(l0); t!(l1); t!(l2); t!(l3); t!(l4); t!(l5); t!(l6); t!(l7); t!(l8); t!(l9);
    }
}

abonander · 2022-05-02T22:39:08Z

After trying the repro locally, it does seem to be leaking connections for some reason but I can't fathom why. #1799 shouldn't have introduced it, although the extra cloning of Arc may have changed exactly which task drops things.

abonander · 2022-05-03T00:22:44Z

Okay, it took me an embarrassingly long time to find the problem, but I think I've figured it out. In hindsight, it should have been painfully obvious from the diff of #1799 but Github collapsed the lines by default because they weren't changed: https://github.com/launchbadge/sqlx/pull/1799/files#diff-782a6e0cb2edbd43e45dfef48fae8534c6a8f46cddebca66256e6f88819f7ef0R397-R399

That mem::forget() was perfectly innocuous when it contained a reference to SharedPool, but now that it contains an Arc, it's effectively leaking the SharedPool because that is called when a connection is released back to the pool, permanently adding 1 strong reference to the Arc every time.

That means that the SharedPool and any idle connections inside it will never be cleaned up, leaking the underlying TCP connections. It's still relatively low impact anyway since the sockets will automatically be closed by the filesystem when the process exits, but since you're creating a new PgPool for every test, it'll quickly exhaust the connection limit on the server.

mlodato517 · 2022-05-03T14:42:33Z

That seems incredibly sneaky and I find this not at all an embarrassingly long time for debugging 😆

And, for me, that link doesn't obviously go anywhere (maybe because the lines are collapsed by default and so GH can't jump to the hidden line or something?)

abonander · 2022-05-03T18:57:06Z

It's here on master:

sqlx/sqlx-core/src/pool/inner.rs

Lines 397 to 399 in fa5c436

    
           pub fn cancel(self) { 
        
               mem::forget(self); 
        
           }

kevinji · 2022-06-10T20:14:01Z

Would it be sufficient to call drop(self.pool) before that mem::forget call to decrement the Arc?

apognu mentioned this issue Jun 9, 2022

Connection does not close in 0.5.13 #1862

Closed

abonander mentioned this issue Jun 16, 2022

pool fixes and breaking changes #1901

Merged

abonander closed this as completed in #1901 Jun 16, 2022

kyle-mccarthy mentioned this issue Mar 1, 2023

InnerPool not dropped when a Pool is created and discarded in quick succession #2375

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pools/connections not being closed in tests #1824

Pools/connections not being closed in tests #1824

mlodato517 commented Apr 21, 2022

abonander commented Apr 21, 2022

mlodato517 commented Apr 22, 2022

mlodato517 commented Apr 22, 2022

mlodato517 commented Apr 22, 2022

abonander commented May 2, 2022

abonander commented May 3, 2022 •

edited

Loading

mlodato517 commented May 3, 2022

abonander commented May 3, 2022

kevinji commented Jun 10, 2022

Pools/connections not being closed in tests #1824

Pools/connections not being closed in tests #1824

Comments

mlodato517 commented Apr 21, 2022

Crazy High Level

Details And Debugging

Minimal Reproduction (sort of)

Versions And Things

abonander commented Apr 21, 2022

mlodato517 commented Apr 22, 2022

mlodato517 commented Apr 22, 2022

mlodato517 commented Apr 22, 2022

Dependencies

Pool Acquisition

Test Arrangement

Result

abonander commented May 2, 2022

abonander commented May 3, 2022 • edited Loading

mlodato517 commented May 3, 2022

abonander commented May 3, 2022

kevinji commented Jun 10, 2022

abonander commented May 3, 2022 •

edited

Loading