Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pools/connections not being closed in tests #1824

Closed
mlodato517 opened this issue Apr 21, 2022 · 9 comments · Fixed by #1901
Closed

Pools/connections not being closed in tests #1824

mlodato517 opened this issue Apr 21, 2022 · 9 comments · Fixed by #1901

Comments

@mlodato517
Copy link

Crazy High Level

After upgrading from 0.5.11 to 0.5.12 (and trying 0.5.13), our test
suite fails because of this error:

remaining connection slots are reserved for non-replication superuser connections


I cloned sqlx and git bisected it and got to #1799.

Details And Debugging

Every test in this particular test suite runs with #[tokio::test] so that seems
awfully related to the discussion of runtimes in #1396 (comment). Every test in this
test suite calls a function which creates its own PgPool to run some queries.
This is, of course, not what pools are for but, if I add a pool.close().await to
the end of that function, the tests all pass.

Originally I thought I'd just do that, problem solved. However we also have
a bunch of tests that, I think, can't share a pool. Every unit test
clones a template database and then connects a pool to its specific database.
This ensures no tests collide which is nice but, I think, prevents us from
sharing one pool among them.

Minimal Reproduction (sort of)

This reproducer shows the "bug" in both 0.5.11 and 0.5.13 so maybe
this is something we were bound to run into eventually but our test suite
happens to consistently pass on 0.5.11 and consistently fail otherwise.
This also, obviously, isn't how our tests work - we don't create the pools
in a loop, each test creates its own. But this is fewer lines of code :-).

#[cfg(test)]
mod tests {
    struct DropDetector(usize, sqlx::PgPool);
    impl Drop for DropDetector {
        fn drop(&mut self) {
            println!("Dropping {}", self.0);
        }
    }async fn handwavy_test_setup(i: usize) {
        let db_url = "your_connection_string_here";
        let drop_detector = DropDetector(i, sqlx::PgPool::connect(db_url).await.unwrap());
        let DropDetector(_, pool) = &drop_detector;// This "fixes" the "problem".
        // pool.close().await;
    }#[tokio::test]
    async fn test_stuff() {
        for i in 0..100 {
            println!("Starting test {i}");
            handwavy_test_setup(i).await;// Give time for pool to close? Simulate doing ... test stuff.
            tokio::time::sleep(std::time::Duration::from_millis(100)).await;
        }
    }
}

Cargo.toml:

[package]
name = "sqlx-pools"
version = "0.1.0"
edition = "2021"

[dependencies]
sqlx = { version = "=0.5.11", default-features = false, features = ["runtime-actix-rustls", "postgres"] }
tokio = { version = "=1.17.0", default-features = false, features = ["macros", "rt"] }

Versions And Things

Thing Version
test suite sqlx works on 0.5.11, did not on 0.5.12, does not on 0.5.13
reproducer sqlx fails on 0.5.11 and 0.5.13 (at least)
driver postgres
postgresql 14.1 and 12.6
@abonander
Copy link
Collaborator

#1799 was designed specifically to fix this situation so I don't understand how it could have made it worse unless it introduced an Arc-cycle somewhere. The fact that calling .close().await fixes the problem suggests it might be some sort of timing issue, though.

Since you're creating a number of different Pools potentially in parallel, you probably want to use something like a global tokio::sync::Semaphore to ensure you don't exceed Postgres' connection limit, or use --test-threads to limit the number of concurrently running tests.

Since the pools are short-lived, you can also turn off the idle reaper by setting max_lifetime and idle_timeout to None, or even try pool_options.after_release(|_| false) to force connections to be eagerly closed when returned to the pool. You might also want to use pool_options.max_connections(1) to make sure each pool is only opening a single connection.

Though, it might make sense to have a PoolOptions::new_test() config with different defaults specifically for testing. We've also been talking about adding a #[sqlx::test] attribute that does this configuration automatically.

@mlodato517
Copy link
Author

Since you're creating a number of different Pools potentially in parallel, you probably want to use something like a global tokio::sync::Semaphore to ensure you don't exceed Postgres' connection limit,

That does sound pretty nifty but I was hoping to not have to make a pool for my connection pools 😆 This may be the way though!

or use --test-threads to limit the number of concurrently running tests.

We actually do have --test-threads throttling in CI and locally that doesn't help either :-(

Example Local Output
➜ cargo test --quiet some_test_names -- --test-threads=2
running 82 tests
..................................................................................
test result: ok. 82 passed; 0 failed; 0 ignored; 0 measured; 47 filtered out; finished in 33.19s

➜ cargo update -p sqlx
    Updating crates.io index
      Adding event-listener v2.5.2
      Adding hkdf v0.12.3
      Adding hmac v0.12.1
    Updating md-5 v0.9.1 -> v0.10.1
      Adding sha2 v0.10.2
    Updating sqlx v0.5.11 -> v0.5.13
    Updating sqlx-core v0.5.11 -> v0.5.13
    Updating sqlx-macros v0.5.11 -> v0.5.13
    Updating sqlx-rt v0.5.11 -> v0.5.13

➜ cargo test --quiet some_test_names -- --test-threads=2

running 82 tests
...............................................test one_test has been running for over 60 seconds
test another_test has been running for over 60 seconds
..F^C # gave up

The tests start taking over 60s because the code that checks out the pool actually does it in a loop when it fails to connect to the DB so they just loop for a long time but if I turn on all the logging things I see the above mentioned error.

Since the pools are short-lived, you can also turn off the idle reaper by setting max_lifetime and idle_timeout to None, or even try pool_options.after_release(|_| false) to force connections to be eagerly closed when returned to the pool. You might also want to use pool_options.max_connections(1) to make sure each pool is only opening a single connection.

Ah that's great - I didn't know if None would make it infinite or effectively zero. I'll try that!

Since the pools are short-lived, you can also turn off the idle reaper by setting max_lifetime and idle_timeout to None, or even try pool_options.after_release(|_| false) to force connections to be eagerly closed when returned to the pool. You might also want to use pool_options.max_connections(1) to make sure each pool is only opening a single connection.

That also sounds really cool and would help our integration tests that connect to the various DB clones. Our integration tests use the same pool connection function that our actual application uses so it maybe can't benefit from this (or the other suggestions above - I'll see!).

@mlodato517
Copy link
Author

I'll spend some more time today trying to get an actual minimal reproduction but I'm having trouble understanding the original reproducer - is it semi-obvious to you that, with 100ms sleeps (also tried with 1s sleeps), the pools wouldn't close? It doesn't look like max_lifetime or idle_timeout help the above reproducer but the after_release(|_| false) gets them to pass! That's cool :-) (but, again, maybe not possible for our integration tests). I'll also try to get some PRs up for docs for idle_lifetime = None and after_release since those are nifty tidbits that weren't obvious to me.

@mlodato517
Copy link
Author

Alrighty, I have a consistent reproducer on my local at least. I made 3 changes:

Dependencies

I needed to lock in sqlx-core. Otherwise, when I updated sqlx to 0.5.13 and then back to 0.5.11, sqlx-core stayed at 0.5.13:

[dev-dependencies]
sqlx = { version = "=0.5.11", default-features = false, features = ["runtime-actix-rustls", "postgres"] }
sqlx-core = "=0.5.11"
tokio = { version = "=1.17.0", default-features = false, features = ["macros", "rt"] }

Pool Acquisition

I added some code that's more similar to how our actual function sets up a pool. I removed some looping and error handling but one important thing is that we test the connection with a query before returning it (not sure if that's necessary - the issue is still open but if I stop my DB server it seems to correctly return an error without that query - maybe because we have a timeout?).

async fn get_pool() -> sqlx::PgPool {
    let db_url = "your_connection_string_here";
    let pool = sqlx::PgPool::connect(db_url).await.unwrap();

    // NOTE: Commenting out this query allows the tests to pass ... freaky

    // sqlx will not fail if it cannot connect to the database at all.
    // Until [this](https://github.com/launchbadge/sqlx/issues/744) is fixed,
    // we need to make a dummy query.
    sqlx::query("SELECT 1").execute(&pool).await.unwrap();

    pool
}

async fn handwavy_test_setup() {
    let pool = get_pool().await;
    let tx = pool.begin().await.unwrap();

    // Do actual test setup stuff ...

    tx.commit().await.unwrap();

    // This "fixes" the "problem"
    // pool.close().await;
}

Test Arrangement

I went away from the weird loop to the "lots of actual tests"

// Generate a test so we can fit this all on one screen.
macro_rules! t {
    ($test_name:tt) => {
        #[tokio::test]
        async fn $test_name() {
            handwavy_test_setup().await;

            // Do test stuff
            tokio::time::sleep(std::time::Duration::from_millis(10)).await;
        }
    };
}

#[rustfmt::skip]
mod actual_tests {
    use super::*;

    t!(a0); t!(a1); t!(a2); t!(a3); t!(a4); t!(a5); t!(a6); t!(a7); t!(a8); t!(a9);
    t!(b0); t!(b1); t!(b2); t!(b3); t!(b4); t!(b5); t!(b6); t!(b7); t!(b8); t!(b9);
    t!(c0); t!(c1); t!(c2); t!(c3); t!(c4); t!(c5); t!(c6); t!(c7); t!(c8); t!(c9);
    t!(d0); t!(d1); t!(d2); t!(d3); t!(d4); t!(d5); t!(d6); t!(d7); t!(d8); t!(d9);
    t!(e0); t!(e1); t!(e2); t!(e3); t!(e4); t!(e5); t!(e6); t!(e7); t!(e8); t!(e9);
    t!(f0); t!(f1); t!(f2); t!(f3); t!(f4); t!(f5); t!(f6); t!(f7); t!(f8); t!(f9);
    t!(g0); t!(g1); t!(g2); t!(g3); t!(g4); t!(g5); t!(g6); t!(g7); t!(g8); t!(g9);
    t!(h0); t!(h1); t!(h2); t!(h3); t!(h4); t!(h5); t!(h6); t!(h7); t!(h8); t!(h9);
    t!(i0); t!(i1); t!(i2); t!(i3); t!(i4); t!(i5); t!(i6); t!(i7); t!(i8); t!(i9);
    t!(j0); t!(j1); t!(j2); t!(j3); t!(j4); t!(j5); t!(j6); t!(j7); t!(j8); t!(j9);
    t!(k0); t!(k1); t!(k2); t!(k3); t!(k4); t!(k5); t!(k6); t!(k7); t!(k8); t!(k9);
    t!(l0); t!(l1); t!(l2); t!(l3); t!(l4); t!(l5); t!(l6); t!(l7); t!(l8); t!(l9);
}

Result

With this set up I get consistent failures with (both sqlx and sqlx-core at) 0.5.13 and consistent success with (both sqlx and sqlx-core at) 0.5.11.

Full Code
#[cfg(test)]
mod tests {
    async fn get_pool() -> sqlx::PgPool {
        let db_url = "your_connection_string_here";
        let pool = sqlx::PgPool::connect(db_url).await.unwrap();

        // NOTE: Commenting out this query allows the tests to pass ... freaky

        // sqlx will not fail if it cannot connect to the database at all.
        // Until [this](https://github.com/launchbadge/sqlx/issues/744) is fixed,
        // we need to make a dummy query.
        sqlx::query("SELECT 1").execute(&pool).await.unwrap();

        pool
    }

    async fn handwavy_test_setup() {
        let pool = get_pool().await;
        let tx = pool.begin().await.unwrap();

        // Do actual test setup stuff ...

        tx.commit().await.unwrap();

        // This "fixes" the "problem"
        // pool.close().await;
    }

    // Generate a test so we can fit this all on one screen.
    macro_rules! t {
        ($test_name:tt) => {
            #[tokio::test]
            async fn $test_name() {
                handwavy_test_setup().await;

                // Do test stuff
                tokio::time::sleep(std::time::Duration::from_millis(10)).await;
            }
        };
    }

    #[rustfmt::skip]
    mod actual_tests {
        use super::*;

        t!(a0); t!(a1); t!(a2); t!(a3); t!(a4); t!(a5); t!(a6); t!(a7); t!(a8); t!(a9);
        t!(b0); t!(b1); t!(b2); t!(b3); t!(b4); t!(b5); t!(b6); t!(b7); t!(b8); t!(b9);
        t!(c0); t!(c1); t!(c2); t!(c3); t!(c4); t!(c5); t!(c6); t!(c7); t!(c8); t!(c9);
        t!(d0); t!(d1); t!(d2); t!(d3); t!(d4); t!(d5); t!(d6); t!(d7); t!(d8); t!(d9);
        t!(e0); t!(e1); t!(e2); t!(e3); t!(e4); t!(e5); t!(e6); t!(e7); t!(e8); t!(e9);
        t!(f0); t!(f1); t!(f2); t!(f3); t!(f4); t!(f5); t!(f6); t!(f7); t!(f8); t!(f9);
        t!(g0); t!(g1); t!(g2); t!(g3); t!(g4); t!(g5); t!(g6); t!(g7); t!(g8); t!(g9);
        t!(h0); t!(h1); t!(h2); t!(h3); t!(h4); t!(h5); t!(h6); t!(h7); t!(h8); t!(h9);
        t!(i0); t!(i1); t!(i2); t!(i3); t!(i4); t!(i5); t!(i6); t!(i7); t!(i8); t!(i9);
        t!(j0); t!(j1); t!(j2); t!(j3); t!(j4); t!(j5); t!(j6); t!(j7); t!(j8); t!(j9);
        t!(k0); t!(k1); t!(k2); t!(k3); t!(k4); t!(k5); t!(k6); t!(k7); t!(k8); t!(k9);
        t!(l0); t!(l1); t!(l2); t!(l3); t!(l4); t!(l5); t!(l6); t!(l7); t!(l8); t!(l9);
    }
}

@abonander
Copy link
Collaborator

After trying the repro locally, it does seem to be leaking connections for some reason but I can't fathom why. #1799 shouldn't have introduced it, although the extra cloning of Arc may have changed exactly which task drops things.

@abonander
Copy link
Collaborator

abonander commented May 3, 2022

Okay, it took me an embarrassingly long time to find the problem, but I think I've figured it out. In hindsight, it should have been painfully obvious from the diff of #1799 but Github collapsed the lines by default because they weren't changed: https://github.com/launchbadge/sqlx/pull/1799/files#diff-782a6e0cb2edbd43e45dfef48fae8534c6a8f46cddebca66256e6f88819f7ef0R397-R399

That mem::forget() was perfectly innocuous when it contained a reference to SharedPool, but now that it contains an Arc, it's effectively leaking the SharedPool because that is called when a connection is released back to the pool, permanently adding 1 strong reference to the Arc every time.

That means that the SharedPool and any idle connections inside it will never be cleaned up, leaking the underlying TCP connections. It's still relatively low impact anyway since the sockets will automatically be closed by the filesystem when the process exits, but since you're creating a new PgPool for every test, it'll quickly exhaust the connection limit on the server.

@mlodato517
Copy link
Author

That seems incredibly sneaky and I find this not at all an embarrassingly long time for debugging 😆

And, for me, that link doesn't obviously go anywhere (maybe because the lines are collapsed by default and so GH can't jump to the hidden line or something?)

@abonander
Copy link
Collaborator

It's here on master:

pub fn cancel(self) {
mem::forget(self);
}

@kevinji
Copy link

kevinji commented Jun 10, 2022

Would it be sufficient to call drop(self.pool) before that mem::forget call to decrement the Arc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants