Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix query cancellation #2077

Closed

Conversation

TalkingFoxMid
Copy link
Contributor

@TalkingFoxMid TalkingFoxMid commented Aug 5, 2024

After migrating Doobie to Cats Effect 3 (CE3), query cancellation no longer works as expected. When a query is canceled (using IO.cancel), the current fiber blocks, and PreparedStatement.close is not called immediately. This issue arises because in CE3, the cancellation of a fiber on a blocked thread does not trigger cancellation handlers right away, while PreparedStatement.close needs to be called in a cancellation handler.

Here is a code example that demonstrates this behavior. In CE2, the cancellation handler runs immediately, but in CE3, it only runs after the thread becomes unblocked:

import cats.effect.{Async, IO, IOApp}

object OnCancelTest extends IOApp.Simple {
  import scala.concurrent.duration._

  def goDB: IO[Unit] = IO(println("START goDB")).map(
    _ => Thread.sleep(10000)
  )
    .guarantee(
      IO(println("SEND CANCELLATION VIA JDBC"))
    )


  override def run: IO[Unit] =
    for {
      f <- goDB.start
      _ <- IO.sleep(1.second)
      _ <- f.cancel
      _ <- IO(println("CANCELLED goBD"))
    } yield ()
}

I also added unit test which reproduce that behaviour.

Alternative more appropriate solution: #2079 (comment)

@jatcwang
Copy link
Collaborator

jatcwang commented Aug 5, 2024

Thanks for trying this out @TalkingFoxMid. Have you seen the comments in #1922?

One of the thing we want to do is to signal to the database that it should cancel the running query on the database side too.
I believe that we might need PreparedStatement#cancel for that. If you want to simulate a long running query in postgres you can try using pg_sleep - I think that way you can test whether a query was really cancelled on the database side.

@TalkingFoxMid TalkingFoxMid force-pushed the fix_query_cancellation branch from 67e6830 to e32557f Compare August 5, 2024 09:28
@TalkingFoxMid
Copy link
Contributor Author

Thanks for trying this out @TalkingFoxMid. Have you seen the comments in #1922?

One of the thing we want to do is to signal to the database that it should cancel the running query on the database side too. I believe that we might need PreparedStatement#cancel for that. If you want to simulate a long running query in postgres you can try using pg_sleep - I think that way you can test whether a query was really cancelled on the database side.

Thanks for the reply. Cancellation on the database side is already working in this solution. I verified it with the following test:

  1. Fiber 1 locks the table.
  2. Fiber 2 attempts to insert data.
  3. Cancel Fiber 2.
  4. Fiber 1 unlocks the table.
  5. Select from the table → The table is empty.

This can also be demonstrated in the QueryCancellationSuite by adding a join on the lock fiber before the final table read. I believe that in this case, the insert query (2) was canceled on the DB side.

@TalkingFoxMid
Copy link
Contributor Author

Thanks for trying this out @TalkingFoxMid. Have you seen the comments in #1922?

One of the thing we want to do is to signal to the database that it should cancel the running query on the database side too. I believe that we might need PreparedStatement#cancel for that. If you want to simulate a long running query in postgres you can try using pg_sleep - I think that way you can test whether a query was really cancelled on the database side.

But I'll also try testing this with pg_sleep to confirm. Thanks!

@TalkingFoxMid TalkingFoxMid force-pushed the fix_query_cancellation branch 2 times, most recently from 2d86feb to a492eb4 Compare August 5, 2024 10:28
@TalkingFoxMid TalkingFoxMid force-pushed the fix_query_cancellation branch from a492eb4 to cc3c5a0 Compare August 5, 2024 12:16
Comment on lines 106 to 107
jdbcBlockedFiber <- asyncM.start(asyncM.blocking(f(a)).attempt)
a <- jdbcBlockedFiber.join.flatMap {
Copy link
Contributor

@satorg satorg Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got some concerns regarding this code. They may or may not be valid though, but let me pronounce them:

  • Sync#blocking is non-cancellable by design. Therefore the execution pipeline can be cancelled either before entering blocking or right after it completes, but not inside. Therefore it can only cancel the fiber that runs the query, but not the operation itself.
  • There can be a potential fiber leak with this code: if a cancel signal comes when the fiber is created, but before it gets joined, then here it goes (btw, I believe this is one of the reasons why background is almost always preferable over low-level fiber start/join manipulations).
  • Firing a fiber and then joining it immediately after looks quite suspicious. What is it attempted to accomplish there exactly?

Copy link
Contributor Author

@TalkingFoxMid TalkingFoxMid Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore it can only cancel the fiber that runs the query, but not the operation itself.

Yes, it is. But cancellation handler that sends the cancel query to the DB is on the fiber that can be cancelled now. So, the blocked fiber will also be canceled from the JDBC side (the database stops the JDBC thread).

There can be a potential fiber leak with this code

I think there is no fiber leak, the asyncM.blocking fiber will continue running until the JDBC thread finishes dealing with the query. When the JDBC query is completed and the thread is unblocked, that fiber will be cleaned up. Currently, Doobie works in exactly the same way.

About background:
There is a problem with this approach when it comes to background tasks. The cancellation of a fiber also tries to cancel the jdbcBlockedFiber, but the issue is that the cancellation of jdbcBlockedFiber is blocking and doesn't work. So, using background tasks doesn't solve the problem

What is it attempted to accomplish there exactly?

You cannot cancel jdbcBlockedFiber because it is running on a blocked JDBC thread. However, you can cancel the jdbcBlockedFiber.join.start fiber, because it is blocked asynchronously. This allows the cancellation handler, which sends a cancel query to the database, to be executed. Please refer to the example code in the description, as it illustrates the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no fiber leak, the asyncM.blocking fiber will continue running until the JDBC thread finishes dealing with the query. When the JDBC query is completed and the thread is unblocked, that fiber will be cleaned up.

Actually, this is what I would call a "fiber leak". Because the fiber goes out of control at this point and lives solely on the mercy of JDBC.

This allows the cancellation handler, which sends a cancel query to the database, to be executed.

Could you elaborate a bit – why do you think a cancel query is sent to the database in this case? Where is it happening exactly?

I see that you are trying to reproduce it with the test. However, I'm not sure it is working because the query gets cancelled gracefully. Perhaps it is happening just because the connection to the DB gets closed after the cancellation, which in turn cancels the query. However, it may not always work with connection pooling, for example.

There is a problem with this approach when it comes to background tasks. The cancellation of a fiber also tries to cancel the jdbcBlockedFiber, but the issue is that the cancellation of jdbcBlockedFiber is blocking and doesn't work.

Yes, exactly. Because it guarantees that there's no leak in that case. I feel it was designed that way for a good reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is what I would call a "fiber leak". Because the fiber goes out of control at this point and lives solely on the mercy of JDBC.

There is no way in CE3 to cancel a fiber running on a blocked thread, I suppose. So the fiber lives solely on the mercy of JDBC anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit – why do you think a cancel query is sent to the database in this case? Where is it happening exactly?

I added a debug log to PreparedStatementInterpreter.close to check if the raw JDBC method (PreparedStatement.close) was called after the fiber was canceled.

Copy link
Member

@armanbilge armanbilge Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the database is slow to respond, the close operation might not execute instantly. The delay occurs also because the cancellation request has to be sent to the database over the network, which can introduce additional latency. Consequently, the timeout might not cancel the query after the intended 300ms, but instead after 600ms or more, due to this added latency (and it leads to user request stucks)

@TalkingFoxMid Thanks for explaining. If I understood correctly, what you are describing is not a limitation of cancelable but actually a general feature of Cats Effect, which is that it backpressures on cancelation i.e. if you request cancelation of some fiber, it will wait until that cancelation fully completes.

Consider this example:

IO.sleep(2.seconds).onCancel(IO.sleep(1.second)).timeout(1.second)

Even though the timeout is after 1 second, it will take 2 seconds to complete because of the onCancel finalizer. Is this what you are considering to be a limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the database is slow to respond, the close operation might not execute instantly. The delay occurs also because the cancellation request has to be sent to the database over the network, which can introduce additional latency. Consequently, the timeout might not cancel the query after the intended 300ms, but instead after 600ms or more, due to this added latency (and it leads to user request stucks)

@TalkingFoxMid Thanks for explaining. If I understood correctly, what you are describing is not a limitation of cancelable but actually a general feature of Cats Effect, which is that it backpressures on cancelation i.e. if you request cancelation of some fiber, it will wait until that cancelation fully completes.

Consider this example:

IO.sleep(2.seconds).onCancel(IO.sleep(1.second)).timeout(1.second)

Even though the timeout is after 1 second, it will take 2 seconds to complete because of the onCancel finalizer. Is this what you are considering to be a limitation?

Consider also this example:

import cats.effect.{Async, IO, IOApp}
import scala.concurrent.duration._

object OnCancelTest extends IOApp.Simple {
  @volatile var blocked = true

  def goDB: IO[Unit] =
    IO {
      while (blocked) {Thread.sleep(100)}
    }.cancelable(
      {IO.sleep(5.seconds) >> IO {blocked = false}}.start.void
    )

  override def run: IO[Unit] =
    for {
      fiber <- goDB.start
      _ <- IO.sleep(100.millis)
      _ <- fiber.cancel
      _ <- IO(println("Cancelled"))
    } yield ()
}

In that snippet, fiber.cancel was blocked not only because it waits for the cancelable handler to complete (the handler completes instantly because it only starts another fiber). The problem also lies in the fact that the external state wasn't modified instantly, even if the cancellation handler completed successfully. This is similar to what we see in Doobie: even after a cancel query is sent to the database, there is still some latency before the database unblocks the blocked JDBC thread running query.

I don't think this is a limitation of CE3, but it does seem like a slight disadvantage of this approach in the context of the current problem in Doobie.

This approach can cause the user query to hang for a long time because we are trying to wait for the query to be canceled in the database. However, there are situations when you just need to timeout immediately, return the default value and continue executing the request further. This is exactly our case, so this disadvantage is relevant for us

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@satorg Could you give an approve here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TalkingFoxMid thanks for including that second example, that was very helpful.

However, there are situations when you just need to timeout immediately, return the default value and continue executing the request further. This is exactly our case, so this disadvantage is relevant for us

Cats Effect specifically has a timeoutAndForget API for this.

I agree that sometimes behavior like this is necessary, but that should be the (careful) choice of the user. Once backpressure is removed, it's not possible to restore it. I am dubious about removing backpressure signals in a library, since then users no longer have agency to make the appropriate decision for their specific usecase.

Copy link
Contributor

@satorg satorg Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TalkingFoxMid ,

However, there are situations when you just need to timeout immediately, return the default value and continue executing the request further.

The catch here is that most likely there will be no immediate relief whatsoever. As @jatcwang found out recently, even if we give up here immediately, the query will be canceled as a part of a bracket's release effect at the end of the transaction. Therefore in a case of long-running cancel query, the entire transaction will have to await until the cancel query completes anyway.

@mergify mergify bot added the postgres label Aug 7, 2024
@TalkingFoxMid TalkingFoxMid force-pushed the fix_query_cancellation branch from 4c0f87e to 4cd42e8 Compare August 8, 2024 13:34
config.setUsername("postgres")
config.setPassword("password")
config.setPassword("mysecretpassword")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be reverted back to jdbc:postgresql:world right?

}

object Primitive {
case class Default[M[_]]()(implicit asyncM: WeakAsync[M]) extends Primitive[M] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is recommended to avoid using case classes in publicly available libraries unless absolutely necessary. I guess, the primary reason for that here was to get a static apply method. Totally understandable it is, but case classes generate a lot of other stuff which makes them difficult to deal with in terms of maintaining binary compatibility. For example: a copy method, inheriting Product, and the last but not least a notorious unapply static method.

A regular class should work just as well here, and if the static apply is desirable, it's better to add it manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is applicable to the case class Cancelable below.

Comment on lines +82 to +85
class KleisliInterpreter[M[_]](logHandler: LogHandler[M], customPrimitive: Option[Primitive[M]] = None)(implicit val asyncM: WeakAsync[M]) { outer =>

private val _primitive = customPrimitive.getOrElse(Primitive.Default[M]())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion here to consider:

Suggested change
class KleisliInterpreter[M[_]](logHandler: LogHandler[M], customPrimitive: Option[Primitive[M]] = None)(implicit val asyncM: WeakAsync[M]) { outer =>
private val _primitive = customPrimitive.getOrElse(Primitive.Default[M]())
class KleisliInterpreter[M[_]](logHandler: LogHandler[M], customPrimitive: Primitive[M])(implicit val asyncM: WeakAsync[M]) { outer =>
def this(logHandler: LogHandler[M])(implicit val asyncM: WeakAsync[M]) =
this(logHandler, Primitive.Default[M])

Now you can use customPrimitive directly.
I.e. in this case the overloaded constructor helps to avoid dealing with Option.

@jatcwang
Copy link
Collaborator

Superceded by #2079

@jatcwang jatcwang closed this Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants