Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP feat: implement metrics-gathering support for Pool #1908

Closed
wants to merge 1 commit into from

Conversation

abonander
Copy link
Collaborator

Got tired of debating the design of #1900 and wanted to try writing the API from scratch myself.

Based on #1901

TODO: tests

cc @domodwyer @ThomWright

@abonander abonander changed the base branch from master to ab/fix-pool June 16, 2022 02:53
@domodwyer
Copy link

Wow don't I feel like my time spent on the PRs, and trying to work with you, was worth it. What a lovely way to work with people.

Sorry @abonander but I too am tired of this - I'll work off a fork instead.

@abonander
Copy link
Collaborator Author

I'm sorry that you feel that way. The main goal here was to just get everything out that was in my head rather than me continuing to fail at describing it. I was looking forward to your feedback on it.

Base automatically changed from ab/fix-pool to master June 16, 2022 19:56
@ThomWright
Copy link

ThomWright commented Jun 17, 2022

I'm sorry that you feel that way. The main goal here was to just get everything out that was in my head rather than me continuing to fail at describing it. I was looking forward to your feedback on it.

Some feedback, if you don't mind: I think this would have been good context to communicate in the PR description.

Having a competing implementation could be helpful here, and sometimes it's easier to just write an implementation than to describe an approach. "Show, don't tell" can be very powerful. I'm in favour of having two designs/implementations to compare and contrast.

That said, if I was in Dom's position I think I would have felt and reacted in the same way to the opening message. When I first read it I too interpreted it as dismissive of the effort put in, which I'm sure wasn't the intention. Communication is hard.

I, for one, am very appreciative of the effort you're both putting in, and I think the results are going to be very helpful for the whole community. sqlx could become a really good example of how to do good observability in this domain.

Can I ask, have either of you looked at how this problem is handled in similar systems either in Rust or in other languages? If we find something, it might help guide our decision. If we don't, it means we might be leading the way, which is exciting! I'm happy to help do some research if you both think this would be valuable.

@abonander
Copy link
Collaborator Author

Yeah, I probably should have been more careful with my choice of words but it seemed like we were basically at an impasse anyways. It happens.

As for prior art, I don't see any comparable examples in Rust. deadpool, bb8 and r2d2 all have very rudimentary "metrics", if you can call them that:

Deadpool has Metrics, comparable to the PoolConnectionMetadata type I just added for 0.6.0, and Status which just gives the number of objects idle/available/total in the pool. (Those links are from deadpool_diesel but they're just reexports from deadpool; I was looking to see if the concrete Manager implementation had more on it).

bb8 has State comparable to deadpool::Status.

r2d2 has an identical State type, which makes sense since bb8 explicitly draws from its API design.

Looking at other languages, Go's database/sql packages has DBStats which has a bit more detail, but it's also just something you have to poll for, there doesn't look to be any user-definable hooks.

Java is, of course, a mess. The one connection pool implementation I'm familiar with is in Vert.x but neither that interface nor its implementors looks like it provides any sort of metrics.

The PostgreSQL JDBC driver provides a DataSource implementation with a connection pool but has no metrics and even says right in the docs that you probably want to be using some other connection pool instead.

Looking around for other connection pools in Java, there's C3P0 which has a number of metrics on it, but also poll based. Various StackOverflow answers recommend either that or DBCP or BoneCP but also mention that none of those is really being worked on anymore.

From BoneCP's README, I finally landed on HikariCP which purports to be one of the best connection pools for Java. And what do you know, finally someone else with pluggable metrics.

Having never seen this before today, it looks surprisingly similar to what I ended up with--it appears they also decided to track acquire time separately from timeouts--although of course since it's Java there's no documentation on these methods whatsoever, so we've only got the names to go off of unless we want to dig through the source.

recordConnectionUsageMillis​ is intriguing, although I'm not sure how useful it is just knowing how long a connection was checked out.

It might be interesting for connections to be able to report parts of the code where they were held for a long time without being used.

@ThomWright
Copy link

ThomWright commented Jun 20, 2022

From BoneCP's README, I finally landed on HikariCP which purports to be one of the best connection pools for Java. And what do you know, finally someone else with pluggable metrics.

Having never seen this before today, it looks surprisingly similar to what I ended up with--it appears they also decided to track acquire time separately from timeouts--although of course since it's Java there's no documentation on these methods whatsoever, so we've only got the names to go off of unless we want to dig through the source.

Ooh, good find. I had a quick look around and I think (I could be mistaken) that when a timeout occurs, it both:

  • records that a timeout occurred, and
  • records the wait time.

This way, the wait time metric isn't missing important information about long waits which happened to time out.

recordConnectionUsageMillis​ is intriguing, although I'm not sure how useful it is just knowing how long a connection was checked out.

Hmm, I can see it being useful to track this. It's essentially the U in USE, right? That's what I had down in #1896. It would be quite nice to be able to get a % utilisation metric, based on sample period and the maximum number of connections. E.g. in the previous 10 seconds, connections were used for 4.5 seconds = 45% utilisation.

As I understand it, there will generally be two reasons why a connection pool starts to saturate:

  1. More attempts to acquire connections (e.g. a increased RPS to the service)
  2. Connections being kept for longer (e.g. queries taking longer, or services needlessly keeping connections)

Having this metric would make it obvious when 2 was happening. Importantly, it might not show up on other metrics e.g. query duration metrics. Imagine a scenario where a service:

  • takes a connection
  • executes a quick query
  • does some expensive IO
  • executes another quick query
  • releases the connection

Query durations would be quick, RPS might be low, but the pool is still saturating.

Maybe I'm missing something here!

Maybe not important for a first iteration, but perhaps something to keep in mind.

It might be interesting for connections to be able to report parts of the code where they were held for a long time without being used.

Exactly.

Looking at other languages, Go's database/sql packages has DBStats which has a bit more detail, but it's also just something you have to poll for, there doesn't look to be any user-definable hooks.

This looks not bad, actually. Could use some more detail. The S in USE is covered by WaitDuration pretty well. U and E could use some more love. Another Duration for S and counter for E would make me, personally, pretty happy. Simple and does a reasonable job.

Which isn't to say that a more comprehensive implementation which allows us to create e.g. histograms wouldn't be better.

@abonander
Copy link
Collaborator Author

abonander commented Jul 13, 2022

I'm afraid I'm not familiar with the USE acronym you've been referencing and Googling it wasn't much help. I inferred that it stood for "Utilization Saturation Errors" and finally found the page describing it: https://www.brendangregg.com/usemethod.html

It's a decent set of first principles, though it's rather reductive. Still, if we're to use this method as guide to designing metrics for Pool from the bottom-up, I'd imagine it'd look something like this:

  • Utilization: (size - idle) / max_size, the ratio of checked-out connections to the pool's capacity
  • Saturation: ratio of acquire() timeouts to all acquire() calls
    • Alternatively, the ratio of the time spent in acquire() to acquire_timeout (peaking above 1.0 might indicate runtime saturation as the task polling itself would be lagging)
  • Errors: the number of times a PoolInner::connect() call returned an error

Then PoolMetricsObserver would have a method for each of these that's called when the value changes. That's significantly simpler than I came up with in this PR.

Tracking the actual timing of things I think would be left to the application, e.g. tracking the response time for a particular API call handler. Tracking the timing of individual sub-calls would be better suited to tracing-timing.

The problem with tracking the time a checked-out connection is idle is that it may be necessary depending on the application, and a single metric wouldn't exactly tell you where in your code that's happening. There's plenty of legitimate situations where you need to keep a connection checked-out even if you're not executing a query on it at that exact moment, usually because you're doing non-idempotent stuff elsewhere while in a transaction.

I could imagine a lint that would warn about a connection being held across an .await point that doesn't involve it, or something similar to ConnectOptions::log_slow_statements() which immediately squawks in the logs with a stack trace if a connection was idle longer than the set threshold. Either way, I'm not sure it should technically be considered a metric.

@ThomWright
Copy link

ThomWright commented Jul 13, 2022

Yes, it's a Brendan Gregg thing. More info here: #1896. Sorry, perhaps I should have linked to this earlier. Keen to hear your thoughts.

As for sampling, I made a case against it in the above issue. I've used sampling before for this, and while it is useful it's not as good as measuring time spent because sampling loses the information between the samples.

I wonder if we have different ideas of what we mean by saturation here. Saturation happens when we have run out of a resource. In this case, any time you can't immediately get a connection because the pool is fully utilised is a case where it's saturating, and I'd want to know when that's happening. I started writing an implementation here which I didn't take very far. The idea was to increment the saturation metric in the case where we can't get a connection because there are none spare. Knowing when we hit the timeout is also useful, but I think of that as a different problem.

Then PoolMetricsObserver would have a method for each of these that's called when the value changes. That's significantly simpler than I came up with in this PR.

Yeah, if I understand correctly then I can see this working well for my intended use-case. If it's possible to build something on top of sqlx that does the timings necessary to produce these metrics then that is good enough in my opinion.

The problem with tracking the time a checked-out connection is idle is that it may be necessary depending on the application, and a single metric wouldn't exactly tell you where in your code that's happening.

For the connection pool utilisation I'm more interested in time spent checked out than time spent idle/busy. I am interested in tracking time spent running queries too though! See the RED metrics part of the above issue. I had a look and it seemed tricker, at least for the pool to track and report this information. I had an idea where a connection could "report back" how long it had spent running queries when it's returned to the pool, but wasn't sure about that design.

There's plenty of legitimate situations where you need to keep a connection checked-out even if you're not executing a query on it at that exact moment, usually because you're doing non-idempotent stuff elsewhere while in a transaction.

You might have to run this one past me slowly 😅. It seems to me that there would be better ways of achieving whatever the goal is here, but it might help to have an example use-case. That might be a conversation for elsewhere!

@Dowwie
Copy link

Dowwie commented Jul 19, 2022

@abonander You may be interested in the telemetry events currently being used by Ecto (Elixir)

More info about these events is shared here

    :idle_time - the time the connection spent waiting before being checked out for the query
    :queue_time - the time spent waiting to check out a database connection
    :query_time - the time spent executing the query
    :decode_time - the time spent decoding the data received from the database
    :total_time - the sum of (queue_time, query_time, and decode_time)️

The following github issue includes a discussion that highlights the relevancy of tracking idle connections

@abonander
Copy link
Collaborator Author

Closing this as I don't currently have the bandwidth to finish it, but supporting metrics is a feature we want eventually.

@abonander abonander closed this Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants