-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP feat: implement metrics-gathering support for Pool
#1908
Conversation
Wow don't I feel like my time spent on the PRs, and trying to work with you, was worth it. What a lovely way to work with people. Sorry @abonander but I too am tired of this - I'll work off a fork instead. |
I'm sorry that you feel that way. The main goal here was to just get everything out that was in my head rather than me continuing to fail at describing it. I was looking forward to your feedback on it. |
Some feedback, if you don't mind: I think this would have been good context to communicate in the PR description. Having a competing implementation could be helpful here, and sometimes it's easier to just write an implementation than to describe an approach. "Show, don't tell" can be very powerful. I'm in favour of having two designs/implementations to compare and contrast. That said, if I was in Dom's position I think I would have felt and reacted in the same way to the opening message. When I first read it I too interpreted it as dismissive of the effort put in, which I'm sure wasn't the intention. Communication is hard. I, for one, am very appreciative of the effort you're both putting in, and I think the results are going to be very helpful for the whole community. sqlx could become a really good example of how to do good observability in this domain. Can I ask, have either of you looked at how this problem is handled in similar systems either in Rust or in other languages? If we find something, it might help guide our decision. If we don't, it means we might be leading the way, which is exciting! I'm happy to help do some research if you both think this would be valuable. |
Yeah, I probably should have been more careful with my choice of words but it seemed like we were basically at an impasse anyways. It happens. As for prior art, I don't see any comparable examples in Rust. Deadpool has
Looking at other languages, Go's Java is, of course, a mess. The one connection pool implementation I'm familiar with is in Vert.x but neither that interface nor its implementors looks like it provides any sort of metrics. The PostgreSQL JDBC driver provides a Looking around for other connection pools in Java, there's C3P0 which has a number of metrics on it, but also poll based. Various StackOverflow answers recommend either that or DBCP or BoneCP but also mention that none of those is really being worked on anymore. From BoneCP's README, I finally landed on HikariCP which purports to be one of the best connection pools for Java. And what do you know, finally someone else with pluggable metrics. Having never seen this before today, it looks surprisingly similar to what I ended up with--it appears they also decided to track acquire time separately from timeouts--although of course since it's Java there's no documentation on these methods whatsoever, so we've only got the names to go off of unless we want to dig through the source.
It might be interesting for connections to be able to report parts of the code where they were held for a long time without being used. |
Ooh, good find. I had a quick look around and I think (I could be mistaken) that when a timeout occurs, it both:
This way, the wait time metric isn't missing important information about long waits which happened to time out.
Hmm, I can see it being useful to track this. It's essentially the U in USE, right? That's what I had down in #1896. It would be quite nice to be able to get a % utilisation metric, based on sample period and the maximum number of connections. E.g. in the previous 10 seconds, connections were used for 4.5 seconds = 45% utilisation. As I understand it, there will generally be two reasons why a connection pool starts to saturate:
Having this metric would make it obvious when 2 was happening. Importantly, it might not show up on other metrics e.g. query duration metrics. Imagine a scenario where a service:
Query durations would be quick, RPS might be low, but the pool is still saturating. Maybe I'm missing something here! Maybe not important for a first iteration, but perhaps something to keep in mind.
Exactly.
This looks not bad, actually. Could use some more detail. The S in USE is covered by Which isn't to say that a more comprehensive implementation which allows us to create e.g. histograms wouldn't be better. |
I'm afraid I'm not familiar with the USE acronym you've been referencing and Googling it wasn't much help. I inferred that it stood for "Utilization Saturation Errors" and finally found the page describing it: https://www.brendangregg.com/usemethod.html It's a decent set of first principles, though it's rather reductive. Still, if we're to use this method as guide to designing metrics for
Then Tracking the actual timing of things I think would be left to the application, e.g. tracking the response time for a particular API call handler. Tracking the timing of individual sub-calls would be better suited to The problem with tracking the time a checked-out connection is idle is that it may be necessary depending on the application, and a single metric wouldn't exactly tell you where in your code that's happening. There's plenty of legitimate situations where you need to keep a connection checked-out even if you're not executing a query on it at that exact moment, usually because you're doing non-idempotent stuff elsewhere while in a transaction. I could imagine a lint that would warn about a connection being held across an |
Yes, it's a Brendan Gregg thing. More info here: #1896. Sorry, perhaps I should have linked to this earlier. Keen to hear your thoughts. As for sampling, I made a case against it in the above issue. I've used sampling before for this, and while it is useful it's not as good as measuring time spent because sampling loses the information between the samples. I wonder if we have different ideas of what we mean by saturation here. Saturation happens when we have run out of a resource. In this case, any time you can't immediately get a connection because the pool is fully utilised is a case where it's saturating, and I'd want to know when that's happening. I started writing an implementation here which I didn't take very far. The idea was to increment the saturation metric in the case where we can't get a connection because there are none spare. Knowing when we hit the timeout is also useful, but I think of that as a different problem.
Yeah, if I understand correctly then I can see this working well for my intended use-case. If it's possible to build something on top of sqlx that does the timings necessary to produce these metrics then that is good enough in my opinion.
For the connection pool utilisation I'm more interested in time spent checked out than time spent idle/busy. I am interested in tracking time spent running queries too though! See the RED metrics part of the above issue. I had a look and it seemed tricker, at least for the pool to track and report this information. I had an idea where a connection could "report back" how long it had spent running queries when it's returned to the pool, but wasn't sure about that design.
You might have to run this one past me slowly 😅. It seems to me that there would be better ways of achieving whatever the goal is here, but it might help to have an example use-case. That might be a conversation for elsewhere! |
@abonander You may be interested in the telemetry events currently being used by Ecto (Elixir) More info about these events is shared here
The following github issue includes a discussion that highlights the relevancy of tracking idle connections |
Closing this as I don't currently have the bandwidth to finish it, but supporting metrics is a feature we want eventually. |
Got tired of debating the design of #1900 and wanted to try writing the API from scratch myself.
Based on #1901
TODO: tests
cc @domodwyer @ThomWright