-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql: add metrics #12143
Comments
And perhaps we should split this up into 1. metrics of user interest (what kind of visibility into dist-SQL would a user care about) and 2. internals (stuff that only an advanced user/developer would care about like type of JOIN used perhaps?) |
That makes sense.
…On Wed, Dec 7, 2016 at 1:47 PM vivekmenezes ***@***.***> wrote:
And perhaps we should split this up into 1. metrics of user interest (what
kind of visibility into dist-SQL would a user care about) and 2. internals
(stuff that only an advanced user/developer would care about like type of
JOIN used perhaps?)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12143 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABffprySL1zDjq1Hfpw1fzkp4eij_hQUks5rFv80gaJpZM4LG7JJ>
.
|
Let's start with this list of metrics @arjunravinarayan and consult with @kuanluo for the admin UI part of this. |
One thing that came up was the security aspect of displaying query information. Our admin UI doesn't have any security right now, so perhaps these metrics should be explicitly enabled with a CLI flag (although we probably shouldn't overload the Right now I'm just focused on metrics for dogfooding (and the ones @cuongdo listed above are a good start), I haven't thought about user interest. My plan is to get some internal metrics up and useful for the DistSQL team, and then we can go from there. |
I think the latency and network bytes metrics should be more general than DistSQL (exposed for all queries), and then DistSQL would be just a separate dimension, even though how the way network bytes is reported would probably be DistSQL-specific. |
We already have general latency metrics. My thinking is that with DistSQL
being the new kid on the block in terms of our SQL code, it's good to have
a separate latency metric for DistSQL queries to help determine whether
future performance issues are regular SQL or DistSQL.
On Thu, Dec 8, 2016 at 12:02 PM Andrei Matei <notifications@github.com> wrote:
I think the latency and network bytes metrics should be more general than
DistSQL (exposed for all queries), and then DistSQL would be just a
separate dimension, even though how the way network bytes is reported would
probably be DistSQL-specific.
|
Just to recap our chat, it'd be good to get these metrics in place for production testing:
|
nodes that actually partake in the query execution can be easily extracted i feel, could possibly benefit any future plans for scheduling (or not). by extension the processing time for a given query and a given node might also be interesting. |
Number of nodes that participate in query execution sounds like an excellent metric, if we can graph the average of that across all DistSQL queries. |
What is this currently waiting on? |
This was waiting on #12998, which was closed on Friday. |
This is probably not going to make it into the 1.0? I'm going to move it out for now, but feel free to move back in if you disagree. |
I'm picking this up again and starting to look into what would be useful here. Some initial remarks: The big picture goal is to add visibility into the processing of DistSQL queries, specifically the processing on one node for a query made to another, as well as intra-cluster network traffic. This information isn't currently reflected in the Admin UI. DistSQL queries tend to be the pretty heavyweight ones, but their impact isn't made clear: the gateway node will see a single tick on the query count chart when the query completes, but the fact that any other node is participating can only be guessed based on an increase in CPU usage. The SQL network traffic chart only shows bytes between a client and the gateway, and not between nodes in the cluster. There is currently one dedicated chart for DistSQL service latency, but there are some questions about exactly what is being measured there. So it sounds like this generally the order to work on new metrics, balancing priority and ease of adding:
|
I would put 1 and 3 first. These would make in-progress DistSQL queries visible. The others are maybe for later. I've tried to add inter-node traffic metrics in the past, and it wasn't possible to get the full bytes in/out including gRPC protocol overhead. I believe that's since been fixed with gRPC interceptors, but I don't know what's involved in creating one. |
There are ready-made interceptors for getting prometheus metrics from grpc connections, although I don't know if any of us have looked at them too closely. |
We've been re-framing the admin UI strategy recently - if these graphs are going to make it into the admin UI, I'd like to tie a clear user story to the graphs so that users can answer valuable questions vs making the admin UI more confusing with more graphs. With this in mind, it sounds like the main reason we want these is so that users can gain more insight into how their SQL queries (specifically DistSQL queries) are performing and impacting overall cluster utilization. If that's our goal, shouldn't we be showing # of distSQL queries, # of nodes associated with a given query, and different utilization per node (CPU / RAM / network) associated with distSQL? Flows (3) sounds like a larger project that we won't be able to scope out by 1.1? |
My main reason for filing this issue is that long-running DistSQL queries are nearly invisible. They don't affect the QPS graph until they finish, and because long-running queries are generally less frequent than OLTP queries, they're not very visible on the QPS graph. To put it another way, TPC-H scale factor 1 query 1 takes ~100 seconds. For that whole 100 seconds, that query is invisible in admin UI. After that query finishes, you'll see a 0.1 qps bump in the QPS graph. This doesn't seem like good UX to me. Also, if a node is executing DistSQL flows on behalf of a query started on another node, that's also invisible in admin UI. You'll see CPU use but not even a slight uptick on any SQL graphs. @dianasaur323 the graphs you're suggesting seem valuable. I think that utilization graphs would be the most challenging. |
@cuongdo That definitely sounds like an important reason to provide more graphs. My point of jumping in was just to make sure that we put in graphs that aren't too low-level -> basically, this will be relatively easy for day-to-day users to understand, and doesn't enable query analysis / optimization since that is out of scope, and I don't think adding in time series is the best approach for that use case. Otherwise, carry on! |
After merging #17050, which adds a few of these graphs, I'm unassigning myself, and we can pick this up again in the future as and when it seems appropriate. |
Closing this issue, as the higher priority items are done. A new issue should be filed for distsql metrics once we've taken a more comprehensive look at what's needed to monitor distsql. |
We'll need metrics to determine how well DistSQL is working, which can be added to the admin UI and Grafana. Here are some potential starting points:
@RaduBerinde @andreimatei @irfansharif @arjunravinarayan Please chime in with thoughts.
The text was updated successfully, but these errors were encountered: