distsql: add metrics #12143

cuongdo · 2016-12-07T18:19:07Z

We'll need metrics to determine how well DistSQL is working, which can be added to the admin UI and Grafana. Here are some potential starting points:

# of queries executed through DistSQL: ensure that DistSQL code is being used
latency of DistSQL queries: helps narrow down whether high latencies are general or specific to DistSQL
flows executed on behalf of DistSQL queries started on other nodes
network bytes in/out for DistSQL
# of currently executing DistSQL queries
counts for each type of JOIN (hash join, merge join, etc.) done by DistSQL

@RaduBerinde @andreimatei @irfansharif @arjunravinarayan Please chime in with thoughts.

vivekmenezes · 2016-12-07T18:41:46Z

please discuss how these will show up on the admin UI with @kuanluo. FYI @maxlang @mrtracy

vivekmenezes · 2016-12-07T18:47:11Z

And perhaps we should split this up into 1. metrics of user interest (what kind of visibility into dist-SQL would a user care about) and 2. internals (stuff that only an advanced user/developer would care about like type of JOIN used perhaps?)

cuongdo · 2016-12-07T18:49:05Z

That makes sense.

…

On Wed, Dec 7, 2016 at 1:47 PM vivekmenezes ***@***.***> wrote: And perhaps we should split this up into 1. metrics of user interest (what kind of visibility into dist-SQL would a user care about) and 2. internals (stuff that only an advanced user/developer would care about like type of JOIN used perhaps?) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#12143 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABffprySL1zDjq1Hfpw1fzkp4eij_hQUks5rFv80gaJpZM4LG7JJ> .

cuongdo · 2016-12-08T15:26:33Z

Let's start with this list of metrics @arjunravinarayan and consult with @kuanluo for the admin UI part of this.

rjnn · 2016-12-08T15:42:33Z

One thing that came up was the security aspect of displaying query information. Our admin UI doesn't have any security right now, so perhaps these metrics should be explicitly enabled with a CLI flag (although we probably shouldn't overload the --insecure flag since we run our production clusters with certificates).

Right now I'm just focused on metrics for dogfooding (and the ones @cuongdo listed above are a good start), I haven't thought about user interest. My plan is to get some internal metrics up and useful for the DistSQL team, and then we can go from there.

andreimatei · 2016-12-08T17:02:02Z

I think the latency and network bytes metrics should be more general than DistSQL (exposed for all queries), and then DistSQL would be just a separate dimension, even though how the way network bytes is reported would probably be DistSQL-specific.

cuongdo · 2016-12-08T17:35:01Z

We already have general latency metrics. My thinking is that with DistSQL being the new kid on the block in terms of our SQL code, it's good to have a separate latency metric for DistSQL queries to help determine whether future performance issues are regular SQL or DistSQL. On Thu, Dec 8, 2016 at 12:02 PM Andrei Matei <notifications@github.com> wrote: I think the latency and network bytes metrics should be more general than DistSQL (exposed for all queries), and then DistSQL would be just a separate dimension, even though how the way network bytes is reported would probably be DistSQL-specific.

cuongdo · 2016-12-21T20:18:02Z

Just to recap our chat, it'd be good to get these metrics in place for production testing:

Latency of DistSQL queries
Counts for each type of JOIN in DistSQL

irfansharif · 2016-12-22T05:57:52Z

nodes that actually partake in the query execution can be easily extracted i feel, could possibly benefit any future plans for scheduling (or not). by extension the processing time for a given query and a given node might also be interesting.

cuongdo · 2016-12-22T18:45:40Z

Number of nodes that participate in query execution sounds like an excellent metric, if we can graph the average of that across all DistSQL queries.

cuongdo · 2017-02-06T18:45:43Z

What is this currently waiting on?

rjnn · 2017-02-06T18:46:56Z

This was waiting on #12998, which was closed on Friday.

dianasaur323 · 2017-04-06T16:12:57Z

This is probably not going to make it into the 1.0? I'm going to move it out for now, but feel free to move back in if you disagree.

couchand · 2017-07-10T15:25:57Z

I'm picking this up again and starting to look into what would be useful here. Some initial remarks:

The big picture goal is to add visibility into the processing of DistSQL queries, specifically the processing on one node for a query made to another, as well as intra-cluster network traffic. This information isn't currently reflected in the Admin UI. DistSQL queries tend to be the pretty heavyweight ones, but their impact isn't made clear: the gateway node will see a single tick on the query count chart when the query completes, but the fact that any other node is participating can only be guessed based on an increase in CPU usage. The SQL network traffic chart only shows bytes between a client and the gateway, and not between nodes in the cluster. There is currently one dedicated chart for DistSQL service latency, but there are some questions about exactly what is being measured there.

So it sounds like this generally the order to work on new metrics, balancing priority and ease of adding:

# of currently executing DistSQL queries
For each node this will show the number of DistSQL queries currently in process. Increment when processing starts, decrement when it's done. This means that for the cluster as a whole it will be queries x nodes.
network bytes in/out for DistSQL
Provide insight into how much intra-cluster SQL traffic there is.
flows executed on behalf of DistSQL queries started on other nodes
Requires more specification. My first thought is that this would be an additional dimension on 1, but might be a different metric if we really want flows and not just number of queries.
number of nodes that participate in query execution
One outstanding question here is whether non-DistSQL queries count as 1 or aren't included. It seems like not including them would make this more useful for evaluating DistSQL specifically.
counts for each type of JOIN (hash join, merge join, etc.) done by DistSQL
Not a priority until merge joins are implemented.

cuongdo · 2017-07-11T00:38:32Z

I would put 1 and 3 first. These would make in-progress DistSQL queries visible.

The others are maybe for later. I've tried to add inter-node traffic metrics in the past, and it wasn't possible to get the full bytes in/out including gRPC protocol overhead. I believe that's since been fixed with gRPC interceptors, but I don't know what's involved in creating one.

a-robinson · 2017-07-11T00:52:21Z

There are ready-made interceptors for getting prometheus metrics from grpc connections, although I don't know if any of us have looked at them too closely.

dianasaur323 · 2017-07-11T16:13:05Z

We've been re-framing the admin UI strategy recently - if these graphs are going to make it into the admin UI, I'd like to tie a clear user story to the graphs so that users can answer valuable questions vs making the admin UI more confusing with more graphs.

With this in mind, it sounds like the main reason we want these is so that users can gain more insight into how their SQL queries (specifically DistSQL queries) are performing and impacting overall cluster utilization. If that's our goal, shouldn't we be showing # of distSQL queries, # of nodes associated with a given query, and different utilization per node (CPU / RAM / network) associated with distSQL?

Flows (3) sounds like a larger project that we won't be able to scope out by 1.1?

cuongdo · 2017-07-11T16:50:33Z

My main reason for filing this issue is that long-running DistSQL queries are nearly invisible. They don't affect the QPS graph until they finish, and because long-running queries are generally less frequent than OLTP queries, they're not very visible on the QPS graph.

To put it another way, TPC-H scale factor 1 query 1 takes ~100 seconds. For that whole 100 seconds, that query is invisible in admin UI. After that query finishes, you'll see a 0.1 qps bump in the QPS graph. This doesn't seem like good UX to me.

Also, if a node is executing DistSQL flows on behalf of a query started on another node, that's also invisible in admin UI. You'll see CPU use but not even a slight uptick on any SQL graphs.

@dianasaur323 the graphs you're suggesting seem valuable. I think that utilization graphs would be the most challenging.

dianasaur323 · 2017-07-11T17:16:59Z

@cuongdo That definitely sounds like an important reason to provide more graphs. My point of jumping in was just to make sure that we put in graphs that aren't too low-level -> basically, this will be relatively easy for day-to-day users to understand, and doesn't enable query analysis / optimization since that is out of scope, and I don't think adding in time series is the best approach for that use case. Otherwise, carry on!

couchand · 2017-07-31T20:46:58Z

After merging #17050, which adds a few of these graphs, I'm unassigning myself, and we can pick this up again in the future as and when it seems appropriate.

cuongdo · 2017-08-22T20:53:09Z

Closing this issue, as the higher priority items are done. A new issue should be filed for distsql metrics once we've taken a more comprehensive look at what's needed to monitor distsql.

cc @dianasaur323

cuongdo assigned rjnn Dec 7, 2016

rjnn mentioned this issue Dec 12, 2016

distsql, sql, ui: update SQL metrics #12299

Merged

petermattis added this to the 1.0 milestone Feb 23, 2017

dianasaur323 modified the milestones: Q2 2017, 1.0 Apr 6, 2017

dianasaur323 modified the milestones: 1.1, Q2 2017 Apr 20, 2017

rjnn assigned couchand and unassigned rjnn Jun 29, 2017

couchand mentioned this issue Jul 17, 2017

distsql, ui: query count metrics #17050

Merged

couchand removed their assignment Jul 31, 2017

cuongdo closed this as completed Aug 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: add metrics #12143

distsql: add metrics #12143

cuongdo commented Dec 7, 2016 •

edited by couchand

Loading

vivekmenezes commented Dec 7, 2016

vivekmenezes commented Dec 7, 2016

cuongdo commented Dec 7, 2016 via email

cuongdo commented Dec 8, 2016

rjnn commented Dec 8, 2016

andreimatei commented Dec 8, 2016

cuongdo commented Dec 8, 2016 via email •

edited by irfansharif

Loading

cuongdo commented Dec 21, 2016

irfansharif commented Dec 22, 2016

cuongdo commented Dec 22, 2016 •

edited

Loading

cuongdo commented Feb 6, 2017

rjnn commented Feb 6, 2017

dianasaur323 commented Apr 6, 2017

couchand commented Jul 10, 2017

cuongdo commented Jul 11, 2017 •

edited

Loading

a-robinson commented Jul 11, 2017

dianasaur323 commented Jul 11, 2017 •

edited

Loading

cuongdo commented Jul 11, 2017

dianasaur323 commented Jul 11, 2017

couchand commented Jul 31, 2017

cuongdo commented Aug 22, 2017

distsql: add metrics #12143

distsql: add metrics #12143

Comments

cuongdo commented Dec 7, 2016 • edited by couchand Loading

vivekmenezes commented Dec 7, 2016

vivekmenezes commented Dec 7, 2016

cuongdo commented Dec 7, 2016 via email

cuongdo commented Dec 8, 2016

rjnn commented Dec 8, 2016

andreimatei commented Dec 8, 2016

cuongdo commented Dec 8, 2016 via email • edited by irfansharif Loading

cuongdo commented Dec 21, 2016

irfansharif commented Dec 22, 2016

cuongdo commented Dec 22, 2016 • edited Loading

cuongdo commented Feb 6, 2017

rjnn commented Feb 6, 2017

dianasaur323 commented Apr 6, 2017

couchand commented Jul 10, 2017

cuongdo commented Jul 11, 2017 • edited Loading

a-robinson commented Jul 11, 2017

dianasaur323 commented Jul 11, 2017 • edited Loading

cuongdo commented Jul 11, 2017

dianasaur323 commented Jul 11, 2017

couchand commented Jul 31, 2017

cuongdo commented Aug 22, 2017

cuongdo commented Dec 7, 2016 •

edited by couchand

Loading

cuongdo commented Dec 8, 2016 via email •

edited by irfansharif

Loading

cuongdo commented Dec 22, 2016 •

edited

Loading

cuongdo commented Jul 11, 2017 •

edited

Loading

dianasaur323 commented Jul 11, 2017 •

edited

Loading