Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsql: add metrics #12143

Closed
4 of 6 tasks
cuongdo opened this issue Dec 7, 2016 · 21 comments
Closed
4 of 6 tasks

distsql: add metrics #12143

cuongdo opened this issue Dec 7, 2016 · 21 comments
Milestone

Comments

@cuongdo
Copy link
Contributor

cuongdo commented Dec 7, 2016

We'll need metrics to determine how well DistSQL is working, which can be added to the admin UI and Grafana. Here are some potential starting points:

  • # of queries executed through DistSQL: ensure that DistSQL code is being used
  • latency of DistSQL queries: helps narrow down whether high latencies are general or specific to DistSQL
  • flows executed on behalf of DistSQL queries started on other nodes
  • network bytes in/out for DistSQL
  • # of currently executing DistSQL queries
  • counts for each type of JOIN (hash join, merge join, etc.) done by DistSQL

@RaduBerinde @andreimatei @irfansharif @arjunravinarayan Please chime in with thoughts.

@vivekmenezes
Copy link
Contributor

please discuss how these will show up on the admin UI with @kuanluo. FYI @maxlang @mrtracy

@vivekmenezes
Copy link
Contributor

And perhaps we should split this up into 1. metrics of user interest (what kind of visibility into dist-SQL would a user care about) and 2. internals (stuff that only an advanced user/developer would care about like type of JOIN used perhaps?)

@cuongdo
Copy link
Contributor Author

cuongdo commented Dec 7, 2016 via email

@cuongdo
Copy link
Contributor Author

cuongdo commented Dec 8, 2016

Let's start with this list of metrics @arjunravinarayan and consult with @kuanluo for the admin UI part of this.

@rjnn
Copy link
Contributor

rjnn commented Dec 8, 2016

One thing that came up was the security aspect of displaying query information. Our admin UI doesn't have any security right now, so perhaps these metrics should be explicitly enabled with a CLI flag (although we probably shouldn't overload the --insecure flag since we run our production clusters with certificates).

Right now I'm just focused on metrics for dogfooding (and the ones @cuongdo listed above are a good start), I haven't thought about user interest. My plan is to get some internal metrics up and useful for the DistSQL team, and then we can go from there.

@andreimatei
Copy link
Contributor

I think the latency and network bytes metrics should be more general than DistSQL (exposed for all queries), and then DistSQL would be just a separate dimension, even though how the way network bytes is reported would probably be DistSQL-specific.

@cuongdo
Copy link
Contributor Author

cuongdo commented Dec 8, 2016 via email

@cuongdo
Copy link
Contributor Author

cuongdo commented Dec 21, 2016

Just to recap our chat, it'd be good to get these metrics in place for production testing:

  1. Latency of DistSQL queries
  2. Counts for each type of JOIN in DistSQL

@irfansharif
Copy link
Contributor

nodes that actually partake in the query execution can be easily extracted i feel, could possibly benefit any future plans for scheduling (or not). by extension the processing time for a given query and a given node might also be interesting.

@cuongdo
Copy link
Contributor Author

cuongdo commented Dec 22, 2016

Number of nodes that participate in query execution sounds like an excellent metric, if we can graph the average of that across all DistSQL queries.

@cuongdo
Copy link
Contributor Author

cuongdo commented Feb 6, 2017

What is this currently waiting on?

@rjnn
Copy link
Contributor

rjnn commented Feb 6, 2017

This was waiting on #12998, which was closed on Friday.

@petermattis petermattis added this to the 1.0 milestone Feb 23, 2017
@dianasaur323
Copy link
Contributor

This is probably not going to make it into the 1.0? I'm going to move it out for now, but feel free to move back in if you disagree.

@dianasaur323 dianasaur323 modified the milestones: Q2 2017, 1.0 Apr 6, 2017
@dianasaur323 dianasaur323 modified the milestones: 1.1, Q2 2017 Apr 20, 2017
@rjnn rjnn assigned couchand and unassigned rjnn Jun 29, 2017
@couchand
Copy link
Contributor

I'm picking this up again and starting to look into what would be useful here. Some initial remarks:

The big picture goal is to add visibility into the processing of DistSQL queries, specifically the processing on one node for a query made to another, as well as intra-cluster network traffic. This information isn't currently reflected in the Admin UI. DistSQL queries tend to be the pretty heavyweight ones, but their impact isn't made clear: the gateway node will see a single tick on the query count chart when the query completes, but the fact that any other node is participating can only be guessed based on an increase in CPU usage. The SQL network traffic chart only shows bytes between a client and the gateway, and not between nodes in the cluster. There is currently one dedicated chart for DistSQL service latency, but there are some questions about exactly what is being measured there.

So it sounds like this generally the order to work on new metrics, balancing priority and ease of adding:

  1. # of currently executing DistSQL queries
    For each node this will show the number of DistSQL queries currently in process. Increment when processing starts, decrement when it's done. This means that for the cluster as a whole it will be queries x nodes.
  2. network bytes in/out for DistSQL
    Provide insight into how much intra-cluster SQL traffic there is.
  3. flows executed on behalf of DistSQL queries started on other nodes
    Requires more specification. My first thought is that this would be an additional dimension on 1, but might be a different metric if we really want flows and not just number of queries.
  4. number of nodes that participate in query execution
    One outstanding question here is whether non-DistSQL queries count as 1 or aren't included. It seems like not including them would make this more useful for evaluating DistSQL specifically.
  5. counts for each type of JOIN (hash join, merge join, etc.) done by DistSQL
    Not a priority until merge joins are implemented.

@cuongdo
Copy link
Contributor Author

cuongdo commented Jul 11, 2017

I would put 1 and 3 first. These would make in-progress DistSQL queries visible.

The others are maybe for later. I've tried to add inter-node traffic metrics in the past, and it wasn't possible to get the full bytes in/out including gRPC protocol overhead. I believe that's since been fixed with gRPC interceptors, but I don't know what's involved in creating one.

@a-robinson
Copy link
Contributor

There are ready-made interceptors for getting prometheus metrics from grpc connections, although I don't know if any of us have looked at them too closely.

@dianasaur323
Copy link
Contributor

dianasaur323 commented Jul 11, 2017

We've been re-framing the admin UI strategy recently - if these graphs are going to make it into the admin UI, I'd like to tie a clear user story to the graphs so that users can answer valuable questions vs making the admin UI more confusing with more graphs.

With this in mind, it sounds like the main reason we want these is so that users can gain more insight into how their SQL queries (specifically DistSQL queries) are performing and impacting overall cluster utilization. If that's our goal, shouldn't we be showing # of distSQL queries, # of nodes associated with a given query, and different utilization per node (CPU / RAM / network) associated with distSQL?

Flows (3) sounds like a larger project that we won't be able to scope out by 1.1?

@cuongdo
Copy link
Contributor Author

cuongdo commented Jul 11, 2017

My main reason for filing this issue is that long-running DistSQL queries are nearly invisible. They don't affect the QPS graph until they finish, and because long-running queries are generally less frequent than OLTP queries, they're not very visible on the QPS graph.

To put it another way, TPC-H scale factor 1 query 1 takes ~100 seconds. For that whole 100 seconds, that query is invisible in admin UI. After that query finishes, you'll see a 0.1 qps bump in the QPS graph. This doesn't seem like good UX to me.

Also, if a node is executing DistSQL flows on behalf of a query started on another node, that's also invisible in admin UI. You'll see CPU use but not even a slight uptick on any SQL graphs.

@dianasaur323 the graphs you're suggesting seem valuable. I think that utilization graphs would be the most challenging.

@dianasaur323
Copy link
Contributor

@cuongdo That definitely sounds like an important reason to provide more graphs. My point of jumping in was just to make sure that we put in graphs that aren't too low-level -> basically, this will be relatively easy for day-to-day users to understand, and doesn't enable query analysis / optimization since that is out of scope, and I don't think adding in time series is the best approach for that use case. Otherwise, carry on!

@couchand
Copy link
Contributor

After merging #17050, which adds a few of these graphs, I'm unassigning myself, and we can pick this up again in the future as and when it seems appropriate.

@couchand couchand removed their assignment Jul 31, 2017
@cuongdo
Copy link
Contributor Author

cuongdo commented Aug 22, 2017

Closing this issue, as the higher priority items are done. A new issue should be filed for distsql metrics once we've taken a more comprehensive look at what's needed to monitor distsql.

cc @dianasaur323

@cuongdo cuongdo closed this as completed Aug 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants