Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multicluster support for Grafana #3405

Closed
andrew-waters opened this issue Sep 9, 2019 · 27 comments
Closed

Add multicluster support for Grafana #3405

andrew-waters opened this issue Sep 9, 2019 · 27 comments
Labels

Comments

@andrew-waters
Copy link

andrew-waters commented Sep 9, 2019

Feature Request

What problem are you trying to solve?

At the moment, linkerd2 supplies some hard coded dashboards for Grafana.

These work very well when viewed on the cluster, but don't take account for use cases where a single pane of glass is required with the ability to drill down into metrics for that particular cluster.

How should the problem be solved?

The standard way of presenting this data is to have a variable within the dashboard that allows cluster: all to be selected, giving you an aggregated view of your clusters whilst also allowing selection by the cluster name.

In order to achieve this, the metrics collected by prometheus need to add an external label (the cluster name). For example, using kube-prometheus, one could could write:

{
  _config+:: {
    cluster: 'my-cluster',
  },
  prometheus+:: {
    prometheus+: {
      spec+: {
        externalLabels: {
          cluster: $._config.cluster
        },
      },
    },
  },
}

This would then give the prometheus instance being queried enough information about it's targets.

Projects like kubernetes-mixins allow for the dynamic generation of dashboards using jsonnet. This allows them to support multi-clusters with very little changes for the operator.

This would make a solid addition to the operational offering linkerd2 gives.

An image at the bottom of this issue is how this would be presented.

jsonnet is the proposed language to write this in and it could be done in an external repo to reduce noise within the linkerd2 repo. I'd propose github.com/linkerd/linkerd2-mixins. Note that this could also offer the ability to install linkerd2 via jsonnet instead of helm which again increases the offering.

It's also proposed that linkerd2 could keep the hard coded dashboards and introduce the output from the new application as part of CI. This way operators could quickly grab the manifests.

Any alternatives you've considered?

Manually creating these dashboards is unnecessary work although it's viable.

How would users interact with this feature?

Screenshot 2019-09-09 at 11 50 13

@andrew-waters
Copy link
Author

For the maintainers, I'd also like to express a willingness to contribute to this project, at the moment this is to solicit some feedback on the problem and the proposed solution.

@grampelberg
Copy link
Contributor

Adding an optional cluster dropdown to the dashboards sounds like a solid idea. Would it be possible to hide if someone's not using thanos?

This subject in particular feels like something that we could do a lot on. I'd love to get a multi-cluster reference architecture together so that we can think a little bit more about which pieces can really be improved.

@andrew-waters
Copy link
Author

Thanos isn't the factor that makes this an issue - it's just a mechanism to consolidate metrics into cheap storage and solve some issues with federation. My description was more to demonstrate one architecture that this could benefit.

If we did this with jsonnet (which is pretty standard for this domain AFAIK) then something like the following would achieve multi-cluster support (of course, there could be large expansions on this (datasources etc):

local linkerd = import "linkerd-mixin/mixin.libsonnet";

linkerd {
  _config+:: {
    multiCluster: true,
    clusterLabel: 'cluster',
    dashboardNamePrefix: 'Linkerd2 / ',
    dashboardTags: ['linkerd', 'infrastucture'],
  },
}

$._config.multiCluster would default to false, so this is purely opt in.

It probably mean potentially moving this into another repo (purely for noise).

@andrew-waters
Copy link
Author

andrew-waters commented Sep 9, 2019

As a side note, it would also be possible from the same jsonnet package to perform a full linkerd2 install. This would expand installation options beyond linkerd install and helm which provides more choice to the operator as they could be bundled inside a larger "stack" application. That's a different issue though...

@andrew-waters
Copy link
Author

@grampelberg I've created a repo to demonstrate this functionality - although for a disclaimer I've only migrated one panel for one dashboard. I'd be interested in getting your feedback as to whether or not this is something that I should spend some time on. The readme should hopefully be self explanatory.

https://github.com/andrew-waters/linkerd2-mixin

@andrew-waters
Copy link
Author

Screenshot 2019-09-10 at 20 55 26

@grampelberg
Copy link
Contributor

That's pretty cool! Where is the source being fetched (as someone who knows nothing about jsonnet)?

@andrew-waters
Copy link
Author

The source for the namespace?

@grampelberg
Copy link
Contributor

top-line.json actually.

@andrew-waters
Copy link
Author

Not sure I'm following the question?

@grampelberg
Copy link
Contributor

Oh, this is building the dashboard for grafana - not taking the existing top-line.json and patching it?

@andrew-waters
Copy link
Author

Yeah precisely. You could patch it, but if we go to the effort of building this with jsonnet we may as well go all in and rebuild them with more customisation

@grampelberg
Copy link
Contributor

I'm really hesitant to bring jsonnet in as a dependency. We'll not be able to get away from helm as the core installation method as that's what most folks want.

I've gone down the "dashboards in my own grafana" path a couple times and the only sustainable solution was export from the linkerd grafana, import into your own. Documenting jsonnet as a way to patch/build the dashboards I'm 100% behind.

Honestly, I'm pretty happy just updating these dashboards to be cluster aware, especially if we just detect the cluster label and hide the dropdown for folks who don't have it setup that way.

@andrew-waters
Copy link
Author

Well, a couple of thoughts:

  1. It doesn’t need to be a dependency - storing the output JSON means it can remain as is (possibly adding multi cluster and standalone directory in this repo) but still give people the choice.

  2. I’d also suggested it’s own repo (whether that lives in linkerd namespace or not). I’d be happy to build it out (if we’re going to adopt linkerd) - this is pretty basic stuff for what we need and I guess it would be for others looking as well.

@grampelberg
Copy link
Contributor

Both options sound great to me. What are you thinking around storing the JSON output? All the dashboards are already in JSON - https://github.com/linkerd/linkerd2/tree/master/grafana/dashboards

@andrew-waters
Copy link
Author

andrew-waters commented Sep 10, 2019

Yep, I’ve used those and ended up here 😂

Obviously linkerd needs these as part of the install (you’ve seen my other issue about unbundling). That suggests there may be more options:

  1. Have CI do some pretty advanced work building these on each (pre) release. I’m not sure how this would be done, just floating it as an idea.

  2. Configure the linkerd bin to pull them at runtime. This is less desirable and certainly would involve you guys doing this under your own organisation account.

  3. Bite the bullet and bring them all under the same roof. This would add an (optional) dependency but my side note above (Add multicluster support for Grafana #3405 (comment)) appears to have been supported as a way this could be built out in the future.

  4. Don’t worry about it. Create a separate repo, point to it it in the docs for more advanced usage and write up some good docs on where this fits in from the ops perspective. Leave the existing dashboard in tact (maintained hard copies or a manual process). Simply add this as an advanced option.

@grampelberg
Copy link
Contributor

For this kind of advanced usage, I don't think it makes sense to allow configuration at this level as part of the install for all the install methods. To your point, the unbundling stuff is definitely important and the correct (tm) way to go.

I would love to provide the tools and docs to get the dashboards for a specific version and configure as you see fit. That's kinda where the kustomize install documentation was going.

@andrew-waters
Copy link
Author

I think this all leads to having it in a separate repo. Does that sound sensible? If so, I’d suggest the following:

I’ll continue with some work on this tomorrow (I’m on GMT). I’ll get a single dashboard complete using what I’ve prepared already for preview.

You can chat internally about where this lives. I’m happy to host on my github account if it’s your preference but I’m also happy to pass it over if you see fit.

We build out the remaining dashboards and once we’re at parity, look at a release that ties in with linkerd’s trunk. If you guys have any resource to throw at it that would be most welcome but I appreciate you may not.

I’m happy to continue this from here and maintain involvement, including any documentation that may be involved in proposing this as a production recommendation.

@grampelberg
Copy link
Contributor

That sounds like a fantastic plan to me!

@andrew-waters
Copy link
Author

👋 @grampelberg - just a courtesy note to mention I'm looking at some other things so haven't had the chance to circle back around to this yet in any meaningful way - I'll do so in due course and update you on here when I can carve that time out.

@grampelberg
Copy link
Contributor

Looking forward to it!

@andrew-waters
Copy link
Author

Hi @grampelberg, I didn't get the opportunity to dive into this further until this weekend. Anyways, I've done quite a bit of work this weekend getting a lot of the boilerplate set up and the top line dashboard generated via jsonnet: https://github.com/andrew-waters/linkerd2-mixin

If you'd like to test it, grab the repo and there should be some instructions for deps. You can then run a make dashboards. I'll add the other dashboards soon (we're just starting our adoption of linekrd2 and this repo is required regardless so we can have a single pane of glass.

It's particularly worth pointing out https://github.com/andrew-waters/linkerd2-mixin/blob/master/config.libsonnet - this is where all the config happens and when you use the repo as a library (ie in another project), you can customise all the way down from the parent.

@andrew-waters
Copy link
Author

Oh and this is the result in Grafan btw. All this effort to make these dashboards much more like IaC and to get the Cluster variable in the top left corner 😂

Screenshot 2019-12-09 at 13 35 52

@grampelberg
Copy link
Contributor

Cool! I'm gonna need to spend some more time getting this into my brain. @zaharidichev you're doing a lot of thinking about multi-cluster right now, mind taking a look? Also, @Pothulapati, check it out! This might be interesting around the grafana chart work we're doing right now.

@Pothulapati
Copy link
Contributor

@andrew-waters That's Awesome! Thank you so much for doing this.

As you know right now, we have the dashboard config statically pushed into a config map. This would allow some great extensibility.

Right now, I am working on a way to have add-ons for Linkerd2, and then move out grafana and prometheus as add-ons. Once we do that, We can maybe update the grafana charts with the mixin generated config. WDYT?

@andrew-waters
Copy link
Author

@Pothulapati sounds great - I have a related issue over at #3406 which sounds similar to the work you are doing on add-ons and whilst not a prerequisite, certainly helps in decoupling. If you have any issues I can track on that side, please let me know.

In terms of how we apply the mixin, that's definitely a solid sequence. That will also give me the chance to get the other dashboards migrated across to the mixin for own testing and we can also consider how they are built as part of CI.

Just to get some idea on commitments, do you have some rough ideas on your own timescale?

@stale
Copy link

stale bot commented Mar 26, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Mar 26, 2020
@stale stale bot closed this as completed Apr 9, 2020
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants