Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

display component uptime #1223

Closed
9547 opened this issue Mar 17, 2021 · 10 comments
Closed

display component uptime #1223

9547 opened this issue Mar 17, 2021 · 10 comments
Assignees
Labels
type/feature-request Categorizes issue as related to a new feature.

Comments

@9547
Copy link
Contributor

9547 commented Mar 17, 2021

Feature Request

Is your feature request related to a problem? Please describe:

Describe the feature you'd like:

I want to display the uptime time for each component in tiup {cluster|dm} display xxx

Describe alternatives you've considered:

As some components' Prometheus metric API returns process_start_time_seconds metric, which we can use directly to represent the process's start timestamp, and use time.Now() - start_timestamp as uptime, those components as below:

  • pd
  • tidb
  • tikv
  • ticdc
  • drainer
  • pump
  • alertmanager, grafana, prometheus

Those components doesn't contain the process_start_time_seconds:

  • tiflash
  • tispark-{master,worker}

For those components that do not include this metric, especially TIFlash, we can wait until the product side is integrated before adding it. During this transition, we can use ssh then ps to see how long the process lives.

Teachability, Documentation, Adoption, Migration Strategy:

@9547 9547 added the type/feature-request Categorizes issue as related to a new feature. label Mar 17, 2021
@9547
Copy link
Contributor Author

9547 commented Mar 17, 2021

/assign

@9547
Copy link
Contributor Author

9547 commented Mar 17, 2021

@lucklove @AstroProfundis PTAL

@lucklove
Copy link
Member

As some components' Prometheus metric API returns process_start_time_seconds metric, which we can use directly to represent the process's start timestamp, and use time.Now() - start_timestamp as uptime

That's a good idea, but what if the user didn't deploy prometheus? Can we handle that case?

@9547
Copy link
Contributor Author

9547 commented Mar 18, 2021

As some components' Prometheus metric API returns process_start_time_seconds metric, which we can use directly to represent the process's start timestamp, and use time.Now() - start_timestamp as uptime

That's a good idea, but what if the user didn't deploy prometheus? Can we handle that case?

I'm sorry, I didn't make myself clear, we are not using Prometheus, but use the component's metric API. Btw, if use Prometheus to query the latest uptime, this is not correct, maybe fetched the staled data, which is the last message before the process dies.

@lucklove
Copy link
Member

I've checked the metrics components returned, but I didn't find any metric that records the start_timestamp...

@9547
Copy link
Contributor Author

9547 commented Mar 18, 2021

I've checked the metrics components returned, but I didn't find any metric that records the start_timestamp...

Sorry for the spelling error, is process_start_time_seconds, not start_timestamp. BTW, we can use systemctl status xxx | grep -Po ".*; \K(.*)(?= ago)" to get the process's uptime for those doesn't has metric.

@lucklove
Copy link
Member

I've checked the metrics components returned, but I didn't find any metric that records the start_timestamp...

Sorry for the spelling error, is process_start_time_seconds, not start_timestamp. BTW, we can use systemctl status xxx | grep -Po ".*; \K(.*)(?= ago)" to get the process's uptime for those doesn't has metric.

Nice, I deployed a cluster and check the metric, it seems pd and tidb has this metric returned but TiKV doesn't.
To keep consistency, we can use the systemctl way for all components no matter they have process_start_time_seconds or not. This may make display slow because we need to iterate every instance and connect the ssh client, it's not friendly for big cluster. So maybe we can make --uptime a option, by default, we don't show uptime.

@9547
Copy link
Contributor Author

9547 commented Mar 22, 2021

I've checked the metrics components returned, but I didn't find any metric that records the start_timestamp...

Sorry for the spelling error, is process_start_time_seconds, not start_timestamp. BTW, we can use systemctl status xxx | grep -Po ".*; \K(.*)(?= ago)" to get the process's uptime for those doesn't has metric.

Nice, I deployed a cluster and check the metric, it seems pd and tidb has this metric returned but TiKV doesn't.

To keep consistency, we can use the systemctl way for all components no matter they have process_start_time_seconds or not. This may make display slow because we need to iterate every instance and connect the ssh client, it's not friendly for big cluster. So maybe we can make --uptime a option, by default, we don't show uptime.

I've checked those components under cluster version v4.0.4 and dm nightly, below components have the process_start_time_seconds metric:

  • pd
  • tidb
  • tikv
  • ticdc
  • drainer
  • pump
  • dm-{master, worker}
  • alertmanager, grafana, prometheus

Those components doesn't contain the process_start_time_seconds:

  • tiflash
  • tispark-{master,worker}

So most of the components are implemented, and I think via metric api is more convenient and useful for normal usage. And we can implement it firstly through metric api, if not implemented or service was down, then use ssh-systemctl.

However, once all the services are down, they will be degraded to be accessed through ssh-systemctl, which will affect the query time, so --uptime or --no-uptime maybe needed 🤔

@lucklove
Copy link
Member

The version I have checked is v4.0.0.

So I think there must be compatibility issues... Not every version implements this.

@9547
Copy link
Contributor Author

9547 commented Apr 8, 2021

It's implemented in #1231

@9547 9547 closed this as completed Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue as related to a new feature.
Projects
None yet
Development

No branches or pull requests

2 participants