Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement Request] Metrics Collector Push-based Implementation #577

Open
gaocegege opened this issue May 24, 2019 · 22 comments
Open

[Enhancement Request] Metrics Collector Push-based Implementation #577

gaocegege opened this issue May 24, 2019 · 22 comments

Comments

@gaocegege
Copy link
Member

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

Now the design of metrics collector is based on pull. We have a metrics collector cron job for one trial. And it collects logs according to the pods log. Then it parses the log and persist the logs in MySQL.

The design has some problems (kubeflow/training-operator#722 (comment)) @johnugeorge proposed a push-based model to avoid the problems caused by the current design. And I also have some ideas about it.

In my design, we need a push-based implementation to push the metrics to prometheus. Then we can use custom-metrics-server to expose the trial or job level metrics. Then katib could get all periodical metrics from k8s master API. The early stopping services can use the API to determine if we should kill the trial. And UI can use the API to show the metrics.

And, tfjob and pytorchjob can also benefit from the metrics collector. Because we can use it collect periodical metrics for them, too. And the metrics will be exposed by a kubernetes native way: K8s metrics API

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

@gaocegege
Copy link
Member Author

I am glad to discuss it after v1alpha2 is released.

@johnugeorge
Copy link
Member

Yes. This is a long standing need :) Lets take this up during next api design phase

@johnugeorge
Copy link
Member

This is fixed with new metric collector design in v1alpha3.

@johnugeorge
Copy link
Member

Closing the issue

@stale
Copy link

stale bot commented Jan 3, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Jan 3, 2022
@andreyvelich
Copy link
Member

/lifecycle frozen

@andreyvelich
Copy link
Member

/area gsoc

@Electronic-Waste
Copy link
Member

Electronic-Waste commented Feb 24, 2024

Hi everyone!

I'm Electronic-Waste, a.k.a. Shao Wang, a senior student from Shanghai Jiao Tong Univeristy. My major interests lies in cloud infra concerned with k8s, and also, ai infra. I have two professional experience concerned with k8s and go:

  1. For the first experience, I wrote a naive kubernetes integrated with knative along with my classmates.
  2. In the second experience, I implement a heterogenous network emulation system based on kubernetes in @sjtu-sail @dtn-dslab, in which I'll pursue my master degree.

And also, I previous made some contributions to tensorchord/envd, an open source software designed to solve AI/ML development environment

I noticed that kubeflow has been chosen as one of the orgs participating in GSoC2024. And I'm interested in this issue. I wonder how I can get started to this issue. Can you please offer me some guidance?

@tenzen-y
Copy link
Member

Hi everyone!

I'm Electronic-Waste, a.k.a. Shao Wang, a senior student from Shanghai Jiao Tong Univeristy. My major interests lies in cloud infra concerned with k8s, and also, ai infra. I have two professional experience concerned with k8s and go:

  1. For the first experience, I wrote a naive kubernetes integrated with knative along with my classmates.
  2. In the second experience, I implement a heterogenous network emulation system based on kubernetes in @sjtu-sail @dtn-dslab, in which I'll pursue my master degree.

And also, I previous made some contributions to tensorchord/envd, an open source software designed to solve AI/ML development environment

I noticed that kubeflow has been chosen as one of the orgs participating in GSoC2024. And I'm interested in this issue. I wonder how I can get started to this issue. Can you please offer me some guidance?

@Electronic-Waste Hi, Shao. Thank you for your interest in the kubeflow GSoC project.
We (mentors) plan to hold a dedicated community meeting for the GSoC candidates.

Please join the kubeflow slack workspace to receive some information about GSoC.

@Electronic-Waste
Copy link
Member

@tenzen-y Okay, thanks for telling me about this.

@Electronic-Waste
Copy link
Member

cc plz👀 @johnugeorge @gaocegege @andreyvelich @tenzen-y

I have a question about the push-based metrics collector. Could you plz have a simple look into my questions?

The paper introducing Katib mentions that Katib supports two kinds of metric collection - push based and pull based. While Katib has already implemented the push based way to collect metrics, why do we still need to implement it again in this enhancement request(or in GSoC project)? Do I misunderstand the content in the paper? Or Katib have implemented it in the yaml configuration way but need a python SDK version now?

截屏2024-03-02 14 55 19

@tenzen-y
Copy link
Member

tenzen-y commented Mar 2, 2024

@Electronic-Waste IIUC, the current Katib provides only pull based mtrics collector like file-metrics-collector, and tf-event-mentrics-collector. So, we need to implement the push based metrics collector.

Regarding the above paper, I'm not sure the above paper isn't correct since I'm not a part of author that paper.

@Electronic-Waste
Copy link
Member

@tenzen-y Thank you for your clarification! I'll try to look into the source code to get details and write my proposal for implementing this enhance request.

@andreyvelich
Copy link
Member

@Electronic-Waste I think, the main idea was that user can still use Katib DB Manager gRPC API to push metrics to Katib DB. In that case user has to disable sidecar injection and make sure that metrics have been collected.
We want to simplify it for user to (maybe) provide a new API in Katib SDK to push metrics directly to Katib DB.

@Electronic-Waste
Copy link
Member

@andreyvelich Okay, thanks for you clarification too! If I do not misunderstand your words, do you mean that we want to add a new interface (such as katib_client.push_metrics()) for users to push metrics to Katib DB directly without changing the current logics?

@andreyvelich
Copy link
Member

As an option, yes, but we need to have design proposal to discuss various options and, potential, Experiment API changes.

@Electronic-Waste
Copy link
Member

As an option, yes, but we need to have design proposal to discuss various options and, potential, Experiment API changes.

@andreyvelich Okay, I'll raise my design proposal in the next few days. Btw, how do you want me to send it to you?

@andreyvelich
Copy link
Member

@Electronic-Waste Please can you join AutoML and Training WG call this Wed at 2pm UTC: https://bit.ly/2PWVCkV so we can discuss details ?
Also, since this task might be part of GSoC, we need to follow the process cc @rareddy

@Electronic-Waste
Copy link
Member

@andreyvelich Yeah, of course!

@YelenaYY
Copy link

Hi @andreyvelich and @rareddy, out of curiosity for the proposal, what are some use cases that prevent the metrics-collector sidecar from working?

@andreyvelich
Copy link
Member

/assign @Electronic-Waste

@andreyvelich
Copy link
Member

Sorry for the late reply @YelenaYY.
@Electronic-Waste explained it very well in the proposal: https://github.com/kubeflow/katib/blob/aff97bf881e41f77b947f8e967263f31bad3103d/docs/proposals/push-based-metrics-collection.md#motivation.

It might have some performance drawbacks since we require to parse entire container StdOut to get the required metrics.
Also, we redirect StdOut output to the file to parse it after: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/file-metricscollector/file-metricscollector.go#L72.
Which means your Trial Pod should have sufficient memory for it.

To redirect output to the file, we wrap the container entrypoint which will always use sh or bash to execute the script: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L163. That might not work for all use-cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants