[Enhancement Request] Metrics Collector Push-based Implementation #577

gaocegege · 2019-05-24T02:35:19Z

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]

Now the design of metrics collector is based on pull. We have a metrics collector cron job for one trial. And it collects logs according to the pods log. Then it parses the log and persist the logs in MySQL.

The design has some problems (kubeflow/training-operator#722 (comment)) @johnugeorge proposed a push-based model to avoid the problems caused by the current design. And I also have some ideas about it.

In my design, we need a push-based implementation to push the metrics to prometheus. Then we can use custom-metrics-server to expose the trial or job level metrics. Then katib could get all periodical metrics from k8s master API. The early stopping services can use the API to determine if we should kill the trial. And UI can use the API to show the metrics.

And, tfjob and pytorchjob can also benefit from the metrics collector. Because we can use it collect periodical metrics for them, too. And the metrics will be exposed by a kubernetes native way: K8s metrics API

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

gaocegege · 2019-05-24T02:37:46Z

I am glad to discuss it after v1alpha2 is released.

johnugeorge · 2019-05-24T05:10:10Z

Yes. This is a long standing need :) Lets take this up during next api design phase

johnugeorge · 2019-10-12T06:25:37Z

This is fixed with new metric collector design in v1alpha3.

johnugeorge · 2019-10-12T06:25:45Z

Closing the issue

stale · 2022-01-03T21:44:30Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2022-01-04T15:20:42Z

/lifecycle frozen

andreyvelich · 2024-02-22T14:55:50Z

/area gsoc

Electronic-Waste · 2024-02-24T08:06:33Z

Hi everyone!

I'm Electronic-Waste, a.k.a. Shao Wang, a senior student from Shanghai Jiao Tong Univeristy. My major interests lies in cloud infra concerned with k8s, and also, ai infra. I have two professional experience concerned with k8s and go:

For the first experience, I wrote a naive kubernetes integrated with knative along with my classmates.
In the second experience, I implement a heterogenous network emulation system based on kubernetes in @sjtu-sail @dtn-dslab, in which I'll pursue my master degree.

And also, I previous made some contributions to tensorchord/envd, an open source software designed to solve AI/ML development environment

I noticed that kubeflow has been chosen as one of the orgs participating in GSoC2024. And I'm interested in this issue. I wonder how I can get started to this issue. Can you please offer me some guidance?

tenzen-y · 2024-02-27T18:37:27Z

Hi everyone!

I'm Electronic-Waste, a.k.a. Shao Wang, a senior student from Shanghai Jiao Tong Univeristy. My major interests lies in cloud infra concerned with k8s, and also, ai infra. I have two professional experience concerned with k8s and go:

For the first experience, I wrote a naive kubernetes integrated with knative along with my classmates.

In the second experience, I implement a heterogenous network emulation system based on kubernetes in @sjtu-sail @dtn-dslab, in which I'll pursue my master degree.

And also, I previous made some contributions to tensorchord/envd, an open source software designed to solve AI/ML development environment

I noticed that kubeflow has been chosen as one of the orgs participating in GSoC2024. And I'm interested in this issue. I wonder how I can get started to this issue. Can you please offer me some guidance?

@Electronic-Waste Hi, Shao. Thank you for your interest in the kubeflow GSoC project.
We (mentors) plan to hold a dedicated community meeting for the GSoC candidates.

Please join the kubeflow slack workspace to receive some information about GSoC.

Electronic-Waste · 2024-02-28T02:57:01Z

@tenzen-y Okay, thanks for telling me about this.

Electronic-Waste · 2024-03-02T07:04:29Z

cc plz👀 @johnugeorge @gaocegege @andreyvelich @tenzen-y

I have a question about the push-based metrics collector. Could you plz have a simple look into my questions?

The paper introducing Katib mentions that Katib supports two kinds of metric collection - push based and pull based. While Katib has already implemented the push based way to collect metrics, why do we still need to implement it again in this enhancement request(or in GSoC project)? Do I misunderstand the content in the paper? Or Katib have implemented it in the yaml configuration way but need a python SDK version now?

tenzen-y · 2024-03-02T14:19:10Z

@Electronic-Waste IIUC, the current Katib provides only pull based mtrics collector like file-metrics-collector, and tf-event-mentrics-collector. So, we need to implement the push based metrics collector.

Regarding the above paper, I'm not sure the above paper isn't correct since I'm not a part of author that paper.

Electronic-Waste · 2024-03-02T15:12:13Z

@tenzen-y Thank you for your clarification! I'll try to look into the source code to get details and write my proposal for implementing this enhance request.

andreyvelich · 2024-03-04T14:17:24Z

@Electronic-Waste I think, the main idea was that user can still use Katib DB Manager gRPC API to push metrics to Katib DB. In that case user has to disable sidecar injection and make sure that metrics have been collected.
We want to simplify it for user to (maybe) provide a new API in Katib SDK to push metrics directly to Katib DB.

Electronic-Waste · 2024-03-04T15:00:36Z

@andreyvelich Okay, thanks for you clarification too! If I do not misunderstand your words, do you mean that we want to add a new interface (such as katib_client.push_metrics()) for users to push metrics to Katib DB directly without changing the current logics?

andreyvelich · 2024-03-04T15:22:15Z

As an option, yes, but we need to have design proposal to discuss various options and, potential, Experiment API changes.

Electronic-Waste · 2024-03-04T15:35:23Z

As an option, yes, but we need to have design proposal to discuss various options and, potential, Experiment API changes.

@andreyvelich Okay, I'll raise my design proposal in the next few days. Btw, how do you want me to send it to you?

andreyvelich · 2024-03-04T15:49:59Z

@Electronic-Waste Please can you join AutoML and Training WG call this Wed at 2pm UTC: https://bit.ly/2PWVCkV so we can discuss details ?
Also, since this task might be part of GSoC, we need to follow the process cc @rareddy

Electronic-Waste · 2024-03-04T15:59:38Z

@andreyvelich Yeah, of course!

YelenaYY · 2024-03-18T00:44:55Z

Hi @andreyvelich and @rareddy, out of curiosity for the proposal, what are some use cases that prevent the metrics-collector sidecar from working?

andreyvelich · 2024-05-24T19:39:11Z

/assign @Electronic-Waste

andreyvelich · 2024-05-24T19:48:01Z

Sorry for the late reply @YelenaYY.
@Electronic-Waste explained it very well in the proposal: https://github.com/kubeflow/katib/blob/aff97bf881e41f77b947f8e967263f31bad3103d/docs/proposals/push-based-metrics-collection.md#motivation.

It might have some performance drawbacks since we require to parse entire container StdOut to get the required metrics.
Also, we redirect StdOut output to the file to parse it after: https://github.com/kubeflow/katib/blob/master/pkg/metricscollector/v1beta1/file-metricscollector/file-metricscollector.go#L72.
Which means your Trial Pod should have sufficient memory for it.

To redirect output to the file, we wrap the container entrypoint which will always use sh or bash to execute the script: https://github.com/kubeflow/katib/blob/master/pkg/webhook/v1beta1/pod/utils.go#L163. That might not work for all use-cases.

epa095 mentioned this issue Jun 18, 2019

Metric-collector cronjob spawns unlimited jobs #659

Closed

bryandai mentioned this issue Jun 20, 2019

if worker failed/pending, merics job will be created each min #667

Closed

johnugeorge closed this as completed Oct 12, 2019

andreyvelich reopened this Jun 30, 2021

stale bot added the lifecycle/stale label Jan 3, 2022

google-oss-prow bot added lifecycle/frozen and removed lifecycle/stale labels Jan 4, 2022

johnugeorge mentioned this issue Nov 2, 2022

Katib v0.15.0 Roadmap #1993

Closed

13 tasks

andreyvelich mentioned this issue Nov 18, 2022

[Fix] add early stopped trials in converter #2004

Merged

1 task

andreyvelich mentioned this issue Feb 10, 2023

A metrics collector for Kubeflow Pipeline Metrics artifacts #2019

Open

andreyvelich mentioned this issue Aug 29, 2023

Feature/Example for training KFP v1 #2118

Closed

1 task

google-oss-prow bot added the area/gsoc label Feb 22, 2024

Electronic-Waste mentioned this issue May 15, 2024

[GSoC] KEP for Project 6: Push-based Metrics Collection for Katib #2328

Merged

1 task

google-oss-prow bot assigned Electronic-Waste May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement Request] Metrics Collector Push-based Implementation #577

[Enhancement Request] Metrics Collector Push-based Implementation #577

gaocegege commented May 24, 2019

gaocegege commented May 24, 2019

johnugeorge commented May 24, 2019

johnugeorge commented Oct 12, 2019

johnugeorge commented Oct 12, 2019

stale bot commented Jan 3, 2022

andreyvelich commented Jan 4, 2022

andreyvelich commented Feb 22, 2024

Electronic-Waste commented Feb 24, 2024 •

edited

Loading

tenzen-y commented Feb 27, 2024

Electronic-Waste commented Feb 28, 2024

Electronic-Waste commented Mar 2, 2024

tenzen-y commented Mar 2, 2024

Electronic-Waste commented Mar 2, 2024

andreyvelich commented Mar 4, 2024

Electronic-Waste commented Mar 4, 2024

andreyvelich commented Mar 4, 2024

Electronic-Waste commented Mar 4, 2024

andreyvelich commented Mar 4, 2024

Electronic-Waste commented Mar 4, 2024

YelenaYY commented Mar 18, 2024

andreyvelich commented May 24, 2024

andreyvelich commented May 24, 2024

[Enhancement Request] Metrics Collector Push-based Implementation #577

[Enhancement Request] Metrics Collector Push-based Implementation #577

Comments

gaocegege commented May 24, 2019

gaocegege commented May 24, 2019

johnugeorge commented May 24, 2019

johnugeorge commented Oct 12, 2019

johnugeorge commented Oct 12, 2019

stale bot commented Jan 3, 2022

andreyvelich commented Jan 4, 2022

andreyvelich commented Feb 22, 2024

Electronic-Waste commented Feb 24, 2024 • edited Loading

tenzen-y commented Feb 27, 2024

Electronic-Waste commented Feb 28, 2024

Electronic-Waste commented Mar 2, 2024

tenzen-y commented Mar 2, 2024

Electronic-Waste commented Mar 2, 2024

andreyvelich commented Mar 4, 2024

Electronic-Waste commented Mar 4, 2024

andreyvelich commented Mar 4, 2024

Electronic-Waste commented Mar 4, 2024

andreyvelich commented Mar 4, 2024

Electronic-Waste commented Mar 4, 2024

YelenaYY commented Mar 18, 2024

andreyvelich commented May 24, 2024

andreyvelich commented May 24, 2024

Electronic-Waste commented Feb 24, 2024 •

edited

Loading