Trials Pods are completed but never successful neither reused, metrics are not shown #1949

joaquingarciaatos · 2022-09-07T13:52:35Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

I have tried to run the Hyperparameter Tuning v1beta1 examples from the official Github of Katib. https://github.com/kubeflow/katib/tree/master/examples/v1beta1/hp-tuning. The only thing I have changed has been the repository name (from kubeflow to joaquin-garcia), and I have tried both keeping enable and disable the sidecar injection (our cluster uses istio), as detailed in Step 3 in https://www.kubeflow.org/docs/components/katib/hyperparameter/ .

The problem is that each pod executes one Trial (one combination of parameters), and the trial is marked as completed but never as successful (in the Terminal neither in the UI), so the goal of the tool is not reached. I have checked that the algorithm is carried out in each pod, as the different epochs and metrics are shown in the terminal, but nothing is shown in the UI.

What did you expect to happen:
I expected each pod to be rerun with a different combination of values for each of the parameters under study / tuning.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

In the katib-ui pod it is shown "Trial random-<pod_number> has no pipeline run."
The UI interface always show this values:

Environment:

Katib version (check the Katib controller image version): 0.13.0
Kubernetes version: (kubectl version): Client v1.25.0 | Server v1.21.13
OS (uname -a): Linux microsoft-standard-WSL2 x86_64 x86_64 x86_64 GNU/Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

The text was updated successfully, but these errors were encountered:

johnugeorge · 2022-09-08T19:59:47Z

Can you check #1795 (comment) ?

joaquingarciaatos · 2022-09-12T15:18:58Z

Can you check #1795 (comment) ?

Dear @johnugeorge, thank you very much for your reply. I have checked the two points of your comment:

The status of the katib-db-manager pod is "Running", and I get the following logs:

I0909 12:47:11.994286       1 db.go:32] Using MySQL                                                                                                              
E0909 12:47:18.008245       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:23.000273       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:28.024249       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:33.016374       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:38.008362       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
I0909 12:47:42.010038       1 init.go:27] Initializing v1beta1 DB schema                                                                                                                 
I0909 12:47:42.028309       1 main.go:113] Start Katib manager: 0.0.0.0:6789

So even if the first pings failed, I understand everything is fine with katib-db-manager, right?

I am not sure if you are referring to the pod metrics server. Logging it, it seems like it is not possible to create the tcp connection, so maybe that's the reason why I can not see the metrics and the trials does not advance with new parameter values. That's the log I get:

I0708 11:32:57.277833       1 serving.go:341] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0708 11:32:57.809105       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController                                                                                                   I0708 11:32:57.809141       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0708 11:32:57.809177       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0708 11:32:57.809211       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0708 11:32:57.809341       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0708 11:32:57.809349       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0708 11:32:57.809392       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0708 11:32:57.809557       1 secure_serving.go:197] Serving securely on :443                                                                                                                                I0708 11:32:57.809703       1 tlsconfig.go:240] Starting DynamicServingCertificateController                                                                                                                 I0708 11:32:57.909691       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file                                                       ││ I0708 11:32:57.909778       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file                                       I0708 11:32:57.909810       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController                                                                                               E0719 12:49:12.795252       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:49:27.794191       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:58:57.797287       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:59:12.789134       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:12.803248       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:27.799399       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:56.288725       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:11.288884       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:26.288453       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:41.289342       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:56.287960       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: i/o timeout" node="vtss06
E0719 13:10:11.289028       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"

Btw, when logging the pod katib-controller, I got the following error:

2022/09/09 13:16:23 http: TLS handshake error from 10.42.0.0:59926: remote error: tls: bad certificate

I think my error has its origin in the way the certificates are generated, but I am not sure neither how to solve it.

johnugeorge · 2022-09-19T07:59:13Z

Sorry for late reply. Is it a fresh installation? Is it stale web hook configurations?

/cc @tenzen-y

tenzen-y · 2022-09-19T09:05:15Z

Maybe, it occurred by the old WebhookConfigurations or Secret embedded certs. As @johnugeorge says, can you re-install katib after running the below commands to clean up the old katib?

kubectl delete -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"

skgreenstar · 2022-11-16T10:48:17Z

I have the same phenomenon. However, I am not a katib-controller problem, even though everything was done normally and the job was done normally, the data does not accumulate on the db.

For your information, I am connecting to a mysql server other than mysql provided by katib. The connection to the corresponding db is normal, and the observe_logs table is also normally created.

However, the data is not accumulated in the db.

What's the problem? Is there a solution?

github-actions · 2023-08-25T05:05:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2023-09-14T05:05:16Z

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

google-oss-prow bot added the kind/bug label Sep 7, 2022

joaquingarciaatos changed the title ~~Trials are never successful, pods are not reused, metrics are not shown~~ Trials Pods are completed but never successful neither reused, metrics are not shown Sep 7, 2022

skgreenstar mentioned this issue Nov 18, 2022

Problem about katib-db #2020

Closed

github-actions bot added the lifecycle/stale label Aug 25, 2023

github-actions bot closed this as completed Sep 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trials Pods are completed but never successful neither reused, metrics are not shown #1949

Trials Pods are completed but never successful neither reused, metrics are not shown #1949

joaquingarciaatos commented Sep 7, 2022 •

edited

Loading

johnugeorge commented Sep 8, 2022

joaquingarciaatos commented Sep 12, 2022

johnugeorge commented Sep 19, 2022

tenzen-y commented Sep 19, 2022

skgreenstar commented Nov 16, 2022 •

edited

Loading

github-actions bot commented Aug 25, 2023

github-actions bot commented Sep 14, 2023

Trials Pods are completed but never successful neither reused, metrics are not shown #1949

Trials Pods are completed but never successful neither reused, metrics are not shown #1949

Comments

joaquingarciaatos commented Sep 7, 2022 • edited Loading

johnugeorge commented Sep 8, 2022

joaquingarciaatos commented Sep 12, 2022

johnugeorge commented Sep 19, 2022

tenzen-y commented Sep 19, 2022

skgreenstar commented Nov 16, 2022 • edited Loading

github-actions bot commented Aug 25, 2023

github-actions bot commented Sep 14, 2023

joaquingarciaatos commented Sep 7, 2022 •

edited

Loading

skgreenstar commented Nov 16, 2022 •

edited

Loading