Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trials Pods are completed but never successful neither reused, metrics are not shown #1949

Closed
joaquingarciaatos opened this issue Sep 7, 2022 · 7 comments

Comments

@joaquingarciaatos
Copy link

joaquingarciaatos commented Sep 7, 2022

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

I have tried to run the Hyperparameter Tuning v1beta1 examples from the official Github of Katib. https://github.com/kubeflow/katib/tree/master/examples/v1beta1/hp-tuning. The only thing I have changed has been the repository name (from kubeflow to joaquin-garcia), and I have tried both keeping enable and disable the sidecar injection (our cluster uses istio), as detailed in Step 3 in https://www.kubeflow.org/docs/components/katib/hyperparameter/ .

The problem is that each pod executes one Trial (one combination of parameters), and the trial is marked as completed but never as successful (in the Terminal neither in the UI), so the goal of the tool is not reached. I have checked that the algorithm is carried out in each pod, as the different epochs and metrics are shown in the terminal, but nothing is shown in the UI.

What did you expect to happen:
I expected each pod to be rerun with a different combination of values for each of the parameters under study / tuning.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

  • In the katib-ui pod it is shown "Trial random-<pod_number> has no pipeline run."
  • The UI interface always show this values:
    UI_Trials

Environment:

  • Katib version (check the Katib controller image version): 0.13.0
  • Kubernetes version: (kubectl version): Client v1.25.0 | Server v1.21.13
  • OS (uname -a): Linux microsoft-standard-WSL2 x86_64 x86_64 x86_64 GNU/Linux

Impacted by this bug? Give it a 👍 We prioritize the issues with the most 👍

@joaquingarciaatos joaquingarciaatos changed the title Trials are never successful, pods are not reused, metrics are not shown Trials Pods are completed but never successful neither reused, metrics are not shown Sep 7, 2022
@johnugeorge
Copy link
Member

Can you check #1795 (comment) ?

@joaquingarciaatos
Copy link
Author

Can you check #1795 (comment) ?

Dear @johnugeorge, thank you very much for your reply. I have checked the two points of your comment:

  1. The status of the katib-db-manager pod is "Running", and I get the following logs:
I0909 12:47:11.994286       1 db.go:32] Using MySQL                                                                                                              
E0909 12:47:18.008245       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:23.000273       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:28.024249       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:33.016374       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
E0909 12:47:38.008362       1 connection.go:40] Ping to Katib db failed: dial tcp 10.43.110.158:3306: connect: connection refused
I0909 12:47:42.010038       1 init.go:27] Initializing v1beta1 DB schema                                                                                                                 
I0909 12:47:42.028309       1 main.go:113] Start Katib manager: 0.0.0.0:6789

So even if the first pings failed, I understand everything is fine with katib-db-manager, right?

  1. I am not sure if you are referring to the pod metrics server. Logging it, it seems like it is not possible to create the tcp connection, so maybe that's the reason why I can not see the metrics and the trials does not advance with new parameter values. That's the log I get:
I0708 11:32:57.277833       1 serving.go:341] Generated self-signed cert (/tmp/apiserver.crt, /tmp/apiserver.key)
I0708 11:32:57.809105       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController                                                                                                   I0708 11:32:57.809141       1 shared_informer.go:240] Waiting for caches to sync for RequestHeaderAuthRequestController
I0708 11:32:57.809177       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0708 11:32:57.809211       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0708 11:32:57.809341       1 configmap_cafile_content.go:202] Starting client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0708 11:32:57.809349       1 shared_informer.go:240] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::client-ca-file
I0708 11:32:57.809392       1 dynamic_serving_content.go:130] Starting serving-cert::/tmp/apiserver.crt::/tmp/apiserver.key
I0708 11:32:57.809557       1 secure_serving.go:197] Serving securely on :443                                                                                                                                I0708 11:32:57.809703       1 tlsconfig.go:240] Starting DynamicServingCertificateController                                                                                                                 I0708 11:32:57.909691       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::client-ca-file                                                       ││ I0708 11:32:57.909778       1 shared_informer.go:247] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file                                       I0708 11:32:57.909810       1 shared_informer.go:247] Caches are synced for RequestHeaderAuthRequestController                                                                                               E0719 12:49:12.795252       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:49:27.794191       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:58:57.797287       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 12:59:12.789134       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:12.803248       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:27.799399       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: connect: connection refus
E0719 13:08:56.288725       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:11.288884       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:26.288453       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:41.289342       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"
E0719 13:09:56.287960       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": dial tcp 62.212.86.98:10250: i/o timeout" node="vtss06
E0719 13:10:11.289028       1 scraper.go:139] "Failed to scrape node" err="Get \"https://62.212.86.98:10250/stats/summary?only_cpu_and_memory=true\": context deadline exceeded" node="vtss067"   

Btw, when logging the pod katib-controller, I got the following error:

2022/09/09 13:16:23 http: TLS handshake error from 10.42.0.0:59926: remote error: tls: bad certificate

I think my error has its origin in the way the certificates are generated, but I am not sure neither how to solve it.

@johnugeorge
Copy link
Member

Sorry for late reply. Is it a fresh installation? Is it stale web hook configurations?

/cc @tenzen-y

@tenzen-y
Copy link
Member

Maybe, it occurred by the old WebhookConfigurations or Secret embedded certs. As @johnugeorge says, can you re-install katib after running the below commands to clean up the old katib?

kubectl delete -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"

@skgreenstar
Copy link

skgreenstar commented Nov 16, 2022

image

I have the same phenomenon. However, I am not a katib-controller problem, even though everything was done normally and the job was done normally, the data does not accumulate on the db.

For your information, I am connecting to a mysql server other than mysql provided by katib. The connection to the corresponding db is normal, and the observe_logs table is also normally created.

However, the data is not accumulated in the db.

스크린샷 2022-11-16 오후 7 48 10

What's the problem? Is there a solution?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions
Copy link

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants