Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file-metrics-collector rock is not accepting args #49

Closed
misohu opened this issue Sep 2, 2024 · 1 comment · Fixed by #52
Closed

file-metrics-collector rock is not accepting args #49

misohu opened this issue Sep 2, 2024 · 1 comment · Fixed by #52
Labels
bug Something isn't working

Comments

@misohu
Copy link
Member

misohu commented Sep 2, 2024

Bug Description

When running bundle integration tests for this task. The enas-cpu experiment is failing. After further inspection I have noticed that the experiment pods are in Error state

NAME                             READY   STATUS    RESTARTS        AGE
enas-cpu-enas-67b84bb964-8h2rr   0/1     Running   1 (7m25s ago)   31m
hyperband-mfn2wr7s-jt579         0/2     Error     0               31m
hyperband-xlbv48sj-p952v         0/2     Error     0               31m

Where the metrics collector pod reports:

kubectl logs -f  -n test-kubeflow hyperband-mfn2wr7s-jt579 -c metrics-logger-and-collector
error: unknown flag `t'

If we look at the metrics collector pod yaml we can see its receiving args:

metrics-logger-and-collector:
    Container ID:  containerd://3fe9eab8341d2db7e5da437b3fd54813c4a337fd0f112f8400a04554e9a43f37
    Image:         docker.io/charmedkubeflow/file-metrics-collector:v0.17.0-92cd6d9
    Image ID:      docker.io/charmedkubeflow/file-metrics-collector@sha256:5b8cc36d901858dc3aec68ed99f242d386e19d10549e4c306d4307bcf1ed8171
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      hyperband-mfn2wr7s
      -m
      loss
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /var/log/katib/metrics.log
      -format
      TEXT
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Mon, 02 Sep 2024 12:35:01 +0200
      Finished:     Mon, 02 Sep 2024 12:35:01 +0200
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /var/log/katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-n48mx (ro)

This the same problem as we were facing in this issue.

To Reproduce

  1. Run integration tests from this PR Integrate v0.17.0 katib rocks katib-operators#233

Environment

MicroK8s v1.29.8
juju 3.4.5-genericlinux-amd64

Relevant Log Output

kubectl logs -f  -n test-kubeflow hyperband-mfn2wr7s-jt579 -c metrics-logger-and-collector
error: unknown flag `t'

Additional Context

No response

@misohu misohu added the bug Something isn't working label Sep 2, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6196.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant