-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Katib UATs failing in AKS and EKS #1063
Comments
Thank you for reporting us your feedback! The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6246.
|
Looking at this test case in the integration test, it is possible that this test passes (at the time of testing it is currently Running), and then the test fails with the Assertion Error shown above. |
Debugging Katib Experiment in EKS ClusterI've deployed my own EKS cluster and began debugging the Katib integration. The How Katib Experiment WorksLet's start by examining the experiment we are creating in the problematic notebook. metadata = V1ObjectMeta(
name=EXPERIMENT_NAME,
)
algorithm_spec = V1beta1AlgorithmSpec(
algorithm_name="cmaes"
)
objective_spec = V1beta1ObjectiveSpec(
type="minimize",
goal=0.001,
objective_metric_name="loss",
additional_metric_names=["Train-accuracy"]
)
# Experiment search space
# In this example, we tune learning rate, momentum, and optimizer
parameters = [
V1beta1ParameterSpec(
name="lr",
parameter_type="double",
feasible_space=V1beta1FeasibleSpace(
min="0.01",
max="0.06"
),
),
V1beta1ParameterSpec(
name="momentum",
parameter_type="double",
feasible_space=V1beta1FeasibleSpace(
min="0.5",
max="0.9"
),
),
]
# JSON template specification for the Trial's Worker Kubernetes Job
trial_spec = {
"apiVersion": "batch/v1",
"kind": "Job",
"spec": {
"template": {
"metadata": {
"annotations": {
"sidecar.istio.io/inject": "false"
}
},
"spec": {
"containers": [
{
"name": "training-container",
"image": "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0",
"command": [
"python3",
"/opt/pytorch-mnist/mnist.py",
"--epochs=1",
"--batch-size=16384",
"--lr=${trialParameters.learningRate}",
"--momentum=${trialParameters.momentum}",
]
}
],
"restartPolicy": "Never"
}
}
}
}
trial_template = V1beta1TrialTemplate(
primary_container_name="training-container",
trial_parameters=[
V1beta1TrialParameterSpec(
name="learningRate",
description="Learning rate for the training model",
reference="lr"
),
V1beta1TrialParameterSpec(
name="momentum",
description="Momentum for the training model",
reference="momentum"
),
],
trial_spec=trial_spec
)
experiment = V1beta1Experiment(
api_version="kubeflow.org/v1beta1",
kind="Experiment",
metadata=metadata,
spec=V1beta1ExperimentSpec(
max_trial_count=3,
parallel_trial_count=2,
max_failed_trial_count=1,
algorithm=algorithm_spec,
objective=objective_spec,
parameters=parameters,
trial_template=trial_template,
)
) When this
Problems IdentifiedDuring debugging, I encountered two major issues: 1. Issue with
|
Bug Description
After merging PR with katib rocks for 1.9 ckf the UATs started to fail for AKS and EKS.
To Reproduce
Environment
CI for AKS or EKS
Relevant Log Output
Additional Context
No response
The text was updated successfully, but these errors were encountered: