-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Namespace and trial pod annotations as CLI argument #2138
Namespace and trial pod annotations as CLI argument #2138
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating this @nagar-ajay!
metadata: | ||
annotations: | ||
sidecar.istio.io/inject: "false" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, we intentionally removed Istio annotation from the examples and explained that user can disable Istio here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-search-algorithm.
@nagar-ajay Maybe we should update run-e2e-experiment.sh
script to append such annotation to our examples if that is required for Conformance.
WDYT @nagar-ajay @johnugeorge @tenzen-y ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @andreyvelich.
It might be better to run the kubectl patch
command in run-e2e-experiment.sh
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, as per conformance test design , we want to run tests inside a pod. I don't think we will have access to kubectl
inside the pod. Correct me if I'm wrong.
We will need to add support to patch annotations in run-e2e-experiment.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think programmatically patching the resources will not help here as the resources are already created with Istio sidecar.
We will need to update the definition somehow before creating the experiment. This is tricky. If we provided an option to pass annotation, this can create confusion as the passed annotation can be applied in more than one place (experiment metadata, pod metadata, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, @tenzen-y meant not use kubectl patch
, but change the run-e2e-experiment script to modify Trial template before creating Experiment. Similar as here for Trial spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnugeorge @tenzen-y Other solution might be to add another experiment just for conformance test here: https://github.com/kubeflow/katib/tree/master/test/e2e/v1beta1/testdata with required annotation and other specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we add another experiment, I think we should run the experiment on CI every PR similar to other examples.
Since we need to verify whether the experiment can be runnable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tenzen-y I see the same exact problem. Conformance tests should be tested regularly. Else, it will diverge from the core examples. I would suggest to go with current random experiment. For annotations, we have to handle 3 cases- non distributed trials, distributed trials and custom trials. We are handling just non distributed trials now. Others can be handled later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @johnugeorge. For now, injecting an annotation in run-e2e-experiment.py
might be better.
For conformance, we don't need to run all tests. Is random example enough? Having distributed training operators in examples might not be a good idea for users who are not using it |
if args.trial_pod_annotations: | ||
trial_spec = experiment.spec.trial_template.trial_spec | ||
trial_spec_metadata = trial_spec['spec']['template'].get('metadata', {}) | ||
trial_spec_metadata['annotations'] = eval(args.trial_pod_annotations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What format do you plan to pass? Also, Consider a case where annotations already present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to pass string representation of a python dictionary (annotations's key-valye pairs) e.g. I'll pass '{"sidecar.istio.io/inject": "false"}'
to disable sidecar injection.
Will update the code to consider already present annotations.
@@ -249,6 +255,12 @@ def run_e2e_experiment( | |||
if experiment.metadata.name == "random": | |||
MAX_TRIAL_COUNT += 1 | |||
PARALLEL_TRIAL_COUNT += 1 | |||
if args.trial_pod_annotations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please can we also throw an error if experiment.spec.trial_template.trial_spec['kind'] != "Job"
.
That will help users of conformance to not define incorrect Trial Spec (e.g. TFJob, PyTorchJob), since we only support Job
for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added NotImplementedError
"--namespace", type=str, required=True, help="Namespace for the Katib E2E test", | ||
) | ||
parser.add_argument( | ||
"--trial-pod-annotations", type=str, help="Annotation for the pod created by trial", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In which formant we are going to pass annotations ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw your comment above: #2138 (comment).
@nagar-ajay Did you test it if you can pass multiple annotation in that format ? E.g.:
'{"sidecar.istio.io/inject": "false", "custom-key": "custom-value"}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
@@ -44,7 +44,8 @@ fi | |||
for exp_name in "${EXPERIMENT_FILE_ARRAY[@]}"; do | |||
echo "Running Experiment from $exp_name file" | |||
exp_path=$(find ../../../../../examples/v1beta1 -name "${exp_name}.yaml") | |||
python run-e2e-experiment.py --experiment-path "${exp_path}" --verbose || (kubectl get pods -n kubeflow && exit 1) | |||
python run-e2e-experiment.py --experiment-path "${exp_path}" --namespace kubeflow \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to run the test in the kubeflow namespace? I'd like to verify whether the katib controller can operate Experiments deployed out of the namespace in which the katib controller is deployed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point, maybe we could run our E2Es on default
namespace.
WDYT @johnugeorge @nagar-ajay ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich @tenzen-y I tried running random experiment test in the default namespace, but it failed. As per the error logs, I think other namespaces require katib.kubeflow.org/metrics-collector-injection: enabled
label.
Error log:
kubernetes.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '689dd995-d83a-4915-9e69-130043993f14', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '66727794-bef2-4ca7-a384-6a1a7b8a571d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c6d64b7c-508c-4954-b20b-0b391dcc2e85', 'Date': 'Thu, 06 Apr 2023 12:40:50 GMT', 'Content-Length': '326'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: Cannot create the Experiment \"random\" in namespace \"default\": the namespace lacks label \"katib.kubeflow.org/metrics-collector-injection: enabled\"","code":400}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we must add the label to the namespace.
Can you add processes to update the namespace label in run-e2e-experiment.py
if the namespace passed by --namespace
doesn't hold the label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also updated test namespace to default
.
7108dd2
to
ac48f21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating this @nagar-ajay!
/lgtm
/assign @tenzen-y @johnugeorge
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: johnugeorge, nagar-ajay The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
For the conformance test, we want to run all e2e tests in
kf-conformance
namespace which is Istio enabled (created by Kubeflow Profile CRD)Currently, the namespace is hardcoded in all example manifests files and we don't have the option to configure the namespace.
Updated
run-e2e-experiment.py
to support namespace as a command line argument. Also, addkatib.kubeflow.org/metrics-collector-injection
label to the test namespace, if missing.Changed test namespace to
default
fromkubeflow
.In Istio enabled namespace, Pods of example manifests get stuck in the
NotReady
state. To fix this, added an option to passtrial-pod-annotations
. Users can pass'{"sidecar.istio.io/inject": "false"}'
annotation to disable istio sidecar injection.Since, we're loading kube config as part of
KatibClient
initialization, we don't need to do it again. Alsoconfig.load_kube_config()
statement breaks if we run the test inside k8s cluster.Updated
random.yaml
to put resource limits on trail spec pod container. This is required as the namespace created for the conformance test, has resource quota specified. https://github.com/kubeflow/kubeflow/blob/master/conformance/1.5/setup.yaml#L24-L28Testing: Tested modified manifests locally. Also, all these manifests run in e2e test as part of Katib CI.