Namespace and trial pod annotations as CLI argument #2138

nagar-ajay · 2023-03-29T12:29:46Z

What this PR does / why we need it:

For the conformance test, we want to run all e2e tests in kf-conformance namespace which is Istio enabled (created by Kubeflow Profile CRD)
Currently, the namespace is hardcoded in all example manifests files and we don't have the option to configure the namespace.
Updated run-e2e-experiment.py to support namespace as a command line argument. Also, add katib.kubeflow.org/metrics-collector-injection label to the test namespace, if missing.
Changed test namespace to default from kubeflow.
In Istio enabled namespace, Pods of example manifests get stuck in the NotReady state. To fix this, added an option to pass trial-pod-annotations. Users can pass '{"sidecar.istio.io/inject": "false"}' annotation to disable istio sidecar injection.
Since, we're loading kube config as part of KatibClient initialization, we don't need to do it again. Also config.load_kube_config() statement breaks if we run the test inside k8s cluster.
Updated random.yaml to put resource limits on trail spec pod container. This is required as the namespace created for the conformance test, has resource quota specified. https://github.com/kubeflow/kubeflow/blob/master/conformance/1.5/setup.yaml#L24-L28
Testing: Tested modified manifests locally. Also, all these manifests run in e2e test as part of Katib CI.

andreyvelich

Thank you for updating this @nagar-ajay!

andreyvelich · 2023-03-29T13:24:13Z

examples/v1beta1/early-stopping/median-stop-with-json-format.yaml

+          metadata:
+            annotations:
+              sidecar.istio.io/inject: "false"


Previously, we intentionally removed Istio annotation from the examples and explained that user can disable Istio here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-search-algorithm.
@nagar-ajay Maybe we should update run-e2e-experiment.sh script to append such annotation to our examples if that is required for Conformance.

WDYT @nagar-ajay @johnugeorge @tenzen-y ?

I agree with @andreyvelich.
It might be better to run the kubectl patch command in run-e2e-experiment.sh.

Actually, as per conformance test design , we want to run tests inside a pod. I don't think we will have access to kubectl inside the pod. Correct me if I'm wrong.

We will need to add support to patch annotations in run-e2e-experiment.py

I think programmatically patching the resources will not help here as the resources are already created with Istio sidecar.

We will need to update the definition somehow before creating the experiment. This is tricky. If we provided an option to pass annotation, this can create confusion as the passed annotation can be applied in more than one place (experiment metadata, pod metadata, etc).

I think, @tenzen-y meant not use kubectl patch, but change the run-e2e-experiment script to modify Trial template before creating Experiment. Similar as here for Trial spec.

@johnugeorge @tenzen-y Other solution might be to add another experiment just for conformance test here: https://github.com/kubeflow/katib/tree/master/test/e2e/v1beta1/testdata with required annotation and other specification.

If we add another experiment, I think we should run the experiment on CI every PR similar to other examples.
Since we need to verify whether the experiment can be runnable.

@tenzen-y I see the same exact problem. Conformance tests should be tested regularly. Else, it will diverge from the core examples. I would suggest to go with current random experiment. For annotations, we have to handle 3 cases- non distributed trials, distributed trials and custom trials. We are handling just non distributed trials now. Others can be handled later.

I agree with @johnugeorge. For now, injecting an annotation in run-e2e-experiment.py might be better.

johnugeorge · 2023-03-29T15:24:46Z

For conformance, we don't need to run all tests. Is random example enough? Having distributed training operators in examples might not be a good idea for users who are not using it

johnugeorge · 2023-03-31T18:09:59Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+        if args.trial_pod_annotations:
+            trial_spec = experiment.spec.trial_template.trial_spec
+            trial_spec_metadata = trial_spec['spec']['template'].get('metadata', {})
+            trial_spec_metadata['annotations'] = eval(args.trial_pod_annotations)


What format do you plan to pass? Also, Consider a case where annotations already present

I'm planning to pass string representation of a python dictionary (annotations's key-valye pairs) e.g. I'll pass '{"sidecar.istio.io/inject": "false"}' to disable sidecar injection.

Will update the code to consider already present annotations.

examples/v1beta1/hp-tuning/random.yaml

andreyvelich · 2023-04-04T21:37:08Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

@@ -249,6 +255,12 @@ def run_e2e_experiment(
    if experiment.metadata.name == "random":
        MAX_TRIAL_COUNT += 1
        PARALLEL_TRIAL_COUNT += 1
+        if args.trial_pod_annotations:


Please can we also throw an error if experiment.spec.trial_template.trial_spec['kind'] != "Job".

That will help users of conformance to not define incorrect Trial Spec (e.g. TFJob, PyTorchJob), since we only support Job for now.

Added NotImplementedError

andreyvelich · 2023-04-04T21:40:47Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.py

+        "--namespace", type=str, required=True, help="Namespace for the Katib E2E test",
+    )
+    parser.add_argument(
+        "--trial-pod-annotations", type=str, help="Annotation for the pod created by trial",


In which formant we are going to pass annotations ?

I saw your comment above: #2138 (comment).
@nagar-ajay Did you test it if you can pass multiple annotation in that format ? E.g.:

'{"sidecar.istio.io/inject": "false", "custom-key": "custom-value"}'

tenzen-y · 2023-04-05T18:15:26Z

test/e2e/v1beta1/scripts/gh-actions/run-e2e-experiment.sh

@@ -44,7 +44,8 @@ fi
 for exp_name in "${EXPERIMENT_FILE_ARRAY[@]}"; do
  echo "Running Experiment from $exp_name file"
  exp_path=$(find ../../../../../examples/v1beta1 -name "${exp_name}.yaml")
-  python run-e2e-experiment.py --experiment-path "${exp_path}" --verbose || (kubectl get pods -n kubeflow && exit 1)
+  python run-e2e-experiment.py --experiment-path "${exp_path}" --namespace kubeflow \


Do we need to run the test in the kubeflow namespace? I'd like to verify whether the katib controller can operate Experiments deployed out of the namespace in which the katib controller is deployed.

It's a good point, maybe we could run our E2Es on default namespace.
WDYT @johnugeorge @nagar-ajay ?

@andreyvelich @tenzen-y I tried running random experiment test in the default namespace, but it failed. As per the error logs, I think other namespaces require katib.kubeflow.org/metrics-collector-injection: enabled label.
Error log:

kubernetes.client.exceptions.ApiException: (400) Reason: Bad Request HTTP response headers: HTTPHeaderDict({'Audit-Id': '689dd995-d83a-4915-9e69-130043993f14', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '66727794-bef2-4ca7-a384-6a1a7b8a571d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c6d64b7c-508c-4954-b20b-0b391dcc2e85', 'Date': 'Thu, 06 Apr 2023 12:40:50 GMT', 'Content-Length': '326'}) HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: Cannot create the Experiment \"random\" in namespace \"default\": the namespace lacks label \"katib.kubeflow.org/metrics-collector-injection: enabled\"","code":400}

Yes, we must add the label to the namespace.

Can you add processes to update the namespace label in run-e2e-experiment.py if the namespace passed by --namespace doesn't hold the label?

sure, will add.

Done. Also updated test namespace to default.

andreyvelich

Thank you for updating this @nagar-ajay!
/lgtm
/assign @tenzen-y @johnugeorge

johnugeorge · 2023-04-10T17:40:23Z

/approve

google-oss-prow · 2023-04-10T17:40:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, nagar-ajay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johnugeorge]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow bot added the size/M label Mar 29, 2023

google-oss-prow bot requested review from anencore94, gaocegege and johnugeorge March 29, 2023 12:30

andreyvelich reviewed Mar 29, 2023

View reviewed changes

nagar-ajay changed the title ~~Disable istio sidecar and Namespace as CLI argument~~ [WIP] - Disable istio sidecar and Namespace as CLI argument Mar 31, 2023

google-oss-prow bot added size/S do-not-merge/work-in-progress and removed size/M labels Mar 31, 2023

nagar-ajay changed the title ~~[WIP] - Disable istio sidecar and Namespace as CLI argument~~ Disable istio sidecar and Namespace as CLI argument Mar 31, 2023

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 31, 2023

nagar-ajay changed the title ~~Disable istio sidecar and Namespace as CLI argument~~ Namespace and trial pod annotations as CLI argument Mar 31, 2023

johnugeorge reviewed Mar 31, 2023

View reviewed changes

andreyvelich reviewed Apr 4, 2023

View reviewed changes

tenzen-y reviewed Apr 5, 2023

View reviewed changes

nagar-ajay added 9 commits April 6, 2023 18:19

disable istio sidecar injection for example manifests

cccf2fd

add namespace as commnad line arg to python test script

d1573d0

revert disable istio sidecar injection

b4549ee

add option to pass trial pod annotations

3bbe4bd

split command over multiple lines

0c96cbe

remove redundant config loading

4372539

add resource limit to containers of random experiment's trial spec pod

6c282ca

update code to support already present annotations

6902e7f

raise NotImplementedError if trailSpec is different from Job

ac48f21

nagar-ajay force-pushed the disable_istio_sidecar branch from 7108dd2 to ac48f21 Compare April 6, 2023 12:49

add metrics-collector-injection to namespace under test if missing

84d68a5

google-oss-prow bot added size/M and removed size/S labels Apr 8, 2023

andreyvelich reviewed Apr 10, 2023

View reviewed changes

google-oss-prow bot assigned johnugeorge, tenzen-y and andreyvelich Apr 10, 2023

google-oss-prow bot added the lgtm label Apr 10, 2023

google-oss-prow bot added the approved label Apr 10, 2023

google-oss-prow bot merged commit 7a4c118 into kubeflow:master Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Namespace and trial pod annotations as CLI argument #2138

Namespace and trial pod annotations as CLI argument #2138

nagar-ajay commented Mar 29, 2023 •

edited

Loading

andreyvelich left a comment

andreyvelich Mar 29, 2023

tenzen-y Mar 29, 2023

nagar-ajay Mar 30, 2023 •

edited

Loading

nagar-ajay Mar 30, 2023

andreyvelich Mar 30, 2023

tenzen-y Mar 31, 2023

andreyvelich Mar 31, 2023

tenzen-y Apr 1, 2023

johnugeorge Apr 2, 2023

tenzen-y Apr 3, 2023

johnugeorge commented Mar 29, 2023

johnugeorge Mar 31, 2023

nagar-ajay Apr 2, 2023

andreyvelich Apr 4, 2023

nagar-ajay Apr 6, 2023

andreyvelich Apr 4, 2023

andreyvelich Apr 5, 2023

nagar-ajay Apr 6, 2023

tenzen-y Apr 5, 2023

andreyvelich Apr 5, 2023

nagar-ajay Apr 6, 2023

tenzen-y Apr 7, 2023

nagar-ajay Apr 7, 2023

nagar-ajay Apr 8, 2023

andreyvelich left a comment

johnugeorge commented Apr 10, 2023

google-oss-prow bot commented Apr 10, 2023

Namespace and trial pod annotations as CLI argument #2138

Namespace and trial pod annotations as CLI argument #2138

Conversation

nagar-ajay commented Mar 29, 2023 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagar-ajay Mar 30, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnugeorge commented Mar 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

johnugeorge commented Apr 10, 2023

google-oss-prow bot commented Apr 10, 2023

nagar-ajay commented Mar 29, 2023 •

edited

Loading

nagar-ajay Mar 30, 2023 •

edited

Loading