Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Namespace and trial pod annotations as CLI argument #2138

Merged
merged 10 commits into from
Apr 10, 2023

Conversation

nagar-ajay
Copy link
Contributor

@nagar-ajay nagar-ajay commented Mar 29, 2023

What this PR does / why we need it:

  • For the conformance test, we want to run all e2e tests in kf-conformance namespace which is Istio enabled (created by Kubeflow Profile CRD)

  • Currently, the namespace is hardcoded in all example manifests files and we don't have the option to configure the namespace.

  • Updated run-e2e-experiment.py to support namespace as a command line argument. Also, add katib.kubeflow.org/metrics-collector-injection label to the test namespace, if missing.

  • Changed test namespace to default from kubeflow.

  • In Istio enabled namespace, Pods of example manifests get stuck in the NotReady state. To fix this, added an option to pass trial-pod-annotations. Users can pass '{"sidecar.istio.io/inject": "false"}' annotation to disable istio sidecar injection.

  • Since, we're loading kube config as part of KatibClient initialization, we don't need to do it again. Also config.load_kube_config() statement breaks if we run the test inside k8s cluster.

  • Updated random.yaml to put resource limits on trail spec pod container. This is required as the namespace created for the conformance test, has resource quota specified. https://github.com/kubeflow/kubeflow/blob/master/conformance/1.5/setup.yaml#L24-L28

  • Testing: Tested modified manifests locally. Also, all these manifests run in e2e test as part of Katib CI.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating this @nagar-ajay!

Comment on lines 62 to 64
metadata:
annotations:
sidecar.istio.io/inject: "false"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we intentionally removed Istio annotation from the examples and explained that user can disable Istio here: https://www.kubeflow.org/docs/components/katib/hyperparameter/#example-using-random-search-algorithm.
@nagar-ajay Maybe we should update run-e2e-experiment.sh script to append such annotation to our examples if that is required for Conformance.

WDYT @nagar-ajay @johnugeorge @tenzen-y ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @andreyvelich.
It might be better to run the kubectl patch command in run-e2e-experiment.sh.

Copy link
Contributor Author

@nagar-ajay nagar-ajay Mar 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as per conformance test design , we want to run tests inside a pod. I don't think we will have access to kubectl inside the pod. Correct me if I'm wrong.

We will need to add support to patch annotations in run-e2e-experiment.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think programmatically patching the resources will not help here as the resources are already created with Istio sidecar.

We will need to update the definition somehow before creating the experiment. This is tricky. If we provided an option to pass annotation, this can create confusion as the passed annotation can be applied in more than one place (experiment metadata, pod metadata, etc).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, @tenzen-y meant not use kubectl patch, but change the run-e2e-experiment script to modify Trial template before creating Experiment. Similar as here for Trial spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnugeorge @tenzen-y Other solution might be to add another experiment just for conformance test here: https://github.com/kubeflow/katib/tree/master/test/e2e/v1beta1/testdata with required annotation and other specification.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add another experiment, I think we should run the experiment on CI every PR similar to other examples.
Since we need to verify whether the experiment can be runnable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y I see the same exact problem. Conformance tests should be tested regularly. Else, it will diverge from the core examples. I would suggest to go with current random experiment. For annotations, we have to handle 3 cases- non distributed trials, distributed trials and custom trials. We are handling just non distributed trials now. Others can be handled later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @johnugeorge. For now, injecting an annotation in run-e2e-experiment.py might be better.

@johnugeorge
Copy link
Member

For conformance, we don't need to run all tests. Is random example enough? Having distributed training operators in examples might not be a good idea for users who are not using it

@nagar-ajay nagar-ajay changed the title Disable istio sidecar and Namespace as CLI argument [WIP] - Disable istio sidecar and Namespace as CLI argument Mar 31, 2023
@nagar-ajay nagar-ajay changed the title [WIP] - Disable istio sidecar and Namespace as CLI argument Disable istio sidecar and Namespace as CLI argument Mar 31, 2023
@nagar-ajay nagar-ajay changed the title Disable istio sidecar and Namespace as CLI argument Namespace and trial pod annotations as CLI argument Mar 31, 2023
if args.trial_pod_annotations:
trial_spec = experiment.spec.trial_template.trial_spec
trial_spec_metadata = trial_spec['spec']['template'].get('metadata', {})
trial_spec_metadata['annotations'] = eval(args.trial_pod_annotations)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What format do you plan to pass? Also, Consider a case where annotations already present

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to pass string representation of a python dictionary (annotations's key-valye pairs) e.g. I'll pass '{"sidecar.istio.io/inject": "false"}' to disable sidecar injection.

Will update the code to consider already present annotations.

examples/v1beta1/hp-tuning/random.yaml Show resolved Hide resolved
@@ -249,6 +255,12 @@ def run_e2e_experiment(
if experiment.metadata.name == "random":
MAX_TRIAL_COUNT += 1
PARALLEL_TRIAL_COUNT += 1
if args.trial_pod_annotations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can we also throw an error if experiment.spec.trial_template.trial_spec['kind'] != "Job".

That will help users of conformance to not define incorrect Trial Spec (e.g. TFJob, PyTorchJob), since we only support Job for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added NotImplementedError

"--namespace", type=str, required=True, help="Namespace for the Katib E2E test",
)
parser.add_argument(
"--trial-pod-annotations", type=str, help="Annotation for the pod created by trial",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which formant we are going to pass annotations ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your comment above: #2138 (comment).
@nagar-ajay Did you test it if you can pass multiple annotation in that format ? E.g.:

'{"sidecar.istio.io/inject": "false", "custom-key": "custom-value"}' 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -44,7 +44,8 @@ fi
for exp_name in "${EXPERIMENT_FILE_ARRAY[@]}"; do
echo "Running Experiment from $exp_name file"
exp_path=$(find ../../../../../examples/v1beta1 -name "${exp_name}.yaml")
python run-e2e-experiment.py --experiment-path "${exp_path}" --verbose || (kubectl get pods -n kubeflow && exit 1)
python run-e2e-experiment.py --experiment-path "${exp_path}" --namespace kubeflow \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to run the test in the kubeflow namespace? I'd like to verify whether the katib controller can operate Experiments deployed out of the namespace in which the katib controller is deployed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point, maybe we could run our E2Es on default namespace.
WDYT @johnugeorge @nagar-ajay ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich @tenzen-y I tried running random experiment test in the default namespace, but it failed. As per the error logs, I think other namespaces require katib.kubeflow.org/metrics-collector-injection: enabled label.
Error log:

kubernetes.client.exceptions.ApiException: (400)
Reason: Bad Request
HTTP response headers: HTTPHeaderDict({'Audit-Id': '689dd995-d83a-4915-9e69-130043993f14', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '66727794-bef2-4ca7-a384-6a1a7b8a571d', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'c6d64b7c-508c-4954-b20b-0b391dcc2e85', 'Date': 'Thu, 06 Apr 2023 12:40:50 GMT', 'Content-Length': '326'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"admission webhook \"validator.experiment.katib.kubeflow.org\" denied the request: Cannot create the Experiment \"random\" in namespace \"default\": the namespace lacks label \"katib.kubeflow.org/metrics-collector-injection: enabled\"","code":400}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we must add the label to the namespace.

Can you add processes to update the namespace label in run-e2e-experiment.py if the namespace passed by --namespace doesn't hold the label?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also updated test namespace to default.

@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Apr 8, 2023
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating this @nagar-ajay!
/lgtm
/assign @tenzen-y @johnugeorge

@johnugeorge
Copy link
Member

/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johnugeorge, nagar-ajay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 7a4c118 into kubeflow:master Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants