Katib UATs failing in AKS and EKS #1063

misohu · 2024-09-10T07:31:55Z

Bug Description

After merging PR with katib rocks for 1.9 ckf the UATs started to fail for AKS and EKS.

To Reproduce

Run UATs for main either on EKS or AKS

Environment

CI for AKS or EKS

Relevant Log Output

=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/.worktrees/109a69f2868d156208bf90d5a013f571a244540d/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))
    
        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
    
        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"
    
        try:
            log.info(f"Running ***os.path.basename(test_notebook)***...")
            output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")
    
        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

/tests/.worktrees/109a69f2868d156208bf90d5a013f571a244540d/tests/test_notebooks.py:59: Failed

Additional Context

No response

syncronize-issues-to-jira · 2024-09-10T07:32:02Z

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6246.

This message was autogenerated

mvlassis · 2024-09-10T13:54:24Z

Looking at this test case in the integration test, it is possible that this test passes (at the time of testing it is currently Running), and then the test fails with the Assertion Error shown above.

misohu · 2024-09-13T08:08:39Z

Debugging Katib Experiment in EKS Cluster

I've deployed my own EKS cluster and began debugging the Katib integration. The katib-integration notebook fails both from a manual trigger and from the driver. Before we dive into the details, let's first explain how a Katib experiment works behind the scenes.

How Katib Experiment Works

Let's start by examining the experiment we are creating in the problematic notebook.

metadata = V1ObjectMeta(
    name=EXPERIMENT_NAME,
)

algorithm_spec = V1beta1AlgorithmSpec(
    algorithm_name="cmaes"
)

objective_spec = V1beta1ObjectiveSpec(
    type="minimize",
    goal=0.001,
    objective_metric_name="loss",
    additional_metric_names=["Train-accuracy"]
)

# Experiment search space
# In this example, we tune learning rate, momentum, and optimizer
parameters = [
    V1beta1ParameterSpec(
        name="lr",
        parameter_type="double",
        feasible_space=V1beta1FeasibleSpace(
            min="0.01",
            max="0.06"
        ),
    ),
    V1beta1ParameterSpec(
        name="momentum",
        parameter_type="double",
        feasible_space=V1beta1FeasibleSpace(
            min="0.5",
            max="0.9"
        ),
    ),
]

# JSON template specification for the Trial's Worker Kubernetes Job
trial_spec = {
    "apiVersion": "batch/v1",
    "kind": "Job",
    "spec": {
        "template": {
            "metadata": {
                "annotations": {
                    "sidecar.istio.io/inject": "false"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "training-container",
                        "image": "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0",
                        "command": [
                            "python3",
                            "/opt/pytorch-mnist/mnist.py",
                            "--epochs=1",
                            "--batch-size=16384",
                            "--lr=${trialParameters.learningRate}",
                            "--momentum=${trialParameters.momentum}",
                        ]
                    }
                ],
                "restartPolicy": "Never"
            }
        }
    }
}

trial_template = V1beta1TrialTemplate(
    primary_container_name="training-container",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="learningRate",
            description="Learning rate for the training model",
            reference="lr"
        ),
        V1beta1TrialParameterSpec(
            name="momentum",
            description="Momentum for the training model",
            reference="momentum"
        ),
    ],
    trial_spec=trial_spec
)

experiment = V1beta1Experiment(
    api_version="kubeflow.org/v1beta1",
    kind="Experiment",
    metadata=metadata,
    spec=V1beta1ExperimentSpec(
        max_trial_count=3,
        parallel_trial_count=2,
        max_failed_trial_count=1,
        algorithm=algorithm_spec,
        objective=objective_spec,
        parameters=parameters,
        trial_template=trial_template,
    )
)

When this Experiment is created, the Katib controller first spawns a pod with one of the suggestion images. This pod is responsible for organizing the trials. Based on the experiment spec, the Katib controller creates Trial objects in the cluster. Each trial consists of a pod with two containers: the training-container and the metrics-logger-and-collector.

The training-container is responsible for running the ML job and computing the metrics defined in the Experiment. In our example, the training-container uses the image docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0.
The metrics-logger-and-collector shares a volume with the training-container, where it collects logs and metrics. Once the training-container finishes, the metrics collector sends the collected metrics to katib-db-manager.

Problems Identified

During debugging, I encountered two major issues:

1. Issue with `katib-db-manager`

The first (and smaller) issue is with the katib-db-manager. In the Rock-based container, we set the workdir in rockcraft.yaml, but this setting is not used in the charm service layer. The charm uses replace and override, which causes the container to execute from the wrong directory, leading to a missing ./katib-db-manager. This issue can be resolved by using override with the merge option in the charm.

2. Issue with `file-metrics-collector`

The second, more significant problem is related to the way the file-metrics-collector is implemented. According to the upstream code, the metrics collector uses the WaitMainProcesses function, which in turn calls the WaitPIDs function here.

The issue arises because WaitPIDs blocks execution until all processes are finished by checking the /proc directory for running processes. This approach works fine for Docker containers but causes issues for Rock-based containers, which always have the Pebble process running indefinitely. Since Pebble never ends, the function never finishes, causing the Katib experiment to hang.

Solutions

Fix for `katib-db-manager`

This issue has been fixed in a separate PR, where we modify the charm to use override: merge instead of override: replace for the service configuration.

Fix for the `file-metrics-collector`

Unfortunately, due to the persistent Pebble process in Rock-based containers, we cannot use the Rock container for the file-metrics-collector without modifying the upstream code. Therefore, we must revert to using Docker containers instead. I have tested this solution using Docker, and it works as expected.

It is important to note that this issue also affects the tfevent-metrics-collector, as it relies on the same process-waiting logic.

misohu added the bug Something isn't working label Sep 10, 2024

misohu mentioned this issue Sep 13, 2024

Fix: rollback to docker images for file-metrics-collector canonical/katib-operators#238

Merged

misohu closed this as completed in canonical/katib-operators#238 Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Katib UATs failing in AKS and EKS #1063

Katib UATs failing in AKS and EKS #1063

misohu commented Sep 10, 2024

syncronize-issues-to-jira bot commented Sep 10, 2024

mvlassis commented Sep 10, 2024

misohu commented Sep 13, 2024

Katib UATs failing in AKS and EKS #1063

Katib UATs failing in AKS and EKS #1063

Comments

misohu commented Sep 10, 2024

Bug Description

To Reproduce

Environment

Relevant Log Output

Additional Context

syncronize-issues-to-jira bot commented Sep 10, 2024

mvlassis commented Sep 10, 2024

misohu commented Sep 13, 2024

Debugging Katib Experiment in EKS Cluster

How Katib Experiment Works

Problems Identified

1. Issue with katib-db-manager

2. Issue with file-metrics-collector

Solutions

Fix for katib-db-manager

Fix for the file-metrics-collector

1. Issue with `katib-db-manager`

2. Issue with `file-metrics-collector`

Fix for `katib-db-manager`

Fix for the `file-metrics-collector`