Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Katib UATs failing in AKS and EKS #1063

Closed
misohu opened this issue Sep 10, 2024 · 3 comments · Fixed by canonical/katib-operators#238
Closed

Katib UATs failing in AKS and EKS #1063

misohu opened this issue Sep 10, 2024 · 3 comments · Fixed by canonical/katib-operators#238
Labels
bug Something isn't working

Comments

@misohu
Copy link
Member

misohu commented Sep 10, 2024

Bug Description

After merging PR with katib rocks for 1.9 ckf the UATs started to fail for AKS and EKS.

To Reproduce

  1. Run UATs for main either on EKS or AKS

Environment

CI for AKS or EKS

Relevant Log Output

=================================== FAILURES ===================================
_______________________ test_notebook[katib-integration] _______________________

test_notebook = '/tests/.worktrees/109a69f2868d156208bf90d5a013f571a244540d/tests/notebooks/katib/katib-integration.ipynb'

    @pytest.mark.ipynb
    @pytest.mark.parametrize(
        # notebook - ipynb file to execute
        "test_notebook",
        NOTEBOOKS.values(),
        ids=NOTEBOOKS.keys(),
    )
    def test_notebook(test_notebook):
        """Test Notebook Generic Wrapper."""
        os.chdir(os.path.dirname(test_notebook))
    
        with open(test_notebook) as nb:
            notebook = nbformat.read(nb, as_version=nbformat.NO_CONVERT)
    
        ep = ExecutePreprocessor(
            timeout=-1, kernel_name="python3", on_notebook_start=install_python_requirements
        )
        ep.skip_cells_with_tag = "pytest-skip"
    
        try:
            log.info(f"Running ***os.path.basename(test_notebook)***...")
            output_notebook, _ = ep.preprocess(notebook, ***"metadata": ***"path": "./"***)
            # persist the notebook output to the original file for debugging purposes
            save_notebook(output_notebook, test_notebook)
        except CellExecutionError as e:
            # handle underlying error
            pytest.fail(f"Notebook execution failed with ***e.ename***: ***e.evalue***")
    
        for cell in output_notebook.cells:
            metadata = cell.get("metadata", dict)
            if "raises-exception" in metadata.get("tags", []):
                for cell_output in cell.outputs:
                    if cell_output.output_type == "error":
                        # extract the error message from the cell output
                        log.error(format_error_message(cell_output.traceback))
>                       pytest.fail(cell_output.traceback[-1])
E                       Failed: AssertionError: Katib Experiment was not successful.

/tests/.worktrees/109a69f2868d156208bf90d5a013f571a244540d/tests/test_notebooks.py:59: Failed

Additional Context

No response

@misohu misohu added the bug Something isn't working label Sep 10, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6246.

This message was autogenerated

@mvlassis
Copy link
Contributor

Looking at this test case in the integration test, it is possible that this test passes (at the time of testing it is currently Running), and then the test fails with the Assertion Error shown above.

@misohu
Copy link
Member Author

misohu commented Sep 13, 2024

Debugging Katib Experiment in EKS Cluster

I've deployed my own EKS cluster and began debugging the Katib integration. The katib-integration notebook fails both from a manual trigger and from the driver. Before we dive into the details, let's first explain how a Katib experiment works behind the scenes.

How Katib Experiment Works

Let's start by examining the experiment we are creating in the problematic notebook.

metadata = V1ObjectMeta(
    name=EXPERIMENT_NAME,
)

algorithm_spec = V1beta1AlgorithmSpec(
    algorithm_name="cmaes"
)

objective_spec = V1beta1ObjectiveSpec(
    type="minimize",
    goal=0.001,
    objective_metric_name="loss",
    additional_metric_names=["Train-accuracy"]
)

# Experiment search space
# In this example, we tune learning rate, momentum, and optimizer
parameters = [
    V1beta1ParameterSpec(
        name="lr",
        parameter_type="double",
        feasible_space=V1beta1FeasibleSpace(
            min="0.01",
            max="0.06"
        ),
    ),
    V1beta1ParameterSpec(
        name="momentum",
        parameter_type="double",
        feasible_space=V1beta1FeasibleSpace(
            min="0.5",
            max="0.9"
        ),
    ),
]

# JSON template specification for the Trial's Worker Kubernetes Job
trial_spec = {
    "apiVersion": "batch/v1",
    "kind": "Job",
    "spec": {
        "template": {
            "metadata": {
                "annotations": {
                    "sidecar.istio.io/inject": "false"
                }
            },
            "spec": {
                "containers": [
                    {
                        "name": "training-container",
                        "image": "docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0",
                        "command": [
                            "python3",
                            "/opt/pytorch-mnist/mnist.py",
                            "--epochs=1",
                            "--batch-size=16384",
                            "--lr=${trialParameters.learningRate}",
                            "--momentum=${trialParameters.momentum}",
                        ]
                    }
                ],
                "restartPolicy": "Never"
            }
        }
    }
}

trial_template = V1beta1TrialTemplate(
    primary_container_name="training-container",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="learningRate",
            description="Learning rate for the training model",
            reference="lr"
        ),
        V1beta1TrialParameterSpec(
            name="momentum",
            description="Momentum for the training model",
            reference="momentum"
        ),
    ],
    trial_spec=trial_spec
)

experiment = V1beta1Experiment(
    api_version="kubeflow.org/v1beta1",
    kind="Experiment",
    metadata=metadata,
    spec=V1beta1ExperimentSpec(
        max_trial_count=3,
        parallel_trial_count=2,
        max_failed_trial_count=1,
        algorithm=algorithm_spec,
        objective=objective_spec,
        parameters=parameters,
        trial_template=trial_template,
    )
)

When this Experiment is created, the Katib controller first spawns a pod with one of the suggestion images. This pod is responsible for organizing the trials. Based on the experiment spec, the Katib controller creates Trial objects in the cluster. Each trial consists of a pod with two containers: the training-container and the metrics-logger-and-collector.

  • The training-container is responsible for running the ML job and computing the metrics defined in the Experiment. In our example, the training-container uses the image docker.io/kubeflowkatib/pytorch-mnist-cpu:v0.14.0.
  • The metrics-logger-and-collector shares a volume with the training-container, where it collects logs and metrics. Once the training-container finishes, the metrics collector sends the collected metrics to katib-db-manager.

Problems Identified

During debugging, I encountered two major issues:

1. Issue with katib-db-manager

The first (and smaller) issue is with the katib-db-manager. In the Rock-based container, we set the workdir in rockcraft.yaml, but this setting is not used in the charm service layer. The charm uses replace and override, which causes the container to execute from the wrong directory, leading to a missing ./katib-db-manager. This issue can be resolved by using override with the merge option in the charm.

2. Issue with file-metrics-collector

The second, more significant problem is related to the way the file-metrics-collector is implemented. According to the upstream code, the metrics collector uses the WaitMainProcesses function, which in turn calls the WaitPIDs function here.

The issue arises because WaitPIDs blocks execution until all processes are finished by checking the /proc directory for running processes. This approach works fine for Docker containers but causes issues for Rock-based containers, which always have the Pebble process running indefinitely. Since Pebble never ends, the function never finishes, causing the Katib experiment to hang.

Solutions

Fix for katib-db-manager

This issue has been fixed in a separate PR, where we modify the charm to use override: merge instead of override: replace for the service configuration.

Fix for the file-metrics-collector

Unfortunately, due to the persistent Pebble process in Rock-based containers, we cannot use the Rock container for the file-metrics-collector without modifying the upstream code. Therefore, we must revert to using Docker containers instead. I have tested this solution using Docker, and it works as expected.

It is important to note that this issue also affects the tfevent-metrics-collector, as it relies on the same process-waiting logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants