We will be following the guide here to create a custom component for Kubeflow to "burst" off a job (a Flux MiniCluster) from a KubeFlow pipeline. This isn't traditional bursting as we've imagined with Flux (extending a running cluster) but rather creating a scoped job to handle one specific task in a workflow (and then go away).
Create your environment with the Kubeflow Pipelines sdk:
python -m venv env
source env/bin/activate
pip install kfp --upgrade
under development
The basis of this component is to launch a job to a cluster, and then monitor it until it finishes, and get output to return. Since local development typically means we have a cluster already running (e.g., kind) we can test this out with an argument to indicate existing. In this case, you don't need to build the container, but you will need the dependencies installed locally:
pip install -r requirements.txt
Then, create a local kind cluster.
kind create cluster
At this point we can emulate (manually) submitting a workflow, running the job, monitoring it, and then cleaning up. We will tell the script that the cluster already exists and is local (meaning the kubectl will hit it!)
python ./src/deploy.py --local --image ghcr.io/flux-framework/flux-restful-api:latest \
--command "echo hello world" \
--nnodes 4 \
--ntasks 4 \
--outfile hello-world.out \
--quiet
You'll see the Flux Operator install, create the MiniCluster, wait and see the pods running, and get the output!
⭐️ Creating the minicluster flux-sample in flux-operator...
flux-sample-0-fjb9p is in phase Pending
flux-sample-1-g2sjv is in phase Pending
flux-sample-2-tmb45 is in phase Pending
flux-sample-0-fjb9p is in phase Running
flux-sample-1-g2sjv is in phase Running
flux-sample-2-tmb45 is in phase Running
flux-sample-3-zgsv6 is in phase Running
All pods are in states "Running" or "Succeeded" or "Completed"
hello world
hello world
hello world
hello world
All pods are terminated.
The output will be saved to the output file you specified:
$ cat hello-world.out
hello world
hello world
hello world
hello world
When you are done, you can delete the cluster:
$ kind delete cluster
Now we can try doing the same, but via Google Cloud! This will emulate already running from a container (in KubeFlow) and create the Google Cloud one.
GOOGLE_PROJECT=llnl-flux
python ./src/deploy.py --project ${GOOGLE_PROJECT} --image ghcr.io/flux-framework/flux-restful-api:latest \
--command "echo hello world" \
--nnodes 4 \
--ntasks 4 \
--outfile hello-world.out \
--quiet
You'll see a similar interaction - running the jobs, getting output, but (since we are bursting to Google Cloud) you'll create the cluster beforehand, and destroy it after!
Command is echo hello world
⭐️ Creating the minicluster flux-sample in flux-operator...
flux-sample-0-htkh7 is in phase Pending
flux-sample-0-htkh7 is in phase Pending
flux-sample-1-wwzj7 is in phase Running
flux-sample-2-gkhrq is in phase Running
flux-sample-3-nf2vl is in phase Pending
flux-sample-0-htkh7 is in phase Running
flux-sample-1-wwzj7 is in phase Running
flux-sample-2-gkhrq is in phase Running
flux-sample-3-nf2vl is in phase Running
All pods are in states "Running" or "Succeeded" or "Completed"
hello world
hello world
hello world
hello world
All pods are terminated.
🧿️ Destroying Kubernetes Cluster, we are done!
Now let's test fully in KubeFlow! First, build the container:
$ docker build -t ghcr.io/converged-computing/flux-operator-component .
under development
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614