Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2170: [SDK] Initial implementation of the Kubeflow Training V2 Python SDK #2324

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
af8ee35
Init commit
andreyvelich Nov 8, 2024
ff90a42
Implement SDK APIs
andreyvelich Nov 9, 2024
0107648
Update __init__ file
andreyvelich Nov 9, 2024
4f68b67
Add model and dataset configs
andreyvelich Nov 9, 2024
1f9e585
Remove print
andreyvelich Nov 11, 2024
c38cb7c
Add status to TrainJob
andreyvelich Nov 12, 2024
b1ef228
Fix sed for Linux
andreyvelich Nov 12, 2024
3525b4d
Fix PHASE_POST_TRAINING const
andreyvelich Nov 12, 2024
e874244
Add device labels for the runtimes
andreyvelich Nov 13, 2024
21fa7c8
Add Components to the TrainJob type
andreyvelich Nov 13, 2024
71f1646
Get container devices util
andreyvelich Nov 15, 2024
4ad2521
Use SDK assets in initializer
andreyvelich Nov 15, 2024
fd4d5aa
Add the runtime_ref to the list_jobs() API
andreyvelich Nov 15, 2024
59e603e
Check if runtime label presents
andreyvelich Nov 21, 2024
42fc257
Add Trainer arg as part of train API
andreyvelich Nov 25, 2024
2621c51
Rename lora to peft config
andreyvelich Nov 26, 2024
4e44209
Using global prop to remove SDK tests
andreyvelich Nov 27, 2024
8fa0d0a
Fix type for lora_dropout
andreyvelich Nov 28, 2024
09473bd
Import the JobSet models
andreyvelich Nov 28, 2024
4d5ee70
Parse runtime as object
andreyvelich Nov 29, 2024
7989e39
Remove old gen_openapi
andreyvelich Nov 29, 2024
b8715bd
Rename GPU_DEVICE_LABEL to NVIDIA_GPU_DEVICE_LABEL
andreyvelich Nov 29, 2024
8e8243c
Run codegen
andreyvelich Dec 10, 2024
7c4cb22
Install git on initializer images
andreyvelich Dec 10, 2024
52f73da
Rename device to accelerator for Runtime class
andreyvelich Dec 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api.v2/openapi-spec/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@
"spec": {
"description": "Specification of the desired JobSet which will be created from TrainJob.",
"default": {},
"$ref": "#/definitions/sigs.k8s.io.jobset.api.jobset.v1alpha2.JobSetSpec"
"$ref": "#/definitions/jobset.v1alpha2.JobSetSpec"
}
}
},
Expand Down
10 changes: 9 additions & 1 deletion cmd/initializer_v2/dataset/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,18 @@ WORKDIR /workspace

# Copy the required Python modules.
COPY cmd/initializer_v2/dataset/requirements.txt .
COPY sdk/python/kubeflow sdk/python/kubeflow
COPY pkg/initializer_v2 pkg/initializer_v2

# Install the needed packages.
RUN pip install -r requirements.txt

# Git is needed for the Kubeflow Training SDK to download JobSet Python models.
RUN apk update && apk add --no-cache git

# Copy and install the Kubeflow Training SDK for the configs.
COPY sdk_v2 sdk_v2
COPY LICENSE LICENSE
COPY README.md README.md
RUN pip install ./sdk_v2

ENTRYPOINT ["python", "-m", "pkg.initializer_v2.dataset"]
10 changes: 9 additions & 1 deletion cmd/initializer_v2/model/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,18 @@ WORKDIR /workspace

# Copy the required Python modules.
COPY cmd/initializer_v2/model/requirements.txt .
COPY sdk/python/kubeflow sdk/python/kubeflow
COPY pkg/initializer_v2 pkg/initializer_v2

# Install the needed packages.
RUN pip install -r requirements.txt

# Git is needed for the Kubeflow Training SDK to download JobSet Python models.
RUN apk update && apk add --no-cache git

# Copy and install the Kubeflow Training SDK for the configs.
COPY sdk_v2 sdk_v2
COPY LICENSE LICENSE
COPY README.md README.md
RUN pip install ./sdk_v2

ENTRYPOINT ["python", "-m", "pkg.initializer_v2.model"]
12 changes: 12 additions & 0 deletions hack/python-sdk-v2/gen-sdk.sh
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,15 @@ git clean -f ${SDK_OUTPUT_PATH}/tox.ini

# Revert the README since it is manually created.
git checkout ${SDK_OUTPUT_PATH}/README.md
git checkout ${SDK_OUTPUT_PATH}/kubeflow/training/__init__.py

# Manually modify the SDK version in the __init__.py file.
if [[ $(uname) == "Darwin" ]]; then
sed -i '' -e "s/__version__.*/__version__ = \"${SDK_VERSION}\"/" ${SDK_OUTPUT_PATH}/kubeflow/training/__init__.py
else
sed -i -e "s/__version__.*/__version__ = \"${SDK_VERSION}\"/" ${SDK_OUTPUT_PATH}/kubeflow/training/__init__.py
fi

# Kubeflow models must have Kubernetes models to perform serialization.
printf "\n# Import JobSet models for the serialization. It imports the Kubernetes models.\n" >>${SDK_OUTPUT_PATH}/kubeflow/training/models/__init__.py
printf "from jobset.models import *\n" >>${SDK_OUTPUT_PATH}/kubeflow/training/models/__init__.py
1,468 changes: 0 additions & 1,468 deletions hack/python-sdk/swagger.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion hack/swagger-v2/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -66,9 +66,9 @@ func main() {

func swaggify(name string) string {
name = strings.Replace(name, "github.com/kubeflow/training-operator/pkg/apis/", "", -1)
name = strings.Replace(name, "sigs.k8s.io/jobset/api/", "", -1)
name = strings.Replace(name, "k8s.io/api/core/", "", -1)
name = strings.Replace(name, "k8s.io/apimachinery/pkg/apis/meta/", "", -1)
name = strings.Replace(name, "k8s.io/apimachinery/pkg/api/resource", "", -1)
name = strings.Replace(name, "/", ".", -1)
return name
}
47 changes: 29 additions & 18 deletions hack/update-codegen.sh
Original file line number Diff line number Diff line change
Expand Up @@ -7,44 +7,55 @@ set -o errexit
set -o nounset
set -o pipefail

GO_CMD=${1:-go}
CURRENT_DIR=$(dirname "${BASH_SOURCE[0]}")
TRAINING_OPERATOR_ROOT=$(realpath "${CURRENT_DIR}/..")
TRAINING_OPERATOR_PKG="github.com/kubeflow/training-operator"
CODEGEN_PKG=$(go list -m -mod=readonly -f "{{.Dir}}" k8s.io/code-generator)

cd "$CURRENT_DIR/.."

# shellcheck source=/dev/null
# Get the code-generator binary.
CODEGEN_PKG=$(go list -m -mod=readonly -f "{{.Dir}}" k8s.io/code-generator)
source "${CODEGEN_PKG}/kube_codegen.sh"
echo ">> Using ${CODEGEN_PKG}"

# Generating conversion and defaults functions
# Generating deepcopy and defaults.
echo "Generating deepcopy and defaults for kubeflow.org/v1 and kubeflow.org/v2alpha1"
kube::codegen::gen_helpers \
--boilerplate "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
"${TRAINING_OPERATOR_ROOT}/pkg/apis"

# Generating OpenAPI for Kueue API extensions for v1
kube::codegen::gen_openapi \
# Generate clients for Training Operator V1 and V2
echo "Generating clients for kubeflow.org/v1 and kubeflow.org/v2alpha1"
kube::codegen::gen_client \
--boilerplate "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
--output-dir "${TRAINING_OPERATOR_ROOT}/pkg/client" \
--output-pkg "${TRAINING_OPERATOR_PKG}/pkg/client" \
--with-watch \
--with-applyconfig \
"${TRAINING_OPERATOR_ROOT}/pkg/apis"

# Get the kube-openapi binary.
OPENAPI_PKG=$(go list -m -mod=readonly -f "{{.Dir}}" k8s.io/kube-openapi)
echo ">> Using ${OPENAPI_PKG}"

echo "Generating OpenAPI specification for kubeflow.org/v1"
go run ${OPENAPI_PKG}/cmd/openapi-gen \
--go-header-file "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
--output-pkg "${TRAINING_OPERATOR_PKG}/pkg/apis/kubeflow.org/v1" \
--output-dir "${TRAINING_OPERATOR_ROOT}/pkg/apis/kubeflow.org/v1" \
--output-file "zz_generated.openapi.go" \
--report-filename "${TRAINING_OPERATOR_ROOT}/hack/violation_exception_v1.list" \
--update-report \
"${TRAINING_OPERATOR_ROOT}/pkg/apis/kubeflow.org/v1"

# Generating OpenAPI for Kueue API extensions for v2alpha1
kube::codegen::gen_openapi \
--boilerplate "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
echo "Generating OpenAPI specification for kubeflow.org/v2alpha1"
go run ${OPENAPI_PKG}/cmd/openapi-gen \
--go-header-file "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
--output-pkg "${TRAINING_OPERATOR_PKG}/pkg/apis/kubeflow.org/v2alpha1" \
--output-dir "${TRAINING_OPERATOR_ROOT}/pkg/apis/kubeflow.org/v2alpha1" \
--output-file "zz_generated.openapi.go" \
--report-filename "${TRAINING_OPERATOR_ROOT}/hack/violation_exception_v2alpha1.list" \
--update-report \
"${TRAINING_OPERATOR_ROOT}/pkg/apis/kubeflow.org/v2alpha1"
andreyvelich marked this conversation as resolved.
Show resolved Hide resolved

kube::codegen::gen_client \
--boilerplate "${TRAINING_OPERATOR_ROOT}/hack/boilerplate/boilerplate.go.txt" \
--output-dir "${TRAINING_OPERATOR_ROOT}/pkg/client" \
--output-pkg "${TRAINING_OPERATOR_PKG}/pkg/client" \
--with-watch \
--with-applyconfig \
"${TRAINING_OPERATOR_ROOT}/pkg/apis"
# Generating OpenAPI Swagger for Training Operator V2.
echo "Generate OpenAPI Swagger for kubeflow.org/v2alpha1"
go run hack/swagger-v2/main.go >api.v2/openapi-spec/swagger.json
9 changes: 0 additions & 9 deletions hack/violation_exception_v1.list
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,3 @@ API rule violation: list_type_missing,github.com/kubeflow/training-operator/pkg/
API rule violation: list_type_missing,github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1,PaddleElasticPolicy,Metrics
API rule violation: names_match,github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1,ElasticPolicy,RDZVID
API rule violation: names_match,github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1,PyTorchJobSpec,PyTorchReplicaSpecs
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,APIResourceList,APIResources
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,Duration,Duration
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,InternalEvent,Object
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,InternalEvent,Type
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,MicroTime,Time
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,StatusCause,Type
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,Time,Time
API rule violation: names_match,k8s.io/apimachinery/pkg/runtime,Unknown,ContentEncoding
API rule violation: names_match,k8s.io/apimachinery/pkg/runtime,Unknown,ContentType
9 changes: 0 additions & 9 deletions hack/violation_exception_v2alpha1.list
Original file line number Diff line number Diff line change
@@ -1,9 +0,0 @@
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,APIResourceList,APIResources
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,Duration,Duration
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,InternalEvent,Object
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,InternalEvent,Type
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,MicroTime,Time
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,StatusCause,Type
API rule violation: names_match,k8s.io/apimachinery/pkg/apis/meta/v1,Time,Time
API rule violation: names_match,k8s.io/apimachinery/pkg/runtime,Unknown,ContentEncoding
API rule violation: names_match,k8s.io/apimachinery/pkg/runtime,Unknown,ContentType
Loading
Loading