docker-workload

The workload-standard docker image is optimised for running machine learning workloads on Kubernetes. It comes with the following packages pre-installed.

Apache Spark 3.2.0
PySpark 3.2.0
Hadoop GCS Connector
Hadoop S3 Connector

Training on Bedrock

See churn_prediction for a complete example.

To train a model using Spark on Bedrock, you will need to create a bedrock.hcl file with the spark-submit directive. For example,

// Refer to https://docs.basis-ai.com/guides/writing-files/bedrock.hcl for more details.
version = "1.0"

train {
    step train {
        image = "quay.io/basisai/workload-standard:v0.3.4"
        install = [
            "pip install -r requirements.txt",
        ]
        script = [
            {spark-submit {
                script = "preprocess.py"
                conf {
                    spark.kubernetes.container.image = "quay.io/basisai/workload-standard:v0.3.4"
                    spark.kubernetes.pyspark.pythonVersion = "3"
                    spark.driver.memory = "4g"
                    spark.driver.cores = "2"
                    spark.executor.instances = "2"
                    spark.executor.memory = "4g"
                    spark.executor.cores = "2"
                    spark.memory.fraction = "0.5"
                    spark.sql.parquet.compression.codec = "gzip"
                    spark.hadoop.fs.AbstractFileSystem.gs.impl = "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
                    spark.hadoop.google.cloud.auth.service.account.enable = "true"
                }
            }}
        ]

        resources {
            cpu = "0.5"
            memory = "1G"
        }
    }
}

The step stanza specifies a single training step to be run. Multiple steps are allowed but they must have unique names. Additionally, you may pass in environment variables and secrets to all steps in the train stanza. Refer to our documentation for a complete list of supported parameters.

How to Contribute

The main Dockerfile downloads a pre-built Spark binary and unzips to /opt/spark. To upgrade to a new version, simply bump the SPARK_VERSION and HADOOP_VERSION environment variables to match the list of pre-built packages currently distributed by Apache.

Additional dependencies are specified in pom.xml so that we can use maven to help resolve transitive dependencies. This includes various connectors for distributed filesystems such as hadoop-gcs and hadoop-s3.

Testing image locally

A sanity check can be done by running (cd test && ./test.sh). This tests that any upgraded image doesn't have any obvious issues or compatibility problems with the latest versions of common Python packages.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github		.github
test		test
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

docker-workload

Training on Bedrock

How to Contribute

Testing image locally

About

Releases 13

Packages

Contributors 9

Languages

License

basisai/docker-workload

Folders and files

Latest commit

History

Repository files navigation

docker-workload

Training on Bedrock

How to Contribute

Testing image locally

About

Resources

License

Stars

Watchers

Forks

Releases 13

Packages 0

Contributors 9

Languages

Packages