Build for an Apache Spark on kubernetes-ready docker image configured with notable AWS Dependencies, including:
- An up-to-date AWS SDK capable of supporting IRSA
- AWS Glue Data Catalog client for Hive Metastore
Builds are managed using https://earthly.dev
earthly --use-inline-cache +build-spark-image
Use in your own Earthfile build:
my-image:
FROM +github.com/viaduct-ai/docker-spark-k8s-aws+build-spark-image
# ...
If you've ever tried building a spark distribution/image with the AWS Glue Data Catalog Client for Hive, you know it's a PITA.
This project aims to open source a working docker image, built using the amazing Earthly tool, to democratize a more integrated Apache Spark on Kubernetes on AWS experience until someone develops a Spark DataSourceV2 API-compliant Glue Data Catalog implementation (instead of this absolute hack of patching hive and building spark from source)
Many thanks to @bbenzikry for open sourcing their solution to build Spark 3 + Glue compatible docker images. This project builds on their work.