This project demonstrates how to make a dataplatform that is scalable by design. When the volume of data increases, the amount of nodes and partitions / shard can easily be increased.
- Access to a Kubernetes cluster
- Nginx ingress controller deployed
- Docker
- Kubectl
- Helm
# Create cluster with 4 worker nodes
kind create cluster --name kind-dataplatform --config=kind.yaml
# Install nginx ingress controller
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml
Use cluster:
kubectl config use-context kind-kind-dataplatform
A Kubernetes Operator can deploy workloads based on Customer Resource definition that defines it. Updates to the resources will also managed by the operator.
Operators are always installed cluster-wide.
# Install Altinity Clickhouse Operator
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
# Install Strimzi Kafka operator
helm repo add strimzi https://strimzi.io/charts/
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator
# Install CloudNativePG PostgreSQL Operator
kubectl apply -f https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.20/releases/cnpg-1.20.0.yaml
The jupyterlab
image is based on the datascience-notebook. It comes with default notebooks and install required dependencies.
docker build ./jupyterlab -t dataplatform-jupyterlab:latest
kind load docker-image dataplatform-jupyterlab:latest --name kind-dataplatform
docker build ./setup-data -t setup-data:latest
kind load docker-image setup-data:latest --name kind-dataplatform
docker build ./data-generator -t data-generator:latest
kind load docker-image data-generator:latest --name kind-dataplatform
This demo in contained in a HELM chart.
helm dependency build ./dataplatform-chart
helm upgrade --install dataplatform ./dataplatform-chart --set jupyter.image=dataplatform-jupyterlab:latest