Instructions for setting up a Spark Cluster for the TextReuse ETL pipeline on CSC Rahti
- Create a project on CSC Rahti
- Add a
spark-credentials
secret with- username for Spark
- password for Spark and Jupyter Lab login
- nbpassword for the Jupyter internally
- Install OpenShift CLI and Helm on local machine
- Create a
values.yaml
following thevalues-template.yaml
- Log into OpenShift project by getting login command from Rahti
- Run
helm install spark-cluster all-spark
Create a key for GitHub SSH in the persistent volume of the spark-notebook service. Then in the values-template.yaml
add the location of this SSH key to add it to the SSH configmap seen in configmap.yaml
.
Then when the notebook pod starts up run mkdir ~/.ssh && cp /etc/ssh-config/config ~/.ssh/config
to copy the SSH configmap file to the correct location.