-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract Spark configuration, secrets, init_scripts and libraries into the config #318
Conversation
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5788025632 (with refid (in response to this comment from @tanya-borisova) |
config.sample.yaml
Outdated
databricks_libraries: # Optional | ||
pypi: | ||
- package: opencensus-ext-azure==1.1.9 | ||
- package: opencensus-ext-logging==0.1.1 | ||
repo: "custom-mirror" | ||
maven: | ||
- coordinates: "com.amazon.deequ:deequ:1.0.4" | ||
repo: "custom-mirror" | ||
exclusions: ["org.apache.avro:avro"] | ||
cran: | ||
- package: "rkeops" | ||
repo: "my-awesome-repo" | ||
whl: | ||
- "dbfs:/FileStore/baz.whl" | ||
egg: | ||
- "dbfs:/FileStore/foo.egg" | ||
jar: | ||
- "dbfs:/FileStore/app-0.0.1.jar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to check if these would work currently or if they are stubs? From what I can tell only the pypi is being tested as working in a deployment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These all should work, yes. I have tested that the configuration for all of them is propagated to the cluster
We currently only use pypi though (we have plans to use Maven as well), but I felt that it would be better to add all of the supported libraries now, in case they are needed in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable overall, only concern I would have is that if we had a user who's expecting these to be used out of the box. I'd be tempted to add a comment so its clear the config values aren't used as of yet or just keep the pypi and maven as those are the ones we have specific plans for
@jjgriff93 @stefpiatek Re-requesting your reviews please as I've made significant changes (added support for init_scripts and databricks_secrets) |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5799823200 (with refid (in response to this comment from @tanya-borisova) |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5799903763 (with refid (in response to this comment from @tanya-borisova) |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5800044005 (with refid (in response to this comment from @tanya-borisova) |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5800447936 (with refid (in response to this comment from @tanya-borisova) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks for adding the secrets and Maven stuff!
config.infra-test.yaml
Outdated
- key: spark.databricks.cluster.profile | ||
value: singleNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
could simplify and use a yaml map instead like so:
spark_config:
spark.databricks.cluster.profile: singleNode
another_key: another_value
config.sample.yaml
Outdated
- key: spark.master | ||
value: local[*] | ||
databricks_secrets: # Optional | ||
- key: cog_services_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same suggestion as prev comment re key values
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5806622753 (with refid (in response to this comment from @tanya-borisova) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks! Opinions aren't strongly held on adding in extra config that currently isn't being used so will leave that up to you
config.sample.yaml
Outdated
databricks_libraries: # Optional | ||
pypi: | ||
- package: opencensus-ext-azure==1.1.9 | ||
- package: opencensus-ext-logging==0.1.1 | ||
repo: "custom-mirror" | ||
maven: | ||
- coordinates: "com.amazon.deequ:deequ:1.0.4" | ||
repo: "custom-mirror" | ||
exclusions: ["org.apache.avro:avro"] | ||
cran: | ||
- package: "rkeops" | ||
repo: "my-awesome-repo" | ||
whl: | ||
- "dbfs:/FileStore/baz.whl" | ||
egg: | ||
- "dbfs:/FileStore/foo.egg" | ||
jar: | ||
- "dbfs:/FileStore/app-0.0.1.jar" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems reasonable overall, only concern I would have is that if we had a user who's expecting these to be used out of the box. I'd be tempted to add a comment so its clear the config values aren't used as of yet or just keep the pypi and maven as those are the ones we have specific plans for
@stefpiatek I am happy to remove these, as you said, they aren't currently used. I've kept the |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5807523748 (with refid (in response to this comment from @tanya-borisova) |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5809501246 (with refid (in response to this comment from @tanya-borisova) |
/test |
🤖 pr-bot 🤖 🏃 Running tests: https://github.com/UCLH-Foundry/FlowEHR/actions/runs/5809625616 (with refid (in response to this comment from @tanya-borisova) |
What is being addressed
Extract Spark configuration and libraries into a separate section of the config.
How is this addressed
spark_config
section to the config and move the single-node configuration theredatabricks_libraries
section to the config and propagate all types of libraries supported by databricks terraform provider: https://registry.terraform.io/providers/databricks/databricks/0.4.2/docs/resources/cluster#library-configuration-blockdatabricks_cluster
section to be able to configure the size of the clusterinit_scripts
section to the config and upload a dbfs file from local filesystemdatabricks_secrets
section to the config and create corresponding secrets in the secret scope