-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add spark_conf_string to thrift connection type #591
add spark_conf_string to thrift connection type #591
Conversation
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @vinhnemo |
…tps://github.com/vinhnemo/dbt-spark into features/vinh-nguyen-add-extras-spark-config-sts
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, don't hesitate to ping @drewbanin. CLA has not been signed by users: @vinhnemo, @justinvin |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR. CLA has not been signed by users: @justinvin, @vinhnemo |
Hi @vinhnemo, thanks for taking the time to contribute! Your approach is to create a new configuration string I have a preference for the |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR. CLA has not been signed by users: @justinvin, @vinhnemo |
It seems that this solution has disadvantage, because it disallows to use
|
Hi @Fleid , Just to be clear, my purpose is to be able to change the Spark configuration with profile.yml when connecting via Thrift connection. These configs will be adjusted according to the DBT models. To avoid change detection on profile.yml and do a full re-parse(docs) . The simplest way is to pass these configs from environment variables through the jinja format(ref). With the server_side_parameters approach, its data type is dict, it is difficult to bypass the re-parse limitation if we want to use the environment variable mentioned above. With these specific requirements, it led me to create another parameter with the String data type. |
@@ -74,6 +74,7 @@ class SparkCredentials(Credentials): | |||
connect_timeout: int = 10 | |||
use_ssl: bool = False | |||
server_side_parameters: Dict[str, Any] = field(default_factory=dict) | |||
spark_conf_string: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to be a string? Can this be a dict? Doing so would allow us to escape special characters in the configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, my purpose is to be able to change the Spark configuration with profile.yml when connecting via Thrift connection. These configs will be adjusted according to the DBT models.
To avoid change detection on profile.yml and do a full re-parse(docs) . The simplest way is to pass these configs from environment variables through the jinja format(ref).
spark_conf_string: "{{ env_var('SPARK_CONFIG_STRING') }}"
With the server_side_parameters approach, its data type is dict, it is difficult to bypass the re-parse limitation if we want to use the environment variable mentioned above.
With these specific requirements, it led me to create another parameter with the String data type.
Hi @colin-rogers-dbt ,
As I mentioned in the previous comment, the main reason is to prevent full-reparse when we need to change the spark configuration via dbt profile sequentially
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, should have made the connection.
If the goal is to allow for per-model configuration I wonder if this would be better put in the model or schema configuration |
Ok so we want the parameter to be a string, so it can be injected via an environment variable. Now to follow-up on Colin's ask, when you say:
Do you mean that you're targeting subsets of the DAG while running dbt, changing the value of that string in between calls? |
The useсase our team ends up with:
|
Thanks @colin-rogers-dbt and @Fleid for replies The configs for Spark that I want to pass in are at the connection level. Each DBT model runs on an independent Spark engine, and the We will override this variable to allocate resources and initialize the spark engine for each DBT-Spark model each time it runs like that Each spark engine will be spawned to serve a DBT-Spark model and shut down when that model is done. |
Ok that's what I was wondering about. This feels like the virtual warehouse overriding for Snowflake. You need a couple models processed with additional resources, and really the expression of the capacity you need should be a model level configuration. When a model has a different def pre_model_hook(self, config: Mapping[str, Any]) -> Optional[str]:
default_warehouse = self.config.credentials.warehouse
warehouse = config.get("snowflake_warehouse", default_warehouse)
if warehouse == default_warehouse or warehouse is None:
return None
previous = self._get_warehouse()
self._use_warehouse(warehouse)
return previous
def post_model_hook(self, config: Mapping[str, Any], context: Optional[str]) -> None:
if context is not None:
self._use_warehouse(context) I'm wondering if we couldn't do something similar here? What I'm not sure about, is the life cycle of sessions/connections in dbt-spark (and dbt at all to be honest). |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR. CLA has not been signed by users: @justinvin, @vinhnemo |
Thanks for your pull request, and welcome to our community! We require contributors to sign our Contributor License Agreement and we don't seem to have your signature on file. Check out this article for more information on why we have a CLA. In order for us to review and merge your code, please submit the Individual Contributor License Agreement form attached above above. If you have questions about the CLA, or if you believe you've received this message in error, please reach out through a comment on this PR. CLA has not been signed by users: @justinvin, @vinhnemo |
closing per @JCZuurmond |
resolves #590
Description
Checklist
changie new
to create a changelog entry