-
Notifications
You must be signed in to change notification settings - Fork 14.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in passing metadata to DataprocClusterCreateOperator #16911
Comments
@pateash - would you be willing to provide a PR fixing it ? |
sure, |
found a workaround for though path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"
cluster_config = ClusterGenerator(
project_id=self.cfg.get('project_id'),
...........
init_actions_uris=[
path
],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl'},
properties=properties,
dag=dag
).make()
return DataprocClusterCreateOperator(
task_id='create_dataproc_cluster',
cluster_name='my-cluster',
project_id=self.cfg.get('project_id'),
region=self.cfg.get('region'),
cluster_config=cluster_config,
dag=dag
) @turbaszek @mik-laj @potiuk , I couldn't able to find an example where i will use new metadata field ( changed from DICT to Sequence[Tuple[str, str]]) |
HI All am also facing similar issue. {ValueError}metadata was invalid: [('PIP_PACKAGES', 'splunk-sdk==1.6.14 google-cloud-storage==1.31.2 pysmb google-cloud-secret-manager gcsfs'), ('x-goog-api-client', 'gl-python/3.7.7 grpc/1.41.0 gax/1.31.3 gccl/airflow_v2.1.0'), ('x-goog-api-client', 'gl-python/3.7.7 grpc/1.41.0 gax/1.31.3 gccl/airflow_v2.1.0')] |
@rajaprakash91 Are you also upgrading from Airflow 1.10.x to 2.x? |
@potiuk I think it doesn't make sense to add a workaround in the code as the older metadata was Dict() and newer argument is Seq(Tuple) type. |
@pateash Sorry, just seen your reply, yes i was trying to update Airflow from 1.10x to 2.X. so what is the fix, i saw a merge request |
@rajaprakash91 You can generate ClusterConfig with the same arguments from ClusterGenerator.make() you have been using and then pass that to the operator. path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"
cluster_config = ClusterGenerator(
project_id=self.cfg.get('project_id'),
...........
init_actions_uris=[
path
],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl'},
properties=properties,
dag=dag
).make()
return DataprocClusterCreateOperator(
task_id='create_dataproc_cluster',
cluster_name='my-cluster',
project_id=self.cfg.get('project_id'),
region=self.cfg.get('region'),
cluster_config=cluster_config,
dag=dag
) |
Oh Nice. Thank you so much. This is very useful.
…On Fri, Nov 19, 2021 at 1:05 PM Ashish Patel ***@***.***> wrote:
@pateash <https://github.com/pateash> Sorry, just seen your reply, yes i
was trying to update Airflow from 1.10x to 2.X.
so what is the fix, i saw a merge request
@rajaprakash91 <https://github.com/rajaprakash91> You can generate
ClusterConfig with the same arguments from ClusterGenerator.make() you have
been using and then pass that to the operator.
path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"
cluster_config = ClusterGenerator(
project_id=self.cfg.get('project_id'),
...........
init_actions_uris=[
path
],
metadata={'PIP_PACKAGES': 'pyyaml requests pandas openpyxl'},
properties=properties,
dag=dag
).make()
return DataprocClusterCreateOperator(
task_id='create_dataproc_cluster',
cluster_name='my-cluster',
project_id=self.cfg.get('project_id'),
region=self.cfg.get('region'),
cluster_config=cluster_config,
dag=dag
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16911 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIUYD5XHDRUPUBJKHBVKAOTUMX43NANCNFSM5ACXMNLA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Hi, sorry to write here but i didn't find another place talking about this. I am using Version: 2.1.4+composer and I have a DAG where i defined the DataprocClusterCreateOperator like this:
I passed the metadata as a sequence of tuples as i read here, using the dict is not working. Also, the metadata is not being rendered in the cluster_config. @pateash could you please explain a more detailed way to use your workaround? In what part of the dag could i use the workaround? Thanks in advance |
@nicolas-settembrini No Problem, you just have to generate config from all these arguments and then pass it to the DataprocClusterCreateOperator |
Hi,
I am facing some issues while installing PIP Packages in the Dataproc cluster using Initialization script,
I am trying to upgrade to Airflow 2.0 from 1.10.12 (where this code works fine)
[2021-07-09 11:35:37,587] {taskinstance.py:1454} ERROR - metadata was invalid: [('PIP_PACKAGES', 'pyyaml requests pandas openpyxl'), ('x-goog-api-client', 'gl-python/3.7.10 grpc/1.35.0 gax/1.26.0 gccl/airflow_v2.0.0+astro.3')
Apache Airflow version:
airflow_v2.0.0
What happened:
I am trying to migrate our codebase from Airflow v1.10.12, on the deeper analysis found that as part refactoring in of below pr #6371, we can no longer pass metadata in DataprocClusterCreateOperator() as this is not being passed to ClusterGenerator() method.
What you expected to happen:
Operator should work as before.
The text was updated successfully, but these errors were encountered: