Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow multiple elements in impersonation chain #35694

Merged
merged 2 commits into from
Nov 25, 2023

Conversation

melugoyal
Copy link
Contributor

@melugoyal melugoyal commented Nov 17, 2023

#33715 added the impersonation_chain parameter to Google Cloud connections, but didn't allow any way to set multiple values for the impersonation chain. with this change, users can specify a comma-separated list of service accounts to be used in the impersonation chain


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers kind:documentation provider:google Google (including GCP) related issues labels Nov 17, 2023
@potiuk
Copy link
Member

potiuk commented Nov 17, 2023

needs static checks fixes

@melugoyal melugoyal force-pushed the connection-support-impersonation-chain branch from 9aa4bc6 to ffae40a Compare November 17, 2023 17:03
@potiuk
Copy link
Member

potiuk commented Nov 17, 2023

cc: @VladaZakharova and team -> do you think it's a safe change (in terms of , being always recognized as separator)

@VladaZakharova
Copy link
Contributor

VladaZakharova commented Nov 17, 2023

Hi!
As i understand, to set multiple SA with current implementation in impersonation_chain parameter you should pass them as a list of strings. What new behaviour will add suggested implementation?

@melugoyal
Copy link
Contributor Author

melugoyal commented Nov 17, 2023

@VladaZakharova essentially there's a bug in the current implementation where there is no actual way to pass in a list of strings via the connection (since impersonation_chain is defined as a StringField). this PR enables that by expecting a comma-separated list of strings, then splitting them into a list when instantiating the hook

@melugoyal melugoyal force-pushed the connection-support-impersonation-chain branch from ffae40a to e75544a Compare November 18, 2023 00:51
@melugoyal melugoyal force-pushed the connection-support-impersonation-chain branch from e75544a to 4dbc17e Compare November 18, 2023 01:32
@VladaZakharova
Copy link
Contributor

Could you please provide your test using this operator where you are using multiple SA in impersonation_chain parameter as long as the results of this run? Thank you!

@melugoyal
Copy link
Contributor Author

melugoyal commented Nov 20, 2023

@VladaZakharova imo the added unit tests should be sufficient, given it's just extending the current functionality already in the provider

@VladaZakharova
Copy link
Contributor

I just want to be sure that previous functionality really doesn't work without your changes, since it is quite strange. That is why i have kindly asked you if you can provide system test that you have used to check this functionality.
@eladkal Can you please also take a look?

@potiuk
Copy link
Member

potiuk commented Nov 21, 2023

Yes. My question is also - what's the problem with keeping the sequence as before? WHY do you think using array is not good enough @melugoyal ? Is there a particular reason you want to introduce coma separated string that having array does not allow? Can you show an example case where this is problematic?

@melugoyal
Copy link
Contributor Author

@VladaZakharova @potiuk please correct me if i'm wrong but i don't see how you can specify a list of strings when configuring a connection through the Airflow UI for a connection field that is a string. this change allows the user to specify a comma-separated list of strings in the Airflow UI for the impersonation_chain argument, and upon instantiating the hook we detect a comma-separated string for that argument and convert it to a list of strings.

@potiuk
Copy link
Member

potiuk commented Nov 22, 2023

@VladaZakharova @potiuk please correct me if i'm wrong but i don't see how you can specify a list of strings when configuring a connection through the Airflow UI for a connection field that is a string. this change allows the user to specify a comma-separated list of strings in the Airflow UI for the impersonation_chain argument, and upon instantiating the hook we detect a comma-separated string for that argument and convert it to a list of strings.

Correct me if I am wrong - but I think you cannot do it with your change either? The impersonation_chain can only be passed by Hook / Operator init parameters, not through connection extra (which is what you are suggesting) ?

And it is defined like this everywhere in those:

        impersonation_chain: str | Sequence[str] | None = None,

But.maybe I am missing something ??

@potiuk
Copy link
Member

potiuk commented Nov 22, 2023

IMHO You'd need to implement quite a lot more changes to allow impersonation chain to be passed by extra (including fields customisations) - and I am not even sure if it is a good idea (I'd revert it to @VladaZakharova and others from Google to have a say on that).

@melugoyal
Copy link
Contributor Author

@potiuk not sure i understand. as mentioned in the PR description, #33715 added the ability to specify impersonation_chain via the extras in the Google Cloud Connection

@potiuk
Copy link
Member

potiuk commented Nov 22, 2023

@potiuk not sure i understand. as mentioned in the PR description, #33715 added the ability to specify impersonation_chain via the extras in the Google Cloud Connection

Ah.. I see - I completely forgotten about the other PR :)

You have to remember that when you look at multiple 10s of PRs a day it's hard to remember that something have been mentioned at description and you might simply not realise that when you look at just the code but have not re-read all the description from the beginning (I understand for you might be obvious because those are the only two PRs you are looking at.

With this context - it is clear what you are after. Now I would refer to @VladaZakharova and team to asy if having impersonation in extra seem like good idea or not - becasue really this is what should be decided (when it is, then indeed this one makes sense).

@VladaZakharova
Copy link
Contributor

Hi!
Thank you for the clarification.
These changes seem to be affecting existing functionality since we didn't properly check the changes from #33715 , so could you please create a simple DAG to check 2 cases:

  1. setup impersonation chain in connection, setup impersonation chain in operator => the impersonation chain from the operator should be used;
  2. setup impersonation chain in connection, don't setup impersonation chain in operator => the impersonation chain from the connection should be used.
    It would be nice to check different types of input impersonation_chain.
    Please, share the results here in PR. Thanks!

@melugoyal
Copy link
Contributor Author

melugoyal commented Nov 22, 2023

test DAG:

from airflow.providers.google.cloud.operators.gcs import GCSCreateBucketOperator
from airflow import DAG
from datetime import datetime

with DAG(
    dag_id="gcp-test", schedule="@once", start_date=datetime(2023, 1, 1), is_paused_upon_creation=False, catchup=False
) as dag:
    GCSCreateBucketOperator(
        task_id="gcscreate",
        bucket_name="test-mehulg",
        gcp_conn_id="gcp_test",
        impersonation_chain="test-sa-in-code@project.iam.gserviceaccount.com",
    )

with connection gcp_test specified without the impersonation_chain set: File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 100, in _make_iam_token_request raise exceptions.RefreshError(_REFRESH_ERROR, response_body) google.auth.exceptions.RefreshError: (\'Unable to acquire impersonated credentials\', \'{\ "error": {\ "code": 404,\ "message": "Not found; Gaia id not found for email test-sa-in-code@project.iam.gserviceaccount.com",\ "status": "NOT_FOUND"\ }\ }\ \')
so it attempted to use test-sa-in-code@project.iam.gserviceaccount.com from the DAG code. note the error is expected, since that is a fake service account and i'm not actually setting up any IAM policies

with impersonation_chain in connection gcp_test set to test-sa-in-connection@project.iam.gserviceaccount.com
image
File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 100, in _make_iam_token_request raise exceptions.RefreshError(_REFRESH_ERROR, response_body) google.auth.exceptions.RefreshError: (\'Unable to acquire impersonated credentials\', \'{\ "error": {\ "code": 404,\ "message": "Not found; Gaia id not found for email test-sa-in-code@project.iam.gserviceaccount.com",\ "status": "NOT_FOUND"\ }\ }\ \'), same error, indicating that as expected the impersonation_chain from the code takes precedence.

now redeployed the DAG removing the impersonation_chain specified in code:
File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 100, in _make_iam_token_request raise exceptions.RefreshError(_REFRESH_ERROR, response_body) google.auth.exceptions.RefreshError: (\'Unable to acquire impersonated credentials\', \'{\ "error": {\ "code": 404,\ "message": "Not found; Gaia id not found for email test-sa-in-connection@project.iam.gserviceaccount.com",\ "status": "NOT_FOUND"\ }\ }\ \')
as expected, it attempted to use the service account specified in the connection.

with impersonation chain in gcp_test updated to test-first-sa-in-connection@project.iam.gserviceaccount.com, test-sa-in-connection@project.iam.gserviceaccount.com
image
File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 100, in _make_iam_token_request raise exceptions.RefreshError(_REFRESH_ERROR, response_body) google.auth.exceptions.RefreshError: (\'Unable to acquire impersonated credentials\', \'{\ "error": {\ "code": 404,\ "message": "Not found; Gaia id not found for email test-first-sa-in-connection@project.iam.gserviceaccount.com",\ "status": "NOT_FOUND"\ }\ }\ \')
as expected, it attempted to use the first service account specified in the connection

with impersonation chain in gcp_test updated to astro-arithmetic-nebula-5764@astro-dev-us-east4-0001.iam.gserviceaccount.com, test-sa-in-connection@project.iam.gserviceaccount.com (and Airflow worker's service account granted access to impersonate astro-arithmetic-nebula-5764@astro-dev-us-east4-0001.iam.gserviceaccount.com)
image
File "/usr/local/lib/python3.11/site-packages/google/auth/impersonated_credentials.py", line 100, in _make_iam_token_request raise exceptions.RefreshError(_REFRESH_ERROR, response_body) google.auth.exceptions.RefreshError: (\'Unable to acquire impersonated credentials\', \'{\ "error": {\ "code": 404,\ "message": "Not found; Gaia id not found for email test-sa-in-connection@project.iam.gserviceaccount.com",\ "status": "NOT_FOUND"\ }\ }\ \')
as expected, it was able to impersonate the first service account in the chain, but then got an error when impersonating the second one.

@VladaZakharova
Copy link
Contributor

Thank you for explanation, i think we can merge the changes
@potiuk

@potiuk potiuk merged commit e2a5dbf into apache:main Nov 25, 2023
46 checks passed
@melugoyal melugoyal deleted the connection-support-impersonation-chain branch November 27, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:documentation provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants