Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of a different AWS connection for DynamoDB #29452

Merged
merged 8 commits into from
Mar 9, 2023
Merged

Add support of a different AWS connection for DynamoDB #29452

merged 8 commits into from
Mar 9, 2023

Conversation

dym-ok
Copy link
Contributor

@dym-ok dym-ok commented Feb 9, 2023

This change adds a new optional argument dest_aws_conn_id to DynamoDBToS3Operator so that a separate AWS connection can be used to access S3 bucket. If not specified, the connection from source_aws_conn_id is used, which is also used to scan a DynamoDB table.

aws_conn_id is marked as deprecated.

This makes it useful for cross-account transfers.

closes: #29422

@boring-cyborg
Copy link

boring-cyborg bot commented Feb 9, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@dstandish
Copy link
Contributor

I would like to suggest that you add "s3_conn_id" instead of "dynamodb_conn_id". Otherwise this end up with quite unintuitive behavior IMO.

Because when user sees "dynamodb_conn_id" they will likely assume that's the one they should use. So then I expect that commonly a user will inadvertently do transfer between dynamodb, connection ID and AWS connection ID.

But if, on the other hand, you add S3 con ID, then user can leave that None generally speaking, and the AWS conn_id will always be the one thats used for Dynamo db.

@eladkal
Copy link
Contributor

eladkal commented Feb 10, 2023

I would like to suggest that you add "s3_conn_id" instead of "dynamodb_conn_id". Otherwise this end up with quite unintuitive behavior IMO.

Wouldn't it be more confusing as we removed S3 conn type
#25980

@Taragolis
Copy link
Contributor

aws_conn_id_in, aws_conn_id_out ??? and deprecate aws_conn_id for transfers operators within AWS?

@dstandish
Copy link
Contributor

Wouldn't it be more confusing as we removed S3 conn type
#25980

I don't think so, for the same reason that there isn't a "dynamodb" conn type. It's just "use this conn id for this service" -- all of them are AWS conn type.

So, aws conn id would by default be used for both dynamo and s3. But you can optionally override s3 by supplying it. That's my thinking.

@eladkal
Copy link
Contributor

eladkal commented Feb 10, 2023

aws_conn_id_in, aws_conn_id_out ??? and deprecate aws_conn_id for transfers operators within AWS?

This is pretty much what I had in mind for a base aws transfer class

@ferruzzi
Copy link
Contributor

aws_conn_id_in, aws_conn_id_out ??? and deprecate aws_conn_id for transfers operators within AWS?

I'd maybe leave the existing aws_conn_id and make the new way an option with some checks to assert aws_conn_id XOR (aws_conn_id_in and aws_conn_id_out). I don't think adding the option to specify an _out should require folks to update their existing code.

@dym-ok
Copy link
Contributor Author

dym-ok commented Feb 10, 2023

aws_conn_id_in, aws_conn_id_out ??? and deprecate aws_conn_id for transfers operators within AWS?

I'd maybe leave the existing aws_conn_id and make the new way an option with some checks to assert aws_conn_id XOR (aws_conn_id_in and aws_conn_id_out). I don't think adding the option to specify an _out should require folks to update their existing code.

I share the consideration of current usage of the operator and I agree that we should definitely keep the people, who are happy with using just one AWS connection untouched, however, the idea of introducing two more connection parameters so that there would never be a case, when all connections will be used at the same time, sounds quite weird to me.

What would be the purpose of such generalisation? Out of all transfer operators only three of them operate entirely within AWS: DynamoDBToS3Operator RedshiftToS3Operator and S3ToRedshiftOperator. Another 15 have just one AWS connection, which is specified with aws_conn_id.

We could even go as far as generalising all transfer operators, as there is always an in and an out with them, but that sounds like a major revamp. What does everyone think? How can we make the most practical first step?

@dstandish
Copy link
Contributor

I agree I think it's best to leave aws conn Id alone and simply add an optional s3 conn Id for optional diff creds for bucket.
This is fully backward compatible and intuitive behavior

@eladkal
Copy link
Contributor

eladkal commented Feb 11, 2023

What would be the purpose of such generalisation? Out of all transfer operators only three of them operate entirely within AWS: DynamoDBToS3Operator RedshiftToS3Operator and S3ToRedshiftOperator. Another 15 have just one AWS connection, which is specified with aws_conn_id.

We could even go as far as generalising all transfer operators, as there is always an in and an out with them, but that sounds like a major revamp. What does everyone think? How can we make the most practical first step?

But why? My initial thought was to generalize only the Aws to Aws transfer operators.
There is no need to handle the non aws ones because there it's very clear who is the input conn and who is the output coon
HiveToDynamoDBOperator - has hiveserver2_conn_id, aws_conn_id you know what is in/out without adding them to the parameter names. The issue we are discussing is relevant only for Aws to Aws transfers where it's not clear.

We can have source_aws_conn_id and dest_aws_conn_id where the dest is default with the source value. Only when user override the default it will be the case of 2 distinct accounts.

@eladkal
Copy link
Contributor

eladkal commented Feb 11, 2023

BTW this issue is not just aws specific.
We can have the same discussion for GCP to GCP transfers and Azure to Azure (or any other cloud provider)

@dym-ok
Copy link
Contributor Author

dym-ok commented Feb 13, 2023

Alright, I feel like the is a consensus around having two in/out connections.
I have two questions left :)

  1. What about the argument of leaving aws_conn_id for people who only use one connection?
  2. How can we finalise the naming, it feels to me that aws_source_conn_id reads somewhat easier than source_aws_conn_id, but that's just my opinion 😅

@ferruzzi
Copy link
Contributor

If I were to pick, I'd say source_aws_conn_id/ destination_aws_conn_id or aws_conn_id_in/aws_conn_id_out. I think I like the idea of keeping the ``aws_conn_id` part intact, with either a prefix or a suffix.

I also still like the idea of having aws_conn_id be a valid parameter when they are both the same value, which keeps from breaking existing workflows and keeps consistency with other operators and sensors for folks who don't use cross-account workflows.

I am curious to hear what kind of percentage of users have cross-account workflows like this. I don't expect you to have an actual answer, just curious if this is a large portion of the userbase or an outlier or????

@dym-ok
Copy link
Contributor Author

dym-ok commented Feb 13, 2023

I am curious to hear what kind of percentage of users have cross-account workflows like this.

I think this should be rather typical use case. Many companies use multiple accounts to isolate services and simplify cost attribution.

Another typical use case is data lakes built on top of S3, which live separately from all the micro-services that use DynamoDB.

We don't have a single intra-account transfer for this operator.

@eladkal
Copy link
Contributor

eladkal commented Feb 13, 2023

I suggest first to check other teansfer operators. I think we already have a precedence for that (we are not obligated to do the same but let's verify)

@dym-ok
Copy link
Contributor Author

dym-ok commented Feb 13, 2023

Sorry, @eladkal, I did't quite understand you, you suggest to check connection in other transfer operators for what exactly?

@eladkal
Copy link
Contributor

eladkal commented Feb 13, 2023

The paramter name. I'm pretty sure I already saw the dest_ prefix somewhere

@dym-ok
Copy link
Contributor Author

dym-ok commented Feb 13, 2023

You're right, there is dest_aws_conn_id in GCSToS3Operator, so I'll adhere to that pattern.

Do we deprecate aws_conn_id?

@dym-ok
Copy link
Contributor Author

dym-ok commented Feb 15, 2023

@eladkal @ferruzzi please have a look at the changes at your convenience.

@ferruzzi
Copy link
Contributor

@eladkal @ferruzzi please have a look at the changes at your convenience.

I'll take a look today, but it looks like Taragolis did a pretty solid review already. 👍

Copy link
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few notes. I'm still not convinced that removing aws_conn_id is the right play here, but if that is what the consensus is, then I'll back it.

In cases when DynamoDBToS3Operator operator is used with a DynamoDB
table and an S3 bucket in different accounts, a separate AWS
connection is needed (i.e. if you need to assume an IAM role from a
different account).

Use source_aws_conn_id to specify AWS connection for accessing DynamoDB
and optionally dest_aws_conn_id for S3 Bucket access with a fallback to
source_aws_conn_id.

Deprecates aws_conn_id in favour of source_aws_conn_id.
dym-ok and others added 3 commits March 9, 2023 13:11
Copy link
Contributor

@Taragolis Taragolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests / Wait for PROD images (pull_request) this refers to problem in main. fixed: #29996

@o-nikolas @eladkal @ferruzzi could you also have a look on PR, for me looks good

Copy link
Contributor

@ferruzzi ferruzzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved with a couple of nitpick suggestions.

Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>
@eladkal
Copy link
Contributor

eladkal commented Mar 9, 2023

I'm OK with it

I still think that this problem is not localized to Dynamo.
I think we should further explore the option to make it generic for other transfer operators.
Something like

class BaseAwsToAwsTransferOperator(BaseOperator):

    def __init__(
        self,
        *,
        source_aws_conn_id: str | None = AwsBaseHook.default_conn_name,
        dest_aws_conn_id: str | None | ArgNotSet = NOTSET,
        aws_conn_id: str | None | ArgNotSet = NOTSET,
        **kwargs,
    ) -> None:
        super().__init__(
            ...,
            **kwargs,
        )


class DynamoDBToS3Operator(BaseAwsToAwsTransferOperator):
    ...


class S3ToS3Operator(BaseAwsToAwsTransferOperator):
    ...

@Taragolis Taragolis merged commit 3780b01 into apache:main Mar 9, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 9, 2023

Awesome work, congrats on your first merged pull request!

@dym-ok dym-ok deleted the fead/dynamodb_conn_id branch March 10, 2023 08:18
eladkal added a commit to eladkal/airflow that referenced this pull request Mar 14, 2023
followup on apache#29452 (comment)
This PR preserve all current behavior but add the needed interface to be used for other transfer operators
eladkal added a commit that referenced this pull request Mar 14, 2023
* Add `AwsToAwsBaseOperator`
followup on #29452 (comment)
This PR preserve all current behavior but add the needed interface to be used for other transfer operators
ahidalgob pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Jun 12, 2023
* Add `AwsToAwsBaseOperator`
followup on apache/airflow#29452 (comment)
This PR preserve all current behavior but add the needed interface to be used for other transfer operators

GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
ahidalgob pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 7, 2023
* Add `AwsToAwsBaseOperator`
followup on apache/airflow#29452 (comment)
This PR preserve all current behavior but add the needed interface to be used for other transfer operators

GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Sep 19, 2024
* Add `AwsToAwsBaseOperator`
followup on apache/airflow#29452 (comment)
This PR preserve all current behavior but add the needed interface to be used for other transfer operators

GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
kosteev pushed a commit to GoogleCloudPlatform/composer-airflow that referenced this pull request Nov 8, 2024
* Add `AwsToAwsBaseOperator`
followup on apache/airflow#29452 (comment)
This PR preserve all current behavior but add the needed interface to be used for other transfer operators

GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple AWS connections support in DynamoDBToS3Operator
5 participants