-
Notifications
You must be signed in to change notification settings - Fork 14.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support of a different AWS connection for DynamoDB #29452
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
I would like to suggest that you add "s3_conn_id" instead of "dynamodb_conn_id". Otherwise this end up with quite unintuitive behavior IMO. Because when user sees "dynamodb_conn_id" they will likely assume that's the one they should use. So then I expect that commonly a user will inadvertently do transfer between dynamodb, connection ID and AWS connection ID. But if, on the other hand, you add S3 con ID, then user can leave that None generally speaking, and the AWS conn_id will always be the one thats used for Dynamo db. |
Wouldn't it be more confusing as we removed S3 conn type |
|
I don't think so, for the same reason that there isn't a "dynamodb" conn type. It's just "use this conn id for this service" -- all of them are AWS conn type. So, aws conn id would by default be used for both dynamo and s3. But you can optionally override s3 by supplying it. That's my thinking. |
This is pretty much what I had in mind for a base aws transfer class |
I'd maybe leave the existing aws_conn_id and make the new way an option with some checks to assert |
I share the consideration of current usage of the operator and I agree that we should definitely keep the people, who are happy with using just one AWS connection untouched, however, the idea of introducing two more connection parameters so that there would never be a case, when all connections will be used at the same time, sounds quite weird to me. What would be the purpose of such generalisation? Out of all transfer operators only three of them operate entirely within AWS: We could even go as far as generalising all transfer operators, as there is always an in and an out with them, but that sounds like a major revamp. What does everyone think? How can we make the most practical first step? |
I agree I think it's best to leave aws conn Id alone and simply add an optional s3 conn Id for optional diff creds for bucket. |
But why? My initial thought was to generalize only the Aws to Aws transfer operators. We can have |
BTW this issue is not just aws specific. |
Alright, I feel like the is a consensus around having two in/out connections.
|
If I were to pick, I'd say I also still like the idea of having I am curious to hear what kind of percentage of users have cross-account workflows like this. I don't expect you to have an actual answer, just curious if this is a large portion of the userbase or an outlier or???? |
I think this should be rather typical use case. Many companies use multiple accounts to isolate services and simplify cost attribution. Another typical use case is data lakes built on top of S3, which live separately from all the micro-services that use DynamoDB. We don't have a single intra-account transfer for this operator. |
I suggest first to check other teansfer operators. I think we already have a precedence for that (we are not obligated to do the same but let's verify) |
Sorry, @eladkal, I did't quite understand you, you suggest to check connection in other transfer operators for what exactly? |
The paramter name. I'm pretty sure I already saw the dest_ prefix somewhere |
You're right, there is Do we deprecate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few notes. I'm still not convinced that removing aws_conn_id
is the right play here, but if that is what the consensus is, then I'll back it.
In cases when DynamoDBToS3Operator operator is used with a DynamoDB table and an S3 bucket in different accounts, a separate AWS connection is needed (i.e. if you need to assume an IAM role from a different account). Use source_aws_conn_id to specify AWS connection for accessing DynamoDB and optionally dest_aws_conn_id for S3 Bucket access with a fallback to source_aws_conn_id. Deprecates aws_conn_id in favour of source_aws_conn_id.
Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>
Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed: #29996Tests / Wait for PROD images (pull_request)
this refers to problem in main.
@o-nikolas @eladkal @ferruzzi could you also have a look on PR, for me looks good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved with a couple of nitpick suggestions.
Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>
I'm OK with it I still think that this problem is not localized to Dynamo.
|
Awesome work, congrats on your first merged pull request! |
followup on apache#29452 (comment) This PR preserve all current behavior but add the needed interface to be used for other transfer operators
* Add `AwsToAwsBaseOperator` followup on #29452 (comment) This PR preserve all current behavior but add the needed interface to be used for other transfer operators
* Add `AwsToAwsBaseOperator` followup on apache/airflow#29452 (comment) This PR preserve all current behavior but add the needed interface to be used for other transfer operators GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
* Add `AwsToAwsBaseOperator` followup on apache/airflow#29452 (comment) This PR preserve all current behavior but add the needed interface to be used for other transfer operators GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
* Add `AwsToAwsBaseOperator` followup on apache/airflow#29452 (comment) This PR preserve all current behavior but add the needed interface to be used for other transfer operators GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
* Add `AwsToAwsBaseOperator` followup on apache/airflow#29452 (comment) This PR preserve all current behavior but add the needed interface to be used for other transfer operators GitOrigin-RevId: 4effd6f48b5b0fabde7e8bc731844a1cd258dc0e
This change adds a new optional argument
dest_aws_conn_id
toDynamoDBToS3Operator
so that a separate AWS connection can be used to access S3 bucket. If not specified, the connection fromsource_aws_conn_id
is used, which is also used to scan a DynamoDB table.aws_conn_id
is marked as deprecated.This makes it useful for cross-account transfers.
closes: #29422