-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of single quotes in RedshiftToS3Operator #32915
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
@@ -144,7 +144,7 @@ def _build_unload_query( | |||
self, credentials_block: str, select_query: str, s3_key: str, unload_options: str | |||
) -> str: | |||
return f""" | |||
UNLOAD ('{select_query}') | |||
UNLOAD ($${select_query}$$) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you check this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for raising this concern. @vincbeck You are right that using dollar quoting might not be sufficient for queries that already use double quotes. Currently, the suggested change in the code does not account for queries using double quotes, and it may not work as expected in those cases.
To address this issue, one possible solution is to detect if the query contains double quotes and, if it does, avoid using dollar quoting. Instead, the operator can dynamically determine the appropriate quoting method based on the query content.
Additionally, to ensure the robustness of the operator, it is essential to include unit tests that cover various use cases, including queries with single quotes, double quotes, and combinations of both. By adding comprehensive unit tests, we can verify that the RedshiftToS3
Operator works correctly with different types of queries.
By implementing these changes and enhancing the operator with proper handling of both single and double quotes, we can provide a more reliable and versatile solution for users dealing with complex queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is something worth going to deep on to work around a very limited amount of edge cases. The only use of double quotes would be around column names that contain illegal character sequences. single quotes are for literals which is really going to cover 99% of usage in my opinion. @vincbeck did you have other edge cases in mind from that comment? One other way to approach this would be to simply regex replace any single quote within the string with double single quotes as given in the Redshift Unload docs and continue to wrap the entire query in single quotes. That would cover the double quote usage edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this solution can work. Or all the way around, you replace ""
with "
and keep using $$
. This solution MIGHT be better because, as you pointed out, queries with ""
are rare and as such the replace will be here "just in case". But I definitely think this is worth handling this use case, otherwise your change will be breaking for any query using already ""
.
] | ||
|
||
# Test a query with single quotes | ||
select_query = "SELECT * FROM table WHERE column = 'value with single quotes'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we maybe want to parameterize this test to include testing double quotes and maybe missing both types or quotes, just to ensure nothing is breaking?
|
||
# Assert that the query is wrapped with double dollar sign quotes | ||
expected_unload_query = f""" | ||
UNLOAD('SELECT * FROM table WHERE column = $$SELECT * FROM table WHERE column = 'value with single quotes'$$') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a typo? This does not seem to match the above select query being passed in.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
@ashwaq06 Are you still working on this one or have we scared you off? |
Hi @ferruzzi , Thank you for your feedback on my pull request. I appreciate your input and am committed to improving the code to meet the project's standards. I've made changes up to my current knowledge and understanding, but I'm open to learning and making the correct changes. If you could provide more specific guidance or pointers on what needs improvement, I would greatly appreciate it. Your assistance in helping me align with the project's requirements would be invaluable. Looking forward to your guidance and collaboration. |
Hey @ashwaq06, let's to go through your first PR :) I know it can be hard. Could you resolve first the static failures? If you go to "Files changes" in this PR, you'll see them all. Another (better) option is to run static checks on your side. Reading and following the documentation should help |
This test case ensures that the RedshiftToS3Operator correctly handles a custom select_query containing single quotes. It validates that the operator generates a proper UNLOAD query that handles single quotes, thereby confirming its functionality in such scenarios.
@vincbeck I've addressed the static failures as per your suggestion. You can see the changes in the "Files changed" section of this PR. Could you please take a look at the changes I've made and let me know your thoughts? Your review would be greatly appreciated. |
|
||
@pytest.mark.parametrize("table_as_file_name, expected_s3_key", [[True, "key/table_"], [False, "key"]]) | ||
@mock.patch("airflow.providers.amazon.aws.hooks.s3.S3Hook.get_connection") | ||
@mock.patch("airflow.models.connection.Connection") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you mocking the connection rather than using the airflow.utils.db.merge_conn
for your tests? I also think it would be preferable here given what the changes are that tests should use Moto which supports redshift rather than mocking in a way that will not confirm that the redshift engine handles them as expected and does not return any errors.
Thanks for doing it! There are still some static checks though. Did you follow/read the documentation. There are some unit tests failures as well. Please look at them at the bottom of the PR. Let me know if you need help |
closes: #30287
This PR resolves an issue with the RedshiftToS3Operator in Apache Airflow, ensuring correct handling of single quotes in data. It includes refactoring, unit tests, and follows project guidelines. Collaboration with the community has been instrumental in making this contribution.