Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of single quotes in RedshiftToS3Operator #32915

Closed
wants to merge 6 commits into from

Conversation

ashwaq06
Copy link

@ashwaq06 ashwaq06 commented Jul 28, 2023

closes: #30287

This PR resolves an issue with the RedshiftToS3Operator in Apache Airflow, ensuring correct handling of single quotes in data. It includes refactoring, unit tests, and follows project guidelines. Collaboration with the community has been instrumental in making this contribution.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon-aws AWS/Amazon - related issues labels Jul 28, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 28, 2023

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@@ -144,7 +144,7 @@ def _build_unload_query(
self, credentials_block: str, select_query: str, s3_key: str, unload_options: str
) -> str:
return f"""
UNLOAD ('{select_query}')
UNLOAD ($${select_query}$$)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you check this comment?

Copy link
Author

@ashwaq06 ashwaq06 Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for raising this concern. @vincbeck You are right that using dollar quoting might not be sufficient for queries that already use double quotes. Currently, the suggested change in the code does not account for queries using double quotes, and it may not work as expected in those cases.

To address this issue, one possible solution is to detect if the query contains double quotes and, if it does, avoid using dollar quoting. Instead, the operator can dynamically determine the appropriate quoting method based on the query content.

Additionally, to ensure the robustness of the operator, it is essential to include unit tests that cover various use cases, including queries with single quotes, double quotes, and combinations of both. By adding comprehensive unit tests, we can verify that the RedshiftToS3 Operator works correctly with different types of queries.

By implementing these changes and enhancing the operator with proper handling of both single and double quotes, we can provide a more reliable and versatile solution for users dealing with complex queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is something worth going to deep on to work around a very limited amount of edge cases. The only use of double quotes would be around column names that contain illegal character sequences. single quotes are for literals which is really going to cover 99% of usage in my opinion. @vincbeck did you have other edge cases in mind from that comment? One other way to approach this would be to simply regex replace any single quote within the string with double single quotes as given in the Redshift Unload docs and continue to wrap the entire query in single quotes. That would cover the double quote usage edge case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this solution can work. Or all the way around, you replace "" with " and keep using $$. This solution MIGHT be better because, as you pointed out, queries with "" are rare and as such the replace will be here "just in case". But I definitely think this is worth handling this use case, otherwise your change will be breaking for any query using already "".

@ashwaq06 ashwaq06 requested a review from hussein-awala July 28, 2023 19:16
@eladkal eladkal requested review from ferruzzi and vincbeck July 29, 2023 13:04
]

# Test a query with single quotes
select_query = "SELECT * FROM table WHERE column = 'value with single quotes'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we maybe want to parameterize this test to include testing double quotes and maybe missing both types or quotes, just to ensure nothing is breaking?


# Assert that the query is wrapped with double dollar sign quotes
expected_unload_query = f"""
UNLOAD('SELECT * FROM table WHERE column = $$SELECT * FROM table WHERE column = 'value with single quotes'$$')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a typo? This does not seem to match the above select query being passed in.

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 22, 2023
@ferruzzi
Copy link
Contributor

@ashwaq06 Are you still working on this one or have we scared you off?

@ashwaq06
Copy link
Author

@ashwaq06 Are you still working on this one or have we scared you off?

Hi @ferruzzi ,

Thank you for your feedback on my pull request. I appreciate your input and am committed to improving the code to meet the project's standards.

I've made changes up to my current knowledge and understanding, but I'm open to learning and making the correct changes. If you could provide more specific guidance or pointers on what needs improvement, I would greatly appreciate it. Your assistance in helping me align with the project's requirements would be invaluable.

Looking forward to your guidance and collaboration.

@github-actions github-actions bot removed the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 23, 2023
@vincbeck
Copy link
Contributor

vincbeck commented Oct 6, 2023

Hey @ashwaq06, let's to go through your first PR :) I know it can be hard. Could you resolve first the static failures? If you go to "Files changes" in this PR, you'll see them all. Another (better) option is to run static checks on your side. Reading and following the documentation should help

This test case ensures that the RedshiftToS3Operator correctly handles a custom select_query containing single quotes. It validates that the operator generates a proper UNLOAD query that handles single quotes, thereby confirming its functionality in such scenarios.
@ashwaq06
Copy link
Author

ashwaq06 commented Oct 8, 2023

@vincbeck I've addressed the static failures as per your suggestion. You can see the changes in the "Files changed" section of this PR.

Could you please take a look at the changes I've made and let me know your thoughts? Your review would be greatly appreciated.


@pytest.mark.parametrize("table_as_file_name, expected_s3_key", [[True, "key/table_"], [False, "key"]])
@mock.patch("airflow.providers.amazon.aws.hooks.s3.S3Hook.get_connection")
@mock.patch("airflow.models.connection.Connection")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you mocking the connection rather than using the airflow.utils.db.merge_conn for your tests? I also think it would be preferable here given what the changes are that tests should use Moto which supports redshift rather than mocking in a way that will not confirm that the redshift engine handles them as expected and does not return any errors.

@vincbeck
Copy link
Contributor

@vincbeck I've addressed the static failures as per your suggestion. You can see the changes in the "Files changed" section of this PR.

Could you please take a look at the changes I've made and let me know your thoughts? Your review would be greatly appreciated.

Thanks for doing it! There are still some static checks though. Did you follow/read the documentation. There are some unit tests failures as well. Please look at them at the bottom of the PR. Let me know if you need help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:amazon-aws AWS/Amazon - related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RedshiftToS3 Operator Wrapping Query in Quotes Instead of $$
5 participants