Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gcs listing - ensure blobs are loaded #34919

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

atrbgithub
Copy link
Contributor

This fixes #34909

Performing the list of the blobs appears to force the blobs to be loaded rather than lazily loaded.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:google Google (including GCP) related issues labels Oct 13, 2023
@atrbgithub
Copy link
Contributor Author

Please see here for more info.

In our case blobs.prefixes was not being populated.

The list here forces that to happen.

Copy link
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
I'd like to note that it happens only with specifying prefix and delimiter.
With match_glob it works as expected without this patch.

@@ -829,10 +829,12 @@ def _list(
versions=versions,
)

all_blobs = list(blobs)
Copy link
Contributor

@shahar1 shahar1 Nov 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this issue is specific to using prefix and delimiter, it would be better for future debugging to convert it into blobs = list(blobs) within the else block above (and revert the later reference to blobs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahar1 thanks for taking a look at this, much appreciated. I've moved the list(blobs) into the above else as requested. I tested this locally using prefix with match_glob and then, prefix with delimiter and both code paths appear to work fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this was also broken for list_by_timespan, I've fixed that as well and squashed the commits.

@shahar1
Copy link
Contributor

shahar1 commented Nov 10, 2023

LGTM :)
@eladkal Feel free to merge

@eladkal eladkal changed the title Airflow-34909 - Fix gcs listing - ensure blobs are loaded Fix gcs listing - ensure blobs are loaded Nov 26, 2023
@eladkal
Copy link
Contributor

eladkal commented Nov 26, 2023

@atrbgithub can you rebase? (You disabled allowing mantainers to do changes on your branch so I can't do it for you)

@atrbgithub
Copy link
Contributor Author

atrbgithub commented Nov 27, 2023

@eladkal My apologies that was not intentional. I've rebased, I will look to get that setting changed.

Apparently this is a known issue when creating a PR from a repo which is under an organisation - https://github.com/orgs/community/discussions/5634

@atrbgithub
Copy link
Contributor Author

I've raised #35884 as an alternative, which is outside the org and allows Allow edits and access to secrets by maintainers

@eladkal
Copy link
Contributor

eladkal commented Nov 27, 2023

I've raised #35884 as an alternative, which is outside the org and allows Allow edits and access to secrets by maintainers

No need :)
Your rebase is enough

@eladkal eladkal merged commit 5d74ffb into apache:main Nov 27, 2023
47 checks passed
@atrbgithub
Copy link
Contributor Author

Great thanks @eladkal 👍

@pankajastro
Copy link
Member

Hey @atrbgithub looks like GCSObjectsWithPrefixExistenceSensor is broken after this PR change I just did a quick run and got the below error.

Sensor

gcs_object_with_prefix_exists = GCSObjectsWithPrefixExistenceSensor(
        bucket=BUCKET_1,
        prefix=PATH_TO_UPLOAD_FILE_PREFIX,
        task_id="gcs_object_with_prefix_exists_task",
        google_cloud_conn_id=GCP_CONN_ID,
    )

Error

[2023-12-08, 09:48:24 UTC] {taskinstance.py:1937} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/google/cloud/sensors/gcs.py", line 338, in execute
    super().execute(context)
  File "/usr/local/lib/python3.9/site-packages/airflow/sensors/base.py", line 257, in execute
    raise e
  File "/usr/local/lib/python3.9/site-packages/airflow/sensors/base.py", line 239, in execute
    poke_return = self.poke(context)
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/google/cloud/sensors/gcs.py", line 331, in poke
    self._matches = hook.list(self.bucket, prefix=self.prefix)
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/google/cloud/hooks/gcs.py", line 763, in list
    self._list(
  File "/usr/local/lib/python3.9/site-packages/airflow/providers/google/cloud/hooks/gcs.py", line 829, in _list
    ids.extend(blob.name for blob in blobs)
  File "/usr/local/lib/python3.9/site-packages/google/api_core/page_iterator.py", line 223, in __iter__
    raise ValueError("Iterator has already started", self)
ValueError: ('Iterator has already started', <google.api_core.page_iterator.HTTPIterator object at 0x7efe067c3880>)

@atrbgithub
Copy link
Contributor Author

@pankajastro thanks for raising, I have mentioned this here and asked for the change not to be merged in.

@atrbgithub
Copy link
Contributor Author

atrbgithub commented Dec 8, 2023

@pankajastro I've raised a PR to fix this #36130

Would you be able to retest?

Edit - A new PR has been raised to address this #36202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

apache-airflow-providers-google 10.9.0 fails to list GCS objects
4 participants