Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove race condition in credentials.py #643

Merged
merged 1 commit into from
Oct 9, 2024

Conversation

seeholza
Copy link
Contributor

@seeholza seeholza commented Oct 7, 2024

@seeholza seeholza changed the title Update credentials.py to test removal of race condition in credentials Removal of possible race condition in credentials.py Oct 7, 2024
@seeholza seeholza changed the title Removal of possible race condition in credentials.py Remove of possible race condition in credentials.py Oct 7, 2024
@seeholza seeholza changed the title Remove of possible race condition in credentials.py Remove possible race condition in credentials.py Oct 7, 2024
@seeholza
Copy link
Contributor Author

seeholza commented Oct 8, 2024

@martindurant here is a branch with your proposed fix.

The tests here pass, but I do see weird behavior when running an actual workload. I double checked and the failure appears by simply bumping the version of gcsfs to this branch.

│   File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 118, in wrapper                                                                                           │
│     return sync(self.loop, func, *args, **kwargs)                                                                                                                               │
│            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                               │
│   File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 103, in sync                                                                                              │
│     raise return_result                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/fsspec/asyn.py", line 56, in _runner                                                                                            │
│     result[0] = await coro                                                                                                                                                      │
│                 ^^^^^^^^^^                                                                                                                                                      │
│   File "/usr/local/lib/python3.11/site-packages/gcsfs/core.py", line 1030, in _ls                                                                                               │
│     for entry in await self._list_objects(                                                                                                                                      │
│                  ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                      │
│   File "/usr/local/lib/python3.11/site-packages/gcsfs/core.py", line 563, in _list_objects                                                                                      │
│     clisting = self._ls_from_cache(path)                                                                                                                                        │
│                ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                        │
│   File "/usr/local/lib/python3.11/site-packages/fsspec/spec.py", line 379, in _ls_from_cache                                                                                    │
│     raise FileNotFoundError(path)

The call comes from GCSFilesystem().ls(path), but the path in question exists and the command works on gcsfs==2024.9.0 gcsfs==2023.12.2post1 (see below). Could be something with the authentication ends up not working?

@reineking
Copy link

I applied the fix in my run (writing hundreds of Parquet files with ~1b rows in total), and it works 🙂 Before the job was hanging consistently after 1h.

@seeholza seeholza changed the title Remove possible race condition in credentials.py Remove race condition in credentials.py Oct 8, 2024
@martindurant
Copy link
Member

I do see weird behavior when running an actual workload.

This causes the workflow to fail? Are you running multiple threads too? Missing credentials should not cause FileNotFound but PermissionError, and the traceback indicates this is coming from the directory listing cache - doing reading?

@seeholza
Copy link
Contributor Author

seeholza commented Oct 8, 2024

I do see weird behavior when running an actual workload.

This causes the workflow to fail? Are you running multiple threads too? Missing credentials should not cause FileNotFound but PermissionError, and the traceback indicates this is coming from the directory listing cache - doing reading?

Yes this causes a failure. We're not running with multiple threads, it's the first (reading) call in the process, which uses GCSFilesystem().ls(path).

Edit: I am disabling caching and double checking, might be an unrelated issue on my side where i don't invalidate caches properly. It's just weird it didn't appear before.

@martindurant
Copy link
Member

Yes this causes a failure. We're not running with multiple threads, it's the first (reading) call in the process, which uses GCSFilesystem().ls(path).

So I understand clearly: ls (alone and with no other operations) works with current main, but not with this patch?

@seeholza
Copy link
Contributor Author

seeholza commented Oct 9, 2024

@martindurant I did notice that pre-patch I accidentally ran gcsfs==2023.12.2.post1, which didn't contain #612. I checked, and updating to 2024.9.0 causes the same ls failure already, so this is unrelated. Sorry for the red herring. I now call ls(path, refresh=True) and this works as intended. Preliminary runs confirm that I do not get the race condition/hanging anymore.

So, as to this bugfix, I think we're good, opening the PR for review.


The following is therefore unrelated to the PR, but just to answer your question @martindurant :

So I understand clearly: ls (alone and with no other operations) works with current main, but not with this patch?

What I do exactly (in pseudocode) is:

  • gcsfs.GCSFileSystem().exists(parent_path) is called earlier
  • extra code creates new paths path/subdir1/subdir2 with content in subdir1 and subdir2
  • gcsfs.GCSFileSystem().ls(path/subdir1/subdir2) is called later, and this is the one that fails, only after the patch.

So, if the cache persists somehow between instantiations of GCSFileSystem, then that could be an explanation for the failure of the ls.

@seeholza seeholza marked this pull request as ready for review October 9, 2024 10:20
@martindurant
Copy link
Member

gcsfs.GCSFileSystem() will get the same instance each time, by design, since creating credentials and connections is expensive. It will also persist the directory cache. You can avoid this by passing skip_instance_cache=True, but it would be worth finding out what's going on.

extra code creates new paths

If this is from a different instance/process/machine, then it is to be expected that the directory cache of the instance above has become stale, so refresh is the right thing to do (or .invalidate_cache()).

@martindurant martindurant merged commit 8a83230 into fsspec:main Oct 9, 2024
5 checks passed
@seeholza
Copy link
Contributor Author

seeholza commented Oct 10, 2024

gcsfs.GCSFileSystem() will get the same instance each time, by design, since creating credentials and connections is expensive. It will also persist the directory cache. You can avoid this by passing skip_instance_cache=True, but it would be worth finding out what's going on.

Thats very useful information @martindurant, thank you very much! I was trying to find something along these lines in the documentation, but failed to. If this is not in there yet, would it make sense adding it?

@martindurant
Copy link
Member

It is an inherited behaviour from fsspec, mentioned here: https://filesystem-spec.readthedocs.io/en/latest/features.html#instance-caching

I totally agree that the whole project and set of repos could do with a docs reorganise, but I think it's beyond my personal effort level right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants