Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid auth error when serializing files across processes #276

Merged
merged 2 commits into from
Sep 7, 2023

Conversation

jrbourbeau
Copy link
Collaborator

@jrbourbeau jrbourbeau commented Aug 7, 2023

With #259, when serializing files, we started re-checking the filesystem type (HTTPS vs. S3) and, if we are on AWS but not using S3 files, we now re-open the file using an s3fs filesystem.

This is good for performance, but when running on a remote Dask cluster, users now need to have all nodes in their cluster (i.e. client, scheduler, and workers) authenticated with earthaccess. For example, if only the client session is authenticated, like in this example:

import earthaccess
from distributed import SubprocessCluster
import xarray as xr

earthaccess.login()

# Create a cluster where the scheduler / workers are in local subprocessses.
# This has the same serialization characteristics as a remote cluster.
cluster = SubprocessCluster()
client = cluster.get_client()

results = earthaccess.search_data(
    short_name="VIIRS_NPP-STAR-L3U-v2.80",
    concept_id="C2147485059-POCLOUD",
    cloud_hosted=True,
    count=1,  # For testing purposes
)

ds = xr.open_mfdataset(earthaccess.open(results))
result = ds.sea_surface_temperature.notnull().mean({"lon", "lat"}).compute()
print(result)

users encounter the following error

Traceback (most recent call last):
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/scheduler.py", line 4346, in update_graph
    graph = deserialize(graph_header, graph_frames).data
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/protocol/serialize.py", line 432, in deserialize
    return loads(header, frames)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/protocol/serialize.py", line 98, in pickle_loads
    return pickle.loads(x, buffers=buffers)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/protocol/pickle.py", line 96, in loads
    return pickle.loads(x)
  File "/Users/james/projects/nsidc/earthaccess/earthaccess/store.py", line 66, in make_instance
    if earthaccess.__store__.running_in_aws and cls is not s3fs.S3File:
AttributeError: 'NoneType' object has no attribute 'running_in_aws'
Full traceback:
Traceback (most recent call last):
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/scheduler.py", line 4346, in update_graph
    graph = deserialize(graph_header, graph_frames).data
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/protocol/serialize.py", line 432, in deserialize
    return loads(header, frames)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/protocol/serialize.py", line 98, in pickle_loads
    return pickle.loads(x, buffers=buffers)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/protocol/pickle.py", line 96, in loads
    return pickle.loads(x)
  File "/Users/james/projects/nsidc/earthaccess/earthaccess/store.py", line 66, in make_instance
    if earthaccess.__store__.running_in_aws and cls is not s3fs.S3File:
AttributeError: 'NoneType' object has no attribute 'running_in_aws'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/james/projects/nsidc/earthaccess/test.py", line 20, in <module>
    result = ds.sea_surface_temperature.notnull().mean({"lon", "lat"}).compute()
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1101, in compute
    return new.load(**kwargs)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/xarray/core/dataarray.py", line 1075, in load
    ds = self._to_temp_dataset().load(**kwargs)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/xarray/core/dataset.py", line 747, in load
    evaluated_data = da.compute(*lazy_data.values(), **kwargs)
  File "/Users/james/mambaforge/envs/nasa/lib/python3.9/site-packages/distributed/client.py", line 2247, in _gather
    raise exception.with_traceback(traceback)
RuntimeError: Error during deserialization of the task graph. This frequently occurs if the Scheduler and Client have different environments. For more information, see https://docs.dask.org/en/stable/deployment-considerations.html#consistent-software-environments

Because other nodes, in this case the scheduler, are attempting to check if they're on AWS, but aren't authed with earthaccess, so things error.

While it's possible to authenticate the other nodes in the cluster with earthaccess with some additional client-side code, most users won't want to / know how to handle this additional authentication step.

This PR proposes we handle this for them by forwarding the existing client-side earthaccess auth object alongside data files, and then re-auth (only if needed) if we need to re-open the file.

This seems sensible to me, but is just one possible approach. I'd welcome any thoughts others may have.

cc @betolink @MattF-NSIDC

@github-actions
Copy link

github-actions bot commented Aug 7, 2023

Binder 👈 Launch a binder notebook on this branch for commit 4387ab6

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit d86e470

@MattF-NSIDC
Copy link

Hey James, thanks for another valuable contribution! I'm not familiar enough yet to review this, but I think it'll get capable eyes on soon.

We're happy to see so much engagement from the community, and we don't have any firm plans yet, but there's interest in bringing on community maintainers and leaders to make this project more sustainable as it continues to grow. Would you be interested, or know anyone who would be?

@jrbourbeau
Copy link
Collaborator Author

@MattF-NSIDC yeah, I'd be happy to help out where I can. My main OSS activity is as a maintainer of Dask. Though I also interact regularly with other projects that interface with Dask like Xarray, Zarr, S3Fs, etc.

@MattF-NSIDC
Copy link

Wonderful! ❤️

FWIW both Luis and I had seen you around GitHub before you started contributing here, and we're very excited to have you helping out :)

@MattF-NSIDC
Copy link

@jrbourbeau we just had a really productive team discussion about bringing on community maintainers. We have team agreement that it would be amazing to bring you on as a maintainer! Welcome to the team ❤️

For now, we're envisioning a simple governance model where earthaccess maintainers act as the governing committee with @betolink as tie-breaker.

This is not something NSIDC has institutional experience with, and @betolink and I have been pushing for developing that experience. We hope to use earthaccess as an opportunity to institutionally recognize the value of open and community-driven development and adopt open project governance practices more broadly. If there's anything you can suggest that will help in that mission, we'd love to hear your thoughts.

@betolink
Copy link
Member

betolink commented Sep 5, 2023

@jrbourbeau is there anything left to do on this one? I think we can merge just had one question about the execution of login()

@jrbourbeau
Copy link
Collaborator Author

I think this one should be good to go (I've been working off this branch for a while without issue)

just had one question about the execution of login()

I think I missed this -- what's your question?

@betolink
Copy link
Member

betolink commented Sep 7, 2023

@jrbourbeau nice! the question was about if this code is meant to run only in Dask workers. If we are not using Dask we won't hit this right?

Copy link
Member

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, this should avoid auth issues when we use distributed workloads

# Attempt to re-authenticate
if not earthaccess.__auth__.authenticated:
earthaccess.__auth__ = auth
earthaccess.login()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create the store in the scheduler why make_instance() has to execute a login() again? is this code running only on the scheduler or in the workers? The idea is that we create the store only once and since it's authenticated we send them to the workers right?

@jrbourbeau
Copy link
Collaborator Author

the question was about if this code is meant to run only in Dask workers. If we are not using Dask we won't hit this right?

Yeah, that's mostly correct. However, this also helps when running earthaccess in multiple processes too (e.g. the main parent process was authed with earthaccess.login() but the child processes haven't been).

@betolink betolink merged commit ff4fb5f into nsidc:main Sep 7, 2023
6 checks passed
@jrbourbeau jrbourbeau deleted the include-store branch September 7, 2023 21:38
@jrbourbeau jrbourbeau mentioned this pull request Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants