Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

occasional lockups during dask reads #61

Open
mukhery opened this issue Jun 22, 2021 · 5 comments
Open

occasional lockups during dask reads #61

mukhery opened this issue Jun 22, 2021 · 5 comments

Comments

@mukhery
Copy link

mukhery commented Jun 22, 2021

It seems that stackstac will occasionally hang indefinitely while doing a dataset read:
image

call stack:

File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run()

File "/srv/conda/envs/notebook/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/_concurrent_futures_thread.py", line 66, in run result = self.fn(*self.args, **self.kwargs)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3616, in apply_function result = function(*args, **kwargs)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 3509, in execute_task return func(*map(execute_task, args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/optimization.py", line 963, in __call__ return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 151, in get result = _execute_task(task, cache)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task return func(*(_execute_task(a, cache) for a in args))

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/to_dask.py", line 172, in fetch_raster_window data = reader.read(current_window)

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 425, in read result = reader.read(

File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 249, in read return self.dataset.read(1, window=window, **kwargs)

Is it possible to pass in a timeout parameter or something like that or would I be better off just cancelling the job entirely when something like this happens?

@gjoseph92
Copy link
Owner

Yeah, a timeout on the read would be reasonable. Then we could have that timeout trigger a retry via whatever logic we implement for #18.

@mukhery I'm curious if you have a reproducer for this, or have noticed cases/datasets/patterns that tend to cause it more often?

For now, you might try playing with setting GDAL_HTTP_MAX_RETRY and GDAL_HTTP_RETRY_DELAY via LayeredEnv. See https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access and https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_TIMEOUT.

Maybe something like:

retry_env = stackstac.DEFAULT_GDAL_ENV.updated(dict(
    GDAL_HTTP_TIMEOUT=45,
    GDAL_HTTP_MAX_RETRY=5,
    GDAL_HTTP_RETRY_DELAY=0.5
))
stackstac.stack(..., gdal_env=retry_env)

@mukhery
Copy link
Author

mukhery commented Jun 29, 2021

I tried to come up with something to reproduce but haven't been able to. We've also been seeing several other network-related/comms issues, so it's possible that our specific workload and how we've implemented the processing is causing some of these issues. I ended up just adding timeouts to the task futures and then cancelling and/or restarting the cluster if needed to meet our current need. Feel free to close this issue if you'd like and I can reopen later if I'm able to reliably reproduce.

@gjoseph92
Copy link
Owner

I'll keep it open, since I think it's a reasonable thing to implement.

I ended up just adding timeouts to the task futures

Curious how you implemented this?

@mukhery
Copy link
Author

mukhery commented Jun 29, 2021

Sounds good, thanks!

I did something like this:

try:
    fut = cluster.client.compute(<task_involving_stackstac_data>)
    dask.distributed.wait(fut, timeout=600)
except dask.distributed.TimeoutError as curr_exception:
    error_text = f'{curr_exception}'[:100] #sometimes the error messages are crazy long
    print(f'task failed with exception: {error_text}')

@gjoseph92
Copy link
Owner

Nice! That makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants