-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use dask.delayed instead of the distributed client #222
Conversation
Started work on using dask.delayed instead of the futures interface (client.submit). It's easier and allows building the entire graph lazily before executing. Still need tests and to port the SplineCV code to this. Will deprecate the client argument and remove in 2.0.0.
Having Dask be optional is causing headaches:
Dask is a really lightweight dependency (pure Python with minimal requirements) and I would like to add more support for it in the future (to enable parallel fitting and larger than RAM Jacobians). I'm thinking we should just adopt Dask as a dependency. Scikit-learn is a much heavier dependency and we already have it. This is nothing compared to GDAL and the geospatial stack. @santisoler @jessepisel any thoughts/objections on this? |
Separated so we can revert this easily if needed.
I'm seeing Dask in the future of Verde, specially for solving problems that include larger-than-memory arrays. So I think we could add it as a dependency right away and save the headaches of writing annoying lines on test functions and struggling with building the docs. |
👍 alright, I'll try to get this one finished soon. Thanks, @santisoler |
I agree with @santisoler that Dask should be a dependency. It will be easier to deprecate the futures interface and not have to write new tests for both options. This will be nice to have for big grids that don't fit in memory @leouieda ! |
Thanks for the input @jessepisel and @santisoler 👍
I'm still struggling with ways of doing this. The dask-ml package has some optimization methods for linear problems that we could use to run the least-squares fit. But it's not stable enough to be a dependency. Right now, solving the system with the linear algebra is dask is bad because there are a lot of limitations (need square chunks, for one). That's where we should start looking at pylops. Matteo is working on a distributed version of that using dask. I'm eager to play around with it and see if we can make it work. |
Another option is to break up the fit into windows and run each window separately (using dask). |
OK, I added the delayed argument to the tutorials as well and made Dask a dependency. @santisoler and/or @jessepisel could you please review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work on the dask conversion @leouieda. I went through and reviewed all the changes. The conf.py
, install.rst
, environment.yml
, requirements-dev.txt
, requirements.txt
, and setup.py
look good to go.
The tutorials look really good. They have a nice logical flow and explain the difference between delayed dask
computation and serial computation. I added some comments here and there for clarity and grammar, but overall I think they work really well.
As for the bulk of the changes in the rest of the package, I looked through them and I think you got all the deprecation warnings covered.
The new tests for test_model_selection.py
should work nicely for the model selection. In the test for test_spline.py
the mindists
changed from 1e-5 to 1e6, is there any reason why you changed the minimum distances to fit by such a large value? Is it ensuring that the 1e-7 is returned for the test? That's all I noticed for the tests.
I think this looks great. The builds are passing and I think there is enough documentation and warnings for users to make the transition to the dask
implementation in verde.
# :class:`~verde.Spline` aren't optimal for this dataset. We could try | ||
# different combinations manually until we get a good score. A better way is to | ||
# do this automatically. In :ref:`model_selection` we'll go over how to do just | ||
# that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the note is appropriate for reminding users that it will be memory intensive. The last bit on improving the score is a great way to transition to model_selection
.
Co-Authored-By: Jesse Pisel <jessepisel@users.noreply.github.com>
Co-Authored-By: Jesse Pisel <jessepisel@users.noreply.github.com>
Thanks for the review @jessepisel!
That is exactly it. I just wanted to make it easier for the CV by giving it obviously bad values and only 1 good combination. This will avoid future headaches due to floating point round-off. |
Started work on using dask.delayed instead of the futures interface
(client.submit). It's easier and allows building the entire graph lazily
before executing. Still need tests and to port the SplineCV code to
this. Will deprecate the client argument and remove in 2.0.0.
Reminders
make format
andmake check
to make sure the code follows the style guide.doc/api/index.rst
andverde/__init__.py
.