Reimplement `.polyfit()` with `apply_ufunc` #5933

slevang · 2021-11-03T15:29:58Z

Closes Unexpected chunking of 3d DataArray in polyfit() #4554
Closes Polyfit performance on large datasets - Suboptimal dask task graph #5629
Closes polyfit with weights alters the DataArray in place #5644
Tests added
Passes pre-commit run --all-files
User visible changes (including notable bug fixes) are documented in whats-new.rst

Reimplement polyfit using apply_ufunc rather than dask.array.linalg.lstsq. This should solve a number of issues with memory usage and chunking that were reported on the current version of polyfit. The main downside is that variables chunked along the fitting dimension cannot be handled with this approach.

There is a bunch of fiddly code here for handling the differing outputs from np.polyfit depending on the values of the full and cov args. Depending on the performance implications, we could simplify some by keeping these in apply_ufunc and dropping later. Much of this parsing would still be required though, because the only way to get the covariances is to set cov=True, full=False.

A few minor departures from the previous implementation:

The rank and singular_values diagnostic variables returned by np.polyfit are now returned on a pointwise basis, since these can change depending on skipped nans. np.polyfit also returns the rcond used for each fit which I've included here.
As mentioned above, this breaks fitting done along a chunked dimension. To avoid regression, we could set allow_rechunk=True and warn about memory implications.
Changed default skipna=True, since the previous behavior seemed to be a limitation of the computational method.
For consistency with the previous version, I included a transpose operation to put degree as the first dimension. This is arbitrary though, and actually the opposite of how curvefit returns ordering. So we could match up with curvefit but it would be breaking for polyfit.

No new tests have been added since the previous suite was fairly comprehensive. Would be great to get some performance reports on real-world data such as the climate model detrending application in #5629.

…ufunc

github-actions · 2021-11-03T15:56:14Z

Unit Test Results

        6 files         6 suites 58m 4s ⏱️
16 290 tests 14 551 ✔️ 1 739 💤 0 ❌
90 936 runs 82 738 ✔️ 8 198 💤 0 ❌

Results for commit 62b4637.

mathause

Looks good to me!

mathause · 2021-11-08T15:20:05Z

xarray/core/dataarray.py

@@ -3867,11 +3867,10 @@ def polyfit(
            Degree of the fitting polynomial.
        skipna : bool, optional
            If True, removes all invalid values before fitting each 1D slices of the array.
-            Default is True if data is stored in a dask.array or if there is any
-            invalid values, False otherwise.
+            Default is True.


You could do

- skipna : bool, optional + skipna : bool, default: True

instead.

mathause · 2021-11-08T15:27:18Z

xarray/core/dataset.py

+                    return tuple(
+                        np.full(len(var) * [order], np.nan) for var in output_core_dims
+                    )
+            output = np.polyfit(x, y, deg, rcond=rcond, full=full, w=w, cov=cov)


numpy recommends to use Polynomial.fit <numpy.polynomial.polynomial.Polynomial.fit> class - did you consider switching to this (no requirement just a question).

mathause · 2021-11-08T15:33:53Z

xarray/core/dataset.py

+            output_core_dims = [("degree",)]
+            output_vars = ["{name}polyfit_coefficients"]
+
+        def _wrapper(x, y):


It might not be worth it, but you could avoid a if conditional in the inner loop as follows:

def _wrapper_skipna(x, y): ... def _wrapper_noskipna(x, y): ... if skipna: _wrapper = _wrapper_skipna else: _wrapper = _wrapper_noskipna

mathause · 2022-01-18T21:50:48Z

@slevang are you still interested to continue this PR? I think that would be a worthwhile addition and should not be too much left to do. (What would be nice, however, are tests for the issues this fixes.)

slevang · 2022-01-18T22:21:49Z

@slevang are you still interested to continue this PR? I think that would be a worthwhile addition and should not be too much left to do. (What would be nice, however, are tests for the issues this fixes.)

Definitely! I got distracted is all, and @dcherian posted a nice solution in #5629 that could allow us to preserve the ability to fit along a chunked dimension using blockwise operations and the dask lstsq implementation used by the existing polyfit code.

I'm happy to pick this back up and finish it off if there is consensus on the right way forward, but the blockwise approach seemed promising so I put this on hold.

Illviljan · 2022-01-19T19:21:56Z

@slevang Yeah, the blockwise approach seems indeed nice. You're very welcome to continue with the blockwise approach in a different PR if you want to.

slevang · 2022-01-19T19:39:57Z

Not sure I understand the blockwise approach well enough to make a PR, but maybe I'll give it a try at some point.

Illviljan · 2022-01-19T19:54:49Z

I think you can use a lot of @dcherian's code as a base and then for starters see if it simply passes all the tests (including the ones you added here). If you make a draft PR here it's easier to help out as well if you're getting stuck.

dcherian · 2024-11-19T20:34:07Z

Underlying problem should be solved by #9766

slevang added 2 commits November 3, 2021 10:25

reimplement polyfit with apply_ufunc

5135518

Merge remote-tracking branch 'upstream/main' into polyfit_with_apply_…

62b4637

…ufunc

slevang changed the title ~~Polyfit with apply ufunc~~ Remiplement .polyfit() with apply_ufunc Nov 3, 2021

slevang changed the title ~~Remiplement .polyfit() with apply_ufunc~~ Reimplement .polyfit() with apply_ufunc Nov 3, 2021

dcherian mentioned this pull request Nov 3, 2021

Polyfit performance on large datasets - Suboptimal dask task graph #5629

Closed

mathause reviewed Nov 8, 2021

View reviewed changes

headtr1ck added topic-dask needs work topic-performance labels Oct 6, 2022

bbuzz31 mentioned this pull request May 28, 2024

weighted polyfit #9048

Open

5 tasks

dcherian closed this Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement `.polyfit()` with `apply_ufunc` #5933

Reimplement `.polyfit()` with `apply_ufunc` #5933

slevang commented Nov 3, 2021 •

edited by mathause

Loading

github-actions bot commented Nov 3, 2021

mathause left a comment

mathause Nov 8, 2021

mathause Nov 8, 2021

mathause Nov 8, 2021

mathause commented Jan 18, 2022

slevang commented Jan 18, 2022 •

edited

Loading

Illviljan commented Jan 19, 2022

slevang commented Jan 19, 2022

Illviljan commented Jan 19, 2022

dcherian commented Nov 19, 2024

Reimplement .polyfit() with apply_ufunc #5933

Reimplement .polyfit() with apply_ufunc #5933

Conversation

slevang commented Nov 3, 2021 • edited by mathause Loading

github-actions bot commented Nov 3, 2021

Unit Test Results

mathause left a comment

Choose a reason for hiding this comment

mathause Nov 8, 2021

Choose a reason for hiding this comment

mathause Nov 8, 2021

Choose a reason for hiding this comment

mathause Nov 8, 2021

Choose a reason for hiding this comment

mathause commented Jan 18, 2022

slevang commented Jan 18, 2022 • edited Loading

Illviljan commented Jan 19, 2022

slevang commented Jan 19, 2022

Illviljan commented Jan 19, 2022

dcherian commented Nov 19, 2024

Reimplement `.polyfit()` with `apply_ufunc` #5933

Reimplement `.polyfit()` with `apply_ufunc` #5933

slevang commented Nov 3, 2021 •

edited by mathause

Loading

slevang commented Jan 18, 2022 •

edited

Loading