Default value for correction parameter (aka. ddof)? #695

vnmabus · 2023-10-04T12:18:04Z

I see that in the standard the ddof parameter of the var and std functions has been standardized with the name of correction, as the prior name was not very clear.

However, I saw no discussion over the default value of that parameter. The standard has 0 as the default, corresponding to a biased estimator. This is opposed to the common behavior found in other languages, such as R, Matlab or Julia. Even in Python, there does not seem to be a consensus, with NumPy (and some libraries inspired by it) using a biased estimator, while Pandas, Polars, and some libraries that evolved separately, such as Pytorch, use the unbiased estimator by default.

In statistics it seems that normalizing by $N-1$ is usually the preferred option, and nowadays in some cases the estimator is even defined using that denominator.

Can someone explain to me what are the advantages of using 0 as the default? Right now I only see potential for confusion and bugs from people which obtain different results in different languages. Is it too late to change it?

The text was updated successfully, but these errors were encountered:

kgryte · 2023-10-04T16:31:54Z

@vnmabus Thanks for filing this issue. For context, the signatures for var and std were derived by analyzing the APIs across array libraries within the PyData ecosystem. In this case, the dominant default was ddof=0. The lone exception was PyTorch, which previously used a bool. TensorFlow does not provide kwarg support in its reduce_std API.

Accordingly, when standardizing we went with what the majority of PyData libraries currently used as the default value in order to best preserve backward compatibility.

In terms of compatibility with other languages or non-array libraries, that is not an explicit goal of the standard, but is something we consider more generally but not to the degree that cross-language compat takes priority over other considerations, such as minimizing ecosystem breakages.

Regardless, in this case, I'd argue that it is better for correction to be 0 as users should be explicit in terms of what correction they want to apply. While a correction of 1 is common, it is by no means the right choice for all applications. So opting for the base case of 0 seems to make more sense to me.

kgryte · 2023-10-04T18:23:33Z

One more comment: now that the specification has standardized the default for correction as 0, we're unlikely to change the default, as this would be a breaking change and likely negatively impact downstream libraries who are already using the Array API and come to expect the current behavior.

kgryte · 2023-11-16T09:13:33Z

I'll go ahead and close this. Due to backward compatibility concerns, I don't believe we could action on changing the default correction value even if we wanted to.

kgryte added the Question General question. label Oct 4, 2023

kgryte closed this as completed Nov 16, 2023

TomNicholas mentioned this issue Dec 27, 2023

Use ddof=1 for std & var pydata/xarray#8566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default value for correction parameter (aka. ddof)? #695

Default value for correction parameter (aka. ddof)? #695

vnmabus commented Oct 4, 2023

kgryte commented Oct 4, 2023 •

edited

Loading

kgryte commented Oct 4, 2023

kgryte commented Nov 16, 2023

Default value for correction parameter (aka. ddof)? #695

Default value for correction parameter (aka. ddof)? #695

Comments

vnmabus commented Oct 4, 2023

kgryte commented Oct 4, 2023 • edited Loading

kgryte commented Oct 4, 2023

kgryte commented Nov 16, 2023

kgryte commented Oct 4, 2023 •

edited

Loading