Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default value for correction parameter (aka. ddof)? #695

Closed
vnmabus opened this issue Oct 4, 2023 · 3 comments
Closed

Default value for correction parameter (aka. ddof)? #695

vnmabus opened this issue Oct 4, 2023 · 3 comments
Labels
Question General question.

Comments

@vnmabus
Copy link

vnmabus commented Oct 4, 2023

I see that in the standard the ddof parameter of the var and std functions has been standardized with the name of correction, as the prior name was not very clear.

However, I saw no discussion over the default value of that parameter. The standard has 0 as the default, corresponding to a biased estimator. This is opposed to the common behavior found in other languages, such as R, Matlab or Julia. Even in Python, there does not seem to be a consensus, with NumPy (and some libraries inspired by it) using a biased estimator, while Pandas, Polars, and some libraries that evolved separately, such as Pytorch, use the unbiased estimator by default.

In statistics it seems that normalizing by $N-1$ is usually the preferred option, and nowadays in some cases the estimator is even defined using that denominator.

Can someone explain to me what are the advantages of using 0 as the default? Right now I only see potential for confusion and bugs from people which obtain different results in different languages. Is it too late to change it?

@kgryte
Copy link
Contributor

kgryte commented Oct 4, 2023

@vnmabus Thanks for filing this issue. For context, the signatures for var and std were derived by analyzing the APIs across array libraries within the PyData ecosystem. In this case, the dominant default was ddof=0. The lone exception was PyTorch, which previously used a bool. TensorFlow does not provide kwarg support in its reduce_std API.

Accordingly, when standardizing we went with what the majority of PyData libraries currently used as the default value in order to best preserve backward compatibility.

In terms of compatibility with other languages or non-array libraries, that is not an explicit goal of the standard, but is something we consider more generally but not to the degree that cross-language compat takes priority over other considerations, such as minimizing ecosystem breakages.

Regardless, in this case, I'd argue that it is better for correction to be 0 as users should be explicit in terms of what correction they want to apply. While a correction of 1 is common, it is by no means the right choice for all applications. So opting for the base case of 0 seems to make more sense to me.

@kgryte
Copy link
Contributor

kgryte commented Oct 4, 2023

One more comment: now that the specification has standardized the default for correction as 0, we're unlikely to change the default, as this would be a breaking change and likely negatively impact downstream libraries who are already using the Array API and come to expect the current behavior.

@kgryte kgryte added the Question General question. label Oct 4, 2023
@kgryte
Copy link
Contributor

kgryte commented Nov 16, 2023

I'll go ahead and close this. Due to backward compatibility concerns, I don't believe we could action on changing the default correction value even if we wanted to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question General question.
Projects
None yet
Development

No branches or pull requests

2 participants