check_outliers
: Incorrect zscore values when using vector instead of dataframe
#476
Labels
Bug 🐛
Something isn't working
TL;DR:
check_outliers
currently provides incorrect zscore values when using a vector instead of a dataframe.Optional reprex below:
Let’s get the current zscore distance:
Compare to this, the underlying calculation in the function, which provides continous scores:
Values under 1 are converted to 1 automatically somehow. How is this possible?
It seems like an artifact of the column aggregation procedure to get an overall distance score based on the max of all columns. When a single column is provided, it does not produce the expected result (this is current code):
We can use the following strategy used elsewhere in
check_outliers
instead:Or, if we want to avoid
sapply
and transposition:Works also with full dataframe
For a data frame, then, the old and new method match. Even if we recalculate the distance with the new method I am proposing.
.
I propose to submit a PR correcting this as soon as #474 is merged.
Created on 2022-09-13 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: