Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_outliers: Incorrect zscore values when using vector instead of dataframe #476

Closed
rempsyc opened this issue Sep 13, 2022 · 0 comments · Fixed by #482
Closed

check_outliers: Incorrect zscore values when using vector instead of dataframe #476

rempsyc opened this issue Sep 13, 2022 · 0 comments · Fixed by #482
Assignees
Labels
Bug 🐛 Something isn't working

Comments

@rempsyc
Copy link
Member

rempsyc commented Sep 13, 2022

TL;DR: check_outliers currently provides incorrect zscore values when using a vector instead of a dataframe.

Optional reprex below:


library(performance)
packageVersion("performance")
#> [1] '0.9.2.2'

Let’s get the current zscore distance:

x <- as.data.frame(mtcars$mpg)
z <- check_outliers(x, method = "zscore", threshold = 1)
z.att <- attributes(z)
d1 <- z.att$data$Distance_Zscore

Compare to this, the underlying calculation in the function, which provides continous scores:

d2 <- abs(as.data.frame(sapply(x, function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE))))

# Comparison
cbind(d1, d2, d1 == d2)
#>          d1 mtcars$mpg mtcars$mpg
#> 1  1.000000 0.15088482      FALSE
#> 2  1.000000 0.15088482      FALSE
#> 3  1.000000 0.44954345      FALSE
#> 4  1.000000 0.21725341      FALSE
#> 5  1.000000 0.23073453      FALSE
#> 6  1.000000 0.33028740      FALSE
#> 7  1.000000 0.96078893      FALSE
#> 8  1.000000 0.71501778      FALSE
#> 9  1.000000 0.44954345      FALSE
#> 10 1.000000 0.14777380      FALSE
#> 11 1.000000 0.38006384      FALSE
#> 12 1.000000 0.61235388      FALSE
#> 13 1.000000 0.46302456      FALSE
#> 14 1.000000 0.81145962      FALSE
#> 15 1.607883 1.60788262       TRUE
#> 16 1.607883 1.60788262       TRUE
#> 17 1.000000 0.89442035      FALSE
#> 18 2.042389 2.04238943       TRUE
#> 19 1.710547 1.71054652       TRUE
#> 20 2.291272 2.29127162       TRUE
#> 21 1.000000 0.23384555      FALSE
#> 22 1.000000 0.76168319      FALSE
#> 23 1.000000 0.81145962      FALSE
#> 24 1.126710 1.12671039       TRUE
#> 25 1.000000 0.14777380      FALSE
#> 26 1.196190 1.19619000       TRUE
#> 27 1.000000 0.98049211      FALSE
#> 28 1.710547 1.71054652       TRUE
#> 29 1.000000 0.71190675      FALSE
#> 30 1.000000 0.06481307      FALSE
#> 31 1.000000 0.84464392      FALSE
#> 32 1.000000 0.21725341      FALSE

Values under 1 are converted to 1 automatically somehow. How is this possible?

It seems like an artifact of the column aggregation procedure to get an overall distance score based on the max of all columns. When a single column is provided, it does not produce the expected result (this is current code):

Distance_Zscore <- sapply(as.data.frame(t(d2)), max, na.omit = TRUE, na.rm = TRUE)
Distance_Zscore
#>       V1       V2       V3       V4       V5       V6       V7       V8 
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 
#>       V9      V10      V11      V12      V13      V14      V15      V16 
#> 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.607883 1.607883 
#>      V17      V18      V19      V20      V21      V22      V23      V24 
#> 1.000000 2.042389 1.710547 2.291272 1.000000 1.000000 1.000000 1.126710 
#>      V25      V26      V27      V28      V29      V30      V31      V32 
#> 1.000000 1.196190 1.000000 1.710547 1.000000 1.000000 1.000000 1.000000

We can use the following strategy used elsewhere in check_outliers instead:

Distance_Zscore <- sapply(as.data.frame(t(d2)), function(x) {
  ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
Distance_Zscore
#>         V1         V2         V3         V4         V5         V6         V7 
#> 0.15088482 0.15088482 0.44954345 0.21725341 0.23073453 0.33028740 0.96078893 
#>         V8         V9        V10        V11        V12        V13        V14 
#> 0.71501778 0.44954345 0.14777380 0.38006384 0.61235388 0.46302456 0.81145962 
#>        V15        V16        V17        V18        V19        V20        V21 
#> 1.60788262 1.60788262 0.89442035 2.04238943 1.71054652 2.29127162 0.23384555 
#>        V22        V23        V24        V25        V26        V27        V28 
#> 0.76168319 0.81145962 1.12671039 0.14777380 1.19619000 0.98049211 1.71054652 
#>        V29        V30        V31        V32 
#> 0.71190675 0.06481307 0.84464392 0.21725341

Or, if we want to avoid sapply and transposition:

Distance_Zscore <- apply(d2, 1, function(x) {
  ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})
Distance_Zscore
#>  [1] 0.15088482 0.15088482 0.44954345 0.21725341 0.23073453 0.33028740
#>  [7] 0.96078893 0.71501778 0.44954345 0.14777380 0.38006384 0.61235388
#> [13] 0.46302456 0.81145962 1.60788262 1.60788262 0.89442035 2.04238943
#> [19] 1.71054652 2.29127162 0.23384555 0.76168319 0.81145962 1.12671039
#> [25] 0.14777380 1.19619000 0.98049211 1.71054652 0.71190675 0.06481307
#> [31] 0.84464392 0.21725341

Works also with full dataframe

x <- as.data.frame(mtcars)
z <- check_outliers(x, method = "zscore", threshold = 1)
z.att <- attributes(z)
d1 <- z.att$data$Distance_Zscore

d <- abs(as.data.frame(sapply(x, function(x) (x - mean(x, na.rm = TRUE)) / stats::sd(x, na.rm = TRUE))))

d2 <- apply(d, 1, function(x) {
  ifelse(all(is.na(x)), NA, max(x, na.rm = TRUE))
})

d3 <- sapply(as.data.frame(t(d)), max, na.omit = TRUE, na.rm = TRUE)

# Comparison
cbind(d1, d2, d3, d1 == d2)
#>           d1       d2       d3  
#> V1  1.189901 1.189901 1.189901 1
#> V2  1.189901 1.189901 1.189901 1
#> V3  1.224858 1.224858 1.224858 1
#> V4  1.122152 1.122152 1.122152 1
#> V5  1.043081 1.043081 1.043081 1
#> V6  1.564608 1.564608 1.564608 1
#> V7  1.433903 1.433903 1.433903 1
#> V8  1.235180 1.235180 1.235180 1
#> V9  2.826755 2.826755 2.826755 1
#> V10 1.116036 1.116036 1.116036 1
#> V11 1.116036 1.116036 1.116036 1
#> V12 1.014882 1.014882 1.014882 1
#> V13 1.014882 1.014882 1.014882 1
#> V14 1.014882 1.014882 1.014882 1
#> V15 2.077505 2.077505 2.077505 1
#> V16 2.255336 2.255336 2.255336 1
#> V17 2.174596 2.174596 2.174596 1
#> V18 2.042389 2.042389 2.042389 1
#> V19 2.493904 2.493904 2.493904 1
#> V20 2.291272 2.291272 2.291272 1
#> V21 1.224858 1.224858 1.224858 1
#> V22 1.564608 1.564608 1.564608 1
#> V23 1.014882 1.014882 1.014882 1
#> V24 1.433903 1.433903 1.433903 1
#> V25 1.365821 1.365821 1.365821 1
#> V26 1.310481 1.310481 1.310481 1
#> V27 1.778928 1.778928 1.778928 1
#> V28 1.778928 1.778928 1.778928 1
#> V29 1.874010 1.874010 1.874010 1
#> V30 1.973440 1.973440 1.973440 1
#> V31 3.211677 3.211677 3.211677 1
#> V32 1.224858 1.224858 1.224858 1

For a data frame, then, the old and new method match. Even if we recalculate the distance with the new method I am proposing.

.

I propose to submit a PR correcting this as soon as #474 is merged.

Created on 2022-09-13 by the reprex package (v2.0.1)

@rempsyc rempsyc added the Bug 🐛 Something isn't working label Sep 13, 2022
@rempsyc rempsyc self-assigned this Sep 13, 2022
@rempsyc rempsyc mentioned this issue Sep 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug 🐛 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant