Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check_outliers and method = “optics”: find automatic way of setting 'xi' #468

Closed
rempsyc opened this issue Aug 16, 2022 · 1 comment
Closed
Labels
Low priority 😴 This issue can be easily workaround or happens only in edge cases

Comments

@rempsyc
Copy link
Member

rempsyc commented Aug 16, 2022

When working on PR #443 , I noticed the note below in .check_outliers_optics. I thought it would be better to make it a proper issue so we don't forget about it.

 # TODO: find automatic way of setting 'xi'

To be honest, I'm not really sure how to tackle this issue or who took the note (and whether that person expects to come back to this issue later, or for someone else to take over eventually).

When working on the PR, I've attempted to test each method to make sure each one was still working correctly. During my testing, I think I've seen why in some situations the xi parameter for method = “optics” can create problems. Let's take the mtcars dataset as example.

library(performance)
check_outliers(mtcars, method = "optics")
#> OK: No outliers detected.
# No outlier

# default threshold is:
2 * ncol(mtcars)
#> [1] 22
# That's 22. Let's change it to be more conservative so we can attempt to find outliers!

check_outliers(mtcars, method = "optics", threshold = 25)
#> Warning in dbscan::extractXi(rez, xi = 0.05): No clusters were found with
#> threshold: 0.05
#> OK: No outliers detected.
# Whoops, we're getting a warning. But still no outliers! Not conservative enough?

check_outliers(mtcars, method = "optics", threshold = 32)
#> Error in dbscan::kNN(x, k, sort = TRUE, ...): Not enough neighbors in data set!
# Whoops, we're getting an error now. What if we try just one value smaller so we don't get an error?
# Surely we will find an outlier then?

check_outliers(mtcars, method = "optics", threshold = 31)
#> Warning in dbscan::extractXi(rez, xi = 0.05): No clusters were found with
#> threshold: 0.05
#> OK: No outliers detected.
# Still a warning but still no outliers!!

# Conclusion: this method cannot find outliers on this dataset. Is that expected?

Created on 2022-08-16 by the reprex package (v2.0.1)

@rempsyc rempsyc added the Low priority 😴 This issue can be easily workaround or happens only in edge cases label Aug 25, 2022
@strengejacke
Copy link
Member

See #443 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Low priority 😴 This issue can be easily workaround or happens only in edge cases
Projects
None yet
Development

No branches or pull requests

2 participants