-
-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check_outlier
improvement (easystats/datawizard#177)
#443
Conversation
check_outlier
improvement (easystats/datawizard#177)
Codecov Report
@@ Coverage Diff @@
## main #443 +/- ##
==========================================
- Coverage 32.67% 31.58% -1.10%
==========================================
Files 80 80
Lines 4682 5047 +365
==========================================
+ Hits 1530 1594 +64
- Misses 3152 3453 +301
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Thanks a lot, impressive PR! Indeed, it takes some time to read your explanation and look at the changes, but I'll try to do this the next days. |
One point: please check if |
Thanks so much. And no rush. We got time. And yes, I forgot to add the plot demo! Thank you for pointing that out. I have updated my reprex accordingly. 👍 |
Is there a reason you closed and deleted? @rempsyc |
I'm sorry I realized I named my branch |
The correct branch is here: https://github.com/rempsyc/performance/tree/check_outliers However, I don't see how I can merge them back to this PR or whether I should open a new PR with the correct name. I'm afraid opening a new PR will lose the existing discussion here. Maybe it's not a big deal to keep the wrong branch name after all. I'm sorry for this unexpected extra trouble! |
Maybe use orange as the color, rather than red? Or let's pick a better red. @strengejacke where are the insight colors specified? |
colour
Character vector, indicating the colour for printing. May be one of "red", "yellow", "green",
"blue", "violet", "cyan" or "grey". So there is no orange (yet). It would indeed be nice to know how to add more colours. |
I like the default R plotting palette |
Ok colours are defined here: https://github.com/easystats/insight/blob/18b5aaee8735ff97022b045cd4d81ffe7e207ff2/R/colour_tools.R E.g. .colour <- function(colour = "red", x) {
switch(colour,
red = .red(x),
yellow = .yellow(x),
green = .green(x),
[...]
)
} And individual colours each have their own function, e.g.: .red <- function(x) {
if (.supports_color()) {
x[!is.na(x)] <- paste0("\033[31m", x[!is.na(x)], "\033[39m")
}
x
} So definitely seems possible to add more choices. Should I attempt a PR to |
Maybe let's just swap the current color palette (which I think is the default color palette used by See I think the bright colors there look pretty good on both light and dark backgrounds. Another option that might be even better would be to support |
In any event, the color printing discussion can move to Another great enhancement in the future would be for this function to pull loo results for Stan based models |
Should I convert this PR to a draft to avoid an accidental merge? Given that there are still questions/issues to be resolved. Perhaps it would be helpful if I were to rephrase my earlier questions as simplified suggestions instead? Here’s an attempt:
Also note that lists/dataframes can’t be printed with Still, I would like to receive the “green light” before moving forward with the rest of the changes in case this is not the outcome you desire. Thoughts? |
That's good. A whole data frame would be too much in color I think |
Sorry I didn't see the questions at the bottom of the first post.
|
Great! I'll start working on that. For 3, sorry that it wasn't clear. I'm referring to the demo from the example in the help file: library(performance)
# Setup data
data <- datawizard::rownames_as_column(mtcars, var = "car")
# Add ID information
outliers_list <- performance::check_outliers(
data, method = c("mahalanobis", "mcd", "zscore"), ID = "car")
# Using `as.data.frame()`, we can access more details!
outliers_info <- as.data.frame(outliers_list)
head(outliers_info)
#> Distance_Zscore Outlier_Zscore Distance_Mahalanobis Outlier_Mahalanobis
#> 1 1.189901 0 8.946673 0
#> 2 1.189901 0 8.287933 0
#> 3 1.224858 0 8.937150 0
#> 4 1.122152 0 6.096726 0
#> 5 1.043081 0 5.429061 0
#> 6 1.564608 0 8.877558 0
#> Distance_MCD Outlier_MCD Outlier
#> 1 11.508353 0 0
#> 2 8.618865 0 0
#> 3 12.265382 0 0
#> 4 14.351997 0 0
#> 5 8.639128 0 0
#> 6 12.003840 0 0 Created on 2022-07-28 by the reprex package (v2.0.1) So here we see that the data frame resulting from Ultimately, that data frame is similar to my detailed list output, except that it prints the information for all observations, not only outliers. Thus why I was thinking of adding ID/row there as well for consistency.
If we are to print the detailed output for I don't think we need to change the default methods. Sure, most people using multiple methods might never see the changes and detailed output, but I think that's ok since it would mostly be useful to those using single methods anyway. Plus, default method for class numeric is already a single univariate method (zscore_robust). |
I think reduced output with "all" is good. And let's add the id as a column |
Thanks a lot, that's really impressive! I'll try to look at this PR asap. |
This is a lot of detail and work! Wow! If anyone wants me to look at something, can you @ me with a pointer to the spot? |
The message is formatted using |
I think the warnings need to be addressed, in particular the usage of |
btw, I don't understand the meaning of the |
I'm sorry, I was working on yet another version with the select helpers fixed,
The error argument was purely to show both behaviours in the long reprex, I've removed it already. (And thanks for fixing the test!) |
Merge branch 'check_model' of https://github.com/rempsyc/performance into check_model # Conflicts: # DESCRIPTION # R/check_outliers.R # man/check_outliers.Rd
OK! We’re almost there. I've hidden my long comments post, because all of the points have either been resolved or moved to their respective issue. Except for one thing, the We should definitely add more tests for Also, Dom mentioned
Would transforming my long reprex above in a vignette be useful at all? |
R CMD check is failing because I imported Error: Error: <callr_remote_error: Cannot install packages:
* deps::.: Can't install dependency datawizard> But current version is Note: no R CMD check error/warning/note locally. |
We can leave it that for the future, rather than a reprex of possible options I had in mind more like a walkthrough of why/how outliers detection, but there's no hurry, we can come back to that later. It's a lot of work already ;) |
You can use the |
All checks pass, so is here anything that needs to be done, or is it ready for merging? |
Cool, thanks again for the work! |
I was hoping to close all issues I opened with this PR, but I could only realistically close #466 and #469 (which was done automatically when you merged). #467, #468, and #470 are not resolved yet, but I can make a new PR once they are. For easystats/datawizard#216, should commit 2c58497 have closed it? Or is there something more you were trying to implement? Because as far as I'm concerned, it works now. |
For #468 and #470, I'm not planning on working on them because I don't know enough about the topic, so I labelled them as low priority for now. But would it be better to close them? They would still be archived and findable if someone eventually finds the motivation to address them but they wouldn't occupy precious space in the processing tray (open issues)... To be honest they are not a big priority for me but given that I expect #467 to resolve within half a month (or less). |
(Closes: #466, closes #469)
Context
This is a pull request aiming to improve the printing method of
check_outliers
, based on easystats/datawizard#177.Specifically, it aims to accomplish the following in the print output: (a) state the methods used; (b) state the thresholds used; and (c) state the variables tested. Additionally, it also aims to (d) report outliers per variable (for univariate methods), (e) report whether any observation comes up as outlier for several variables (when that is the case), and (f) include an optional ID variable along the row information. The changes were inspired by rempsyc::find_mad.
This is a prototype/proof of concept. (a) to (c) were implemented for all methods, but (d) to (f) were only implemented for method
"zscore"
for now. Before working on this further, I would like to get feedback to know whether it is worth implementing for other methods, and if modifications are needed before proceeding (as I would need to adapt the code to each method individually).Reprex
Reprex demo of the changes below:
check_model(model)
Created on 2022-07-01 by the reprex package (v2.0.1)
Observations
dplyr
, so it was a nice challenge attempting to convert everything to base R anddatawizard
. Feel free to make suggestions to improve the code.robust
in thresholds referring tomahalanobis_robust
, and I have changed it as such to avoid confusion with e.g.,zscore_robust
and to be consistent with method names to be able to refer to it back later. There was also a threshold calledzscore
but not one calledzscore_robust
, as the first one was used in both cases. Again, for clarity and compatibility with later code, I have madezscore_robust
its own.insight::print_color
(seems a limitation ofcat()
; below)? In any case, I think it shouldn’t be red anyway, that would be more for errors, so I picked yellow for now, as I feel it is functionally close and much more readable..check_outliers_zscore
again on individual columns withlapply
.easyverse
that I have missed. Open to suggestions for improvement.What's next
zscore
"zscore_robust
"iqr
"ci
"eti
"hdi
"bci
"cook
"pareto
"mahalanobis
"mahalanobis_robust
"mcd
"ics
"optics
"iforest
"lof
"check_outliers.BFBayesFactor
check_outliers.gls
Questions
- Based on the following methods and thresholds: mahalanobis (21.92), iqr (1.5), zscore (1.96).
.check_outliers_zscore
, etc.) so that this info is also part of the outlier info data frame (outliers_info
in the examples). Only if useful though.rempsyc::find_mad
tocheck_outliers
is that the former only uses one method (zscore_robust
), whereas the latter needs to support multiple methods, which complicates the output formatting, especially for the per-variable section.method = "all"
, because then the output will be very long. Perhaps we could only print (and compute) the second per-variable part with an optional argument,detailed = TRUE
(or the like) passed tocheck_outliers
.Looking forward to your comments and feedback.