-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Improved Sheather-Jones bandwidth and reflection KDE #125
Comments
Sorry for the late reply, I had to do some reading about this. Also, I am not the original maintaner, but I have commit rights and I am happy to review and merge improvements. I think that the API is ripe for a minor redesign. I would prefer something like this: kde([bw,] [kernel,] sample) where
I am not sure whether we should export The only exported symbols would be Multivariate API would be the same, but using I think that we should extend this package with proposed contributions like yours in mind, so that they can be incorporated easily. Given that it is still the de facto KDE package in Julia, it should get some attention. This is just a thought, please comment. The proposed API reflects my personal preferences and may not be everyone's first choice. |
Thanks @tpapp for the thoughtful response. I in general like the proposed API changes. At least, I prefer it over the current API. I have a slight preference for the argument order kde(sample [,bw] [,kernel,]) with support for providing
This is the one choice I'm not a fan of, since KDE reflection involves no change to the kernel whatsoever (unlike other boundary correction methods). I wonder if it makes sense to have an additional module Of course, it would maybe be confusing where to put boundary correction methods that use specialized location-specific kernels. For bandwidth, we should take care that the internal interface would support e.g. adaptive bandwidth methods. |
@sethaxen: Actually, upon reflection, I am not sure that my proposal for 3 positional arguments is good style. I would just keep one (for the data), and have
as keyword arguments exposed in
You convinced me, I agree. Practically, this is how I would sketch it: function kde(data; kernel, bandwidth_method, boundary)
bandwidth = calculate_bandwidth(data, bandwidth_method)
calculate_kde(data, bandwidth, kernel, boundary)
end where we might as well just expose (Sorry to spend so much time bikeshedding the API, but I think it is worth it). |
This sounds promising. To make sure the API is sufficiently flexible, it would be good to have an implementation using an adaptive bandwidth and one with an adaptive kernel. For multivariate KDE, IIRC a bandwidth selection method can provide a bandwidth matrix, not just a tuple of bandwidths for each direction. I'll double check. I'll then also prototype a package from scratch to test the API, and we can then make the necessary refactor here. Another consideration perhaps is the calculation method for the KDE. Might be worth defining a few methods, including
I completely agree! |
I propose adding a new function implementing the "Improved Sheather-Jones" bandwidth selector, which works well for multimodal distributions, and a new function
kde_reflected
, which works well for bounded distributions.Background
Silverman's bandwidth (used here as the default) is a slightly undersmoothed variant of the optimal bandwidth if the true density is normal. It's known to be quite a bad choice for multimodal distributions with well-separated modes. The "improved Sheather-Jones" bandwidth, implemented e.g. in KDEpy and arviz, works well for a wide variety of distributions (see Table 1 of the paper), including those with well-separated modes and that depart from normality. It matches or outperforms the Sheather-Jones bandwidth selector, which is recommended by R's
density
and used as the default in ggdist. PosteriorStats.jl's main branch implements this asisj_bandwidth
.Also, the standard KDE as implemented here estimates the true density poorly near the bounds of bounded densities with large amounts of density near the bounds (e.g. a half-normal). While many KDE variants exist for boundary-correction, many of them use specialized location-dependent kernels. A much simpler solution, which is used e.g. by KDEpy, arviz, is boundary correction via reflection/mirroring, where the density outside of the bounds is reflected at the bounds and added to the density near the bounds. Given a standard KDE
f
and lower and upper boundsl
andu
, the reflected KDE isf_reflected(x) = f(x) + f(2l - x) + f(2u - x)
. This approach performs quite well and is simple to implement. On PosteriorStats.jl's main branch, this is implemented askde_reflected
.Proposed changes
I propose adding both of these to this package. Currently
default_bandwidth
is not part of the API. Perhaps we could renamedefault_bandwidth
tobandwidth_silverman
and then havedefault_bandwidth
call this function? Thenbandwidth_silverman
andbandwidth_isj
could be added to the API. It would also be convenient to be able to provide just a bandwidth function (with signaturef(::AbstractVector{<:Real})::Real
) tobandwidth
, which is then called by the KDE function.Currently this package already includes
kde
andkde_lscv
instead of trying to have a singlekde
function with many options. Perhaps it makes sense to include akde_reflected
method as well, which would be more-or-less as implemented in PosteriorStats.jl. This would default to using the bounds of the data as the bounds of the reflected density but allows the user to specify natural bounds.Examples
ISJ bandwidth vs. Silverman bandwidth
This example shows that for normal-like distributions, the two bandwidths perform similarly, but for distributions with heavier tails or well-separated modes, the ISJ bandwidth is better.
Reflected KDE vs. standard KDE
This example demonstrates that near the bounds of bounded densities, the reflected KDE performs better than the standard KDE, while for unbounded densities the two perform similarly. It will tend to overestimate the density near a bound if the density decays quickly to 0 near the bound.
Comparison of bandwidth selector runtimes
This benchmark shows that ISJ is slower than Silverman for smallish samples but is no slower for large samples, I'm guessing because it is
O(N)
, while the quantile sort required to compute IQR computed indefault_bandwidth
is I thinkO(N * log(N))
.The text was updated successfully, but these errors were encountered: