Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Improved Sheather-Jones bandwidth and reflection KDE #125

Open
sethaxen opened this issue Nov 5, 2024 · 4 comments
Open

Adding Improved Sheather-Jones bandwidth and reflection KDE #125

sethaxen opened this issue Nov 5, 2024 · 4 comments

Comments

@sethaxen
Copy link

sethaxen commented Nov 5, 2024

I propose adding a new function implementing the "Improved Sheather-Jones" bandwidth selector, which works well for multimodal distributions, and a new function kde_reflected, which works well for bounded distributions.

Background

Silverman's bandwidth (used here as the default) is a slightly undersmoothed variant of the optimal bandwidth if the true density is normal. It's known to be quite a bad choice for multimodal distributions with well-separated modes. The "improved Sheather-Jones" bandwidth, implemented e.g. in KDEpy and arviz, works well for a wide variety of distributions (see Table 1 of the paper), including those with well-separated modes and that depart from normality. It matches or outperforms the Sheather-Jones bandwidth selector, which is recommended by R's density and used as the default in ggdist. PosteriorStats.jl's main branch implements this as isj_bandwidth.

Also, the standard KDE as implemented here estimates the true density poorly near the bounds of bounded densities with large amounts of density near the bounds (e.g. a half-normal). While many KDE variants exist for boundary-correction, many of them use specialized location-dependent kernels. A much simpler solution, which is used e.g. by KDEpy, arviz, is boundary correction via reflection/mirroring, where the density outside of the bounds is reflected at the bounds and added to the density near the bounds. Given a standard KDE f and lower and upper bounds l and u, the reflected KDE is f_reflected(x) = f(x) + f(2l - x) + f(2u - x). This approach performs quite well and is simple to implement. On PosteriorStats.jl's main branch, this is implemented as kde_reflected.

Proposed changes

I propose adding both of these to this package. Currently default_bandwidth is not part of the API. Perhaps we could rename default_bandwidth to bandwidth_silverman and then have default_bandwidth call this function? Then bandwidth_silverman and bandwidth_isj could be added to the API. It would also be convenient to be able to provide just a bandwidth function (with signature f(::AbstractVector{<:Real})::Real) to bandwidth, which is then called by the KDE function.

Currently this package already includes kde and kde_lscv instead of trying to have a single kde function with many options. Perhaps it makes sense to include a kde_reflected method as well, which would be more-or-less as implemented in PosteriorStats.jl. This would default to using the bounds of the data as the bounds of the reflected density but allows the user to specify natural bounds.

Examples

ISJ bandwidth vs. Silverman bandwidth

This example shows that for normal-like distributions, the two bandwidths perform similarly, but for distributions with heavier tails or well-separated modes, the ISJ bandwidth is better.

using Random, Distributions, CairoMakie, KernelDensity, PosteriorStats #main
Random.seed!(42)

dists = [
    "Normal" => Normal(),
    "t(df=3)" => TDist(3),
    "mixture(Normal(0, 1), Normal(300, 1))" => MixtureModel([Normal(0, 1), Normal(300, 1)]),
    "logNormal" => LogNormal()
]

fig = Figure(size=(800, 600));
ga_plots = fig[1, 1] = GridLayout()
axs = [Axis(ga_plots[i, j]) for i in 1:2 for j in 1:2]

for (i, (name, dist)) in enumerate(dists)
    ax = axs[i]
    ax.xlabel = "x"
    ax.ylabel = "density"
    ax.title = name
    samples = rand(dist, 1_000)
    bw_silverman = KernelDensity.default_bandwidth(samples)
    bw_isj = PosteriorStats.isj_bandwidth(samples)
    kde_silverman = kde(samples; bandwidth=bw_silverman)
    kde_isj = kde(samples; bandwidth=bw_isj)
    plot!(ax, dist; color=:black, label="True")
    plot!(ax, kde_silverman; color=:blue, label="Silverman", alpha=0.7)
    plot!(ax, kde_isj; color=:orange, label="ISJ", alpha=0.7)
end
Legend(fig[2, 1], axs[1]; tellheight=true, tellwidth=false, orientation=:horizontal)
fig

bw

Reflected KDE vs. standard KDE

This example demonstrates that near the bounds of bounded densities, the reflected KDE performs better than the standard KDE, while for unbounded densities the two perform similarly. It will tend to overestimate the density near a bound if the density decays quickly to 0 near the bound.

using Random, Distributions, CairoMakie, KernelDensity, PosteriorStats #main
Random.seed!(42)

dists = [
    "Normal" => Normal(),
    "HalfNormal" => truncated(Normal(); lower=0),
    "Exponential" => Exponential(),
    "Uniform" => Uniform(0, 1),
    "Beta(2, 5)" => Beta(2, 5),
    "Beta(2, 2)" => Beta(2, 2),
]

fig = Figure(size=(1200, 600));
ga_plots = fig[1, 1] = GridLayout()
axs = [Axis(ga_plots[i, j]) for i in 1:2 for j in 1:3]

for (i, (name, dist)) in enumerate(dists)
    ax = axs[i]
    ax.xlabel = "x"
    ax.ylabel = "density"
    ax.title = name
    samples = rand(dist, 10_000)
    bw_silverman = KernelDensity.default_bandwidth(samples)
    bw_isj = PosteriorStats.isj_bandwidth(samples)
    kde_standard = kde(samples; bandwidth=bw_silverman)
    kde_reflected = PosteriorStats.kde_reflected(samples; bandwidth=bw_silverman)
    plot!(ax, dist; color=:black, label="True")
    plot!(ax, kde_standard; color=:blue, label="Standard", alpha=0.7)
    plot!(ax, kde_reflected; color=:orange, label="Reflected", alpha=0.7)
end
Legend(fig[2, 1], axs[1]; tellheight=true, tellwidth=false, orientation=:horizontal)
fig

reflect

Comparison of bandwidth selector runtimes

This benchmark shows that ISJ is slower than Silverman for smallish samples but is no slower for large samples, I'm guessing because it is O(N), while the quantile sort required to compute IQR computed in default_bandwidth is I think O(N * log(N)).

using Random, BenchmarkTools, KernelDensity, PosteriorStats #main
Random.seed!(42)

# smallish sample
x = randn(1_000)
@btime $(KernelDensity.default_bandwidth)($x)
@btime $(PosteriorStats.isj_bandwidth)($x)

# large sample
x = randn(100_000)
@btime $(KernelDensity.default_bandwidth)($x)
@btime $(PosteriorStats.isj_bandwidth)($x)
  8.251 μs (7 allocations: 8.03 KiB)
  1.436 ms (364 allocations: 280.78 KiB)
  4.079 ms (7 allocations: 781.47 KiB)
  3.551 ms (364 allocations: 280.78 KiB)
@tpapp
Copy link
Collaborator

tpapp commented Nov 18, 2024

Sorry for the late reply, I had to do some reading about this. Also, I am not the original maintaner, but I have commit rights and I am happy to review and merge improvements.

I think that the API is ripe for a minor redesign. I would prefer something like this:

kde([bw,] [kernel,] sample)

where bw can be

  1. an explicit bandwidth like 0.1,
  2. a singleton like IJS() or Silverman() (maybe living in a Bandwidth submodule, so the user would typically use it like Bandwidth.ISJ() to keep the namespace clean,
  3. a wrapper Adjust(bw, factor) which would use bw, then multiply it with factor. eg Adjust(ISJ(), 1.1).

Bandwidth.LSCV() would be simply another option.

kernel would be a kernel, or a wrapper Reflected(kernel), implementing kde_reflected. Perhaps these would live in another submodule Kernel.

I am not sure whether we should export default_bandwidth given Adjust, but we can.

The only exported symbols would be kde, Kernel, Bandwidth.

Multivariate API would be the same, but using Tuple.

I think that we should extend this package with proposed contributions like yours in mind, so that they can be incorporated easily. Given that it is still the de facto KDE package in Julia, it should get some attention.

This is just a thought, please comment. The proposed API reflects my personal preferences and may not be everyone's first choice.

@sethaxen
Copy link
Author

Thanks @tpapp for the thoughtful response. I in general like the proposed API changes. At least, I prefer it over the current API. I have a slight preference for the argument order

kde(sample [,bw] [,kernel,])

with support for providing bw and kernel as keywords. I'd also prefer we still support user-provided weights

kernel would be a kernel, or a wrapper Reflected(kernel), implementing kde_reflected. Perhaps these would live in another submodule Kernel.

This is the one choice I'm not a fan of, since KDE reflection involves no change to the kernel whatsoever (unlike other boundary correction methods). I wonder if it makes sense to have an additional module Boundary with options like Wrapped(low, hi) for circular data, Padded(factor) for padding by a factor of the bandwidth (current default), Trimmed(boundary) for fitting KDE with boundary and then trimming it to data and re-normalizing, or Reflected for kde_reflected. Then the signature above could be modified to take a boundary defaulting to Padded.

Of course, it would maybe be confusing where to put boundary correction methods that use specialized location-specific kernels.

For bandwidth, we should take care that the internal interface would support e.g. adaptive bandwidth methods.

@tpapp
Copy link
Collaborator

tpapp commented Nov 18, 2024

@sethaxen: Actually, upon reflection, I am not sure that my proposal for 3 positional arguments is good style. I would just keep one (for the data), and have

  • kernel,
  • boundary,
  • bandwidth

as keyword arguments exposed in kde, which may just be a thin wrapper that calls the relevant methods, or enumerate them to something like a kde_implementation that enumerates them and allows dispatch (but whether that is a sensible choice will be seen during refactoring).

This is the one choice I'm not a fan of

You convinced me, I agree.

Practically, this is how I would sketch it:

function kde(data; kernel, bandwidth_method, boundary)
    bandwidth = calculate_bandwidth(data, bandwidth_method)
    calculate_kde(data, bandwidth, kernel, boundary)
end

where we might as well just expose calculate_bandwidth.

(Sorry to spend so much time bikeshedding the API, but I think it is worth it).

@sethaxen
Copy link
Author

This sounds promising. To make sure the API is sufficiently flexible, it would be good to have an implementation using an adaptive bandwidth and one with an adaptive kernel. For multivariate KDE, IIRC a bandwidth selection method can provide a bandwidth matrix, not just a tuple of bandwidths for each direction. I'll double check. I'll then also prototype a package from scratch to test the API, and we can then make the necessary refactor here.

Another consideration perhaps is the calculation method for the KDE. Might be worth defining a few methods, including Explicit, Binned(method), FFTConvolution. Adaptive methods generally can't be convolved, so Explicit would be necessary and otherwise can be used to check that other implementations work correctly. Binned indicates that data should be binned (what we currently do) before applying the method. Binned(FFTConvolution()) would be the default. IIRC FastKDE (#85) would be something like an NFFTConvolution method with specific choices of kernel/bandwidth, but because these choices are all interdependent, it might make sense for bandwidth, kernel, etc to all be fields of a <:AbstractKDEMethod objects, and then the signature could be just kde(data; method)

where we might as well just expose calculate_bandwidth.

(Sorry to spend so much time bikeshedding the API, but I think it is worth it).

I completely agree!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants