Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shapiro-Wilk normality test #124

Merged
merged 68 commits into from
Oct 2, 2024
Merged
Show file tree
Hide file tree
Changes from 64 commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
6e2a5b0
add interpolating polynomials from R94
kalmarek Jan 4, 2018
c576f60
SWCoeffs to store coefficients for S-W test
kalmarek Jan 4, 2018
0e52c97
swstat and pvalue
kalmarek Jan 4, 2018
dc74a53
ShapiroWilkTest and belongings
kalmarek Jan 4, 2018
ee9b8aa
export ShapiroWilkTest and SWCoeffs
kalmarek Jan 4, 2018
23485c9
shapirowilk uses StatsFuns (norminvcdf, normccdf)
kalmarek Jan 4, 2018
8219f5e
add tests for shapirowilk
kalmarek Jan 4, 2018
e843ca5
add note on the swilk fortran subroutine of ALGORITHM AS R94
kalmarek Jan 4, 2018
9f1bb32
add note on the swilk fortran subroutine of ALGORITHM AS R94
kalmarek Jan 4, 2018
52ba998
rewrite SWCoeffs following Royston1992 closely
kalmarek Jan 5, 2018
5669f2c
rewrite pvalue following Royston1992 closely
kalmarek Jan 5, 2018
c2eda86
fix: ln -> log
kalmarek Jan 9, 2018
b956bf0
Array interface for SWCoeffs
kalmarek Jan 19, 2018
4116979
use standard dot(x,y) instead of iteration
kalmarek Jan 19, 2018
875cc80
add flags to compile fortran to 64-bit lib
kalmarek Jan 19, 2018
a8838c6
Fix Errors for Julia 1.0
Ph0non Nov 20, 2018
240187f
updates to make compatible with current 'master'
jarredclloyd Dec 11, 2022
4953630
update tests fixing random values
kalmarek Dec 11, 2022
57e76e6
hide Royston-1992 coefficients/polynomials
kalmarek Dec 11, 2022
48783ff
speed-up indexing to SWCoeffs
kalmarek Dec 11, 2022
2dbc09b
create allocation-free version
kalmarek Dec 11, 2022
5660d69
re-formatting and updating docstrings
kalmarek Dec 11, 2022
771096d
refactor creation of RS92 polynomials
May 5, 2023
a15c906
don't export SWCoeffs
kalmarek May 5, 2023
0133ca3
fix: swstat use only uncensored range for sample
kalmarek May 5, 2023
31ea237
rewrite pvalue(::ShapiroWilkTest) using log1p
kalmarek May 5, 2023
9891972
move non-fatal errors to notes in the doctring
kalmarek May 5, 2023
714f494
throw specific exceptions in ShapiroWilkTest
kalmarek May 5, 2023
060e54e
swilkfort expects sample to be already sorted
kalmarek May 5, 2023
c162b49
rewrite and extend tests
kalmarek May 5, 2023
d013864
rewrite getindex(::SWCoeffs, i)
kalmarek May 5, 2023
c7848f1
adjust dostrings/error messages following review
kalmarek Jun 9, 2023
e07b4aa
describe W-statistic explicitly
kalmarek Jun 9, 2023
103bb20
rename and document kwargs for SW test
kalmarek Jun 9, 2023
820b62c
rename SWCoeffs to ShapiroWilkCoefs + others
kalmarek Jun 9, 2023
8afe009
add tests for printing
kalmarek Jun 9, 2023
1db096b
fix: specify that we want HypothesisTests.pvalue
kalmarek Jun 10, 2023
96d634a
use `censored` kwarg and sync docstring
kalmarek Jun 11, 2023
442561f
formatting according to yas
kalmarek Jun 11, 2023
282ef0b
extend StatsAPI.pvalue
kalmarek Jun 11, 2023
0460aa0
shorten description of W statistic
kalmarek Jun 11, 2023
2e408fb
minor changes as indicated in review
kalmarek Jul 30, 2023
cfad621
document ShapiroWilkCoefs and its use.
kalmarek Jul 30, 2023
080af1b
move checks from swstat to ShapiroWilkTest
kalmarek Jul 30, 2023
edf696f
push coverage up
kalmarek Jul 30, 2023
fcf00ef
move shapirowilk.jl → shapiro_wilk.jl etc
kalmarek Jul 30, 2023
4b5e2d8
increase test coverage even more
kalmarek Jul 30, 2023
e874016
add ShapiroWilkTest to docs
kalmarek Jul 30, 2023
5be7f67
use !!! warning in SW doc
kalmarek Jul 30, 2023
0af19f2
whitespace
Aug 13, 2023
fe0edc9
whitespace
Aug 13, 2023
47bbd04
test pvalue of un-sorted example
Aug 13, 2023
3a4d376
add HypothesisTests in docs for unexported ShapiroWilkCoefs
Aug 13, 2023
55623bd
typo
Aug 13, 2023
7abe439
typo
Aug 13, 2023
d36355a
explain argument `N` in ShapiroWilkCoefs
Aug 13, 2023
23318ce
use 'data in vector X` instead of `sample X`
Aug 13, 2023
cccb03c
remove the dead (?!) branch from pvalue computation
kalmarek Oct 19, 2023
475498d
remove ShapiroWilkCoefs
kalmarek Oct 23, 2023
135e43a
use moment instead of sum abs2
Oct 23, 2023
833a18e
use begin instead of 1-based indexing
Oct 23, 2023
6403297
remove resizes! in favour of view
kalmarek Oct 23, 2023
262bf2c
Revert "use moment instead of sum abs2"
kalmarek Oct 23, 2023
46e4cca
Revert "use begin instead of 1-based indexing"
kalmarek Nov 2, 2023
cf21de0
use clamp instead of min
kalmarek May 23, 2024
1bd9d79
throw ArgumentError instead of string
kalmarek May 23, 2024
6047be0
fix ref link
kalmarek May 23, 2024
6e313c7
fix test_throws
kalmarek May 27, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/src/nonparametric.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,8 @@ ApproximatePermutationTest
```@docs
FlignerKilleenTest
```

## Shapiro-Wilk test
```@docs
ShapiroWilkTest
```
1 change: 1 addition & 0 deletions src/HypothesisTests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -148,5 +148,6 @@ include("diebold_mariano.jl")
include("clark_west.jl")
include("white.jl")
include("var_equality.jl")
include("shapiro_wilk.jl")
kalmarek marked this conversation as resolved.
Show resolved Hide resolved

end
252 changes: 252 additions & 0 deletions src/shapiro_wilk.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
export ShapiroWilkTest

#=
From:
PATRICK ROYSTON
Approximating the Shapiro-Wilk W-test for non-normality
*Statistics and Computing* (1992) **2**, 117-119
DOI: [10.1007/BF01891203](https://doi.org/10.1007/BF01891203)
=#

# Coefficients from Royston (1992)
for (s, c) in [(:C1, [0.0, 0.221157, -0.147981, -2.07119, 4.434685, -2.706056]),
(:C2, [0.0, 0.042981, -0.293762, -1.752461, 5.682633, -3.582633]),
(:C3, [0.5440, -0.39978, 0.025054, -0.0006714]),
(:C4, [1.3822, -0.77857, 0.062767, -0.0020322]),
(:C5, [-1.5861, -0.31082, -0.083751, 0.0038915]),
(:C6, [-0.4803, -0.082676, 0.0030302]),
(:C7, [0.164, 0.533]),
(:C8, [0.1736, 0.315]),
(:C9, [0.256, -0.00635]),
(:G, [-2.273, 0.459])]
@eval $(Symbol(:__RS92_, s))(x) = Base.Math.@horner(x, $(c...))
end

"""
HypothesisTests.shapiro_wilk_coefs(N::Integer)

Construct a vector of de-correlated expected order statistics for Shapiro-Wilk
test for a sample of size `N`.

If multiple tests on samples of size `N` are performed, it is beneficial to
construct and pass a single vector of coefficients to `ShapiroWilkTest`(@ref).
kalmarek marked this conversation as resolved.
Show resolved Hide resolved
"""
function shapiro_wilk_coefs(N::Integer)
if N < 3
throw(ArgumentError("N must be greater than or equal to 3, got $N"))
elseif N == 3 # exact
w = sqrt(2.0) / 2.0
return [w,zero(w),-w]
else
n = div(N, 2)
swc = Vector{Float64}(undef, N)
m = @view swc[1:n] # store only positive half of swc:
# it is (anti-)symmetric; hence '2' factor below
for i in eachindex(m)
# Weisberg&Bingham 1975 statistic
m[i] = -quantile(Normal(), (i - 3 / 8) / (N + 1 / 4))
end
mᵀm = 2sum(abs2, m)
x = 1 / sqrt(N)
a₁ = m[1] / sqrt(mᵀm) + __RS92_C1(x) # aₙ = cₙ + (...)
if N ≤ 5
ϕ = (mᵀm - 2m[1]^2) / (1 - 2a₁^2)
m .= m ./ sqrt(ϕ) # A, but reusing m to save allocs
m[1] = a₁
else
a₂ = m[2] / sqrt(mᵀm) + __RS92_C2(x) # aₙ₋₁ = cₙ₋₁ + (...)
ϕ = (mᵀm - 2m[1]^2 - 2m[2]^2) / (1 - 2a₁^2 - 2a₂^2)
m .= m ./ sqrt(ϕ) # A, but reusing m to save allocs
m[1], m[2] = a₁, a₂
end

for i in 1:n
swc[N-i+1] = -swc[i]
end
if isodd(N)
swc[n+1] = zero(eltype(swc))
end
return swc
end
end

function unsafe_swstat(X::AbstractVector{<:Real}, A::AbstractVector{<:Real})
_begin = firstindex(A)
AX = @inbounds dot(view(A, _begin:(_begin + length(X) - 1)), X)
m = mean(X)
S² = sum(x -> abs2(x - m), X)
Comment on lines +76 to +77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m = mean(X)
= sum(x -> abs2(x - m), X)
= moment(X, 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmotion isn't moment normalized (i.e. integrated w.r.t. uniform measure on X? Then this should be S² = moment(X,2)*length(X).

But the real question is: did you benchmark this? On my machine this is (2-6) × slower (N ∈ 2.^7:15, increasing with N) than before.
(Which, accidentally, makes it even slower than the non-BLAS dot for old SWCoeffs (N=2^7, increases to ~8 for N=2^15)

If we really care about perf I'd rather revert this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should improve moment if it is slow. But regardless, you're right of course that the moment will be normalized. I'm fine with reverting it.

W = AX^2 / S²
return min(W, one(W)) # to guard against numeric errors
kalmarek marked this conversation as resolved.
Show resolved Hide resolved
end

struct ShapiroWilkTest <: HypothesisTest
coefs::Vector{Float64} # expectation of order statistics for Shapiro-Wilk test
W::Float64 # test statistic
censored::Int # only the smallest N₁ samples were used
end

testname(::ShapiroWilkTest) = "Shapiro-Wilk normality test"
function population_param_of_interest(t::ShapiroWilkTest)
return ("Squared correlation of data and expected order statistics of N(0,1) (W)",
1.0, t.W)
end
default_tail(::ShapiroWilkTest) = :left
censored_ratio(t::ShapiroWilkTest) = t.censored / length(t.coefs)

function show_params(io::IO, t::ShapiroWilkTest, indent)
l = 24
println(io, indent, rpad("number of observations:", l), length(t.coefs))
println(io, indent, rpad("censored ratio:", l), censored_ratio(t))
return println(io, indent, rpad("W-statistic:", l), t.W)
end

function StatsAPI.pvalue(t::ShapiroWilkTest)
n = length(t.coefs)
W = t.W

if iszero(censored_ratio(t))
if n == 3 # exact by Shapiro&Wilk 1965
# equivalent to 6/π * (asin(sqrt(W)) - asin(sqrt(3/4)))
return 1 - 6acos(sqrt(W)) / π
elseif n ≤ 11 # Royston 1992
γ = __RS92_G(n)
w = -log(γ - log1p(-W))
μ = __RS92_C3(n)
σ = exp(__RS92_C4(n))
elseif 12 ≤ n # Royston 1992
w = log1p(-W)
μ = __RS92_C5(log(n))
σ = exp(__RS92_C6(log(n)))
end
return ccdf(Normal(μ, σ), w)
else
throw("censored samples not implemented yet")
kalmarek marked this conversation as resolved.
Show resolved Hide resolved
# to implement censored samples follow Royston 1993 Section 3.3
end
end

"""
ShapiroWilkTest(X::AbstractVector{<:Real},
swc::AbstractVector{<:Real}=shapiro_wilk_coefs(length(X));
sorted::Bool=issorted(X),
censored::Integer=0)

Perform a Shapiro-Wilk test of the null hypothesis that the data in vector `X` come from a
normal distribution.

This implementation is based on the method by Royston (1992).
The calculation of the p-value is exact for sample size `N = 3`, and for ranges
`4 ≤ N ≤ 11` and `12 ≤ N ≤ 5000` (Royston 1992) two separate approximations
for p-values are used.

# Keyword arguments
The following keyword arguments may be passed.
* `sorted::Bool=issorted(X)`: to indicate that sample `X` is already sorted.
* `censored::Integer=0`: to censor the largest samples from `X`
(so called upper-tail censoring)

Implements: [`pvalue`](@ref)

!!! warning
As noted by Royston (1993), (approximated) W-statistic will be accurate
but returned p-values may not be reliable if either of these apply:
* Sample size is large (`N > 2000`) or small (`N < 20`)
* Too much data is censored (`censored / N > 0.8`)

# Implementation notes
* The current implementation DOES NOT implement p-values for censored data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the utility of allowing the user to compute the Shapiro-Wilk test statistic at all if they can't get its p-value nor view the object directly in the REPL (since show calls pvalue)? That's an honest question; I would guess that many users would try it and get frustrated without checking the docstring but that is a naive assumption.

I'd probably recommend commenting out the censoring-related functionality for the time being, then someone can implement it fully in a follow-up PR. Thoughts on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ararslan feel free to suggest (via gh suggestions) any edits that would suit your proposal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was half question, half proposal—do you envision a use case for this with censored data as it is currently implemented in this PR (no pvalue/show)? If not, I can make those suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the W statistic can be still useful if somebody wants to use section 3.3 of Royston 1993 to calculate significance levels directly.

I didn't implement those methods back then since the significance levels there are only empirical, and until I need to do it at work I don't see myself doing this now either ;)

If those suggestions push this PR over the final line, please make them!

* If multiple Shapiro-Wilk tests are to be performed on samples of same
size, it is faster to construct `swc = shapiro_wilk_coefs(length(X))` once
and pass it to the test via `ShapiroWilkTest(X, swc)` for re-use.
* For maximal performance sorted `X` should be passed and indicated with
`sorted=true` keyword argument.

# References
Shapiro, S. S., & Wilk, M. B. (1965). An Analysis of Variance Test for Normality
(Complete Samples). *Biometrika*, 52, 591–611.
[doi:10.1093/BIOMET/52.3-4.591](https://doi.org/10.1093/BIOMET/52.3-4.591).

Royston, P. (1992). Approximating the Shapiro-Wilk W-test for non-normality.
*Statistics and Computing*, 2(3), 117–119.
[doi:10.1007/BF01891203](https://doi.org/10.1007/BF01891203)

Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and
Censored Samples. Journal of the Royal Statistical Society Series D
(The Statistician), 42(1), 37–43.
[doi:10.2307/2348109](https://doi.org/10.2307/2348109)

Royston, P. (1995). Remark AS R94: A Remark on Algorithm AS 181: The W-test for
Normality. *Journal of the Royal Statistical Society Series C
(Applied Statistics)*, 44(4), 547–551.
[doi:10.2307/2986146](https://doi.org/10.2307/2986146).
"""
function ShapiroWilkTest(sample::AbstractVector{<:Real},
swcoefs::AbstractVector{<:Real}=shapiro_wilk_coefs(length(sample));
sorted::Bool=issorted(sample),
censored::Integer=0)
N = length(sample)
if N < 3
throw(ArgumentError("at least 3 samples are required, got $N"))
elseif censored ≥ N
throw(ArgumentError("`censored` must be less than `length(sample)`, " *
"got `censored = $censored > $N = length(sample)`"))
elseif length(swcoefs) ≠ length(sample)
throw(DimensionMismatch("length of sample and Shapiro-Wilk coefficients " *
"differ, got $N and $(length(swcoefs))"))
end

W = if !sorted
X = sort!(sample[1:(end - censored)])
if abs(last(X) - first(X)) < length(X) * eps()
throw(ArgumentError("sample is constant (up to numerical accuracy)"))
end
unsafe_swstat(X, swcoefs)
else
X = @view sample[1:(end - censored)]
if last(X) - first(X) < length(X) * eps()
throw(ArgumentError("sample doesn't seem to be sorted or " *
"is constant (up to numerical accuracy)"))
end
unsafe_swstat(X, swcoefs)
end

return ShapiroWilkTest(swcoefs, W, censored)
end

#=
# To compare with the standard ALGORITHM AS R94 fortran subroutine
# * grab scipys (swilk.f)[https://raw.githubusercontent.com/scipy/scipy/main/scipy/stats/statlib/swilk.f];
# * compile
# ```
# gfortran -shared -fPIC -o swilk.so swilk.f
# gfortran -fdefault-integer-8 -fdefault-real-8 -shared -fPIC swilk.f -o swilk64.so
# ```

for (lib, I, F) in (("./swilk64.so", Int64, Float64),
("./swilk.so" , Int32, Float32))
@eval begin
function swilkfort!(X::AbstractVector{$F}, A::AbstractVector{$F}, computeA=true)
X = issorted(X) ? X : sort(X)
w, pval = Ref{$F}(0.0), Ref{$F}(0.0)
ifault = Ref{$I}(0)

ccall((:swilk_, $lib),
Cvoid,
(
Ref{Bool}, # INIT if false compute SWCoeffs in A, else use A
Ref{$F}, # X sample
Ref{$I}, # N samples length
Ref{$I}, # N1 (upper) uncensored data length
Ref{$I}, # N2 length of A
Ref{$F}, # A A
Ref{$F}, # W W-statistic
Ref{$F}, # PW p-value
Ref{$I}, # IFAULT error code (see swilk.f for meaning)
),
!computeA, X, length(X), length(X), div(N,2), A, w, pval, ifault)
return (w[], pval[], ifault[], A)
end
swilkfort(X::Vector{$F}) = swilkfort!(X, zeros($F, div(length(X),2)))
end
end
=#
1 change: 1 addition & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ include("mann_whitney.jl")
include("correlation.jl")
include("permutation.jl")
include("power_divergence.jl")
include("shapiro_wilk.jl")
include("show.jl")
include("t.jl")
include("wald_wolfowitz.jl")
Expand Down
Loading
Loading