Shapiro-Wilk normality test #124

kalmarek · 2018-01-04T18:51:59Z

implements ShapiroWilkTest following

PATRICK ROYSTON
Approximating the Shapiro-Wilk W-test for non-normality
Statistics and Computing (1992) 2, 117-119
DOI: 10.1007/BF01891203

This is work in progress, all comments are welcome!
I tried to follow the paper closely, but copied e.g. computed constants from the original swilk.f.
Please let me know if You are ok (license-wise) with this.

These polynomial interpolations are computed at loadtime for speed.
Currently

julia> using BenchmarkTools
julia> srand(1);
julia> k = 5000;
julia> X = sort(randn(k));
julia> @btime ShapiroWilkTest(X);
  112.523 μs (18 allocations: 137.45 KiB)
julia> swc = SWCoeffs(k);
julia> @btime ShapiroWilkTest(X, swc);
  62.476 μs (11 allocations: 78.50 KiB)

whereas calling swilkfort directly

julia> srand(1);
julia> X = sort(randn(5000));
julia> A = zeros(2500);
julia> @btime swilkfort!(X, A);
  205.763 μs (10 allocations: 224 bytes)
julia> @btime swilkfort!(X, A, false);
  75.692 μs (10 allocations: 224 bytes)

Still missing:

documentation
censored data
polynomial interpolation with better precision (?),

and probably more.

I tried to compute exact values of SWCoeffs (via MonteCarlo simulation), but the results I'm getting are off the reported ones in Table 1 op. cit. Woule be glad if anyone could help.

> function approximate_A(N, samps)
    s = [sort(randn(N)) for i in 1:samps]
    m = sum(s[i][1:N] for i in 1:samps)/samps
    ss = vcat(s...)
    Vinv = inv(cov(ss, ss))
    A = (-m'*Vinv/sqrt(m'*Vinv*Vinv*m))'
    return ((A - reverse(A))/2)[1:div(N,2)]
end;

approximate_A(10,1000000) results in

[0.5469493640066541, 0.3559315327467397, 0.23322491989287789, 0.1336531031109471, 0.043609401791162974]

Compare swilk.f's, and SWCoeffs(10).A:

[0.5737146934337275, 0.3289700603561813, 0.21434902886647542, 0.12279063037794058, 0.04008871216291867]
[0.5737147066903874, 0.3289700464878161, 0.21434901803439887, 0.1227906248657577,  0.04008871105102476]

Rahman & Govindarajulu

A different implementation of SWCoeffs following

M. Mahibbur Rahman & Z. Govindarajulu,
A modification of the test of Shapiro and Wilk for normality
Journal of Applied Statistics 24:2, 219-236, 1997
DOI: 10.1080/02664769723828

function HypothesisTests.SWCoeffs(N::Int, ::Type{Val{:Rahman}})
    M = [norminvcdf(i/(N+1)) for i in 1:N];
    normM = normpdf.(M)

    c = (N+1)*(N+2)
    d = 2c.*normM.^2
    dl= [-c*normM[i]*normM[i+1] for i in 1:N-1]
    Vinv = SymTridiagonal(d, dl)
    C = sqrt(M'*(Vinv*Vinv)*M)

    Astar = -(M'*Vinv)'/C
    return SWCoeffs(N, Astar[1:div(N,2)])
end

But they don't provide critical values for their statistic (or I couldn't find it), so that would require further simulation. Also their SWCoeffs(10, Val{:Rahman}) are quite different:

[0.6010309510135708, 0.2975089851946684, 0.19212508233094944, 0.10979215067183574, 0.03583065785524651]

On the other hand ALGORITHM AS R94 is a standard.

andreasnoack · 2018-01-05T08:00:41Z

Thanks for contributing this. Your implementation looks good. The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.

Another issue is the license. http://lib.stat.cmu.edu/apstat/ states that

The Royal Statistical Society holds the copyright to these routines, but has given its permission for their distribution provided that no fee is charged.

which is not compatible with the MIT license so either you should try to implement these directly from the paper or try to contact the Royal Statistical Society and ask them if they have changed their mind regarding the clause. The best would be if they could use a real license.

kalmarek · 2018-01-07T01:41:01Z

The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.

Yep, I tried to do this, but failed obtaining the "true" SWCoeffs by semi-exact means

Even obtaining the "exact" expected values of order statistics E(r:n) (our vector m) seems to be a problem. Convention: we obtain the first half of -order statistics N(0,1).

A simple MC produces highly variable results:

function mMC(N::Int,samps=10^5)
    X = Array{Float64}(N)
    s = zeros(div(N,2))
    for i in 1:samps
        for j in 1:N
            X[j] = randn()
        end
        sort!(X)
        for j in 1:div(N,2)
            s[j] += X[j]
        end
    end
    return -s/samps
end
d = mMC(101, 10^7) - mMC(101, 10^7);
norm(d,1)
0.0019123699000122493

A more or less exact order statistics can be obtained by integration directly, eg. following

Algorithm AS 177,
Royston, J.P.
Expected Normal Order Statistics (Exact and Approximate)

But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?

function Royston(n,r, logs=[log(i) for i in 1:n], o=7*sqrt(n))
    C = sum(logs) - sum(view(logs, 2:r-1)) - sum(view(logs, 2:n-r))
    I(x) = C + (r-1)*log(normccdf(x)) + (n-r)*log(normcdf(x)) - 1/2*(log(2π) + x^2)
    return quadgk(x->x*exp(I(x)), -9.0-log(n),9.0+log(n), 
            order=floor(Int, o), 
            maxevals=floor(Int, log(n)*10^7))
end
function mRoyston(n, o=7*sqrt(n))
    logs=[log(i) for i in 1:n]
    m = [Royston(n,r, logs, o) for r in 1:div(n,2)]
    err = norm([i[2] for i in m],1)
    @show err
    return [i[1] for i in m]
end

Royston's Algoritm AS R94 uses Blom approximation:

Blom(n,r) = -norminvcdf((r - 3/8)/(n + 1/4))
mBlom(n) = [Blom(n,r) for r in 1:div(n,2)]

and Rahman (cited above) used a simple

Rahman(n,r) = -norminvcdf(r/(n + 1))
mRahman(n) = [Rahman(n,r) for r in 1:div(n,2)]

Comparing

I believe the direct integration to be the most accurate, so I'd lean towards regarding it as "true". Comparing other methods against it gives

const N = 5000
m_Ro = mRoyston(N);
m_mc = mMC(N, 10^6);
m_Bl = mBlom(N);
m_Rh = mRahman(N);

p = plot(yaxis=:log10, title="Relative to Royston",size=(800,600))
for (X, l) in [(m_mc, "MonteCarlo"), 
        (m_Bl, "Blom approximation"), 
        (m_Rh, "Rahman approximation"), 
   ]
    plot!(abs.((X-m_Ro)./m_Ro), label=l)
end
p

Question:

Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)
Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?

andreasnoack · 2018-01-08T10:36:29Z

But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?

Both should be high quality and quadgk gives can provide an estimate of the error. Notice also that there are normlogcdf which might give you a little extra accuracy over the explicit log.

So all-in-all I think this should be fine for computing the mean order statistics. Have you tried to compute the covariances yet? They are needed for fitting the polynomials, right?

Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)

I guess we could do that. We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.

Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?

Given how inaccuratly they computed the order statistics back then, I guess it would be good if we could recompute them.

Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.

kalmarek · 2018-01-08T13:21:23Z

For covariances I plan to do the following: [DavisStephens1977] goves an algorithm to compute covariance matrix of normal order statistics given

$V_11$ - variance of the largers order statistics → Approximated by [SheaScallon1988]
$EX_1$, $EX_2$ - the expected value of the largest and second largest order statistics → ~exact by [Royston1982]
$\Sigma^2$ - sum of squared of order statistics → Approximated (with bounds on the error) by [Balakrishnan1984]

[DavidStephens1977]: C. S. Davis & M. A. Stephens The covariance matrix of normal order statistics Communications in Statistics - Simulation and Computation Vol. 6 , Iss. 1, 1977 DOI:10.1080/03610917708812028
[SheaScallon1988]: B. L. Shea and A. J. Scallon Remark AS R72: A Remark on Algorithm AS 128. Approximating the Covariance Matrix of Normal Order Statistics Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 37, No. 1 (1988), pp. 151-155 DOI:10.2307/2347514
[Royston1982]: J. P. Royston Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate) Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 31, No. 2 (1982), pp. 161-165 DOI:10.2307/2347982
[Balakrishnan1984]: N. Balakrishnan Algorithm AS 200: Approximating the Sum of Squares of Normal Scores Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 33, No. 2 (1984), pp. 242-245 DOI:10.2307/2347458

But I may not have enough time to this during this week.

We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.

We may have a look at speeds and decide when I finish the plan above; but I guess cached (or @memoized?) version of SWCoeff(N::Int) will be the way to go.

Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.

Roystons AS R94 is essentially Weisberg-Bingham + corrections + fitting for the distribution of W-statistic. AS R94 uses Bloms expectation of normal order statistics m and then corrects just the first two coefficients
This is an approximation of (the perfect) Shapiro-Wilk, if You like. If we can (on modern hardware) do it the "exact" way (and make it fast) maybe we should?

kalmarek · 2018-01-08T13:25:51Z

Another thing, do you have experience with Weisberg-Bingham version?

As a matter of fact, I'm not a statistician, doing pure math, coding in julia "after hours" ;-). But for most of my "empirical science" friends SW was the one to go to, and seems to have the best power for small samples.

kalmarek · 2018-01-25T18:16:34Z

Ok, variances/covariances turned out harder than I thought:
I implemented the exact formulas from Godwin Some Low Moments of Order Statistics in normordstats.jl, please have a look. These become numerically unstable for n ~ 25, but for n=20 agree to ~10 digits with Table 1 of Parrish. He used 128 bits of precision to compute tables of products of order statistics up to n=50 (and I guess he didn't progress any further because of loss of precision). So maybe my cheap implementation isn't that bad ;-)

(BTW. Do You have, by a chance, access to a paper copy of Parrishs article (to scan and send me)? The digital one I have is without text layer and of such poor quality that OCRing those tables produces errors in every row. Which for an article to BORROW for 24h for $50 is OUTRAGEOUS. )

With these I can tell that Roystons approximation of SWCoeffs is surprisingly good, whereas Rahmans is (surprisingly) bad? Maybe there is an error in my implementation of the latter:

blom(n,i) = norminvcdf((i - 3/8)/(n + 1/4))
rahman(n,i) = norminvcdf(i/(n + 1))

struct Blom
    n::Int
end

struct Rahman
    n::Int
end

OrderStatistics.expectation(B::Blom) = [blom(B.n,i) for i in 1:B.n]
OrderStatistics.expectation(R::Rahman) = [rahman(R.n,i) for i in 1:R.n]

# M. Mahibbur Rahman & Z. Govindarajulu,  
# A modification of the test of Shapiro and Wilk for normality  
# *Journal of Applied Statistics* 24:2, 219-236, 1997  
# DOI: [10.1080/02664769723828](http://dx.doi.org/10.1080/02664769723828)

function invV(N, M)
    pdfM = normpdf.(M)
    d = pdfM.^2
    dl= [-pdfM[i]*pdfM[i+1] for i in 1:N-1]
    
    return (N+1)*(N+2)*SymTridiagonal(d, dl)
end

function HypothesisTests.SWCoeffs(R::Rahman)    
    M = expectation(R)
    A = invV(R.n, M)
    C = sqrt(M'*(A*A)*M)

    Astar = (M'*A)'/C
    return SWCoeffs(R.n, Astar[1:div(R.n,2)])
end

function HypothesisTests.SWCoeffs(OS::NormOrderStatistic)
    m = expectation(OS)
    iV = inv(cov(OS))
    A = m'*iV./sqrt(m'*iV*iV*m)
    return SWCoeffs(OS.n, A[1,1:div(OS.n,2)])
end

Then:

N = 25
@time ExactA = SWCoeffs(NormOrderStatistic(N));
plot(SWCoeffs(N).A, label="Blom + 0-correlation OrdStats")
plot!(SWCoeffs(Rahman(N)).A, label="Rahman + limit inverse correlation")
plot!(-ExactA.A, label="Exact formulas, BigFloat")

(You can see instability kicking in)

Right now I can not think of anything better than the mentioned above approximation by Stephen&David (using Taylor expansion of the expectation of product of order statistics).

Using BigFloats, ArbFloats, etc will not help much, as the precision of e.g. erf, hence normcdf is limited to Float64, at least for now.

kalmarek · 2018-03-02T13:50:48Z

It took more time than I had predicted, but turned out a fun project ;-)
I'm looking for a piece of advise what to do with the code, based on the results below. At the moment I think that numerically Shapiro-Wilk Test should be left "as is", until someone could point what did Royston fit his polynomials against. We should of course use the exact values of Parrish for N≤20 (or even N≤50 if I will be able to fix the loss of precision mentioned below).

Normal order statistics

For implementation see https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats
Using this code You can confirm the covariances computed by Parrish (the 25dp values) for N≤20, but beyond that point there seems to be precision loss, which I was not able to trace.

David-Johnson formula:

function diffG(x)
    s = normpdf(x)
    s² = s^2
    s⁴ = s²*s²
    x² = x^2
    x⁴ = x²*x²
    
    return (
        1/s, 
        x/s², 
        (1 + 2x²)/(s*s²), 
        x*(7 + 6x²)/s⁴, 
        (7 + 46x² + 24*x⁴)/(s*s⁴),
        x*(127 + 326x² + 120x⁴)/(s²*s⁴)
        )
end
function covDavidJohnson(i::Int,j::Int,n::Int, order=3)
    if j < i
        return covDavidJohnson(j,i,n)
    end
        
    pᵢ, pⱼ = i/(n+1), j/(n+1)
    qᵢ, qⱼ = 1 - pᵢ, 1 - pⱼ
    Gᵢ = diffG(norminvcdf(pᵢ))
    Gⱼ = diffG(norminvcdf(pⱼ))
    X = 1/(n+2)
    T = Vector{eltype(Gᵢ)}()
    if order >= 1
        T1 = X  *pᵢ*qⱼ*Gᵢ[1]*Gⱼ[1]
        push!(T, T1)
    end
    if order >= 2
        T2 = X^2*pᵢ*qⱼ*(
                (qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] + 
                (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] + 
                0.5(
                    pᵢ*qᵢ*Gᵢ[3]*Gⱼ[1] +
                    pⱼ*qⱼ*Gᵢ[1]*Gⱼ[3] +
                    pᵢ*pⱼ*Gᵢ[2]*Gⱼ[2] 
                    )
                )
        push!(T, T2)
    end
    if order >= 3
        T3 = X^3*pᵢ*qⱼ*(
                -(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] - (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] + 
                ((qᵢ-pᵢ)^2 - pᵢ*qᵢ)*Gᵢ[3]*Gⱼ[1] + 
                ((qⱼ-pⱼ)^2 - pⱼ*qⱼ)*Gᵢ[1]*Gⱼ[3] +
                1/2*(3(qᵢ-pᵢ)*(qⱼ-pⱼ) + pⱼ*qᵢ - 4pᵢ*qⱼ)*Gᵢ[2]*Gⱼ[2] + 
                5/6*pᵢ*qᵢ*(qᵢ-pᵢ)*Gᵢ[4]*Gⱼ[1] + 
                5/6*pⱼ*qⱼ*(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[4] + 
                (pᵢ*qⱼ*(qᵢ-pᵢ) + 0.5pᵢ*qᵢ*(qⱼ-pⱼ))*Gᵢ[3]*Gⱼ[2] + 
                (pᵢ*qⱼ*(qⱼ-pⱼ) + 0.5pⱼ*qⱼ*(qᵢ-pᵢ))*Gᵢ[2]*Gⱼ[3] + 
                1/8*(pᵢ^2*qᵢ^2*Gᵢ[5]*Gⱼ[1] + pⱼ^2*qⱼ^2*Gᵢ[1]*Gⱼ[5]) + 
                1/4*(pᵢ^2*qᵢ*qⱼ*Gᵢ[4]*Gⱼ[2] + pᵢ*pⱼ*qⱼ^2*Gᵢ[2]*Gⱼ[4]) + 
                1/12*(2pᵢ^2*qⱼ^2 + 3pᵢ*pⱼ*qᵢ*qⱼ)*Gᵢ[3]*Gⱼ[3]
                )
        push!(T, T3)
    end
    return sum(T)
end

Comparing to the high-precision (25d) values obtained by Parrish (N = 20; log10.(abs.(differences))):

DavisStephens Covariance Matrix of Normal order Statistics

This method uses the David-Johnson formula + relations between correlations, etc:

function normalizerel!(v::AbstractVector, relative::Union{Vector{Int}, UnitRange})
    Sr = sum(v[relative])
    c = (1-Sr)/(sum(v) - Sr)
    for i in eachindex(v)
        if !(i in relative)
            v[i] *= c
        end
    end
    return v
end
function fillcov!(V, j::Int)
    # copy reversed j-th column to (end-j)-th column
    @views V[j:end-j+1, end-j+1] .= reverse(V[j:end-j+1, j])
    # copy j-th column to j-th row
    @views V[j, j+1:end-j] .= V[j+1:end-j, j]
    # copy (end-j)-th column to (end-j)-th row 
    @views V[end-j+1, j+1:end-j] .= V[j+1:end-j, end-j+1] 
    return V
end
function covDavisStephens(OS::NormOrderStatistic{T}, V11::T=var(OS, 1)) where T
    V = zeros(T, OS.n, OS.n)
    
    V[1,1] = V11
    V[2,1] = moment(OS, 1, 2) - expectation(OS,1)*expectation(OS,2) - T(1)
    
    for i in 3:OS.n
        V[i, 1] = covDavidJohnson(i, 1, OS.n)
    end
    normalizerel!(view(V, :, 1), [1,2])
    fillcov!(V, 1)
    j=1

    for j in 2:div(OS.n, 2)
        V[j,j] = var(OS, j)
        for i in j+1:OS.n-j+1
            V[i,j] = covDavidJohnson(i, j, OS.n)
        end
        normalizerel!(view(V, :, j), [collect(1:j); collect(OS.n-j+1:OS.n)])
        fillcov!(V, j)
    end
    
    if isodd(OS.n) # the middle point
        j = div(OS.n, 2)+1 
        V[j, j] = var(OS, j)
        normalizerel!(view(V, :, j), [j])
        fillcov!(V, j)
    end
    return Symmetric(V, :U)
end

covDavisStephens(N::Int) = covDavisStephens(NormOrderStatistic(N))

Again comparing against high-precision values of Parrish (N = 20; log10.(abs.(differences))):

The formula is pretty quick:

N = 2000
@time M = covDavisStephens(N)
@time invM = inv(M);
heatmap(abs.(invM./maximum(invM)), size=(800,700), yflip=true)
# 2.853262 seconds (12.26 M allocations: 365.602 MiB, 2.97% gc time)
# 0.552725 seconds (22 allocations: 62.043 MiB, 1.02% gc time)

Shapiro-Wilk coefficients

Finally the coefficients:

swcoeff(invcov, m) = ((m'*invcov)'/sqrt(m'*(invcov*invcov)*m))[1:div(length(m),2)]

Relative to the exact (=using high-precision covariance by Parrish) value of SW coefficients:

N=20
m = Float64.(expectation(NormOrderStatistic{BigFloat}(N)))
exact = -swcoeff(inv(Float64.(covd[N])), m)
p = plot(yaxis=:log10)
# plot!(exact, label="Exact (Parrish)")
plot!(abs.(SWCoeffs(N).A)./exact, label="Royston")
plot!(abs.(swcoeff(eye(N), expectation(Blom(N))))./exact, label="Weisberg-Bingham")
plot!(abs.(swcoeff(inv(covDavisStephens(N)), m))./exact, label="DavisStephens")

I still have no idea what Royston fitted his coefficients against, but that must have been pretty accurate!

for N=2000, the third coefficient of Davis-Stephens is still larger than it should by ~25% (relative to Royston):

andreasnoack · 2018-06-05T11:12:05Z

@kalmarek Sorry for dropping the ball here. What is the current status?

kalmarek · 2018-06-17T21:39:54Z

I also had too much to do, couldn't finish this cleanly;)

long story short we either:

use Royston approximation (already in place), forgetting that he fit using 32-bits of precision (as everybody do)
use Parrish exact values (N≤25) of covariances to compute exact SW-coeffs and Royston otherwise (as everybody should do)
try to get the order statistics going (and pre-compute the SW-coeffs, e.g. for N ≤ 50, or 100), and use Royston method above;
- I tried my own solution to compute exact covariances of normal order statistics here https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats, but as noted above, there is some problem with precision loss which I was not able to track down;
- use Nemo's high quality acb_integrate https://github.com/Nemocas/Nemo.jl/blob/master/src/arb/acb_calc.jl (was the next thing on my list, but never happened)

Happy to finish this, starting the next week, actually (conferences, etc.)

Ph0non · 2018-11-15T12:11:28Z

This dedicated work on Shapiro-Wilk-Test is exceptional fantastic 🥇

Are there any news to complete the work on the SWT?

kalmarek · 2018-11-15T16:37:57Z

@Ph0non this effort (it was fun!) is an offspring of my frustration on the teaching I was forced to do.
and yes, that included teaching statistics ;-)
that semester is over, so I don't have enough frustration in my store right to forge it into code ;-)

but if You are willing to provide documentation, I may look into finishing this, but only in December (conferences abound).

Ph0non · 2018-11-16T17:23:31Z

Do you need documentation like the other Tests in the package or something more (with more theoretical background or so)?

kalmarek · 2018-11-17T22:20:36Z

There are three things to be finished here:

fix problems with accuracy
find out what Royston fit his approximation to
produce test suite
write documentation

I volunteer to finish 1 (and maybe 2)
If I succeed then 3 should be rather easy, but not zero-work task. In the documentation (4) we should (among others) carefully explain:

why do we differ from standard packages (everybody out there is using Royston),
how do we obtain our SW coeffs
why everybody should be using exact order statistics in SW (at least for N≤50)

Ph0non · 2018-11-20T06:12:28Z

Some small fixes kalmarek#1

albz · 2018-11-28T12:27:13Z

Hi, thanks for the work done so far.

Have you check python implementation?
It could be helpful, it will allow to cross check results as well as check coefficients.

I cannot remember right now, by heart, how the coefficients are computed in the python version but I believe they are for double precision and are not cover by any copyright. It could be helpful to cross check this info.

I had also implemented a version in the fashion for tran at some point, it could come handy.

kalmarek · 2018-11-28T12:37:20Z

@albz could You point me to a python implementation? the one from scipy
https://github.com/scipy/scipy/blob/v1.1.0/scipy/stats/morestats.py#L1304
uses standard fortran code, the approximating polynomials were not fit in double precision there.

right now copyright is not a problem, since I wrote everything from scratch following the paper only.

kalmarek · 2018-11-28T13:17:28Z

I see that people are generally interested so here is some details. In 1992 paper Royston

obtained (by 10000-MC samples) what he calls "exact" coefficients (A),
took Weisberg-Bingham approximation (C) and
fit two quintic polynomials (in n^{1/2}) to differences A_n - C_n and A_{n-1} - C_{n-1}. Coefficients of those polynomials are visible in the fortran code as C1 and C2;
His coefficients for SW consist of corrected C_n and C_{n-1} and re-normalized rest of C.

I'm happy with both options here:

providing a pre-computed (but this time exact!) coefficients A for sizes up to e.g. 200 and then
compare the exact coefficients (again) with Weisberg-Bingham and use the Roystons method for approximating the rest, as above.

Note that my experiments with MC show that it can not provide better precision than 1e-5 or 1e-6;

Then additional fitting must be provided for the distribution of W-statistic.
Royston claims it's close to two ~lognormals for two different sets of parameters which needs to be refitted again (these are C3-C7 in the fortran code).

albz · 2018-11-29T00:15:16Z

You are right, the python-fortran code is in single precision. That is of little help.

andreasnoack · 2020-08-20T07:38:22Z

@kalmarek It's really a shame that this never was completed given all the work you put into this. So hereby my encouragement to resume the work.

kalmarek · 2020-08-20T09:59:06Z

@andreasnoack Thanks for the encouragement ;) However unlikely it is, SW test is still in the back of my head.
Hopefully we will be soon bringing Arblib.jl to a usable state and then I could use its certified integration to re-compute SW coefficients.
The plan is just as it was before:

use exact SW coeffs for small number of samples, say N≤200 (using Arblib)
use Royston approximation afterwards (this is implemented) (but SW test becomes pretty useless for large sample sizes :)
re-fit polynomials (logpolynomials?) to distribution(s) of W-statistics and find critical thresholds (TODO)

kalmarek · 2021-01-09T12:58:47Z

@andreasnoack in kalmarek/ShapiroWilk#1 SWCoeffs are implemented using Arblib;
one can do (an embarrassingly parallel) computation e.g. for (n=50, prec=128), which gives in a couple of minutes ~ 7 digits of precision; one gets more than Float64 has for prec=256.

To compute critical/p-values I'd need to

approximate a normalizing transform (1-Wₙ)^t ~ N(µ, σ) (µ could be computed analytically, I'm not sure about σ) via MC
fit a (exponential? biexponential?, inverse log?) function to t = t(n) and use it for all sample sizes
use z-scores for the transformed W-statistic

I used of the shelf Optim.jl methods to do so:

using Statistics
using Random

using Distributions
using LinearAlgebra
using Optim


using Revise
using ShapiroWilk


function sample_Wstatistic(sw::ShapiroWilk.SWCoeffs{T}, sample_size) where T
    Wstats = zeros(T, sample_size)
    tmp = similar(Wstats, length(sw))
    for i in 1:sample_size
        randn!(tmp)
        sort!(tmp)
        Wstats[i] = ShapiroWilk.Wstatistic(tmp, sw)
    end
    Wstats
end;

function normal_power_fit(
    t,
    sample,
    qs = 0.005:0.005:0.995,
    normal_qs=quantile.(Normal(0,1), qs),
)

    sample_t = similar(sample)

    for idx in 1:length(sample_t)
        sample_t[idx] = (one(eltype(sample)) - sample[idx])^t
    end

    µ, σ = mean(sample_t), std(sample_t)
    for idx in 1:length(sample_t)
        sample_t[idx] = (sample_t[idx] - μ)/σ
    end

    #@assert issorted(sample_t)

    sample_t_qs = quantile(sample_t, qs, sorted=true)

    for idx in 1:length(sample_t_qs)
        sample_t_qs[idx] = sample_t_qs[idx] - normal_qs[idx]
    end

    return norm(sample_t_qs, 2)
end

optionally e.g.

@time let n = 30, prec=128
    OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
    @time ShapiroWilk.OrderStatisticsArblib._precompute_ψ(n, prec=prec, R=18.0)
end

then

exponents = map(5:30) do n
    prec=128
    qs=0.005:0.005:0.995

    OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
    @info "Computing coefficients..."
    @time swfl = ShapiroWilk.SWCoeffs(n, Float64.(ShapiroWilk.SWCoeffs(OS)))

    @info "Sampling W-statistic..."
    @time W_sample = sort!(sample_Wstatistic(swfl, 10_000_000), rev=true)

    @info "Fitting power-transform for normality..."
    @time res = let qs=qs
        normal_qs=quantile.(Normal(0,1), qs)

        optimize(
            x->normal_power_fit(first(x), W_sample, qs, normal_qs),
            0.01,
            0.5,
            Brent(),
        )
    end
    @info "For n=$n optimization yielded" res

    Optim.minimizer(res)
end

yields the normalizing exponents, i.e. τ such that (1-W)^τ ~ N(µ, σ). fitting of τ=τ(n) is straightforward, e.g.

┌ Info: For n=30 optimization yielded
│   res =
│    Results of Optimization Algorithm
│     * Algorithm: Brent's Method
│     * Search Interval: [0.010000, 0.500000]
│     * Minimizer: 5.386236e-02
│     * Minimum: 6.233191e-02
│     * Iterations: 12
│     * Convergence: max(|x - x_upper|, |x - x_lower|) <= 2*(1.5e-08*|x|+2.2e-16): true
└     * Objective Function Calls: 13

The whole table of these τs look as follows

BioTurboNick · 2021-01-18T07:20:43Z

Was just looking into Shapiro-Wilk, am I reading it right that it's pretty close?

kalmarek · 2021-01-18T10:45:07Z

@BioTurboNick You can already use code from this branch directly (well, the only thing that doesn't work is censored samples) for an implementation based on Royston approximation of S-W test; (well updating to modern HypothesisTests will require a bit coding, but not much).

What is (relatively) close is a new approximation that I'm doing in my spare time ;) Whether it will make a difference, I don't know, but it will be much closer to "mathematical correctness" (whatever that means:).

EDIT: just to let you know: everybody else (R, scipy) are using the same old fortran implementation which will be comparable to what is in this branch.

kalmarek · 2021-02-26T15:29:53Z

@andreasnoack or anyone else in this thread ;) I'd need your advice.

I have the SW-coeffs computed up to eps(Float64) precision for n=3:189 (I could go further, but I wasn't patient enough ;).

what is left is the critical levels for the W-statistic, i.e. wstatcdf. Royston/Parrish use different normalizing transforms (functions that take W to Z ~N(0, 1))

for n∈4:11 approximate/interpolate pdf/cdf of W=W_n using MC. Royston in 1993 suggests -ln(γ-ln(1-W)) as normalizing transform, but this seems very crude, given that pdf seems non-differentiable. I'm not even sure this is taken into the account in the existing Shapiro-Wilk implementations.
find a normalizing transform for n∈12:2000 and estimate its parameters based on MC for n∈12:189. Here there are two candidates: W → (ln(1-W) - µ)/σ (Royston) or W → ((1-W)^t - µ)/σ (Parrish):

After looking closely at MC samples I arrived at this:

n = 3 → analytic solution
n = 4:6 → MC + Piecewise fit ??
n = 7:28 → Parrishes power transform (1-W)^t ~ N(µₙ,σₙ): we need to store (tₙ, µₙ, σₙ)
n >= 29 → Roystons lognormal log(1-W) → N(µ′ₙ, σ′ₙ): fit biexps/someother models to µ′, and σ′ on n = 29:189 and use it for all n.

(The transition Parrish → Royston is determined based on MSE on ~200 uniformly distributed quantiles)

What do you think?

I asked about the shape of the W-distributions on discourse: https://discourse.julialang.org/t/can-anyone-tell-me-what-is-this-distribution/53726 but got no input ;)

kalmarek · 2021-03-02T14:36:15Z

ok, it seems that I was fitting Normal distribution wrong;) Box-Cox transform W → (1-W)^λ/λ for n in 7:189 seems superior to shifted logarithmic, when optimizing log likelihood (currently using BoxCoxTrans.jl).

using 1_000_000 MC samples of W-statistic taken from normal (for each of n∈7:189) I computed λ (the exponent of Box-Cox transform), µ and σ (the mean and std of the transformed sample of W) as functions of n

inverse(t, a₀, a₁, γ) = a₀ + a₁/(t - γ)
logarithmic(t, a₀, a₁, γ) = a₀ + a₁*log(t - γ).
The models (green) were fit using least squares for n=7:140 (blue) and you can see how well they fit the "test" data (in red). The bottom plot is normalized residuals with their mean and std.

λ_7:140.pdf
µ_7:140.pdf
σ_7:140.pdf

@andreasnoack As I really have very little knowledge about statistics/fitting models/parameters estimation I need someone knowledgeable to have a look at this and provide a stamp of approval ;)

jarredclloyd · 2022-12-09T00:19:25Z

@kalmarek @andreasnoack just curious if this is still being looked into/implemented

kalmarek · 2022-12-09T14:57:20Z

@jarredclloyd this kind of stalled because I was not able to produce a meaningful fits (except eyeballing plots and saying: this looks seriously gooood...). If you're willing to help (and have more experience with this) be my guest

jarredclloyd · 2022-12-09T23:11:23Z

Definitely willing to help, but I think the specific problem you are trying to overcome might be beyond my level of knowledge currently.
Correct me if I'm wrong, but the implementation here has a working version consistent with other languages?
I think it would be great to have this implemented into HypothesisTest so that Julia has a functional Shapiro-Wilk test, especially given all the fantastic work you've done on this. From there we could attempt to resolve the fitting issue (I can ask some of my colleagues who might be able to help).
So, what needs to be completed to have a functional version of Shapiro-Wilk that can be merged into the main branch?

kalmarek · 2023-10-19T00:59:09Z

@nalimilan thanks for your review and patience. It seems ready to be merged. Let's celebrate 🎉 (and don't forget to squash ;P)

devmotion · 2023-10-19T07:45:54Z

src/shapiro_wilk.jl

+struct ShapiroWilkCoefs <: AbstractVector{Float64}
+    N::Int
+    A::Vector{Float64}
+end


Why is it necessary to construct a new array type? Isn't a function that constructs a Vector{Float64} with the desired coefficients A and a function that performs the desired computation with them simpler?

Generally, it's quite non-trivial to implement the AbstractArray interface correctly and most operations with the resulting array type will fall back to the generic implementations in Base and LinearAlgebra instead of exploiting optimizations for Vector{Float64}.

It is not necessary; SW coeffs are antisymmetric and we're computing and storing just half of them.
It'd be easy enough to collect them, store in ShapiroWilkCoefs and use internally (we might gain something performance-wise by this, but only when they are reused). Distinguishing them from Vector{Float64} makes the interface and signatures clearer imo.

@devmotion Was this just a question or rather a suggestion for change? If it is the latter I'd appreciate a set of concrete suggestions ;)

It started as a question but now I'd actually suggest changing the implementation.

Note that I did not mean to change the efficiency of the algorithm, I don't think you necessarily have to store all coefficients. I was suggesting to store and use the A field directly in the test. More concretely, make the A coefficients a field of the test struct, and then reimplement unsafe_swstat in a way that takes the vector of coefficients directly (instead of providing its wrapper).

@devmotion What is the benefit of dropping this abstraction in favour of plain Vector{Float64}? On performance side, since ShapiroWilkCoefs is an immutable its creation will be most probably omitted by the compiler; on the programmer side it makes clear what kind of vector must be passed, and it is not an arbitrary Vector{Float64}. Users may pass the coefficients for re-use and I find it very fitting to provide a thin (no cost?) wrapper to store the coefficients.

After doing a bit of benchmarking, I think we gain very little by storing only half of the coefficients and we loose dispatch to Blas.dot (the generic one isn't that much slower though). I'm fine changing the implementation to store the whole of the vector. I'd rather keep ShapiroWilkCoefs though.

If we store the whole vector, I agree with @devmotion that we may as well take a plain Vector{Float64}. Wrapping values in ShapiroWilkCoefs doesn't add anything.

I'd argue that the wrapper should be removed regardless od whether the full vector is stored or not. As you noticed in your benchmarks, the wrapper array type is often not performant since you miss optimizations/BLAS. I don't think there's a major benefit from a user or correctness perspective either - you can always provide nonsensical coefficients and as a user you would always construct them with some documented and exported function without caring about their type.

Taken together, I think you should remove the wrapper and possibly store half of the coefficients. I wonder if you also benchmarked the version based on the Vector{Float64} with half of the coefficients (dot rewritten as sum of two dot operations)?

for simplicity I went with the plain vector storing all of the coefficients.

a user you would always construct them with some documented and exported function

shapiro_wilk_coefs is not exported and neither was ShapiroWilkCoefs (suggestions for other names welcome).

I did not benchmark the two-dot producs as sw coefs are antisymmetric and dot is memory bound anyway.

devmotion · 2023-10-23T08:25:46Z

src/shapiro_wilk.jl

+    m = mean(X)
+    S² = sum(x -> abs2(x - m), X)


Suggested change

m = mean(X)

S² = sum(x -> abs2(x - m), X)

S² = moment(X, 2)

@devmotion isn't moment normalized (i.e. integrated w.r.t. uniform measure on X? Then this should be S² = moment(X,2)*length(X).

But the real question is: did you benchmark this? On my machine this is (2-6) × slower (N ∈ 2.^7:15, increasing with N) than before.
(Which, accidentally, makes it even slower than the non-BLAS dot for old SWCoeffs (N=2^7, increases to ~8 for N=2^15)

If we really care about perf I'd rather revert this.

We should improve moment if it is slow. But regardless, you're right of course that the moment will be normalized. I'm fine with reverting it.

devmotion · 2023-10-23T08:27:33Z

src/shapiro_wilk.jl

+            m .= m ./ sqrt(ϕ) # A, but reusing m to save allocs
+            m[1], m[2] = a₁, a₂
+        end
+        resize!(m, N)


This is not needed, it seems?

Suggested change

resize!(m, N)

Why do you think so @devmotion? to me it seems very much needed

Because you create m as a Vector{Float64}(undef, N) a few lines above.

yes but then I resize it to length n = N÷2, and compute only the first half (normalizing w.r.t. n). Only then I fill the rest of the vector below line 62; To do so I need to bring back the old size

I missed this. Can you remove all resize! calls? I think you should be able to work with a view instead of resizing the array.

yes, @views are much better than resizes ;)

src/shapiro_wilk.jl

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

This reverts commit 135e43a.

kalmarek · 2023-10-24T11:39:00Z

now tests on 1.3 fail with

┌ Warning: Deprecated syntax `"begin" inside indexing expression` at /home/runner/work/HypothesisTests.jl/HypothesisTests.jl/src/shapiro_wilk.jl:74.
└ @ ~/work/HypothesisTests.jl/HypothesisTests.jl/src/shapiro_wilk.jl:74
ERROR: LoadError: LoadError: syntax: extra token ")" after end of expression
Stacktrace:
 [1] top-level scope at /home/runner/work/HypothesisTests.jl/HypothesisTests.jl/src/shapiro_wilk.jl:74

due to @view(A[begin:(begin + length(X) - 1)])
I'm actually surprised that this syntax is not valid on julia 1.3!
@devmotion since You suggested this change what would be the proper fix (to keep it indexing agnostic and allocation free)?

This reverts commit 833a18e.

kalmarek · 2023-11-02T13:50:46Z

@devmotion I took my liberty to revert/modify the suggestion on begin-based indexing, so that julia-1.3 parses the expression.

To be honest I'd be very glad to have this merged as soon as possible so that I can cross it off my list... It's been sitting here for long enough.

nalimilan · 2023-11-05T10:46:20Z

You can do view(A, firstindex(X) .+ (0:length(X) - 1))) instead I think.

kalmarek · 2023-11-06T16:12:52Z

You can do view(A, firstindex(X) .+ (0:length(X) - 1))) instead I think.

that's what I did.

kalmarek · 2023-11-28T12:45:04Z

@devmotion I'd like to ask you again to have a look at the changes you requested;
If there is something more to be done here let me know; If not please merge.

jarredclloyd · 2024-04-05T04:08:23Z

Friendly bump, hoping this is still on the radar for review/merging @devmotion @nalimilan @andreasnoack as it looked to be in a ready state, or only minor adjustments needed.

src/shapiro_wilk.jl

ararslan · 2024-05-21T01:10:15Z

src/shapiro_wilk.jl

+    * Too much data is censored (`censored / N > 0.8`)
+
+# Implementation notes
+* The current implementation DOES NOT implement p-values for censored data.


What is the utility of allowing the user to compute the Shapiro-Wilk test statistic at all if they can't get its p-value nor view the object directly in the REPL (since show calls pvalue)? That's an honest question; I would guess that many users would try it and get frustrated without checking the docstring but that is a naive assumption.

I'd probably recommend commenting out the censoring-related functionality for the time being, then someone can implement it fully in a follow-up PR. Thoughts on that?

@ararslan feel free to suggest (via gh suggestions) any edits that would suit your proposal

I guess it was half question, half proposal—do you envision a use case for this with censored data as it is currently implemented in this PR (no pvalue/show)? If not, I can make those suggestions.

Well the W statistic can be still useful if somebody wants to use section 3.3 of Royston 1993 to calculate significance levels directly.

I didn't implement those methods back then since the significance levels there are only empirical, and until I need to do it at work I don't see myself doing this now either ;)

If those suggestions push this PR over the final line, please make them!

Co-authored-by: Alex Arslan <ararslan@comcast.net>

kalmarek · 2024-10-02T09:18:23Z

Friendly bump, hoping this is still on the radar for review/merging @devmotion @nalimilan @andreasnoack as it looked to be in a ready state, or only minor adjustments needed.

This already starts feeling silly ;)
Apparently this functionality is not needed since nobody wants to take responsibility of this merge. If this last bump doesn't do the job I'll simply close the PR.

andreasnoack

Thanks for another reminder but, first of all, thanks for spending so much time on this over so many years. Unfortunately, we are all busy with other tasks which has delayed this review beyond any reasonable limites. I think we have all now spent enough time on this and any subsequent modifications should be delegated to other PRs.

azev77 mentioned this pull request May 25, 2020

Goodness of fit tests: filling in the gaps #201

Open

26 tasks

kalmarek mentioned this pull request Jan 10, 2021

Port to Hypothesis Tests jl kalmarek/ShapiroWilk#2

Open

remove the dead (?!) branch from pvalue computation

cccb03c

devmotion reviewed Oct 19, 2023

View reviewed changes

remove ShapiroWilkCoefs

475498d

kalmarek force-pushed the ShapiroWilk branch from 284d780 to 475498d Compare October 23, 2023 08:08

devmotion reviewed Oct 23, 2023

View reviewed changes

Marek Kaluba and others added 4 commits October 23, 2023 12:10

use moment instead of sum abs2

135e43a

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

use begin instead of 1-based indexing

833a18e

Co-authored-by: David Widmann <devmotion@users.noreply.github.com>

remove resizes! in favour of view

6403297

Revert "use moment instead of sum abs2"

262bf2c

This reverts commit 135e43a.

Revert "use begin instead of 1-based indexing"

46e4cca

This reverts commit 833a18e.

kalmarek requested review from devmotion and nalimilan November 2, 2023 13:50

ararslan reviewed May 21, 2024

View reviewed changes

src/shapiro_wilk.jl Outdated Show resolved Hide resolved

ararslan reviewed May 21, 2024

View reviewed changes

src/shapiro_wilk.jl Outdated Show resolved Hide resolved

ararslan reviewed May 21, 2024

View reviewed changes

src/shapiro_wilk.jl Outdated Show resolved Hide resolved

ararslan reviewed May 21, 2024

View reviewed changes

kalmarek and others added 4 commits May 23, 2024 09:41

use clamp instead of min

cf21de0

Co-authored-by: Alex Arslan <ararslan@comcast.net>

throw ArgumentError instead of string

1bd9d79

Co-authored-by: Alex Arslan <ararslan@comcast.net>

fix ref link

6047be0

Co-authored-by: Alex Arslan <ararslan@comcast.net>

fix test_throws

6e313c7

andreasnoack approved these changes Oct 2, 2024

View reviewed changes

andreasnoack merged commit eb93eec into JuliaStats:master Oct 2, 2024
10 checks passed

Shapiro-Wilk normality test #124

Shapiro-Wilk normality test #124

Conversation

kalmarek commented Jan 4, 2018 • edited Loading

Rahman & Govindarajulu

andreasnoack commented Jan 5, 2018 • edited Loading

kalmarek commented Jan 7, 2018 • edited by andreasnoack Loading

Comparing

andreasnoack commented Jan 8, 2018

kalmarek commented Jan 8, 2018 • edited Loading

kalmarek commented Jan 8, 2018

kalmarek commented Jan 25, 2018

kalmarek commented Mar 2, 2018

Normal order statistics

David-Johnson formula:

DavisStephens Covariance Matrix of Normal order Statistics

Shapiro-Wilk coefficients

andreasnoack commented Jun 5, 2018

kalmarek commented Jun 17, 2018 • edited Loading

Ph0non commented Nov 15, 2018

kalmarek commented Nov 15, 2018 • edited Loading

Ph0non commented Nov 16, 2018

kalmarek commented Nov 17, 2018 • edited Loading

Ph0non commented Nov 20, 2018

albz commented Nov 28, 2018

kalmarek commented Nov 28, 2018 • edited Loading

kalmarek commented Nov 28, 2018 • edited Loading

albz commented Nov 29, 2018

andreasnoack commented Aug 20, 2020

kalmarek commented Aug 20, 2020

kalmarek commented Jan 9, 2021

BioTurboNick commented Jan 18, 2021

kalmarek commented Jan 18, 2021 • edited Loading

kalmarek commented Feb 26, 2021

kalmarek commented Mar 2, 2021

jarredclloyd commented Dec 9, 2022

kalmarek commented Dec 9, 2022

jarredclloyd commented Dec 9, 2022

kalmarek commented Oct 19, 2023

devmotion Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalmarek Oct 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalmarek commented Oct 24, 2023

kalmarek commented Nov 2, 2023

nalimilan commented Nov 5, 2023

kalmarek commented Nov 6, 2023

kalmarek commented Nov 28, 2023

jarredclloyd commented Apr 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalmarek commented Oct 2, 2024

andreasnoack left a comment

Choose a reason for hiding this comment

kalmarek commented Jan 4, 2018 •

edited

Loading

andreasnoack commented Jan 5, 2018 •

edited

Loading

kalmarek commented Jan 7, 2018 •

edited by andreasnoack

Loading

kalmarek commented Jan 8, 2018 •

edited

Loading

kalmarek commented Jun 17, 2018 •

edited

Loading

kalmarek commented Nov 15, 2018 •

edited

Loading

kalmarek commented Nov 17, 2018 •

edited

Loading

kalmarek commented Nov 28, 2018 •

edited

Loading

kalmarek commented Nov 28, 2018 •

edited

Loading

kalmarek commented Jan 18, 2021 •

edited

Loading

devmotion Oct 19, 2023 •

edited

Loading

kalmarek Oct 23, 2023 •

edited

Loading