Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shapiro-Wilk normality test #124

Merged
merged 68 commits into from
Oct 2, 2024
Merged

Conversation

kalmarek
Copy link
Contributor

@kalmarek kalmarek commented Jan 4, 2018

implements ShapiroWilkTest following

PATRICK ROYSTON
Approximating the Shapiro-Wilk W-test for non-normality
Statistics and Computing (1992) 2, 117-119
DOI: 10.1007/BF01891203

This is work in progress, all comments are welcome!
I tried to follow the paper closely, but copied e.g. computed constants from the original swilk.f.
Please let me know if You are ok (license-wise) with this.

These polynomial interpolations are computed at loadtime for speed.
Currently

julia> using BenchmarkTools
julia> srand(1);
julia> k = 5000;
julia> X = sort(randn(k));
julia> @btime ShapiroWilkTest(X);
  112.523 μs (18 allocations: 137.45 KiB)
julia> swc = SWCoeffs(k);
julia> @btime ShapiroWilkTest(X, swc);
  62.476 μs (11 allocations: 78.50 KiB)

whereas calling swilkfort directly

julia> srand(1);
julia> X = sort(randn(5000));
julia> A = zeros(2500);
julia> @btime swilkfort!(X, A);
  205.763 μs (10 allocations: 224 bytes)
julia> @btime swilkfort!(X, A, false);
  75.692 μs (10 allocations: 224 bytes)

Still missing:

  • documentation
  • censored data
  • polynomial interpolation with better precision (?),

and probably more.

I tried to compute exact values of SWCoeffs (via MonteCarlo simulation), but the results I'm getting are off the reported ones in Table 1 op. cit. Woule be glad if anyone could help.

> function approximate_A(N, samps)
    s = [sort(randn(N)) for i in 1:samps]
    m = sum(s[i][1:N] for i in 1:samps)/samps
    ss = vcat(s...)
    Vinv = inv(cov(ss, ss))
    A = (-m'*Vinv/sqrt(m'*Vinv*Vinv*m))'
    return ((A - reverse(A))/2)[1:div(N,2)]
end;

approximate_A(10,1000000) results in

[0.5469493640066541, 0.3559315327467397, 0.23322491989287789, 0.1336531031109471, 0.043609401791162974]

Compare swilk.f's, and SWCoeffs(10).A:

[0.5737146934337275, 0.3289700603561813, 0.21434902886647542, 0.12279063037794058, 0.04008871216291867]
[0.5737147066903874, 0.3289700464878161, 0.21434901803439887, 0.1227906248657577,  0.04008871105102476]

Rahman & Govindarajulu

A different implementation of SWCoeffs following

M. Mahibbur Rahman & Z. Govindarajulu,
A modification of the test of Shapiro and Wilk for normality
Journal of Applied Statistics 24:2, 219-236, 1997
DOI: 10.1080/02664769723828

function HypothesisTests.SWCoeffs(N::Int, ::Type{Val{:Rahman}})
    M = [norminvcdf(i/(N+1)) for i in 1:N];
    normM = normpdf.(M)

    c = (N+1)*(N+2)
    d = 2c.*normM.^2
    dl= [-c*normM[i]*normM[i+1] for i in 1:N-1]
    Vinv = SymTridiagonal(d, dl)
    C = sqrt(M'*(Vinv*Vinv)*M)

    Astar = -(M'*Vinv)'/C
    return SWCoeffs(N, Astar[1:div(N,2)])
end

But they don't provide critical values for their statistic (or I couldn't find it), so that would require further simulation. Also their SWCoeffs(10, Val{:Rahman}) are quite different:

[0.6010309510135708, 0.2975089851946684, 0.19212508233094944, 0.10979215067183574, 0.03583065785524651]

On the other hand ALGORITHM AS R94 is a standard.

@andreasnoack
Copy link
Member

andreasnoack commented Jan 5, 2018

Thanks for contributing this. Your implementation looks good. The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.

Another issue is the license. http://lib.stat.cmu.edu/apstat/ states that

The Royal Statistical Society holds the copyright to these routines, but has given its permission for their distribution provided that no fee is charged.

which is not compatible with the MIT license so either you should try to implement these directly from the paper or try to contact the Royal Statistical Society and ask them if they have changed their mind regarding the clause. The best would be if they could use a real license.

@kalmarek
Copy link
Contributor Author

kalmarek commented Jan 7, 2018

The main issue is that the swilk.f is written for Float32 and not Float64 so you'll probably need to compute new polynomial coefficient for the Float64 implementation.

Yep, I tried to do this, but failed obtaining the "true" SWCoeffs by semi-exact means

Even obtaining the "exact" expected values of order statistics E(r:n) (our vector m) seems to be a problem. Convention: we obtain the first half of -order statistics N(0,1).

A simple MC produces highly variable results:

function mMC(N::Int,samps=10^5)
    X = Array{Float64}(N)
    s = zeros(div(N,2))
    for i in 1:samps
        for j in 1:N
            X[j] = randn()
        end
        sort!(X)
        for j in 1:div(N,2)
            s[j] += X[j]
        end
    end
    return -s/samps
end
d = mMC(101, 10^7) - mMC(101, 10^7);
norm(d,1)
0.0019123699000122493

A more or less exact order statistics can be obtained by integration directly, eg. following

Algorithm AS 177,
Royston, J.P.
Expected Normal Order Statistics (Exact and Approximate)

But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?

function Royston(n,r, logs=[log(i) for i in 1:n], o=7*sqrt(n))
    C = sum(logs) - sum(view(logs, 2:r-1)) - sum(view(logs, 2:n-r))
    I(x) = C + (r-1)*log(normccdf(x)) + (n-r)*log(normcdf(x)) - 1/2*(log(2π) + x^2)
    return quadgk(x->x*exp(I(x)), -9.0-log(n),9.0+log(n), 
            order=floor(Int, o), 
            maxevals=floor(Int, log(n)*10^7))
end
function mRoyston(n, o=7*sqrt(n))
    logs=[log(i) for i in 1:n]
    m = [Royston(n,r, logs, o) for r in 1:div(n,2)]
    err = norm([i[2] for i in m],1)
    @show err
    return [i[1] for i in m]
end

Royston's Algoritm AS R94 uses Blom approximation:

Blom(n,r) = -norminvcdf((r - 3/8)/(n + 1/4))
mBlom(n) = [Blom(n,r) for r in 1:div(n,2)]

and Rahman (cited above) used a simple

Rahman(n,r) = -norminvcdf(r/(n + 1))
mRahman(n) = [Rahman(n,r) for r in 1:div(n,2)]

Comparing

I believe the direct integration to be the most accurate, so I'd lean towards regarding it as "true". Comparing other methods against it gives

const N = 5000
m_Ro = mRoyston(N);
m_mc = mMC(N, 10^6);
m_Bl = mBlom(N);
m_Rh = mRahman(N);

p = plot(yaxis=:log10, title="Relative to Royston",size=(800,600))
for (X, l) in [(m_mc, "MonteCarlo"), 
        (m_Bl, "Blom approximation"), 
        (m_Rh, "Rahman approximation"), 
   ]
    plot!(abs.((X-m_Ro)./m_Ro), label=l)
end
p

newplot 2

Question:

  1. Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)
  2. Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?

@andreasnoack
Copy link
Member

But I don't know how exact are normcdfs, etc., quadgk, i.e. how exact is the following method?

Both should be high quality and quadgk gives can provide an estimate of the error. Notice also that there are normlogcdf which might give you a little extra accuracy over the explicit log.

So all-in-all I think this should be fine for computing the mean order statistics. Have you tried to compute the covariances yet? They are needed for fitting the polynomials, right?

  1. Is storing/caching values of m (and later of SWCoeffs) an option (for sizes `N in 4:1:5000)

I guess we could do that. We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.

Should we follow AS R94 for the ShapiroWilk test, or should we fit our own curves for distributions of W?

Given how inaccuratly they computed the order statistics back then, I guess it would be good if we could recompute them.

Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.

@kalmarek
Copy link
Contributor Author

kalmarek commented Jan 8, 2018

For covariances I plan to do the following: [DavisStephens1977] goves an algorithm to compute covariance matrix of normal order statistics given

  1. $V_11$ - variance of the largers order statistics → Approximated by [SheaScallon1988]
  2. $EX_1$, $EX_2$ - the expected value of the largest and second largest order statistics → ~exact by [Royston1982]
  3. $\Sigma^2$ - sum of squared of order statistics → Approximated (with bounds on the error) by [Balakrishnan1984]
  • [DavidStephens1977]: C. S. Davis & M. A. Stephens The covariance matrix of normal order statistics Communications in Statistics - Simulation and Computation Vol. 6 , Iss. 1, 1977 DOI:10.1080/03610917708812028
  • [SheaScallon1988]: B. L. Shea and A. J. Scallon Remark AS R72: A Remark on Algorithm AS 128. Approximating the Covariance Matrix of Normal Order Statistics Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 37, No. 1 (1988), pp. 151-155 DOI:10.2307/2347514
  • [Royston1982]: J. P. Royston Algorithm AS 177: Expected Normal Order Statistics (Exact and Approximate) Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 31, No. 2 (1982), pp. 161-165 DOI:10.2307/2347982
  • [Balakrishnan1984]: N. Balakrishnan Algorithm AS 200: Approximating the Sum of Squares of Normal Scores Journal of the Royal Statistical Society. Series C (Applied Statistics) Vol. 33, No. 2 (1984), pp. 242-245 DOI:10.2307/2347458

But I may not have enough time to this during this week.

We could also try to do this as part of a generated function that would compute everything on demand but it might make the first call pretty slow. It would be pretty cool on the other hand.

We may have a look at speeds and decide when I finish the plan above; but I guess cached (or @memoized?) version of SWCoeff(N::Int) will be the way to go.

Another thing, do you have experience with Weisberg-Bingham version? It is so much easier to compute so do you know why people seem to prefer SW? Path dependency? I tried to look for power comparisons I couldn't find anything that suggested it to be significantly worse.

Roystons AS R94 is essentially Weisberg-Bingham + corrections + fitting for the distribution of W-statistic. AS R94 uses Bloms expectation of normal order statistics m and then corrects just the first two coefficients
This is an approximation of (the perfect) Shapiro-Wilk, if You like. If we can (on modern hardware) do it the "exact" way (and make it fast) maybe we should?

@kalmarek
Copy link
Contributor Author

kalmarek commented Jan 8, 2018

Another thing, do you have experience with Weisberg-Bingham version?

As a matter of fact, I'm not a statistician, doing pure math, coding in julia "after hours" ;-). But for most of my "empirical science" friends SW was the one to go to, and seems to have the best power for small samples.

@kalmarek
Copy link
Contributor Author

Ok, variances/covariances turned out harder than I thought:
I implemented the exact formulas from Godwin Some Low Moments of Order Statistics in normordstats.jl, please have a look. These become numerically unstable for n ~ 25, but for n=20 agree to ~10 digits with Table 1 of Parrish. He used 128 bits of precision to compute tables of products of order statistics up to n=50 (and I guess he didn't progress any further because of loss of precision). So maybe my cheap implementation isn't that bad ;-)

(BTW. Do You have, by a chance, access to a paper copy of Parrishs article (to scan and send me)? The digital one I have is without text layer and of such poor quality that OCRing those tables produces errors in every row. Which for an article to BORROW for 24h for $50 is OUTRAGEOUS. )

With these I can tell that Roystons approximation of SWCoeffs is surprisingly good, whereas Rahmans is (surprisingly) bad? Maybe there is an error in my implementation of the latter:

blom(n,i) = norminvcdf((i - 3/8)/(n + 1/4))
rahman(n,i) = norminvcdf(i/(n + 1))

struct Blom
    n::Int
end

struct Rahman
    n::Int
end

OrderStatistics.expectation(B::Blom) = [blom(B.n,i) for i in 1:B.n]
OrderStatistics.expectation(R::Rahman) = [rahman(R.n,i) for i in 1:R.n]

# M. Mahibbur Rahman & Z. Govindarajulu,  
# A modification of the test of Shapiro and Wilk for normality  
# *Journal of Applied Statistics* 24:2, 219-236, 1997  
# DOI: [10.1080/02664769723828](http://dx.doi.org/10.1080/02664769723828)

function invV(N, M)
    pdfM = normpdf.(M)
    d = pdfM.^2
    dl= [-pdfM[i]*pdfM[i+1] for i in 1:N-1]
    
    return (N+1)*(N+2)*SymTridiagonal(d, dl)
end

function HypothesisTests.SWCoeffs(R::Rahman)    
    M = expectation(R)
    A = invV(R.n, M)
    C = sqrt(M'*(A*A)*M)

    Astar = (M'*A)'/C
    return SWCoeffs(R.n, Astar[1:div(R.n,2)])
end

function HypothesisTests.SWCoeffs(OS::NormOrderStatistic)
    m = expectation(OS)
    iV = inv(cov(OS))
    A = m'*iV./sqrt(m'*iV*iV*m)
    return SWCoeffs(OS.n, A[1,1:div(OS.n,2)])
end

Then:

N = 25
@time ExactA = SWCoeffs(NormOrderStatistic(N));
plot(SWCoeffs(N).A, label="Blom + 0-correlation OrdStats")
plot!(SWCoeffs(Rahman(N)).A, label="Rahman + limit inverse correlation")
plot!(-ExactA.A, label="Exact formulas, BigFloat")

newplot 4
(You can see instability kicking in)

Right now I can not think of anything better than the mentioned above approximation by Stephen&David (using Taylor expansion of the expectation of product of order statistics).

Using BigFloats, ArbFloats, etc will not help much, as the precision of e.g. erf, hence normcdf is limited to Float64, at least for now.

@kalmarek
Copy link
Contributor Author

kalmarek commented Mar 2, 2018

It took more time than I had predicted, but turned out a fun project ;-)
I'm looking for a piece of advise what to do with the code, based on the results below. At the moment I think that numerically Shapiro-Wilk Test should be left "as is", until someone could point what did Royston fit his polynomials against. We should of course use the exact values of Parrish for N≤20 (or even N≤50 if I will be able to fix the loss of precision mentioned below).

Normal order statistics

For implementation see https://github.com/kalmarek/HypothesisTests.jl/tree/normordstats
Using this code You can confirm the covariances computed by Parrish (the 25dp values) for N≤20, but beyond that point there seems to be precision loss, which I was not able to trace.

David-Johnson formula:

function diffG(x)
    s = normpdf(x)
    s² = s^2
    s⁴ =*s²
    x² = x^2
    x⁴ =*return (
        1/s, 
        x/s², 
        (1 + 2x²)/(s*s²), 
        x*(7 + 6x²)/s⁴, 
        (7 + 46+ 24*x⁴)/(s*s⁴),
        x*(127 + 326+ 120x⁴)/(s²*s⁴)
        )
end
function covDavidJohnson(i::Int,j::Int,n::Int, order=3)
    if j < i
        return covDavidJohnson(j,i,n)
    end
        
    pᵢ, pⱼ = i/(n+1), j/(n+1)
    qᵢ, qⱼ = 1 - pᵢ, 1 - pⱼ
    Gᵢ = diffG(norminvcdf(pᵢ))
    Gⱼ = diffG(norminvcdf(pⱼ))
    X = 1/(n+2)
    T = Vector{eltype(Gᵢ)}()
    if order >= 1
        T1 = X  *pᵢ*qⱼ*Gᵢ[1]*Gⱼ[1]
        push!(T, T1)
    end
    if order >= 2
        T2 = X^2*pᵢ*qⱼ*(
                (qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] + 
                (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] + 
                0.5(
                    pᵢ*qᵢ*Gᵢ[3]*Gⱼ[1] +
                    pⱼ*qⱼ*Gᵢ[1]*Gⱼ[3] +
                    pᵢ*pⱼ*Gᵢ[2]*Gⱼ[2] 
                    )
                )
        push!(T, T2)
    end
    if order >= 3
        T3 = X^3*pᵢ*qⱼ*(
                -(qᵢ-pᵢ)*Gᵢ[2]*Gⱼ[1] - (qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[2] + 
                ((qᵢ-pᵢ)^2 - pᵢ*qᵢ)*Gᵢ[3]*Gⱼ[1] + 
                ((qⱼ-pⱼ)^2 - pⱼ*qⱼ)*Gᵢ[1]*Gⱼ[3] +
                1/2*(3(qᵢ-pᵢ)*(qⱼ-pⱼ) + pⱼ*qᵢ - 4pᵢ*qⱼ)*Gᵢ[2]*Gⱼ[2] + 
                5/6*pᵢ*qᵢ*(qᵢ-pᵢ)*Gᵢ[4]*Gⱼ[1] + 
                5/6*pⱼ*qⱼ*(qⱼ-pⱼ)*Gᵢ[1]*Gⱼ[4] + 
                (pᵢ*qⱼ*(qᵢ-pᵢ) + 0.5pᵢ*qᵢ*(qⱼ-pⱼ))*Gᵢ[3]*Gⱼ[2] + 
                (pᵢ*qⱼ*(qⱼ-pⱼ) + 0.5pⱼ*qⱼ*(qᵢ-pᵢ))*Gᵢ[2]*Gⱼ[3] + 
                1/8*(pᵢ^2*qᵢ^2*Gᵢ[5]*Gⱼ[1] + pⱼ^2*qⱼ^2*Gᵢ[1]*Gⱼ[5]) + 
                1/4*(pᵢ^2*qᵢ*qⱼ*Gᵢ[4]*Gⱼ[2] + pᵢ*pⱼ*qⱼ^2*Gᵢ[2]*Gⱼ[4]) + 
                1/12*(2pᵢ^2*qⱼ^2 + 3pᵢ*pⱼ*qᵢ*qⱼ)*Gᵢ[3]*Gⱼ[3]
                )
        push!(T, T3)
    end
    return sum(T)
end

Comparing to the high-precision (25d) values obtained by Parrish (N = 20; log10.(abs.(differences))):
newplot 5

DavisStephens Covariance Matrix of Normal order Statistics

This method uses the David-Johnson formula + relations between correlations, etc:

function normalizerel!(v::AbstractVector, relative::Union{Vector{Int}, UnitRange})
    Sr = sum(v[relative])
    c = (1-Sr)/(sum(v) - Sr)
    for i in eachindex(v)
        if !(i in relative)
            v[i] *= c
        end
    end
    return v
end
function fillcov!(V, j::Int)
    # copy reversed j-th column to (end-j)-th column
    @views V[j:end-j+1, end-j+1] .= reverse(V[j:end-j+1, j])
    # copy j-th column to j-th row
    @views V[j, j+1:end-j] .= V[j+1:end-j, j]
    # copy (end-j)-th column to (end-j)-th row 
    @views V[end-j+1, j+1:end-j] .= V[j+1:end-j, end-j+1] 
    return V
end
function covDavisStephens(OS::NormOrderStatistic{T}, V11::T=var(OS, 1)) where T
    V = zeros(T, OS.n, OS.n)
    
    V[1,1] = V11
    V[2,1] = moment(OS, 1, 2) - expectation(OS,1)*expectation(OS,2) - T(1)
    
    for i in 3:OS.n
        V[i, 1] = covDavidJohnson(i, 1, OS.n)
    end
    normalizerel!(view(V, :, 1), [1,2])
    fillcov!(V, 1)
    j=1

    for j in 2:div(OS.n, 2)
        V[j,j] = var(OS, j)
        for i in j+1:OS.n-j+1
            V[i,j] = covDavidJohnson(i, j, OS.n)
        end
        normalizerel!(view(V, :, j), [collect(1:j); collect(OS.n-j+1:OS.n)])
        fillcov!(V, j)
    end
    
    if isodd(OS.n) # the middle point
        j = div(OS.n, 2)+1 
        V[j, j] = var(OS, j)
        normalizerel!(view(V, :, j), [j])
        fillcov!(V, j)
    end
    return Symmetric(V, :U)
end

covDavisStephens(N::Int) = covDavisStephens(NormOrderStatistic(N))

Again comparing against high-precision values of Parrish (N = 20; log10.(abs.(differences))):
newplot 6

The formula is pretty quick:

N = 2000
@time M = covDavisStephens(N)
@time invM = inv(M);
heatmap(abs.(invM./maximum(invM)), size=(800,700), yflip=true)
# 2.853262 seconds (12.26 M allocations: 365.602 MiB, 2.97% gc time)
# 0.552725 seconds (22 allocations: 62.043 MiB, 1.02% gc time)

newplot 7

Shapiro-Wilk coefficients

Finally the coefficients:

swcoeff(invcov, m) = ((m'*invcov)'/sqrt(m'*(invcov*invcov)*m))[1:div(length(m),2)]

Relative to the exact (=using high-precision covariance by Parrish) value of SW coefficients:

N=20
m = Float64.(expectation(NormOrderStatistic{BigFloat}(N)))
exact = -swcoeff(inv(Float64.(covd[N])), m)
p = plot(yaxis=:log10)
# plot!(exact, label="Exact (Parrish)")
plot!(abs.(SWCoeffs(N).A)./exact, label="Royston")
plot!(abs.(swcoeff(eye(N), expectation(Blom(N))))./exact, label="Weisberg-Bingham")
plot!(abs.(swcoeff(inv(covDavisStephens(N)), m))./exact, label="DavisStephens")

newplot 8

I still have no idea what Royston fitted his coefficients against, but that must have been pretty accurate!

for N=2000, the third coefficient of Davis-Stephens is still larger than it should by ~25% (relative to Royston):
newplot 9

@andreasnoack
Copy link
Member

@kalmarek Sorry for dropping the ball here. What is the current status?

@kalmarek
Copy link
Contributor Author

kalmarek commented Jun 17, 2018

I also had too much to do, couldn't finish this cleanly;)

long story short we either:

  1. use Royston approximation (already in place), forgetting that he fit using 32-bits of precision (as everybody do)
  2. use Parrish exact values (N≤25) of covariances to compute exact SW-coeffs and Royston otherwise (as everybody should do)
  3. try to get the order statistics going (and pre-compute the SW-coeffs, e.g. for N ≤ 50, or 100), and use Royston method above;

Happy to finish this, starting the next week, actually (conferences, etc.)

@Ph0non
Copy link
Contributor

Ph0non commented Nov 15, 2018

This dedicated work on Shapiro-Wilk-Test is exceptional fantastic 🥇

Are there any news to complete the work on the SWT?

@kalmarek
Copy link
Contributor Author

kalmarek commented Nov 15, 2018

@Ph0non this effort (it was fun!) is an offspring of my frustration on the teaching I was forced to do.
and yes, that included teaching statistics ;-)
that semester is over, so I don't have enough frustration in my store right to forge it into code ;-)

but if You are willing to provide documentation, I may look into finishing this, but only in December (conferences abound).

@Ph0non
Copy link
Contributor

Ph0non commented Nov 16, 2018

Do you need documentation like the other Tests in the package or something more (with more theoretical background or so)?

@kalmarek
Copy link
Contributor Author

kalmarek commented Nov 17, 2018

There are three things to be finished here:

  1. fix problems with accuracy
  2. find out what Royston fit his approximation to
  3. produce test suite
  4. write documentation

I volunteer to finish 1 (and maybe 2)
If I succeed then 3 should be rather easy, but not zero-work task. In the documentation (4) we should (among others) carefully explain:

  • why do we differ from standard packages (everybody out there is using Royston),
  • how do we obtain our SW coeffs
  • why everybody should be using exact order statistics in SW (at least for N≤50)

@Ph0non
Copy link
Contributor

Ph0non commented Nov 20, 2018

Some small fixes kalmarek#1

@albz
Copy link

albz commented Nov 28, 2018

Hi, thanks for the work done so far.

Have you check python implementation?
It could be helpful, it will allow to cross check results as well as check coefficients.

I cannot remember right now, by heart, how the coefficients are computed in the python version but I believe they are for double precision and are not cover by any copyright. It could be helpful to cross check this info.

I had also implemented a version in the fashion for tran at some point, it could come handy.

@kalmarek
Copy link
Contributor Author

kalmarek commented Nov 28, 2018

@albz could You point me to a python implementation? the one from scipy
https://github.com/scipy/scipy/blob/v1.1.0/scipy/stats/morestats.py#L1304
uses standard fortran code, the approximating polynomials were not fit in double precision there.

right now copyright is not a problem, since I wrote everything from scratch following the paper only.

@kalmarek
Copy link
Contributor Author

kalmarek commented Nov 28, 2018

I see that people are generally interested so here is some details. In 1992 paper Royston

  1. obtained (by 10000-MC samples) what he calls "exact" coefficients (A),
  2. took Weisberg-Bingham approximation (C) and
  3. fit two quintic polynomials (in n^{1/2}) to differences A_n - C_n and A_{n-1} - C_{n-1}. Coefficients of those polynomials are visible in the fortran code as C1 and C2;
  4. His coefficients for SW consist of corrected C_n and C_{n-1} and re-normalized rest of C.

I'm happy with both options here:

  • providing a pre-computed (but this time exact!) coefficients A for sizes up to e.g. 200 and then
  • compare the exact coefficients (again) with Weisberg-Bingham and use the Roystons method for approximating the rest, as above.

Note that my experiments with MC show that it can not provide better precision than 1e-5 or 1e-6;

Then additional fitting must be provided for the distribution of W-statistic.
Royston claims it's close to two ~lognormals for two different sets of parameters which needs to be refitted again (these are C3-C7 in the fortran code).

@albz
Copy link

albz commented Nov 29, 2018

You are right, the python-fortran code is in single precision. That is of little help.

@andreasnoack
Copy link
Member

@kalmarek It's really a shame that this never was completed given all the work you put into this. So hereby my encouragement to resume the work.

@kalmarek
Copy link
Contributor Author

@andreasnoack Thanks for the encouragement ;) However unlikely it is, SW test is still in the back of my head.
Hopefully we will be soon bringing Arblib.jl to a usable state and then I could use its certified integration to re-compute SW coefficients.
The plan is just as it was before:

  • use exact SW coeffs for small number of samples, say N≤200 (using Arblib)
  • use Royston approximation afterwards (this is implemented) (but SW test becomes pretty useless for large sample sizes :)
  • re-fit polynomials (logpolynomials?) to distribution(s) of W-statistics and find critical thresholds (TODO)

@kalmarek
Copy link
Contributor Author

kalmarek commented Jan 9, 2021

@andreasnoack in kalmarek/ShapiroWilk#1 SWCoeffs are implemented using Arblib;
one can do (an embarrassingly parallel) computation e.g. for (n=50, prec=128), which gives in a couple of minutes ~ 7 digits of precision; one gets more than Float64 has for prec=256.

To compute critical/p-values I'd need to

  1. approximate a normalizing transform (1-Wₙ)^t ~ N(µ, σ) (µ could be computed analytically, I'm not sure about σ) via MC
  2. fit a (exponential? biexponential?, inverse log?) function to t = t(n) and use it for all sample sizes
  3. use z-scores for the transformed W-statistic

I used of the shelf Optim.jl methods to do so:

using Statistics
using Random

using Distributions
using LinearAlgebra
using Optim


using Revise
using ShapiroWilk


function sample_Wstatistic(sw::ShapiroWilk.SWCoeffs{T}, sample_size) where T
    Wstats = zeros(T, sample_size)
    tmp = similar(Wstats, length(sw))
    for i in 1:sample_size
        randn!(tmp)
        sort!(tmp)
        Wstats[i] = ShapiroWilk.Wstatistic(tmp, sw)
    end
    Wstats
end;

function normal_power_fit(
    t,
    sample,
    qs = 0.005:0.005:0.995,
    normal_qs=quantile.(Normal(0,1), qs),
)

    sample_t = similar(sample)

    for idx in 1:length(sample_t)
        sample_t[idx] = (one(eltype(sample)) - sample[idx])^t
    end

    µ, σ = mean(sample_t), std(sample_t)
    for idx in 1:length(sample_t)
        sample_t[idx] = (sample_t[idx] - μ)/σ
    end

    #@assert issorted(sample_t)

    sample_t_qs = quantile(sample_t, qs, sorted=true)

    for idx in 1:length(sample_t_qs)
        sample_t_qs[idx] = sample_t_qs[idx] - normal_qs[idx]
    end

    return norm(sample_t_qs, 2)
end

optionally e.g.

@time let n = 30, prec=128
    OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
    @time ShapiroWilk.OrderStatisticsArblib._precompute_ψ(n, prec=prec, R=18.0)
end

then

exponents = map(5:30) do n
    prec=128
    qs=0.005:0.005:0.995

    OS = ShapiroWilk.OrderStatisticsArblib.NormOrderStatistic(n, prec=prec)
    @info "Computing coefficients..."
    @time swfl = ShapiroWilk.SWCoeffs(n, Float64.(ShapiroWilk.SWCoeffs(OS)))

    @info "Sampling W-statistic..."
    @time W_sample = sort!(sample_Wstatistic(swfl, 10_000_000), rev=true)

    @info "Fitting power-transform for normality..."
    @time res = let qs=qs
        normal_qs=quantile.(Normal(0,1), qs)

        optimize(
            x->normal_power_fit(first(x), W_sample, qs, normal_qs),
            0.01,
            0.5,
            Brent(),
        )
    end
    @info "For n=$n optimization yielded" res

    Optim.minimizer(res)
end

yields the normalizing exponents, i.e. τ such that (1-W)^τ ~ N(µ, σ). fitting of τ=τ(n) is straightforward, e.g.

┌ Info: For n=30 optimization yielded
│   res =
│    Results of Optimization Algorithm
│     * Algorithm: Brent's Method
│     * Search Interval: [0.010000, 0.500000]
│     * Minimizer: 5.386236e-02* Minimum: 6.233191e-02* Iterations: 12* Convergence: max(|x - x_upper|, |x - x_lower|) <= 2*(1.5e-08*|x|+2.2e-16): true* Objective Function Calls: 13

The whole table of these τs look as follows

tau_fit_5:30

@BioTurboNick
Copy link

Was just looking into Shapiro-Wilk, am I reading it right that it's pretty close?

@kalmarek
Copy link
Contributor Author

kalmarek commented Jan 18, 2021

@BioTurboNick You can already use code from this branch directly (well, the only thing that doesn't work is censored samples) for an implementation based on Royston approximation of S-W test; (well updating to modern HypothesisTests will require a bit coding, but not much).

What is (relatively) close is a new approximation that I'm doing in my spare time ;) Whether it will make a difference, I don't know, but it will be much closer to "mathematical correctness" (whatever that means:).


EDIT: just to let you know: everybody else (R, scipy) are using the same old fortran implementation which will be comparable to what is in this branch.

@kalmarek
Copy link
Contributor Author

@andreasnoack or anyone else in this thread ;) I'd need your advice.

I have the SW-coeffs computed up to eps(Float64) precision for n=3:189 (I could go further, but I wasn't patient enough ;).

what is left is the critical levels for the W-statistic, i.e. wstatcdf. Royston/Parrish use different normalizing transforms (functions that take W to Z ~N(0, 1))

  • for n∈4:11 approximate/interpolate pdf/cdf of W=W_n using MC. Royston in 1993 suggests -ln(γ-ln(1-W)) as normalizing transform, but this seems very crude, given that pdf seems non-differentiable. I'm not even sure this is taken into the account in the existing Shapiro-Wilk implementations.
  • find a normalizing transform for n∈12:2000 and estimate its parameters based on MC for n∈12:189. Here there are two candidates: W → (ln(1-W) - µ)/σ (Royston) or W → ((1-W)^t - µ)/σ (Parrish):

After looking closely at MC samples I arrived at this:

  • n = 3 → analytic solution
  • n = 4:6 → MC + Piecewise fit ??
  • n = 7:28 → Parrishes power transform (1-W)^t ~ N(µₙ,σₙ): we need to store (tₙ, µₙ, σₙ)
  • n >= 29 → Roystons lognormal log(1-W) → N(µ′ₙ, σ′ₙ): fit biexps/someother models to µ′, and σ′ on n = 29:189 and use it for all n.

(The transition Parrish → Royston is determined based on MSE on ~200 uniformly distributed quantiles)

What do you think?

I asked about the shape of the W-distributions on discourse: https://discourse.julialang.org/t/can-anyone-tell-me-what-is-this-distribution/53726 but got no input ;)

@kalmarek
Copy link
Contributor Author

kalmarek commented Mar 2, 2021

ok, it seems that I was fitting Normal distribution wrong;) Box-Cox transform W → (1-W)^λ/λ for n in 7:189 seems superior to shifted logarithmic, when optimizing log likelihood (currently using BoxCoxTrans.jl).

using 1_000_000 MC samples of W-statistic taken from normal (for each of n∈7:189) I computed λ (the exponent of Box-Cox transform), µ and σ (the mean and std of the transformed sample of W) as functions of n

  • inverse(t, a₀, a₁, γ) = a₀ + a₁/(t - γ)
  • logarithmic(t, a₀, a₁, γ) = a₀ + a₁*log(t - γ).
    The models (green) were fit using least squares for n=7:140 (blue) and you can see how well they fit the "test" data (in red). The bottom plot is normalized residuals with their mean and std.

λ_7:140
µ_7:140
σ_7:140

λ_7:140.pdf
µ_7:140.pdf
σ_7:140.pdf

@andreasnoack As I really have very little knowledge about statistics/fitting models/parameters estimation I need someone knowledgeable to have a look at this and provide a stamp of approval ;)

@jarredclloyd
Copy link
Contributor

@kalmarek @andreasnoack just curious if this is still being looked into/implemented

@kalmarek
Copy link
Contributor Author

kalmarek commented Dec 9, 2022

@jarredclloyd this kind of stalled because I was not able to produce a meaningful fits (except eyeballing plots and saying: this looks seriously gooood...). If you're willing to help (and have more experience with this) be my guest

@jarredclloyd
Copy link
Contributor

Definitely willing to help, but I think the specific problem you are trying to overcome might be beyond my level of knowledge currently.
Correct me if I'm wrong, but the implementation here has a working version consistent with other languages?
I think it would be great to have this implemented into HypothesisTest so that Julia has a functional Shapiro-Wilk test, especially given all the fantastic work you've done on this. From there we could attempt to resolve the fitting issue (I can ask some of my colleagues who might be able to help).
So, what needs to be completed to have a functional version of Shapiro-Wilk that can be merged into the main branch?

@kalmarek
Copy link
Contributor Author

@nalimilan thanks for your review and patience. It seems ready to be merged. Let's celebrate 🎉 (and don't forget to squash ;P)

Comment on lines 34 to 37
struct ShapiroWilkCoefs <: AbstractVector{Float64}
N::Int
A::Vector{Float64}
end
Copy link
Member

@devmotion devmotion Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it necessary to construct a new array type? Isn't a function that constructs a Vector{Float64} with the desired coefficients A and a function that performs the desired computation with them simpler?

Generally, it's quite non-trivial to implement the AbstractArray interface correctly and most operations with the resulting array type will fall back to the generic implementations in Base and LinearAlgebra instead of exploiting optimizations for Vector{Float64}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary; SW coeffs are antisymmetric and we're computing and storing just half of them.
It'd be easy enough to collect them, store in ShapiroWilkCoefs and use internally (we might gain something performance-wise by this, but only when they are reused). Distinguishing them from Vector{Float64} makes the interface and signatures clearer imo.

@devmotion Was this just a question or rather a suggestion for change? If it is the latter I'd appreciate a set of concrete suggestions ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It started as a question but now I'd actually suggest changing the implementation.

Note that I did not mean to change the efficiency of the algorithm, I don't think you necessarily have to store all coefficients. I was suggesting to store and use the A field directly in the test. More concretely, make the A coefficients a field of the test struct, and then reimplement unsafe_swstat in a way that takes the vector of coefficients directly (instead of providing its wrapper).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmotion What is the benefit of dropping this abstraction in favour of plain Vector{Float64}? On performance side, since ShapiroWilkCoefs is an immutable its creation will be most probably omitted by the compiler; on the programmer side it makes clear what kind of vector must be passed, and it is not an arbitrary Vector{Float64}. Users may pass the coefficients for re-use and I find it very fitting to provide a thin (no cost?) wrapper to store the coefficients.


After doing a bit of benchmarking, I think we gain very little by storing only half of the coefficients and we loose dispatch to Blas.dot (the generic one isn't that much slower though). I'm fine changing the implementation to store the whole of the vector. I'd rather keep ShapiroWilkCoefs though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we store the whole vector, I agree with @devmotion that we may as well take a plain Vector{Float64}. Wrapping values in ShapiroWilkCoefs doesn't add anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that the wrapper should be removed regardless od whether the full vector is stored or not. As you noticed in your benchmarks, the wrapper array type is often not performant since you miss optimizations/BLAS. I don't think there's a major benefit from a user or correctness perspective either - you can always provide nonsensical coefficients and as a user you would always construct them with some documented and exported function without caring about their type.

Taken together, I think you should remove the wrapper and possibly store half of the coefficients. I wonder if you also benchmarked the version based on the Vector{Float64} with half of the coefficients (dot rewritten as sum of two dot operations)?

Copy link
Contributor Author

@kalmarek kalmarek Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for simplicity I went with the plain vector storing all of the coefficients.

a user you would always construct them with some documented and exported function

shapiro_wilk_coefs is not exported and neither was ShapiroWilkCoefs (suggestions for other names welcome).

I did not benchmark the two-dot producs as sw coefs are antisymmetric and dot is memory bound anyway.

Comment on lines +75 to +76
m = mean(X)
S² = sum(x -> abs2(x - m), X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m = mean(X)
= sum(x -> abs2(x - m), X)
= moment(X, 2)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmotion isn't moment normalized (i.e. integrated w.r.t. uniform measure on X? Then this should be S² = moment(X,2)*length(X).

But the real question is: did you benchmark this? On my machine this is (2-6) × slower (N ∈ 2.^7:15, increasing with N) than before.
(Which, accidentally, makes it even slower than the non-BLAS dot for old SWCoeffs (N=2^7, increases to ~8 for N=2^15)

If we really care about perf I'd rather revert this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should improve moment if it is slow. But regardless, you're right of course that the moment will be normalized. I'm fine with reverting it.

m .= m ./ sqrt(ϕ) # A, but reusing m to save allocs
m[1], m[2] = a₁, a₂
end
resize!(m, N)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, it seems?

Suggested change
resize!(m, N)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think so @devmotion? to me it seems very much needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you create m as a Vector{Float64}(undef, N) a few lines above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but then I resize it to length n = N÷2, and compute only the first half (normalizing w.r.t. n). Only then I fill the rest of the vector below line 62; To do so I need to bring back the old size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this. Can you remove all resize! calls? I think you should be able to work with a view instead of resizing the array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, @views are much better than resizes ;)

src/shapiro_wilk.jl Outdated Show resolved Hide resolved
Marek Kaluba and others added 4 commits October 23, 2023 12:10
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
Co-authored-by: David Widmann <devmotion@users.noreply.github.com>
@kalmarek
Copy link
Contributor Author

now tests on 1.3 fail with

┌ Warning: Deprecated syntax `"begin" inside indexing expression` at /home/runner/work/HypothesisTests.jl/HypothesisTests.jl/src/shapiro_wilk.jl:74.
└ @ ~/work/HypothesisTests.jl/HypothesisTests.jl/src/shapiro_wilk.jl:74
ERROR: LoadError: LoadError: syntax: extra token ")" after end of expression
Stacktrace:
 [1] top-level scope at /home/runner/work/HypothesisTests.jl/HypothesisTests.jl/src/shapiro_wilk.jl:74

due to @view(A[begin:(begin + length(X) - 1)])
I'm actually surprised that this syntax is not valid on julia 1.3!
@devmotion since You suggested this change what would be the proper fix (to keep it indexing agnostic and allocation free)?

@kalmarek
Copy link
Contributor Author

kalmarek commented Nov 2, 2023

@devmotion I took my liberty to revert/modify the suggestion on begin-based indexing, so that julia-1.3 parses the expression.

To be honest I'd be very glad to have this merged as soon as possible so that I can cross it off my list... It's been sitting here for long enough.

@nalimilan
Copy link
Member

You can do view(A, firstindex(X) .+ (0:length(X) - 1))) instead I think.

@kalmarek
Copy link
Contributor Author

kalmarek commented Nov 6, 2023

You can do view(A, firstindex(X) .+ (0:length(X) - 1))) instead I think.

that's what I did.

@kalmarek
Copy link
Contributor Author

@devmotion I'd like to ask you again to have a look at the changes you requested;
If there is something more to be done here let me know; If not please merge.

@jarredclloyd
Copy link
Contributor

Friendly bump, hoping this is still on the radar for review/merging @devmotion @nalimilan @andreasnoack as it looked to be in a ready state, or only minor adjustments needed.

src/shapiro_wilk.jl Outdated Show resolved Hide resolved
src/shapiro_wilk.jl Outdated Show resolved Hide resolved
src/shapiro_wilk.jl Outdated Show resolved Hide resolved
* Too much data is censored (`censored / N > 0.8`)

# Implementation notes
* The current implementation DOES NOT implement p-values for censored data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the utility of allowing the user to compute the Shapiro-Wilk test statistic at all if they can't get its p-value nor view the object directly in the REPL (since show calls pvalue)? That's an honest question; I would guess that many users would try it and get frustrated without checking the docstring but that is a naive assumption.

I'd probably recommend commenting out the censoring-related functionality for the time being, then someone can implement it fully in a follow-up PR. Thoughts on that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ararslan feel free to suggest (via gh suggestions) any edits that would suit your proposal

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was half question, half proposal—do you envision a use case for this with censored data as it is currently implemented in this PR (no pvalue/show)? If not, I can make those suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the W statistic can be still useful if somebody wants to use section 3.3 of Royston 1993 to calculate significance levels directly.

I didn't implement those methods back then since the significance levels there are only empirical, and until I need to do it at work I don't see myself doing this now either ;)

If those suggestions push this PR over the final line, please make them!

kalmarek and others added 4 commits May 23, 2024 09:41
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
@kalmarek
Copy link
Contributor Author

kalmarek commented Oct 2, 2024

Friendly bump, hoping this is still on the radar for review/merging @devmotion @nalimilan @andreasnoack as it looked to be in a ready state, or only minor adjustments needed.

This already starts feeling silly ;)
Apparently this functionality is not needed since nobody wants to take responsibility of this merge. If this last bump doesn't do the job I'll simply close the PR.

Copy link
Member

@andreasnoack andreasnoack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for another reminder but, first of all, thanks for spending so much time on this over so many years. Unfortunately, we are all busy with other tasks which has delayed this review beyond any reasonable limites. I think we have all now spent enough time on this and any subsequent modifications should be delegated to other PRs.

@andreasnoack andreasnoack merged commit eb93eec into JuliaStats:master Oct 2, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants