Make covariance and correlation work for any iterators #30

pdeffebach · 2020-03-13T15:44:29Z

This PR is in reference to #35050 in the Julia Repo here. The goal is to make working with iterators more convenient. In particular, this PR will make working with iterators that skip over missing values more convenient.

This PR will improve both cov and cor. However it starts with cov, covm and covzm. I think I went overboard in avoiding allocations because covm and covzm are handeled via iterate only, the way mean(itr) is handled in the same file. However covm for AbstractArrays is fine to use map and allocate a new vector. The decision on how to proceed will be based on how we would like to handle stateful iterators.

Currently there are 4 new methods added.

covzm(itr::Any; corrected::Bool=true)
covzm(x::Any, y::Any, corrected::Bool=true)
covm(itr::Any, itrmean; corrected::Bool=true)
covm(x::Any, xmean, y::Any, ymean; corrected::Bool=true)

As a consequence, something like

cov(x::AbstractMatrix, itr::Any)

will break.

This is my first PR to a stdlib, so let me know if I am missing any conventions. Thanks!

nalimilan · 2020-04-03T15:57:41Z

Thanks. That implementation is reasonable, but I wonder whether using the existing methods based on map (just widening the signatures) wouldn't be better given how simple that would be. That will allocate more, but in terms of performance it might be faster due to the use of BLAS on the collected vectors. OTOH, I generally prefer avoiding allocations where possible so...

Could you run a few benchmarks since you've already written the code? For example, testing on generators and skipmissing[s] would be interesting.

As a consequence, something like
cov(x::AbstractMatrix, itr::Any)
will break.

Why would it break?

pdeffebach · 2020-04-09T13:13:09Z

Why would it break?

It won't "break" but it won't give you the matrix variance-covariance matrix from treating each column in the matrix as a separate vector.

It will fall back on iterate-ing the matrix and thus the itr would have to be an iterator of scalars with the same number of elements as the corresponding matrix. This might be unintuitive behavior.

pdeffebach · 2020-04-09T13:50:29Z

One thing I wasn't expecting was that this fails:

x = [rand(5) for i in 1:10]
y = [rand(5) for i in 1:10]
cov(x, y)
# long error message

Whereas

If x and y are matrices where each row is one of their vectors, cov works.
If x and y are generators, then after this PR it works.

pdeffebach · 2020-04-09T14:59:48Z

Here are the results of my benchmark. I added a method so that the above covariance between vectors of vectors works. This makes benchmarking easier but will also have to be added since iterators and vectors should behave similarly.

Highlights:

Things look good. Slowdown from vectors to generators less than a factor of 10.
covariances between skipmissings is just as fast as collecting them. But this is due to the overhead with iteration.
Not collecting generators is faster.
Multivariate covariances, i.e. iterators of vectors, or two matrices, are quite bad. Slowdown above a factor of 10 compared to matrix case. The two calls actually have the same number of allocations, which I don't understand. Also, my implementation is wrong. I am returning the transpose of what I want. I need to look into this.

Unfortunately given these difficulties this PR is unlikely to be merged before the 1.5 feature freeze.

============
Proportion missing: 0.2
============

Covariance:
============
N = 10000: vectors
    7.8364000000e-05 seconds

N = 10000: skipmissings, no collect
    4.9105100000e-04 seconds

N = 10000: skipmissings, collect
    4.7475300000e-04 seconds

N = 10000: generator, no collect
    6.5641000000e-05 seconds

N = 10000: generator, collect
    1.2176000000e-04 seconds

N = 10000, multivariate vector, 2 args
    1.7621935000e-02 seconds

N = 10000, multivariate matrix, 2 args
    6.8707000000e-04 seconds

N = 10000, multivariate generator, no collect
    1.6903689000e-02 seconds

N = 10000, multivariate generator, collect
    1.7486476000e-02 seconds

N = 10000, multivariate vector, 1 arg
    1.5073024000e-02 seconds

N = 10000, multivariate matrix, 1 arg
    4.1739900000e-04 seconds

N = 10000, multivariate generator, 1 arg, no collect
    1.4867595000e-02 seconds

N = 10000, multivariate generator, 1 arg, collect
    1.4857078000e-02 seconds

nalimilan · 2020-04-03T12:20:58Z

src/Statistics.jl

 # core functions

-unscaled_covzm(x::AbstractVector{<:Number})    = sum(abs2, x)
-unscaled_covzm(x::AbstractVector)              = sum(t -> t*t', x)
+unscaled_covzm(x::AbstractVector{<:Number})    = sum(_abs2, x)


This method isn't needed if you dispatch in _abs2.

Suggested change

unscaled_covzm(x::AbstractVector{<:Number}) = sum(_abs2, x)

nalimilan · 2020-04-03T15:35:08Z

src/Statistics.jl

+        #                                 Base.IteratorEltype(x)) / 0
+        return NaN
+    end
+    f = let xmean = xmean, ymean = ymean


I'd just have f take xmean and ymean, and pass that explicitly in the two call places.

I would have to write (x, y) -> _conj(x - xmean, y - ymean). Wouldn't that cause the closure bug?

nalimilan · 2020-04-03T15:35:39Z

src/Statistics.jl

@@ -504,6 +525,28 @@ function covzm(x::AbstractMatrix, vardim::Int=1; corrected::Bool=true)
    A .= A .* b
    return A
 end
+function covzm(x::Any, y::Any; corrected::Bool=true)


Maybe this could just call covm(x, 0, y, 0)? Check whether the compiler is able to optimize for 0 statically.

Have you tried this?

Just did.

julia> x = randn(10_000); y = randn(10_000); julia> x = x .- Statistics.mean(x); y = y .- Statistics.mean(y); julia> gx = (xi for xi in x); gy = (yi for yi in y); julia> @btime Statistics.covzm($gx, $gy) 11.820 μs (0 allocations: 0 bytes) 0.009480946018135657 julia> @btime Statistics.covm($gx, 0, $gy, 0) 14.768 μs (0 allocations: 0 bytes) 0.009480946018135657

Let me know what you think of that difference. It doesn't seem very important to me. The covzm version takes 80% as long as covm with 0s.

nalimilan · 2020-04-03T15:41:25Z

src/Statistics.jl

+        # TODO: Understand how to improve this error.
+        #return Base.mapreduce_empty_iter(t -> _conj(t[2])*t[1]', Base.add_sum, itr,
+        #                                 Base.IteratorEltype(x)) / 0
+        return NaN


I don't think mapreduce_empty_iter is really useful outside standard reductions. Here you just need to find the best way to compute NaN so that it works for all types. Maybe this?

Suggested change

# TODO: Understand how to improve this error.

#return Base.mapreduce_empty_iter(t -> _conj(t[2])*t[1]', Base.add_sum, itr,

# Base.IteratorEltype(x)) / 0

return NaN

v = conj(zero(eltype(y)))*zero(eltype(x))'

return (v + v) / 0

nalimilan · 2020-04-03T15:41:54Z

src/Statistics.jl

+        count += 1
+        z_itr = iterate(z, state)
+    end
+    return total ./ (count - Int(corrected))


Suggested change

return total ./ (count - Int(corrected))

return total / (count - Int(corrected))

nalimilan · 2020-04-03T15:42:48Z

src/Statistics.jl

+        return NaN
+    end
+    count = 1
+    value, state = z_itr


Something like this would be more readable (if you adapt code elsewhere):

Suggested change

value, state = z_itr

(xi, yi), state = z_itr

nalimilan · 2020-04-03T15:46:12Z

src/Statistics.jl

+function covzm(itr::Any; corrected::Bool=true)
+    y = iterate(itr)
+    if y === nothing
+        return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr,


+ is used instead of add_sum:

Suggested change

return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr,

return Base.mapreduce_empty_iter(_abs2, +, itr,

But maybe it would be clearer to use the same approach for all methods (see suggestion for the two-argument case).

nalimilan · 2020-04-03T15:50:44Z

src/Statistics.jl

 covm(x::AbstractVector, xmean, y::AbstractVector, ymean; corrected::Bool=true) =
    covzm(map(t -> t - xmean, x), map(t -> t - ymean, y); corrected=corrected)
 covm(x::AbstractVecOrMat, xmean, y::AbstractVecOrMat, ymean, vardim::Int=1; corrected::Bool=true) =
    covzm(x .- xmean, y .- ymean, vardim; corrected=corrected)

 # cov (API)
+"""
+    cov(x::Any; corrected::Bool=true)


Better adapt the existing docstring (and method) to only mention iterators, since vectors are just a special case. Same for others.

nalimilan · 2020-04-09T16:09:55Z

src/Statistics.jl

@@ -517,17 +560,75 @@ end

 # covm (with provided mean)
 ## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector}
-## which can't be handled by broadcast
+## which can't be handled by broadcastz


Suggested change

## which can't be handled by broadcastz

## which can't be handled by broadcast

(but this comment just be moved below)

pdeffebach · 2020-04-11T17:11:24Z

@nalimilan if you are happy with this I can move on to cor and var and any other functions that need it.

nalimilan · 2020-04-14T16:30:40Z


julia> x = x .- Statistics.mean(x); y = y .- Statistics.mean(y);

julia> gx = (xi for xi in x); gy = (yi for yi in y);

julia> @btime Statistics.covzm($gx, $gy)
 11.820 μs (0 allocations: 0 bytes)
0.009480946018135657

julia> @btime Statistics.covm($gx, 0,  $gy, 0)
 14.768 μs (0 allocations: 0 bytes)
0.009480946018135657

Let me know what you think of that difference. It doesn't seem very important to me. The covzm version takes 80% as long as covm with 0s.

I think it's OK. We could get rid of that small overhead by defining an internal helper with @inline, but I guess that's not worth it (and the compiler may improve in the future).

pdeffebach · 2020-04-14T21:19:14Z

Working on correlation now. Shouldn't be too hard.

pdeffebach · 2020-04-18T16:05:48Z

I am working through this. I want to add corzm(x::Any, y::Any).

There seems to be no corzm(x::AbstractVector, y::AbstractVector), so I won't add the method above, which is analogous. Is this intentional? Do we not care about the missing method because it is not exported?

Let me know if I should cc Andreas or someone else who has worked with this code more.

nalimilan · 2020-04-19T14:57:23Z

I think that method isn't needed since this definition works both for vectors and matrices:

corm(x::AbstractVecOrMat, xmean, y::AbstractVecOrMat, ymean, vardim::Int=1) =
    corzm(x .- xmean, y .- ymean, vardim)

Looking at how tricky the changes need to be to support iterators, I'm more and more inclined to just add methods to cor which call collect on the iterators. Anyway the current code already makes a copy to subtract the mean, so if you subtract the mean in-place you won't be less efficient than the code for vectors. Calling collect has the advantage that you will know the element type, which isn't the case for non-HasEltype iterators: in that case, writing correct code could be really difficult and possibly ugly.

pdeffebach · 2020-04-19T15:20:41Z

I am starting to agree. I was doing well using the iterate methods for everything, but ran into

Using first(itr) to do dispatch, which will have compilation issues
Bugs relating to Any[1, 2], where my dispatch currently relies assuming an vector that isn't full of numbers is full of vectors.
Not working with stateful iterators anyways because of mean, as you describe above. Though corm should still work with stateful iterators.

Tests currently fail due to bug 2, and writing generic code means adding a lot of methods to helper functions. It probably isn't worth it. Take a look at the most recent commit pushed to this branch to make the final call based off of implementation that is pretty much what we want if we avoid collecting, barring a few helper method dispatches.

nalimilan · 2020-04-19T15:25:52Z

I vote for keeping things simple, at least for now. :-)

nalimilan

A few comments that probably apply also to cor, etc.

nalimilan · 2020-04-11T18:19:54Z

src/Statistics.jl

+        return (v + v) / 0
+    end
+    count = 1
+    itri, state = y


Why not use value as elsewhere?

nalimilan · 2020-04-23T14:48:21Z

src/Statistics.jl

@@ -518,20 +527,37 @@ end
 # covm (with provided mean)
 ## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector}
 ## which can't be handled by broadcast
+covm(itr::Any, itrmean; corrected::Bool=true) =
+    covm(map(t -> t - itrmean, x); corrected = corrected)
 covm(x::AbstractVector, xmean; corrected::Bool=true) =


This method is identical to the previous one so it's no longer needed.

nalimilan · 2020-04-23T14:50:11Z

src/Statistics.jl

+"""
+function cov(x::Any, corrected::Bool=true)
+    cx = collect(x)
+    covm(cx, mean(cx); corrected=corrected)


This is going to make another copy of cx. Better call covzm directly.

nalimilan · 2020-04-23T14:51:28Z

src/Statistics.jl

+_abs2(x::Number) = abs2(x)
+_abs2(x)       = x*x'
+
+_conjmul(x::Number, y::Number) = x * conj(y)


Is this still needed?

pdeffebach added 3 commits March 13, 2020 11:13

initial commit, no tests yet

c13786d

Add a few tests

dd9c04f

fix project.toml

38a0706

nalimilan reviewed Apr 9, 2020

View reviewed changes

pdeffebach added 3 commits April 9, 2020 12:32

Fix transpose bug and conjugate bug

174040c

respond to code review and add complex number tests

131ae70

fix docstring

ff55dac

get rid of covzm

55fc30a

More correlations

ddee4e2

x, y generators

056af67

Working on tests, still using iterate

1a15b52

start converting to collect

2aa0187

nalimilan reviewed Apr 23, 2020

View reviewed changes

pdeffebach mentioned this pull request Apr 27, 2020

Make covariance and correlation work for iterators, skipmissing in particular. #34

Open

pdeffebach closed this Apr 27, 2020

pdeffebach mentioned this pull request Dec 1, 2023

Add spreadmissings JuliaData/Missings.jl#122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make covariance and correlation work for any iterators #30

Make covariance and correlation work for any iterators #30

pdeffebach commented Mar 13, 2020

nalimilan commented Apr 3, 2020

pdeffebach commented Apr 9, 2020

pdeffebach commented Apr 9, 2020

pdeffebach commented Apr 9, 2020

nalimilan Apr 3, 2020

nalimilan Apr 3, 2020

pdeffebach Apr 9, 2020

nalimilan Apr 3, 2020

nalimilan Apr 11, 2020

pdeffebach Apr 14, 2020

nalimilan Apr 3, 2020

nalimilan Apr 3, 2020

nalimilan Apr 3, 2020

nalimilan Apr 3, 2020

nalimilan Apr 3, 2020

nalimilan Apr 9, 2020

pdeffebach commented Apr 11, 2020

nalimilan commented Apr 14, 2020

pdeffebach commented Apr 14, 2020

pdeffebach commented Apr 18, 2020

nalimilan commented Apr 19, 2020

pdeffebach commented Apr 19, 2020

nalimilan commented Apr 19, 2020

nalimilan left a comment

nalimilan Apr 11, 2020

nalimilan Apr 23, 2020

nalimilan Apr 23, 2020

nalimilan Apr 23, 2020

	return total ./ (count - Int(corrected))
	return total / (count - Int(corrected))

	return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr,
	return Base.mapreduce_empty_iter(_abs2, +, itr,

	## which can't be handled by broadcastz
	## which can't be handled by broadcast

Make covariance and correlation work for any iterators #30

Make covariance and correlation work for any iterators #30

Conversation

pdeffebach commented Mar 13, 2020

nalimilan commented Apr 3, 2020

pdeffebach commented Apr 9, 2020

pdeffebach commented Apr 9, 2020

pdeffebach commented Apr 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Apr 11, 2020

nalimilan commented Apr 14, 2020

pdeffebach commented Apr 14, 2020

pdeffebach commented Apr 18, 2020

nalimilan commented Apr 19, 2020

pdeffebach commented Apr 19, 2020

nalimilan commented Apr 19, 2020

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment