Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make covariance and correlation work for any iterators #30

Closed
wants to merge 11 commits into from

Conversation

pdeffebach
Copy link

This PR is in reference to #35050 in the Julia Repo here. The goal is to make working with iterators more convenient. In particular, this PR will make working with iterators that skip over missing values more convenient.

This PR will improve both cov and cor. However it starts with cov, covm and covzm. I think I went overboard in avoiding allocations because covm and covzm are handeled via iterate only, the way mean(itr) is handled in the same file. However covm for AbstractArrays is fine to use map and allocate a new vector. The decision on how to proceed will be based on how we would like to handle stateful iterators.

Currently there are 4 new methods added.

  • covzm(itr::Any; corrected::Bool=true)
  • covzm(x::Any, y::Any, corrected::Bool=true)
  • covm(itr::Any, itrmean; corrected::Bool=true)
  • covm(x::Any, xmean, y::Any, ymean; corrected::Bool=true)

As a consequence, something like

cov(x::AbstractMatrix, itr::Any)

will break.

This is my first PR to a stdlib, so let me know if I am missing any conventions. Thanks!

@nalimilan
Copy link
Member

Thanks. That implementation is reasonable, but I wonder whether using the existing methods based on map (just widening the signatures) wouldn't be better given how simple that would be. That will allocate more, but in terms of performance it might be faster due to the use of BLAS on the collected vectors. OTOH, I generally prefer avoiding allocations where possible so...

Could you run a few benchmarks since you've already written the code? For example, testing on generators and skipmissing[s] would be interesting.

As a consequence, something like

cov(x::AbstractMatrix, itr::Any)

will break.

Why would it break?

@pdeffebach
Copy link
Author

Why would it break?

It won't "break" but it won't give you the matrix variance-covariance matrix from treating each column in the matrix as a separate vector.

It will fall back on iterate-ing the matrix and thus the itr would have to be an iterator of scalars with the same number of elements as the corresponding matrix. This might be unintuitive behavior.

@pdeffebach
Copy link
Author

One thing I wasn't expecting was that this fails:

x = [rand(5) for i in 1:10]
y = [rand(5) for i in 1:10]
cov(x, y)
# long error message

Whereas

  • If x and y are matrices where each row is one of their vectors, cov works.
  • If x and y are generators, then after this PR it works.

@pdeffebach
Copy link
Author

Here are the results of my benchmark. I added a method so that the above covariance between vectors of vectors works. This makes benchmarking easier but will also have to be added since iterators and vectors should behave similarly.

Highlights:

  • Things look good. Slowdown from vectors to generators less than a factor of 10.
  • covariances between skipmissings is just as fast as collecting them. But this is due to the overhead with iteration.
  • Not collecting generators is faster.
  • Multivariate covariances, i.e. iterators of vectors, or two matrices, are quite bad. Slowdown above a factor of 10 compared to matrix case. The two calls actually have the same number of allocations, which I don't understand. Also, my implementation is wrong. I am returning the transpose of what I want. I need to look into this.

Unfortunately given these difficulties this PR is unlikely to be merged before the 1.5 feature freeze.

============
Proportion missing: 0.2
============

Covariance:
============
N = 10000: vectors
    7.8364000000e-05 seconds

N = 10000: skipmissings, no collect
    4.9105100000e-04 seconds

N = 10000: skipmissings, collect
    4.7475300000e-04 seconds

N = 10000: generator, no collect
    6.5641000000e-05 seconds

N = 10000: generator, collect
    1.2176000000e-04 seconds

N = 10000, multivariate vector, 2 args
    1.7621935000e-02 seconds

N = 10000, multivariate matrix, 2 args
    6.8707000000e-04 seconds

N = 10000, multivariate generator, no collect
    1.6903689000e-02 seconds

N = 10000, multivariate generator, collect
    1.7486476000e-02 seconds

N = 10000, multivariate vector, 1 arg
    1.5073024000e-02 seconds

N = 10000, multivariate matrix, 1 arg
    4.1739900000e-04 seconds

N = 10000, multivariate generator, 1 arg, no collect
    1.4867595000e-02 seconds

N = 10000, multivariate generator, 1 arg, collect
    1.4857078000e-02 seconds

# core functions

unscaled_covzm(x::AbstractVector{<:Number}) = sum(abs2, x)
unscaled_covzm(x::AbstractVector) = sum(t -> t*t', x)
unscaled_covzm(x::AbstractVector{<:Number}) = sum(_abs2, x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method isn't needed if you dispatch in _abs2.

Suggested change
unscaled_covzm(x::AbstractVector{<:Number}) = sum(_abs2, x)

# Base.IteratorEltype(x)) / 0
return NaN
end
f = let xmean = xmean, ymean = ymean
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just have f take xmean and ymean, and pass that explicitly in the two call places.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have to write (x, y) -> _conj(x - xmean, y - ymean). Wouldn't that cause the closure bug?

@@ -504,6 +525,28 @@ function covzm(x::AbstractMatrix, vardim::Int=1; corrected::Bool=true)
A .= A .* b
return A
end
function covzm(x::Any, y::Any; corrected::Bool=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could just call covm(x, 0, y, 0)? Check whether the compiler is able to optimize for 0 statically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did.

julia> x = randn(10_000); y = randn(10_000);

julia> x = x .- Statistics.mean(x); y = y .- Statistics.mean(y);

julia> gx = (xi for xi in x); gy = (yi for yi in y);

julia> @btime Statistics.covzm($gx, $gy)
  11.820 μs (0 allocations: 0 bytes)
0.009480946018135657

julia> @btime Statistics.covm($gx, 0,  $gy, 0)
  14.768 μs (0 allocations: 0 bytes)
0.009480946018135657

Let me know what you think of that difference. It doesn't seem very important to me. The covzm version takes 80% as long as covm with 0s.

Comment on lines 532 to 535
# TODO: Understand how to improve this error.
#return Base.mapreduce_empty_iter(t -> _conj(t[2])*t[1]', Base.add_sum, itr,
# Base.IteratorEltype(x)) / 0
return NaN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think mapreduce_empty_iter is really useful outside standard reductions. Here you just need to find the best way to compute NaN so that it works for all types. Maybe this?

Suggested change
# TODO: Understand how to improve this error.
#return Base.mapreduce_empty_iter(t -> _conj(t[2])*t[1]', Base.add_sum, itr,
# Base.IteratorEltype(x)) / 0
return NaN
v = conj(zero(eltype(y)))*zero(eltype(x))'
return (v + v) / 0

count += 1
z_itr = iterate(z, state)
end
return total ./ (count - Int(corrected))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return total ./ (count - Int(corrected))
return total / (count - Int(corrected))

return NaN
end
count = 1
value, state = z_itr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this would be more readable (if you adapt code elsewhere):

Suggested change
value, state = z_itr
(xi, yi), state = z_itr

function covzm(itr::Any; corrected::Bool=true)
y = iterate(itr)
if y === nothing
return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ is used instead of add_sum:

Suggested change
return Base.mapreduce_empty_iter(_abs2, Base.add_sum, itr,
return Base.mapreduce_empty_iter(_abs2, +, itr,

But maybe it would be clearer to use the same approach for all methods (see suggestion for the two-argument case).

covm(x::AbstractVector, xmean, y::AbstractVector, ymean; corrected::Bool=true) =
covzm(map(t -> t - xmean, x), map(t -> t - ymean, y); corrected=corrected)
covm(x::AbstractVecOrMat, xmean, y::AbstractVecOrMat, ymean, vardim::Int=1; corrected::Bool=true) =
covzm(x .- xmean, y .- ymean, vardim; corrected=corrected)

# cov (API)
"""
cov(x::Any; corrected::Bool=true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better adapt the existing docstring (and method) to only mention iterators, since vectors are just a special case. Same for others.

@@ -517,17 +560,75 @@ end

# covm (with provided mean)
## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector}
## which can't be handled by broadcast
## which can't be handled by broadcastz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## which can't be handled by broadcastz
## which can't be handled by broadcast

(but this comment just be moved below)

@pdeffebach
Copy link
Author

@nalimilan if you are happy with this I can move on to cor and var and any other functions that need it.

@nalimilan
Copy link
Member


julia> x = x .- Statistics.mean(x); y = y .- Statistics.mean(y);

julia> gx = (xi for xi in x); gy = (yi for yi in y);

julia> @btime Statistics.covzm($gx, $gy)
 11.820 μs (0 allocations: 0 bytes)
0.009480946018135657

julia> @btime Statistics.covm($gx, 0,  $gy, 0)
 14.768 μs (0 allocations: 0 bytes)
0.009480946018135657

Let me know what you think of that difference. It doesn't seem very important to me. The covzm version takes 80% as long as covm with 0s.

I think it's OK. We could get rid of that small overhead by defining an internal helper with @inline, but I guess that's not worth it (and the compiler may improve in the future).

@pdeffebach
Copy link
Author

Working on correlation now. Shouldn't be too hard.

@pdeffebach
Copy link
Author

I am working through this. I want to add corzm(x::Any, y::Any).

There seems to be no corzm(x::AbstractVector, y::AbstractVector), so I won't add the method above, which is analogous. Is this intentional? Do we not care about the missing method because it is not exported?

Let me know if I should cc Andreas or someone else who has worked with this code more.

@nalimilan
Copy link
Member

I think that method isn't needed since this definition works both for vectors and matrices:

corm(x::AbstractVecOrMat, xmean, y::AbstractVecOrMat, ymean, vardim::Int=1) =
    corzm(x .- xmean, y .- ymean, vardim)

Looking at how tricky the changes need to be to support iterators, I'm more and more inclined to just add methods to cor which call collect on the iterators. Anyway the current code already makes a copy to subtract the mean, so if you subtract the mean in-place you won't be less efficient than the code for vectors. Calling collect has the advantage that you will know the element type, which isn't the case for non-HasEltype iterators: in that case, writing correct code could be really difficult and possibly ugly.

@pdeffebach
Copy link
Author

I am starting to agree. I was doing well using the iterate methods for everything, but ran into

  1. Using first(itr) to do dispatch, which will have compilation issues
  2. Bugs relating to Any[1, 2], where my dispatch currently relies assuming an vector that isn't full of numbers is full of vectors.
  3. Not working with stateful iterators anyways because of mean, as you describe above. Though corm should still work with stateful iterators.

Tests currently fail due to bug 2, and writing generic code means adding a lot of methods to helper functions. It probably isn't worth it. Take a look at the most recent commit pushed to this branch to make the final call based off of implementation that is pretty much what we want if we avoid collecting, barring a few helper method dispatches.

@nalimilan
Copy link
Member

I vote for keeping things simple, at least for now. :-)

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments that probably apply also to cor, etc.

return (v + v) / 0
end
count = 1
itri, state = y
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use value as elsewhere?

@@ -518,20 +527,37 @@ end
# covm (with provided mean)
## Use map(t -> t - xmean, x) instead of x .- xmean to allow for Vector{Vector}
## which can't be handled by broadcast
covm(itr::Any, itrmean; corrected::Bool=true) =
covm(map(t -> t - itrmean, x); corrected = corrected)
covm(x::AbstractVector, xmean; corrected::Bool=true) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is identical to the previous one so it's no longer needed.

"""
function cov(x::Any, corrected::Bool=true)
cx = collect(x)
covm(cx, mean(cx); corrected=corrected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to make another copy of cx. Better call covzm directly.

_abs2(x::Number) = abs2(x)
_abs2(x) = x*x'

_conjmul(x::Number, y::Number) = x * conj(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants