Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Statistic.mean(f, A) calls f length(A)+1 times #49

Open
stefan911 opened this issue Aug 5, 2020 · 19 comments · May be fixed by #151
Open

Statistic.mean(f, A) calls f length(A)+1 times #49

stefan911 opened this issue Aug 5, 2020 · 19 comments · May be fixed by #151

Comments

@stefan911
Copy link

Was expecting an iterator to only call the function once per element.

"""
Testing mean calls F length+1 times
"""
function meantest()
    aa = [1, 3, 2, 7, 11]
    bb = [1, 2, 3, 4, 5]


    function meanf(a)
        b = 1
        d = mean(a) do v
            r = v+b
            println("$b $r $v")
            b += 1
            return r
        end
        return d
    end

    avg = sum(aa .+ bb) / length(aa)
    avg2 = meanf(aa)
    println(" $avg != $avg2 ")
end


julia> versioninfo()
Julia Version 1.5.0-rc2.0
Commit 7f0ee122d7 (2020-07-27 15:24 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin19.5.0)
CPU: Intel(R) Core(TM) i7-3615QM CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, ivybridge)
Environment:
JULIA_NUM_THREADS = 8

@fredrikekre fredrikekre transferred this issue from JuliaLang/julia Aug 5, 2020
@nalimilan
Copy link
Member

Right, f is called twice on the first element because of this line:
https://github.com/JuliaLang/Statistics.jl/blob/b384104d35ff0e7cf311485607b177223ed72b9a/src/Statistics.jl#L176

This behavior is new in Julia 1.5.0 (#25). But I'm not sure it's a problem: AFAIK we don't make any guarantees about how many times f will be called. Also note that this happens only for arrays, not for general iterators.

FWIW, the previous code was this:
https://github.com/JuliaLang/Statistics.jl/blob/a2203d3b67f7413701be5de251622cb85c9cc69d/src/Statistics.jl#L159

I'm not sure we can retain the new promotion behavior and at the same time avoid calling f twice. Cc: @stevengj @kagalenko-m-b

@stevengj
Copy link
Contributor

stevengj commented Aug 9, 2020

We could do something like:

x1 = f(first(A)) / 1
result = x1 + sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end])

in the common case where A is a 1d array and dims = 1.

@stefan911
Copy link
Author

That looks reasonable, thanks. It's easy to work around but was a bit surprising. Guess I view an iterator with a visiting function to have an invariant of only calling the function once per element.

@marius311
Copy link

Also ran into this as I was surprised why my very expensive function was getting called N+1 times instead of N (which really mattered, since my N was small). Having that special case or some other workaround would be great.

@stevengj
Copy link
Contributor

stevengj commented Jul 5, 2021

Should be a straightforward PR to implement something like the suggestion above, if someone wants to take a crack at it?

@kagalenko-m-b
Copy link
Contributor

x1 = f(first(A)) / 1

Unclear how to define first() when there's dims argument.

@stevengj
Copy link
Contributor

stevengj commented Jul 6, 2021

I wrote:

in the common case where A is a 1d array and dims = 1.

It would be nice to at least fix this simple common case.

@kagalenko-m-b
Copy link
Contributor

I must have overlooked the qualification because it feels unsatisfactory to me.

@stefan911
Copy link
Author

stefan911 commented Jul 6, 2021

Interesting rabbit hole...

  1. current code behaves like mapreduce with no expectation that f() will be called n times see mapreduce documentation.
    This was a surprise as I was expecting interation not a map reduce algorithm. Document or use first case of code below.

  2. Union missing and abstract float arrays with dims selection seems like a case that should work.
    mean([1.f0 3.f1 2; 1.0 3 missing],dims=1) issue Mean of an array with missing values does not work if the dims argument is provided #7

  3. @test mean(x) == sum(x) / length(x) ? shouldn't this be approx floating point compare?

@test (@inferred mean(Int[])) === 0/0
    @test (@inferred mean(Float32[])) === 0.f0/0    
    @test (@inferred mean(Float64[])) === 0/0

seems like

typeof(mean(Float32[]) == typeof(0.f0/0)

is good enough I have no clue why the compiler is returning Any

Julia> @inferred (mean([1.f0 3.f1 2; 1.0 3 missing],dims=1))
ERROR: return type Matrix{Union{Missing, Float64}} does not match inferred return type Any
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] top-level scope
   @ REPL[27]:1

julia> mean([1.f0 3.f1 2; 1.0 3 missing],dims=1)
1×3 Matrix{Union{Missing, Float64}}:
 1.0  16.5  missing

Recommend case two to assist with issue #7. The first case this bug can be handled with documentation change.

Code:
first case handles call f() once per element
second case for : mean([1.f0 3.f1 2; 1.0 3 missing],dims=1)
third case is original and handles type conversion from non-abstract floats but calls N+1 times

function _mean(f, A::AbstractArray, dims::Dims=:) where Dims
    isempty(A) && return sum(f, A, dims=dims)/0

    if dims === (:)
        x1 = f(first(A)) / 1
        n = length(A)
        if n == 1
            return x1
        end
        AA = @view A[2:end]
        result = x1 + sum(x -> _mean_promote(x1, f(x)), AA, dims=dims)
        return result / n
    elseif typeof(first(A)) <: AbstractFloat
        x1 = convert(typeof(first(A)), 0.f0)
        n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
        result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims)
        return result ./= n
    else # note: f() will be called n+1 times
        x1 = f(first(A)) / 1
        n = mapreduce(i -> size(A, i), *, unique(dims); init=1)
        result = sum(x -> _mean_promote(x1, f(x)), A, dims=dims)
        return result ./= n
    end
end

@stevengj
Copy link
Contributor

stevengj commented Jul 6, 2021

(PS. please quote code in the future, @stefan911 … I edited your post to fix this.)

@stefan911
Copy link
Author

Thanks.

Since f() can change the return type it is difficult to remove the third case without unrolling into for loops.

@kagalenko-m-b
Copy link
Contributor

kagalenko-m-b commented Jul 6, 2021

x1 = f(first(A)) / 1
result = x1 + sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end])

That also fails on vectors with one element, so it should be

 x1 = f(first(A)) / 1
result = sum(x -> _mean_promote(x1, f(x)), @view A[begin+1:end], init=x1)

@kagalenko-m-b
Copy link
Contributor

kagalenko-m-b commented Jul 7, 2021

Created PR

I think Base.first() should be amended to accept dims argument. On the other hand, in multidimensional case performance impact of an extra call to f() is proportionally smaller.

@stefan911
Copy link
Author

Opinion: The general case of an iterator is to apply the function exactly once per element, to do otherwise should be documented in the help. foldr, foldl vs mapreduce. The expectation that f() is a pure function shouldn't be implied.

Did issue #7 get addressed as well?

@kagalenko-m-b
Copy link
Contributor

Did issue #7 get addressed as well?

That probably should be fixed in Base.

@mschauer
Copy link
Member

As said elsewhere, it would be nice if Julia would develop a standard design pattern for this. In accumulation where there is no sensible typestable default value, you end up splitting off the first iteration, duplicate the entire loop body, and make a tiny change to it (eg += becomes =) and continue with a loop over the rest of the iterator.

@nalimilan nalimilan linked a pull request Sep 9, 2023 that will close this issue
@nalimilan
Copy link
Member

nalimilan commented Sep 9, 2023

Actually I think I've found a super simple solution which works in general and luckily doesn't reduce performance. See #151.

EDIT: that's more complex than I thought. I don't have a good solution.

@mschauer
Copy link
Member

mschauer commented Sep 9, 2023

Not sure if it helps you but DNF gave me me a quirky snipped which initialises an accumulator in the loop in a way Julia can handle

function myfsum(f, xs)
   isfirst = true
   local y
   for x in xs
        if isfirst
            isfirst, y = false, f(x)
        else
            y += f(x)
        end
    end
    return y
end
julia> @time sum(sin, 1:10000)
  0.000155 seconds
1.633891021792461

julia> @time myfsum(sin, 1:10000)
  0.000130 seconds
1.6338910217924516

@nalimilan
Copy link
Member

Unfortunately the difficulty comes from the fact that we do not control the loop as we want to use pairwise summation just like sum. So we have to hack the mapreduce machinery (or copy it but that's a lot of code duplication).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants