Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interest for repeat(x, lengths) function? #16443

Open
nalimilan opened this issue May 19, 2016 · 16 comments
Open

Interest for repeat(x, lengths) function? #16443

nalimilan opened this issue May 19, 2016 · 16 comments

Comments

@nalimilan
Copy link
Member

nalimilan commented May 19, 2016

DataArrays.jl currently offers a rep function modelled after R's, and which is mostly redundant with repeat (at least after #14082 is merged). So I'd like to deprecate it.

There's one missing feature, though: rep(x::AbstractVector, lengths::AbstractVector) returns a vector with each element repeated lengths[i] times. This is used once in DataFrames.jl, so it looks like it might be of more general use. Should we include this functionality in Base?

If yes, I'm not sure where it should go. It would be logical to add it to repeat via another keyword argument, but it would completely conflict with the existing ones, which means it would essentially be a separate function. We could add repvec to parallel repmat and support it there (with the same signature as rep). OTOH, it could be nice to be able to deprecate repmat when type inference is fixed for repeat.

For reference, the definition of rep:

function rep{T <: Integer}(x::AbstractVector, lengths::AbstractVector{T})
    if length(x) != length(lengths)
        throw(DimensionMismatch("vector lengths must match"))
    end
    res = similar(x, sum(lengths))
    i = 1
    for idx in 1:length(x)
        tmp = x[idx]
        for kdx in 1:lengths[idx]
            res[i] = tmp
            i += 1
        end
    end
    return res
end
@johnmyleswhite
Copy link
Member

This is a really good question. I've never felt that there's a great solution, but I would say that repvec is the least contentious.

@nalimilan
Copy link
Member Author

Should I interpret the small number of reactions as a sign of lack of support? :-)

@tlnagy
Copy link
Contributor

tlnagy commented Dec 22, 2016

I just ran into this problem and this function would be nice to have :)

@timofeytt
Copy link

+1 for repvec

@Gnimuc
Copy link
Contributor

Gnimuc commented Apr 17, 2017

@mkborregaard
Copy link
Contributor

Bump - I came to the repo to open this issue. I quite often have a use for this - one example is to interpret range-length-encoded Vectors.
In fact, RLEVectors.jl has an efficient implementation of this (but it's often a small atomic data householding operation that this is used for, and rarely worth adding a dependency, especially in package code).

@nalimilan
Copy link
Member Author

Maybe make a pull request to get the ball rolling? Since repmat no longer exists, this should probably be implemented via a keyword argument to repeat.

@mkborregaard
Copy link
Contributor

Yeah, the question is to design how it would interact with inner and outer.

@mkborregaard
Copy link
Contributor

Possibly a clearer suggestion than a new keyword would be to allow the inner keyword to take a Vector of the same length as the input, for the Vector case, and not support higher Array dimensions?

@nalimilan
Copy link
Member Author

Yeah, why not.

@mkborregaard
Copy link
Contributor

OK I'll take a look

@briochemc
Copy link
Contributor

This just came up on Slack. Any reason the PR was never merged? Also what's the current best workaround?

@vtjnash
Copy link
Member

vtjnash commented Apr 20, 2021

Seems like the PR author might have stopped working on it before it was finished. Do you want to take a stab at updating it?

@briochemc
Copy link
Contributor

mmm... I'm not sure how I would proceed, but since the Slack question was about sparse arrays, I just noticed that this could use some improvement (on Julia v1.6.0):

julia> using SparseArrays, BenchmarkTools

julia> n = 1000 ;

julia> S = sprand(n,n,1/n) ; # Let's (inner) repeat this sparse array

julia> function repeat_inner(X, t) # new function to (inner) repeat by repeating the axes first
           idxs = [repeat(axes(X,d), inner=c) for (d,c) in enumerate(t)]
           X[idxs...]
       end
repeat_inner (generic function with 1 method)

julia> repeat_inner(S, (2,3)) == repeat(S, inner=(2,3)) # check
true

julia> @btime repeat($S, inner=(20,30)) ;
  8.254 s (24 allocations: 19.82 MiB)

julia> @btime repeat_inner($S, (20,30)) ;
  1.245 ms (15 allocations: 9.81 MiB)

This poorman's benchmark gives a factor ~2 reduction in memory allocation and a factor ~5'000 improvement in speed! So before I (potentially) dive into these internals, could someone help me understand why repeat does not take repeated indices first and then just getindex on those?

@nalimilan
Copy link
Member Author

Are you sure we're talking about the same function? This issue is about passing a lengths vector with one entry for each entry in x indicating how many times to repeat it. You seem to be looking for an optimized method for the currently supported repeat behavior, which is a completely different issue.

@briochemc
Copy link
Contributor

I showed the benchmark because if I was starting a PR I would use the same approach, except I would use the rep function (from the original post) to repeat the indices inside the body of the function. Specifically, I was thinking of

function rep(x::AbstractVector, lengths::AbstractVector{T}) where {T <: Integer}
    if length(x) != length(lengths)
        throw(DimensionMismatch("vector lengths must match"))
    end
    res = similar(x, sum(lengths))
    i = 1
    for idx in 1:length(x)
        tmp = x[idx]
        for kdx in 1:lengths[idx]
            res[i] = tmp
            i += 1
        end
    end
    return res
end
function repeat_inner(X, t) # new function to (inner) repeat by repeating the axes first
    idxs = [rep(axes(X,d), v) for (d,v) in enumerate(t)]
    X[idxs...]
end

which would give

julia> A = [1 2; 3 4]
2×2 Matrix{Int64}:
 1  2
 3  4

julia> repeat_inner(A, ([1,2],[3, 4]))
3×7 Matrix{Int64}:
 1  1  1  2  2  2  2
 3  3  3  4  4  4  4
 3  3  3  4  4  4  4

Now this works on that example, but I'm guessing there is a reason that approach was not used for repeat in Base, right? This is why I posted the benchmark.

So yes, I am aware that this was a different feature. Since Julia developers/contributors generally know these things better than me, I thought I should post that benchmark first to ask if there's anything wrong with my approach (and hopefully it would help my understanding of the current Base approach).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants