Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename LinearFast etc. and define an indexing enumerate(::IndexMethod, iter) method #16378

Merged
merged 3 commits into from
Feb 17, 2017

Conversation

timholy
Copy link
Member

@timholy timholy commented May 15, 2016

enumerate lets you count along as you access elements of an iterator/array. This can be viewed as equivalent to key/value iteration, but this perspective basically assumes linear indexing. Sometimes you'd prefer it if the returned "key" is of the most efficient indexing type. Hence, visiteachindexvalue.

I'm not wedded to the name, so let the bikeshedding begin. This was inspired by working on #16260 (which introduces another reason to be careful about the distinction between counting and indexing), but is sufficiently stand-alone that I thought it merited a separate PR.

I also noticed that enumerate doesn't inline next for LinearSlow arrays. On a quick benchmark, adding @inline gave a 3x speed improvement. In that case visit is still 2x faster (it was 6x faster), but I haven't yet tried to figure out the reason for the remaining gap.

@nalimilan
Copy link
Member

+1, but needs docs. You should probably mention visit in the docs about enumerate and vice-versa.

@timholy timholy added the needs docs Documentation for this change is required label May 15, 2016
base/exports.jl Outdated
@@ -981,6 +982,7 @@ export
enumerate,
next,
start,
visit,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to export both uppercase and lowercase any more. Let's stick with just one of them exported. Maybe only one of them defined at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically agree with this. So the question is, do we want for (i, a) in enumerate(A) or for (i, a) in Enumerate(A)? I kind of prefer the lowercase version, and un-export the type.

@timholy
Copy link
Member Author

timholy commented May 15, 2016

As an alternative to a new exported name like visit, one potential model is checkbounds, which can be called as checkbounds(A, inds...) (throws if out-of-bounds) or checkbounds(Bool, A, inds...) (returns true/false). We could make it enumerate(T, A) to have enumerate return indices rather than a count.

However, here this alternative seems to have two problems:

  • The very name enumerate seems to scream "count," which key/value iteration most definitely isn't (or at least, isn't for LinearSlow arrays or future 1d arrays with unconventional indices)
  • The best option I can come up with for what would go in the first slot of enumerate is the function indices, i.e., for (I, a) in enumerate(indices, A). Reasonable, or weird?

Consequently, for the moment I suspect that exporting visit is still probably the best option.

@kmsquire
Copy link
Member

Most iterators seem to have lowercase functions, and if needed, they're backed by CamelCase types which hold the iterator state. To me, this feels like a good convention to stick to, although it has the usual downside of polluting the address space.

(As an aside, I'm liking golang's convention of qualifying most functions with their (short) package name. Numerical computing in Python seems to be going a similar direction (import numpy as np, import pandas as pd, etc.).)

@timholy
Copy link
Member Author

timholy commented May 15, 2016

I'm just wondering why we export the "backing" type at all. Seems like the user-friendly constructor is enough.

@toivoh
Copy link
Contributor

toivoh commented May 16, 2016

The name visit seems to evoke the Visitor pattern to me, which is quite different as far as I understand. But it's not so easy to come up with a better name. label could be one possibility, since you're labeling elements with their indices, but the word might be too strongly associated with goto.
Python uses the name items for (key, value) pairs. Maybe eachitem to show the connection to eachindex?

@Keno
Copy link
Member

Keno commented May 16, 2016

I called it indenumerate in AbstractTrees.

@kmsquire
Copy link
Member

In Scala, this is called zipWithIndex. zipwithindex doesn't really work (for me), but maybe zip_with_index?

@timholy
Copy link
Member Author

timholy commented May 16, 2016

keyvalue? EDIT: or indexvalue? keyvalue seems to imply that keys should work, which it doesn't now (but that could change).

@JeffBezanson
Copy link
Member

Is this faster than zip(eachindex(A), A)? I'm guessing it might be, since the state is effectively reused for both iterators.

@kmsquire
Copy link
Member

Even though there's a correlation between keys and indices, there's enough of a difference that, to me, they should remain separate concepts.

keyvalue? EDIT: or indexvalue? keyvalue seems to imply that keys should work, which it doesn't now (but that could change).

Since we already have eachindex and each* seems to have become a pattern in Julia, eachindexvalue would seem logical, if a little long.

@timholy
Copy link
Member Author

timholy commented May 16, 2016

Yes, considerably faster:

julia> function itervisit(A)
           s = zero(eltype(A))
           n = 0
           for (I, a) in visit(A)
               s += a
               n += I[1]>2
           end
           s, n
       end
itervisit (generic function with 1 method)

julia> function iterzip(A)
           s = zero(eltype(A))
           n = 0
           for (I, a) in zip(eachindex(A), A)
               s += a
               n += I[1]>2
           end
           s, n
       end
iterzip (generic function with 1 method)

julia> A = rand(10000, 1000);

julia> B = sub(A, 1:size(A,1)-1, :);

julia> @benchmark itervisit(A)
================ Benchmark Results ========================
     Time per evaluation: 13.47 ms [9.70 ms, 17.24 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 100
   Number of evaluations: 100
 Time spent benchmarking: 1.67 s


julia> @benchmark iterzip(A)
================ Benchmark Results ========================
     Time per evaluation: 19.27 ms [19.22 ms, 19.31 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 100
   Number of evaluations: 100
 Time spent benchmarking: 2.22 s


julia> @benchmark itervisit(B)
================ Benchmark Results ========================
     Time per evaluation: 33.56 ms [32.94 ms, 34.17 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 48.00 bytes
   Number of allocations: 1 allocations
       Number of samples: 100
   Number of evaluations: 100
 Time spent benchmarking: 3.63 s


julia> @benchmark iterzip(B)
================ Benchmark Results ========================
     Time per evaluation: 52.60 ms [51.96 ms, 53.25 ms]
Proportion of time in GC: 0.00% [0.00%, 0.00%]
        Memory allocated: 0.00 bytes
   Number of allocations: 0 allocations
       Number of samples: 100
   Number of evaluations: 100
 Time spent benchmarking: 5.57 s

Interesting, though, that there's an allocation in itervisit for the LinearSlow case. I hadn't noticed that before, worth investigating. Might get even faster.

@timholy timholy removed the needs docs Documentation for this change is required label May 19, 2016
@timholy
Copy link
Member Author

timholy commented May 19, 2016

I went with @kmsquire's suggestion of eachindexvalue and updated the docs. Should be ready to go once CI passes (it does locally).

@timholy
Copy link
Member Author

timholy commented May 19, 2016

The AppVeyor looks like a new and very exciting (but unrelated) failure. Anyone seen that before?

@lstagner
Copy link
Contributor

None of my business but I thought I would suggest traverse instead of eachindexvalue

@kmsquire
Copy link
Member

None of my business but I thought I would suggest traverse instead of eachindexvalue

@lstagner you should definitely feel free to make suggestions like this!

I'm personally not wed to eachindexvalue--I just think it's the most consistent with eachindex. I would actually prefer indices and indexvalues. To me, traverse doesn't evokes the idea of returning an index and a value for each element in an array... but anyway, we'll see where this goes.

@timholy timholy added the needs decision A decision on this change is needed label Jul 19, 2016
@oxinabox
Copy link
Contributor

Is there intent to rebase this for 0.6, or 1.0?

@lstagner
Copy link
Contributor

Since this got brought up again one last bit of bikeshedding.

I still think this function should be called traversal or traverse

In the most general sense an array (or any iterable collection) can be thought of type of graph where each vertex has a label (index, key,...) and a value. This iterator then returns the label/value pairs of this graph in the most efficient way. Since the pairs have a special order, in this case efficient memory order, the collection of pairs is a traversal of the graph.

Here is the definition of traversal

In computer science, graph traversal refers to the process of visiting (checking and/or updating) each vertex in a graph/tree. Such traversals are classified by the order in which the vertices are visited.

In principle this function could be generalized to any iterable collection.

@timholy
Copy link
Member Author

timholy commented Feb 15, 2017

To me the fundamental obstacle is having both enumerate and whatever we decide to call this---maybe I'm being too much of a purist, but I really don't like having two things which do almost the same thing. If you can figure out how to resolve that problem satisfactorily, I'd be excited to revise this and merge it. Until then, less so.

@nalimilan
Copy link
Member

This is a pattern we have in other cases too, like for find (vs. findn), findfirst, findmax, etc., where one could want either a linear or a cartesian index (or any kind of custom index). Cf. the Find & Search Julep.

I suggest to use a general mechanism like enumerate(LinearIndex, x) vs. enumerate(CartesianIndex, x) vs. enumerate(FastestIndex, x). The latter could use a better name, but the idea is that you would call it when you don't care about the kind of index and want either linear or cartesian depending on whether the array is LinearFast or LinearSlow. Of course enumerate(x) would keep it's current meaning.

@timholy
Copy link
Member Author

timholy commented Feb 15, 2017

Hmm, maybe in conjunction with a renaming (#20175 (comment)) this might be viable. Putting this up for consideration for 0.6. I saw that tomorrow is the stated deadline; I can't do it today, but I probably could tackle it by the end of the day tomorrow. I'll mark this now, and folks coordinating the 0.6 release can comment on whether they think it's important enough to squeeze in.

@timholy timholy added this to the 0.6.0 milestone Feb 15, 2017
@timholy
Copy link
Member Author

timholy commented Feb 16, 2017

One big problem, though, if we return instances rather than types:

julia> CartesianIndex()
CartesianIndex{0}(())

But we don't actually want a 0-dimensional cartesian index, we want it to match the dimensionality of the input array. We have this:

julia> CartesianIndex{3}()
CartesianIndex{3}((1, 1, 1))

but now I worry that this is less obviously a "trait" than it is a value.

@timholy
Copy link
Member Author

timholy commented Feb 16, 2017

OK, review comments addressed with the exception of #16378 (comment). If we really want to close the feature/deprecation window today, I think it's too risky to try to fix something like this now, and we should just go with the improvement we have.

It passes locally, but perhaps better to let it get through CI.

@tkelman
Copy link
Contributor

tkelman commented Feb 16, 2017

shouldn't harm anything, but just in case @nanosoldier runbenchmarks(ALL, vs = ":master")

@mbauman
Copy link
Member

mbauman commented Feb 16, 2017

Yes, I think we'd have to move to a type-based system in order for CartesianIndex to make sense as a trait. But even then, there are multiple possibilities for what a cartesian array should return as its index style — does it just return CartesianIndex? Or CartesianIndex{ndims(A)}? Could a three-dimensional array specify that it's most efficient to index it as a CartesianIndex{2}? (This isn't all that esoteric; it's how it'd be best to index a 3-d array that uses a compressed-column storage like SparseCSCMatrix.) We'd also have to decide on if we do const LinearIndex = Int for symmetry or if we introduce a new index type. We'd also lose the abstract trait tree, and could no longer dispatch on the general ::IndexStyle. And, really, I think it'd make us hammer down a definitive answer as to what a linear index means sooner than we're ready to.

I think it's too late in the release cycle to open all those cans of worms. This rename is already a huge win. I think that the difference in names can be attributed to the difference between how you index and what you index with… and I think that is defensible.

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @jrevels

@tkelman
Copy link
Contributor

tkelman commented Feb 16, 2017

deprecation maybe not working as it should?
ERROR: LoadError: indexing not defined for BaseBenchmarks.ArrayBenchmarks.ArrayLF{Int32,2}

@nalimilan
Copy link
Member

But even then, there are multiple possibilities for what a cartesian array should return as its index style — does it just return CartesianIndex? Or CartesianIndex{ndims(A)}? Could a three-dimensional array specify that it's most efficient to index it as a CartesianIndex{2}? (This isn't all that esoteric; it's how it'd be best to index a 3-d array that uses a compressed-column storage like SparseCSCMatrix.)

I would say it's completely orthogonal to the choice of merging IndexCartesian and CartesianIndex. In both cases, we could support returning the dimensionality as a type parameter later, but that would be an extension of the system anyway.

We'd also have to decide on if we do const LinearIndex = Int for symmetry or if we introduce a new index type.

Well, Int is ambiguous as it could mean 1-based indices or any other kind of offset indices; or it could be row or column-major. So that really needs to be a separate LinearIndex type (after defining what it means).

I admit that passing CartesianIndex() to choose the return type sounds like an abuse (even if it probably wouldn't create any practical problems). I guess it's OK to go with the new names for now, but I hope we can find something better for 1.0...

@timholy
Copy link
Member Author

timholy commented Feb 16, 2017

Good catch, @tkelman. Should be fixed now. @nanosoldier runbenchmarks(ALL, vs = ":master").

@mbauman
Copy link
Member

mbauman commented Feb 16, 2017

@nalimilan Yes, sorry, I didn't make it clear my questions were rhetorical. Point is simply that I don't think the use of CartesianIndex would be particularly obvious, even if we go up to the type domain.

@@ -1272,6 +1272,11 @@ for f in (:airyai, :airyaiprime, :airybi, :airybiprime, :airyaix, :airyaiprimex,
end
end

@deprecate_binding LinearIndexing IndexStyle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro will export the old bindings. The @deprecate macro has an optional third argument to disable this behavior — perhaps you could add that to @deprecate_binding, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Fixed (I hope).

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @jrevels

@tkelman tkelman merged commit cb50dee into master Feb 17, 2017
@tkelman tkelman deleted the teh/visit branch February 17, 2017 08:22
maleadt added a commit to JuliaGPU/CUDAnative.jl that referenced this pull request Feb 17, 2017
@malmaud
Copy link
Contributor

malmaud commented Feb 17, 2017

Should we think about exporting a short form for enumerate(IndexStyle(S), S)? That's probably both a common case and quite verbose to write out.

@timholy
Copy link
Member Author

timholy commented Feb 17, 2017

See discussion above. There seems to be a shortage of good names in the English language.

@Sacha0 Sacha0 added deprecation This change introduces or involves a deprecation needs news A NEWS entry is required for this change labels May 20, 2017
Sacha0 added a commit to Sacha0/julia that referenced this pull request May 20, 2017
tkelman pushed a commit that referenced this pull request May 24, 2017
@Sacha0 Sacha0 removed the needs news A NEWS entry is required for this change label May 25, 2017
tkelman pushed a commit that referenced this pull request Jun 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation This change introduces or involves a deprecation needs decision A decision on this change is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.