Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Julep: broadcast with dictionaries #25904

Open
andyferris opened this issue Feb 6, 2018 · 28 comments
Open

Julep: broadcast with dictionaries #25904

andyferris opened this issue Feb 6, 2018 · 28 comments
Labels
broadcast Applying a function over a collection collections Data structures holding multiple items, e.g. sets julep Julia Enhancement Proposal

Comments

@andyferris
Copy link
Member

andyferris commented Feb 6, 2018

Current behavior

I just learned that apparently broadcast doesn't work as I expected on dictionaries:

julia> d = Dict(:a => "Alice", :b => "Bob", :c => "Charlie")
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> map(kv -> kv[1] => kv[2], d)
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> broadcast(kv -> kv[1] => kv[2], d)
ERROR: KeyError: key 1 not found
Stacktrace:
 [1] getindex at ./dict.jl:474 [inlined]
 [2] (::##17#18)(::Dict{Symbol,String}) at ./REPL[11]:1
 [3] broadcast(::Function, ::Dict{Symbol,String}) at ./broadcast.jl:455

julia> broadcast(length, d)
3

AFAICT it seems broadcast is working on dictionaries like they are scalars or something. Someone correct me if I'm wrong - but maybe we never added any broadcast methods for dictionaries?

Proposal

Make broadcast work with dictionaries. I can see two possibilities, of which I favor option 2:

  1. broadcast for 1D indexable containers generally works like map, so we can extend this to AbstractDict and have broadcast map the key-value pairs
  2. broadcast deals with matching indices and performing some operation on values. So broadcast(f, dict) could be a bit like map(f, values(dict)) except the dictionary structure and it's indices are preserved in the output.

It seems to me that map deals with the iteration protocol and broadcast is more about matching indices and mapping those values. Using option 2 would give:

julia> length.(d)
Dict{Symbol,Int64} with 3 entries:
  :a => 5
  :b => 3
  :c => 7

IMO this kind of syntax could help a lot with manipulating dictionaries (one example is using AbstractDict for output of groupby-style operations, and then acting on the groups) while making no breaking changes to dictionary iteration or map.

To me it also dovetails with the "getindices julep" via this comment #24019 (comment) which basically suggests we can get multi-valued getindex between different indexable container types including dictionaries via the lowering transformation:

a.[inds] --> broadcast(i -> a[i], inds)

EDIT: Addendum (27/07/2019)

I forgot to mention a crucial use case of broadcasting over dictionaries where you have two or more dictionaries with the same keys and you want to match up the values based on the keys. For example dict1 .+ dict2. For our hash-based Dict, the user can't really predict the iteration order and they may actually differ between dict1 and dict2 even if the user can guarantee that they share the same set of keys (for example, dict2 might have been rehashed and dict1 not, but IMO this implementation detail shouldn't affect the semantics of the broadcast operation).

@ararslan ararslan added julep Julia Enhancement Proposal collections Data structures holding multiple items, e.g. sets broadcast labels Feb 6, 2018
@mbauman
Copy link
Member

mbauman commented Feb 6, 2018

Duplicate of #18618. Do you mind if we try to keep all the discussion consolidated there?

@nalimilan
Copy link
Member

The additional complexity compared with #18618 is that for Dicts we need to decide whether the broadcasted function should receive values or pairs. So maybe we can discuss that issue here, and keep the discussion regarding general iterables at #18618.

I suggest throwing an error when a Dict is passed to broadcast for now (just like I think we should throw an error for an iterator which doesn't implement the right methods), so that we can discuss the design and implement it in 1.x.

@andyferris
Copy link
Member Author

I suggest throwing an error when a Dict is passed to broadcast for now

I agree wit that; we probably shouldn't ship with a known broken implementation of broadcast on AbstractDict that doesn't error out.

@stevengj
Copy link
Member

stevengj commented Feb 7, 2018

I don't think it's "broken." It's not a bug, or unintentional. It was a conscious decision that broadcast is different from map: the former treats things as scalars by default, and the latter as iterators; the former "broadcasts" scalars to match array arguments, whereas the latter errors for mismatched dimensions.

Dictionaries, in particular, are not treated as "arrays" because (a) they don't have a shape/dimensionality, (b) they have an undefined iteration order, and (c) we intentionally had only a small whitelist of things that are array-like containers for broadcast.

Of course a "do what I mean" behavior would be better, but the point is that there are some times when you want one behavior and some times the other.

Making it an error just means you get neither behavior. It doesn't seem like an improvement.

@stevengj
Copy link
Member

stevengj commented Feb 7, 2018

For an example of where the current behavior is useful, think of anything where you are broadcasting over the keys:

julia> d = Dict(3=>4, 5=>6, 7=>20)
Dict{Int64,Int64} with 3 entries:
  7 => 20
  3 => 4
  5 => 6

julia> get.(d, 1:10, 0)
10-element Array{Int64,1}:
  0
  0
  4
  0
  6
  0
 20
  0
  0
  0

Erroring on a Dict argument to broadcast would break this, and gain us nothing useful in return.

@mbauman
Copy link
Member

mbauman commented Feb 7, 2018

Personally, I find it quite unexpected that map(f, x) is different from f.(x). And these issues suggest I'm not alone. In fact, I think making these two behave the same is a crucial element to any proposed solution to #18618. That seems to be where that issue is slowly converging.

Note that we've deprecated map(f, dict) mapping over pairs in #5794 with the intention that we can change it to map over values in the future, so have space to design map here, too. That'll change.

The big question is if we have:

map(sqrt, Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)

and

sqrt.(Dict(:a=>4,:b=>9))    # => Dict(:a=>2.0, :b=>3.0)
Dict(:a=>4, :b=>9) .+ [1,2] # => Dict(:a=>5, :b=>11) # EDIT: THIS MAKES NO SENSE, AND SHOULD ERROR

Then what would happen in the case like Dict(:a=>4, :b=>9) .+ [1 2]? We simply don't have a data structure (built in) for that. I suppose it could be an error or a matrix. Edit: Never mind — we shouldn't allow this operation since the keys between the dict and array don't match.

@stevengj
Copy link
Member

stevengj commented Feb 7, 2018

f.(x) should be either broadcast or map. Longstanding usage for things like .* suggest the former. And those two functions are definitely not the same. map(f, array, scalar) will presumably always throw an error, and map(f, rowvector, columnvector) will never produce a matrix, no?

Almost all of these things seem to boil down to "I want map, not broadcast."

@mbauman
Copy link
Member

mbauman commented Feb 7, 2018

Absolutely these two functions are not the same. But I think they should collapse to the same behavior in the one-argument case.

@stevengj
Copy link
Member

stevengj commented Feb 7, 2018

Even in the one-argument case, there are differences. e.g. map(f, string) maps over characters. But broadcast (rightly, I think), treats strings as "scalars", which is extremely useful for processing collections of strings.

I don't think it makes broadcast more intuitive if its behavior suddenly changes when you have only a single argument.

@stevengj
Copy link
Member

stevengj commented Feb 7, 2018

So your proposal is that whether broadcast treats an argument as a collection or an atom should depend on the number of arguments?

@mbauman
Copy link
Member

mbauman commented Feb 7, 2018

No, definitively not. My preference would be to systematically re-work broadcast along the lines of #18618 (comment), but I know you feel differently.

@JeffBezanson
Copy link
Member

I know there are examples where it's useful for broadcast to treat strings or dicts (or anything else for that matter, including arrays) as scalars, but "seems useful" does not generalize. broadcast should have a simple general behavior that naturally collapses to map in the 1-argument case.

If we were only dealing with the word broadcast, I could be quite flexible. But with that dot, broadcast is occupying some valuable real estate and so it needs to play nice.

@nalimilan
Copy link
Member

I don't think strings are the best example to decide how broadcast should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.

@jekbradbury
Copy link
Contributor

jekbradbury commented Feb 7, 2018

My mental model of broadcasting is basically that broadcast(f, x_1, ..., x_n) should mean something semantically equivalent to:

  1. For each x_i in x_1...x_n, what are its indices inds_i (also includes what's currently referred to as keys, and the implicit linear indices of types that satisfy the iteration interface but don't have indices). If an argument doesn't have either kind of indices, it is a scalar, which carries a single implicit index (maybe it's wrapped in a Ref to make this possible, and also to allow users to designate scalars).
  2. Try to create a new set of indices inds such that the sequence of functions g_1...g_n is a natural, structure-preserving way to map each index in inds to a single index in each inds_i. If this isn't possible, throw an error.
  3. Return the unique y whose value at each index ind in inds is f(x_1[g_1(ind)], ..., x_n[g_n(ind)]) and whose type is a predictable, minimally complicated type that can hold all such values.

In practice, such a function g_i would have to return a tuple of indices which is splatted into x_i's getindex in order to support Ref, whose getindex takes no arguments.

@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2018

As @mbauman, such questions of the preferred model are being discussed in #18618. In particular, the question is whether to define that (a) broadcast operates on any iterable (so broadcasting values over keys would mean writing broadcast(f, values(dict)), to express the situation where iteration coincides with indexing), or whether (b) the broadcast operates on indexes (so it only makes sense to use it when the indexes form a cartesian shape, and broadcast(f, dict) is the same as f(dict)), or whether (c) it can also broadcast non-cartesian keys (so broadcasting a dict would require that all other elements be either scalars or have the same indexes, or be an nd-sparse table with more complex rules; and the alternate version are written as either broadcast(f, (dict,)) or broadcast(f, collect(p for p in dict))

Currently, we most nearly implement (b). This proposes we implement (c), which is mostly just a superset of (b). The issue #18618 proposes that we do (a) instead. Summary, in table form, of the possible combinations of (1-D) operations you might want to do and how to do them:

Operation Current / b a c d Generator
as-value broadcast(f, dict) broadcast(f, (dict,)) broadcast(f, (dict,)) broadcast(f, (dict,)) f(dict)
value-map-keys Dict(zip(keys(dict), broadcast(f, collect(values(dict))))) broadcast(f, values(dict)) broadcast(f, dict) broadcast(f, dict) Dict(k => f(v) for (k, v) in dict)
value-map-order broadcast(f, collect(values(dict))) broadcast(f, (v for v in values(dict))) broadcast(f, collect(values(dict)))) broadcast(f, (v for v in values(dict)))) Array(f(v) for (k, v) in dict)
pair-map-order broadcast(f, collect(p for p in dict)) broadcast(f, (p for p in dict)) broadcast(f, collect(p for p in dict)) broadcast(f, (p for p in dict)) Array(f(p) for p in dict)
pair-map-keys Dict(broadcast(f, collect(p for p in dict))) broadcast(f, dict) Dict(broadcast(f, collect(p for p in dict))) Dict(broadcast(f, (p for p in dict))) Dict(f(p) for p in dict)

edit: added column d as suggested below by jekbradbury, which is mostly a superset of d, but where objects without a shape (cartesian axes) and without keys (non-cartesian indices) would instead assume ordered iteration

edit: also added a column of what this broadcast operation would mean in terms of a Generator

@jekbradbury
Copy link
Contributor

I guess I'm arguing that broadcast should operate on indices but fall back to iteration? So column c but without the need for collects?

@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2018

Ah, good call. I added a column (d) and a note to describe it. It's mostly (c) without collect, but where value-map-order is the same as column (a)

@vtjnash
Copy link
Member

vtjnash commented Feb 7, 2018

And now also added one more column "Generator", to give an equivalent result in terms of a different expression construct, for more clarity

@bkamins
Copy link
Member

bkamins commented Jul 26, 2019

Based on the discussion with @mbauman here is a list of possible test cases to consider (if they should error and if not what result should be produced):

  • (a=1,b=1) .+ [1 2]
  • (a=1,b=1) .+ (1,2)
  • (a=1,b=1) .+ [1,2]
  • (a=1,b=1) .+ (a=1,b=2)
  • (a=1,b=1) .+ (c=1,d=2)

EDIT: And I "vote up" for a higher priority for this decision as I get requests for DataFrames.jl functionalities that depend on this decision. Thank you!

@andyferris
Copy link
Member Author

My opinion has become firmly that broadcast should match-up the values from keys (there is some trickery allowed here for CartesianIndex's) rather than worry about iteration order.

Therefore the 4th one makes sense to me, as does:

(a=1,b=1) .+ (b=2,a=1) --> (a=2,b=3)

The others seem odd to me (and could be errors), except perhaps that named tuples have two kinds of indexes, the Symbols and the Integers (but I personally don't see the latter as particularly useful when you can unwrap it into a tuple so trivially, yet their presence serves to complicate a bit the indexing API).

One of the reasons I think keys is really important and fundamental is really the OP - I want this to work for hash-based Dicts and to be able to write things like dict1 .+ dict2. Unfortunately, for all intents and purposes the relative ordering of the elements of two different hash-based Dicts is not deterministic from the perspective of the programmer (it might actually be deterministic underneath it all but there's no real way for users to align the ordering of two Dicts to be the same, and even if they tried that's not necessarily cheap).

@bkamins If you're thinking of DataFrames.jl I suspect you'll want to match column names together, right?

@andyferris
Copy link
Member Author

(I've updated the OP with the two-Dict example)

@bkamins
Copy link
Member

bkamins commented Jul 27, 2019

Currently in DataFrames.jl we have a rule that names must match exactly (order and values), except for push!, when we allow for a mixed order match on names and setindex! of AbstractDict on RHS (where clearly we can only look at values). So we have a slight inconsistency, which I want to get rid of.

If we had a decision in Base (via this PR) what are the rules of combining collections with names I would copy them to DataFrames.jl to be consistent.

@andyferris
Copy link
Member Author

andyferris commented Jul 27, 2019

OK, thanks, I agree.

I have two more examples I think are important. The first is named tuples and dictionaries:

out = (a = 0, b = 1) .+ Dict(:a => 2, :b => 4)

If broadcast were iteration based then the set of values (Set(values(out))) could depend on details like whether we are running Julia on a 32-bit or 64-bit system.

Another example is mixing different types of AbstractDict:

sortdict .+ hashdict

I feel the only semantically useful definition here is that the broadcast operates by matching up the keys and basically ignoring the iteration order. I can't see anything else that is useful or helpful to users.

Basically, I'd like to be able to write generic code over AbstractDict, like function f(a::AbstractDict, b::AbstractDict); ... ; end. It would be nice to be able to use something like broadcast inside this generic function. Going further, I'd love to be able to duck type this function function f(a, b); ... ; end and let me mix-and-match different containers like dictionaries and named tuples, which really have very similar interfaces.

To reach that goal I believe the best way forward is that map obeys iteration such that first(map(f, a, b)) == f(first(a), first(b)) and broadcast obeys indexing such that broadcast(f, a, b)[i] = f(a[i], b[i]) (for single-dimensional containers; we would do something slightly different where i isa CartesianIndex). With these guarantees I can use map and broadcast in more generic code, the two functions operate have well-defined semantics, and they have somewhat orthogonal concerns (that tend to overlap for e.g. AbstractArray).

@tkf
Copy link
Member

tkf commented Jul 27, 2019

a list of possible test cases to consider (if they should error and if not what result should be produced):

  • (a=1,b=1) .+ [1 2]
  • (a=1,b=1) .+ (1,2)
  • (a=1,b=1) .+ [1,2]
  • (a=1,b=1) .+ (a=1,b=2)
  • (a=1,b=1) .+ (c=1,d=2)

How about the test case involving the lhs?

y = Dict(:b=>1)
y .= (a = 1,)

@bkamins
Copy link
Member

bkamins commented Jul 28, 2019

I agree that LHS (broadcasting assignment) test cases are relevant, but they are more complex, so I have left them out intentionally for later.

The reason is that copyto! is usually overloaded by a custom type of LHS anyway, and the first thing we should settle on is what Style should have a Broadcasted object have on RHS (this is needed anyway to implement a proper copyto! method as this Broadcasted object is passed to it anyway).

But I fully agree that we should have a consistent broadcasting + broadcasting assignment system for types defined in Base.

Just for a reference - in DataFrames.jl we have settled that if we have a broadcasting assignment and we have a data frame on LHS then:

  • we use assignment using column and row number;
  • we require column names to match exactly including order if RHS defines column names (currently this is only data frames, but after this Julep I guess we will have more possibilities).

@nalimilan
Copy link
Member

I agree it's clear that broadcasting on dicts should only care about keys and not about order. But the situation is more difficult for named tuples, since they are somewhat a hybrid of a tuple and of a dict. I suggest this:

  • broadcasting a named tuple with a tuple or vector should use the order of values
  • broadcasting a named tuple with a dict should use names. It should also probably return a named tuple, so that the order of the named tuple is preserved
  • broadcasting a named tuple with a named tuple should require both names and order to match

@tkf
Copy link
Member

tkf commented Sep 19, 2019

Before implementing everything in Base in one go, how about implementing this outside Base? More specifically, how about allowing to construct Broadcasted object but throw the not-implemented error in the materialization phase? Note that this is impossible at the moment because the error is thrown in broadcastable which is invoked before constructing Broadcasted. This would let people use something like

joinstyle.((a=1,b=1) .+ (a=1,b=2))

where joinstyle is a "magic" function that sets a custom broadcast style implemented in some package.

Importantly, this code would be forward-compatible even after something similar is implemented in Base. I think this let people try this in more serious code base than some toy experimental packages.

Extending this idea, it would be interesting to have different kind "join style":

innerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,)
outerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,b=1,c=2)  #???

(I discussed a similar idea in #32081 (comment))

@andyferris
Copy link
Member Author

I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary broadcasting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection collections Data structures holding multiple items, e.g. sets julep Julia Enhancement Proposal
Projects
None yet
Development

No branches or pull requests

10 participants