Julep: `broadcast` with dictionaries #25904

andyferris · 2018-02-06T00:10:37Z

Current behavior

I just learned that apparently broadcast doesn't work as I expected on dictionaries:

julia> d = Dict(:a => "Alice", :b => "Bob", :c => "Charlie")
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> map(kv -> kv[1] => kv[2], d)
Dict{Symbol,String} with 3 entries:
  :a => "Alice"
  :b => "Bob"
  :c => "Charlie"

julia> broadcast(kv -> kv[1] => kv[2], d)
ERROR: KeyError: key 1 not found
Stacktrace:
 [1] getindex at ./dict.jl:474 [inlined]
 [2] (::##17#18)(::Dict{Symbol,String}) at ./REPL[11]:1
 [3] broadcast(::Function, ::Dict{Symbol,String}) at ./broadcast.jl:455

julia> broadcast(length, d)
3

AFAICT it seems broadcast is working on dictionaries like they are scalars or something. Someone correct me if I'm wrong - but maybe we never added any broadcast methods for dictionaries?

Proposal

Make broadcast work with dictionaries. I can see two possibilities, of which I favor option 2:

broadcast for 1D indexable containers generally works like map, so we can extend this to AbstractDict and have broadcast map the key-value pairs
broadcast deals with matching indices and performing some operation on values. So broadcast(f, dict) could be a bit like map(f, values(dict)) except the dictionary structure and it's indices are preserved in the output.

It seems to me that map deals with the iteration protocol and broadcast is more about matching indices and mapping those values. Using option 2 would give:

julia> length.(d)
Dict{Symbol,Int64} with 3 entries:
  :a => 5
  :b => 3
  :c => 7

IMO this kind of syntax could help a lot with manipulating dictionaries (one example is using AbstractDict for output of groupby-style operations, and then acting on the groups) while making no breaking changes to dictionary iteration or map.

To me it also dovetails with the "getindices julep" via this comment #24019 (comment) which basically suggests we can get multi-valued getindex between different indexable container types including dictionaries via the lowering transformation:

a.[inds] --> broadcast(i -> a[i], inds)

EDIT: Addendum (27/07/2019)

I forgot to mention a crucial use case of broadcasting over dictionaries where you have two or more dictionaries with the same keys and you want to match up the values based on the keys. For example dict1 .+ dict2. For our hash-based Dict, the user can't really predict the iteration order and they may actually differ between dict1 and dict2 even if the user can guarantee that they share the same set of keys (for example, dict2 might have been rehashed and dict1 not, but IMO this implementation detail shouldn't affect the semantics of the broadcast operation).

The text was updated successfully, but these errors were encountered:

mbauman · 2018-02-06T01:13:20Z

Duplicate of #18618. Do you mind if we try to keep all the discussion consolidated there?

nalimilan · 2018-02-06T12:31:10Z

The additional complexity compared with #18618 is that for Dicts we need to decide whether the broadcasted function should receive values or pairs. So maybe we can discuss that issue here, and keep the discussion regarding general iterables at #18618.

I suggest throwing an error when a Dict is passed to broadcast for now (just like I think we should throw an error for an iterator which doesn't implement the right methods), so that we can discuss the design and implement it in 1.x.

andyferris · 2018-02-06T23:47:06Z

I suggest throwing an error when a Dict is passed to broadcast for now

I agree wit that; we probably shouldn't ship with a known broken implementation of broadcast on AbstractDict that doesn't error out.

stevengj · 2018-02-07T15:46:37Z

I don't think it's "broken." It's not a bug, or unintentional. It was a conscious decision that broadcast is different from map: the former treats things as scalars by default, and the latter as iterators; the former "broadcasts" scalars to match array arguments, whereas the latter errors for mismatched dimensions.

Dictionaries, in particular, are not treated as "arrays" because (a) they don't have a shape/dimensionality, (b) they have an undefined iteration order, and (c) we intentionally had only a small whitelist of things that are array-like containers for broadcast.

Of course a "do what I mean" behavior would be better, but the point is that there are some times when you want one behavior and some times the other.

Making it an error just means you get neither behavior. It doesn't seem like an improvement.

stevengj · 2018-02-07T15:50:57Z

For an example of where the current behavior is useful, think of anything where you are broadcasting over the keys:

julia> d = Dict(3=>4, 5=>6, 7=>20)
Dict{Int64,Int64} with 3 entries:
  7 => 20
  3 => 4
  5 => 6

julia> get.(d, 1:10, 0)
10-element Array{Int64,1}:
  0
  0
  4
  0
  6
  0
 20
  0
  0
  0

Erroring on a Dict argument to broadcast would break this, and gain us nothing useful in return.

mbauman · 2018-02-07T16:40:44Z

Personally, I find it quite unexpected that map(f, x) is different from f.(x). And these issues suggest I'm not alone. In fact, I think making these two behave the same is a crucial element to any proposed solution to #18618. That seems to be where that issue is slowly converging.

Note that we've deprecated map(f, dict) mapping over pairs in #5794 with the intention that we can change it to map over values in the future, so have space to design map here, too. That'll change.

The big question is if we have:

map(sqrt, Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)

and

sqrt.(Dict(:a=>4,:b=>9))    # => Dict(:a=>2.0, :b=>3.0)
Dict(:a=>4, :b=>9) .+ [1,2] # => Dict(:a=>5, :b=>11) # EDIT: THIS MAKES NO SENSE, AND SHOULD ERROR

~~Then what would happen in the case like Dict(:a=>4, :b=>9) .+ [1 2]? We simply don't have a data structure (built in) for that. I suppose it could be an error or a matrix.~~ Edit: Never mind — we shouldn't allow this operation since the keys between the dict and array don't match.

stevengj · 2018-02-07T16:43:24Z

f.(x) should be either broadcast or map. Longstanding usage for things like .* suggest the former. And those two functions are definitely not the same. map(f, array, scalar) will presumably always throw an error, and map(f, rowvector, columnvector) will never produce a matrix, no?

Almost all of these things seem to boil down to "I want map, not broadcast."

mbauman · 2018-02-07T16:44:35Z

Absolutely these two functions are not the same. But I think they should collapse to the same behavior in the one-argument case.

stevengj · 2018-02-07T16:46:29Z

Even in the one-argument case, there are differences. e.g. map(f, string) maps over characters. But broadcast (rightly, I think), treats strings as "scalars", which is extremely useful for processing collections of strings.

I don't think it makes broadcast more intuitive if its behavior suddenly changes when you have only a single argument.

stevengj · 2018-02-07T16:48:58Z

So your proposal is that whether broadcast treats an argument as a collection or an atom should depend on the number of arguments?

mbauman · 2018-02-07T16:51:48Z

No, definitively not. My preference would be to systematically re-work broadcast along the lines of #18618 (comment), but I know you feel differently.

JeffBezanson · 2018-02-07T17:03:50Z

I know there are examples where it's useful for broadcast to treat strings or dicts (or anything else for that matter, including arrays) as scalars, but "seems useful" does not generalize. broadcast should have a simple general behavior that naturally collapses to map in the 1-argument case.

If we were only dealing with the word broadcast, I could be quite flexible. But with that dot, broadcast is occupying some valuable real estate and so it needs to play nice.

nalimilan · 2018-02-07T17:19:22Z

I don't think strings are the best example to decide how broadcast should behave, since they are really the only collection which is more often considered as a scalar than as an actual collection. Better decide the general behavior based on other containers (arrays, tuples, dicts, sets...), and only then ask whether strings should follow that convention or be an exception IMHO.

jekbradbury · 2018-02-07T18:34:21Z

My mental model of broadcasting is basically that broadcast(f, x_1, ..., x_n) should mean something semantically equivalent to:

For each x_i in x_1...x_n, what are its indices inds_i (also includes what's currently referred to as keys, and the implicit linear indices of types that satisfy the iteration interface but don't have indices). If an argument doesn't have either kind of indices, it is a scalar, which carries a single implicit index (maybe it's wrapped in a Ref to make this possible, and also to allow users to designate scalars).
Try to create a new set of indices inds such that the sequence of functions g_1...g_n is a natural, structure-preserving way to map each index in inds to a single index in each inds_i. If this isn't possible, throw an error.
Return the unique y whose value at each index ind in inds is f(x_1[g_1(ind)], ..., x_n[g_n(ind)]) and whose type is a predictable, minimally complicated type that can hold all such values.

In practice, such a function g_i would have to return a tuple of indices which is splatted into x_i's getindex in order to support Ref, whose getindex takes no arguments.

vtjnash · 2018-02-07T19:07:08Z

As @mbauman, such questions of the preferred model are being discussed in #18618. In particular, the question is whether to define that (a) broadcast operates on any iterable (so broadcasting values over keys would mean writing broadcast(f, values(dict)), to express the situation where iteration coincides with indexing), or whether (b) the broadcast operates on indexes (so it only makes sense to use it when the indexes form a cartesian shape, and broadcast(f, dict) is the same as f(dict)), or whether (c) it can also broadcast non-cartesian keys (so broadcasting a dict would require that all other elements be either scalars or have the same indexes, or be an nd-sparse table with more complex rules; and the alternate version are written as either broadcast(f, (dict,)) or broadcast(f, collect(p for p in dict))

Currently, we most nearly implement (b). This proposes we implement (c), which is mostly just a superset of (b). The issue #18618 proposes that we do (a) instead. Summary, in table form, of the possible combinations of (1-D) operations you might want to do and how to do them:

Operation	Current / b	a	c	d	Generator
as-value	broadcast(f, dict)	broadcast(f, (dict,))	broadcast(f, (dict,))	broadcast(f, (dict,))	f(dict)
value-map-keys	Dict(zip(keys(dict), broadcast(f, collect(values(dict)))))	broadcast(f, values(dict))	broadcast(f, dict)	broadcast(f, dict)	Dict(k => f(v) for (k, v) in dict)
value-map-order	broadcast(f, collect(values(dict)))	broadcast(f, (v for v in values(dict)))	broadcast(f, collect(values(dict))))	broadcast(f, (v for v in values(dict))))	Array(f(v) for (k, v) in dict)
pair-map-order	broadcast(f, collect(p for p in dict))	broadcast(f, (p for p in dict))	broadcast(f, collect(p for p in dict))	broadcast(f, (p for p in dict))	Array(f(p) for p in dict)
pair-map-keys	Dict(broadcast(f, collect(p for p in dict)))	broadcast(f, dict)	Dict(broadcast(f, collect(p for p in dict)))	Dict(broadcast(f, (p for p in dict)))	Dict(f(p) for p in dict)

edit: added column d as suggested below by jekbradbury, which is mostly a superset of d, but where objects without a shape (cartesian axes) and without keys (non-cartesian indices) would instead assume ordered iteration

edit: also added a column of what this broadcast operation would mean in terms of a Generator

jekbradbury · 2018-02-07T19:11:01Z

I guess I'm arguing that broadcast should operate on indices but fall back to iteration? So column c but without the need for collects?

vtjnash · 2018-02-07T19:21:59Z

Ah, good call. I added a column (d) and a note to describe it. It's mostly (c) without collect, but where value-map-order is the same as column (a)

vtjnash · 2018-02-07T19:27:29Z

And now also added one more column "Generator", to give an equivalent result in terms of a different expression construct, for more clarity

bkamins · 2019-07-26T15:22:22Z

Based on the discussion with @mbauman here is a list of possible test cases to consider (if they should error and if not what result should be produced):

(a=1,b=1) .+ [1 2]
(a=1,b=1) .+ (1,2)
(a=1,b=1) .+ [1,2]
(a=1,b=1) .+ (a=1,b=2)
(a=1,b=1) .+ (c=1,d=2)

EDIT: And I "vote up" for a higher priority for this decision as I get requests for DataFrames.jl functionalities that depend on this decision. Thank you!

andyferris · 2019-07-26T22:18:58Z

My opinion has become firmly that broadcast should match-up the values from keys (there is some trickery allowed here for CartesianIndex's) rather than worry about iteration order.

Therefore the 4th one makes sense to me, as does:

(a=1,b=1) .+ (b=2,a=1) --> (a=2,b=3)

The others seem odd to me (and could be errors), except perhaps that named tuples have two kinds of indexes, the Symbols and the Integers (but I personally don't see the latter as particularly useful when you can unwrap it into a tuple so trivially, yet their presence serves to complicate a bit the indexing API).

One of the reasons I think keys is really important and fundamental is really the OP - I want this to work for hash-based Dicts and to be able to write things like dict1 .+ dict2. Unfortunately, for all intents and purposes the relative ordering of the elements of two different hash-based Dicts is not deterministic from the perspective of the programmer (it might actually be deterministic underneath it all but there's no real way for users to align the ordering of two Dicts to be the same, and even if they tried that's not necessarily cheap).

@bkamins If you're thinking of DataFrames.jl I suspect you'll want to match column names together, right?

andyferris · 2019-07-26T22:28:57Z

(I've updated the OP with the two-Dict example)

bkamins · 2019-07-27T00:02:33Z

Currently in DataFrames.jl we have a rule that names must match exactly (order and values), except for push!, when we allow for a mixed order match on names and setindex! of AbstractDict on RHS (where clearly we can only look at values). So we have a slight inconsistency, which I want to get rid of.

If we had a decision in Base (via this PR) what are the rules of combining collections with names I would copy them to DataFrames.jl to be consistent.

andyferris · 2019-07-27T00:56:57Z

OK, thanks, I agree.

I have two more examples I think are important. The first is named tuples and dictionaries:

out = (a = 0, b = 1) .+ Dict(:a => 2, :b => 4)

If broadcast were iteration based then the set of values (Set(values(out))) could depend on details like whether we are running Julia on a 32-bit or 64-bit system.

Another example is mixing different types of AbstractDict:

sortdict .+ hashdict

I feel the only semantically useful definition here is that the broadcast operates by matching up the keys and basically ignoring the iteration order. I can't see anything else that is useful or helpful to users.

Basically, I'd like to be able to write generic code over AbstractDict, like function f(a::AbstractDict, b::AbstractDict); ... ; end. It would be nice to be able to use something like broadcast inside this generic function. Going further, I'd love to be able to duck type this function function f(a, b); ... ; end and let me mix-and-match different containers like dictionaries and named tuples, which really have very similar interfaces.

To reach that goal I believe the best way forward is that map obeys iteration such that first(map(f, a, b)) == f(first(a), first(b)) and broadcast obeys indexing such that broadcast(f, a, b)[i] = f(a[i], b[i]) (for single-dimensional containers; we would do something slightly different where i isa CartesianIndex). With these guarantees I can use map and broadcast in more generic code, the two functions operate have well-defined semantics, and they have somewhat orthogonal concerns (that tend to overlap for e.g. AbstractArray).

tkf · 2019-07-27T23:24:43Z

a list of possible test cases to consider (if they should error and if not what result should be produced):

(a=1,b=1) .+ [1 2]

(a=1,b=1) .+ (1,2)

(a=1,b=1) .+ [1,2]

(a=1,b=1) .+ (a=1,b=2)

(a=1,b=1) .+ (c=1,d=2)

How about the test case involving the lhs?

y = Dict(:b=>1)
y .= (a = 1,)

bkamins · 2019-07-28T06:38:29Z

I agree that LHS (broadcasting assignment) test cases are relevant, but they are more complex, so I have left them out intentionally for later.

The reason is that copyto! is usually overloaded by a custom type of LHS anyway, and the first thing we should settle on is what Style should have a Broadcasted object have on RHS (this is needed anyway to implement a proper copyto! method as this Broadcasted object is passed to it anyway).

But I fully agree that we should have a consistent broadcasting + broadcasting assignment system for types defined in Base.

Just for a reference - in DataFrames.jl we have settled that if we have a broadcasting assignment and we have a data frame on LHS then:

we use assignment using column and row number;
we require column names to match exactly including order if RHS defines column names (currently this is only data frames, but after this Julep I guess we will have more possibilities).

nalimilan · 2019-07-28T17:57:50Z

I agree it's clear that broadcasting on dicts should only care about keys and not about order. But the situation is more difficult for named tuples, since they are somewhat a hybrid of a tuple and of a dict. I suggest this:

broadcasting a named tuple with a tuple or vector should use the order of values
broadcasting a named tuple with a dict should use names. It should also probably return a named tuple, so that the order of the named tuple is preserved
broadcasting a named tuple with a named tuple should require both names and order to match

tkf · 2019-09-19T21:05:06Z

Before implementing everything in Base in one go, how about implementing this outside Base? More specifically, how about allowing to construct Broadcasted object but throw the not-implemented error in the materialization phase? Note that this is impossible at the moment because the error is thrown in broadcastable which is invoked before constructing Broadcasted. This would let people use something like

joinstyle.((a=1,b=1) .+ (a=1,b=2))

where joinstyle is a "magic" function that sets a custom broadcast style implemented in some package.

Importantly, this code would be forward-compatible even after something similar is implemented in Base. I think this let people try this in more serious code base than some toy experimental packages.

Extending this idea, it would be interesting to have different kind "join style":

innerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,)
outerjoin.((a=1,b=1) .+ (a=1,c=2))  # => (a=2,b=1,c=2)  #???

(I discussed a similar idea in #32081 (comment))

andyferris · 2019-12-09T22:55:45Z

I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary broadcasting.

ararslan added julep Julia Enhancement Proposal collections Data structures holding multiple items, e.g. sets broadcast labels Feb 6, 2018

andyferris mentioned this issue Feb 9, 2018

Broadcast had one job (e.g. broadcasting over iterators and generator) #18618

Closed

andyferris mentioned this issue Feb 22, 2018

Implement broadcasting for AbstractDict and NamedTuple #26158

Open

andyferris mentioned this issue Mar 8, 2018

map over a KeySet unexpectedly returns a Set #26359

Closed

mbauman added broadcast Applying a function over a collection and removed broadcast labels Apr 24, 2018

mbauman mentioned this issue Jan 25, 2019

What's up with broadcasting over dictionaries and NamedTuple's is reserved #30836

Open

mbauman mentioned this issue May 20, 2019

Broadcasting as functor map #32081

Open

bkamins mentioned this issue Jul 28, 2019

Sync the behavior of push!, vcat and append! in DataFrames.jl with Base JuliaData/DataFrames.jl#1904

Closed

This was referenced Sep 16, 2019

Define interval arithmetics using broadcasting JuliaMath/IntervalSets.jl#55

Open

Broadcasting with Set returns an arbitrarily-ordered array #33282

Open

This was referenced May 30, 2020

Add Base.Experimental.ReservedStyle: Allow constructing Broadcasted of dictionaries and named tuples #36085

Open

Broadcasting dictionaries with unmatched keys gives confusing result andyferris/Dictionaries.jl#19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Julep: `broadcast` with dictionaries #25904

Julep: `broadcast` with dictionaries #25904

andyferris commented Feb 6, 2018 •

edited

Loading

mbauman commented Feb 6, 2018

nalimilan commented Feb 6, 2018

andyferris commented Feb 6, 2018

stevengj commented Feb 7, 2018 •

edited

Loading

stevengj commented Feb 7, 2018 •

edited

Loading

mbauman commented Feb 7, 2018 •

edited

Loading

stevengj commented Feb 7, 2018 •

edited

Loading

mbauman commented Feb 7, 2018

stevengj commented Feb 7, 2018

stevengj commented Feb 7, 2018

mbauman commented Feb 7, 2018

JeffBezanson commented Feb 7, 2018

nalimilan commented Feb 7, 2018

jekbradbury commented Feb 7, 2018 •

edited

Loading

vtjnash commented Feb 7, 2018 •

edited

Loading

jekbradbury commented Feb 7, 2018

vtjnash commented Feb 7, 2018

vtjnash commented Feb 7, 2018

bkamins commented Jul 26, 2019 •

edited

Loading

andyferris commented Jul 26, 2019

andyferris commented Jul 26, 2019

bkamins commented Jul 27, 2019

andyferris commented Jul 27, 2019 •

edited

Loading

tkf commented Jul 27, 2019

bkamins commented Jul 28, 2019

nalimilan commented Jul 28, 2019

tkf commented Sep 19, 2019

andyferris commented Dec 9, 2019

Julep: broadcast with dictionaries #25904

Julep: broadcast with dictionaries #25904

Comments

andyferris commented Feb 6, 2018 • edited Loading

Current behavior

Proposal

EDIT: Addendum (27/07/2019)

mbauman commented Feb 6, 2018

nalimilan commented Feb 6, 2018

andyferris commented Feb 6, 2018

stevengj commented Feb 7, 2018 • edited Loading

stevengj commented Feb 7, 2018 • edited Loading

mbauman commented Feb 7, 2018 • edited Loading

stevengj commented Feb 7, 2018 • edited Loading

mbauman commented Feb 7, 2018

stevengj commented Feb 7, 2018

stevengj commented Feb 7, 2018

mbauman commented Feb 7, 2018

JeffBezanson commented Feb 7, 2018

nalimilan commented Feb 7, 2018

jekbradbury commented Feb 7, 2018 • edited Loading

vtjnash commented Feb 7, 2018 • edited Loading

jekbradbury commented Feb 7, 2018

vtjnash commented Feb 7, 2018

vtjnash commented Feb 7, 2018

bkamins commented Jul 26, 2019 • edited Loading

andyferris commented Jul 26, 2019

andyferris commented Jul 26, 2019

bkamins commented Jul 27, 2019

andyferris commented Jul 27, 2019 • edited Loading

tkf commented Jul 27, 2019

bkamins commented Jul 28, 2019

nalimilan commented Jul 28, 2019

tkf commented Sep 19, 2019

andyferris commented Dec 9, 2019

Julep: `broadcast` with dictionaries #25904

Julep: `broadcast` with dictionaries #25904

andyferris commented Feb 6, 2018 •

edited

Loading

stevengj commented Feb 7, 2018 •

edited

Loading

stevengj commented Feb 7, 2018 •

edited

Loading

mbauman commented Feb 7, 2018 •

edited

Loading

stevengj commented Feb 7, 2018 •

edited

Loading

jekbradbury commented Feb 7, 2018 •

edited

Loading

vtjnash commented Feb 7, 2018 •

edited

Loading

bkamins commented Jul 26, 2019 •

edited

Loading

andyferris commented Jul 27, 2019 •

edited

Loading