-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julep: broadcast
with dictionaries
#25904
Comments
Duplicate of #18618. Do you mind if we try to keep all the discussion consolidated there? |
The additional complexity compared with #18618 is that for I suggest throwing an error when a |
I agree wit that; we probably shouldn't ship with a known broken implementation of |
I don't think it's "broken." It's not a bug, or unintentional. It was a conscious decision that Dictionaries, in particular, are not treated as "arrays" because (a) they don't have a shape/dimensionality, (b) they have an undefined iteration order, and (c) we intentionally had only a small whitelist of things that are array-like containers for Of course a "do what I mean" behavior would be better, but the point is that there are some times when you want one behavior and some times the other. Making it an error just means you get neither behavior. It doesn't seem like an improvement. |
For an example of where the current behavior is useful, think of anything where you are broadcasting over the keys: julia> d = Dict(3=>4, 5=>6, 7=>20)
Dict{Int64,Int64} with 3 entries:
7 => 20
3 => 4
5 => 6
julia> get.(d, 1:10, 0)
10-element Array{Int64,1}:
0
0
4
0
6
0
20
0
0
0 Erroring on a |
Personally, I find it quite unexpected that Note that we've deprecated The big question is if we have: map(sqrt, Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0) and sqrt.(Dict(:a=>4,:b=>9)) # => Dict(:a=>2.0, :b=>3.0)
Dict(:a=>4, :b=>9) .+ [1,2] # => Dict(:a=>5, :b=>11) # EDIT: THIS MAKES NO SENSE, AND SHOULD ERROR
|
Almost all of these things seem to boil down to "I want |
Absolutely these two functions are not the same. But I think they should collapse to the same behavior in the one-argument case. |
Even in the one-argument case, there are differences. e.g. I don't think it makes |
So your proposal is that whether |
No, definitively not. My preference would be to systematically re-work broadcast along the lines of #18618 (comment), but I know you feel differently. |
I know there are examples where it's useful for broadcast to treat strings or dicts (or anything else for that matter, including arrays) as scalars, but "seems useful" does not generalize. If we were only dealing with the word |
I don't think strings are the best example to decide how |
My mental model of broadcasting is basically that
In practice, such a function |
As @mbauman, such questions of the preferred model are being discussed in #18618. In particular, the question is whether to define that (a) Currently, we most nearly implement (b). This proposes we implement (c), which is mostly just a superset of (b). The issue #18618 proposes that we do (a) instead. Summary, in table form, of the possible combinations of (1-D) operations you might want to do and how to do them:
edit: added column d as suggested below by jekbradbury, which is mostly a superset of d, but where objects without a shape (cartesian axes) and without keys (non-cartesian indices) would instead assume ordered iteration edit: also added a column of what this broadcast operation would mean in terms of a Generator |
I guess I'm arguing that |
Ah, good call. I added a column (d) and a note to describe it. It's mostly (c) without |
And now also added one more column "Generator", to give an equivalent result in terms of a different expression construct, for more clarity |
Based on the discussion with @mbauman here is a list of possible test cases to consider (if they should error and if not what result should be produced):
EDIT: And I "vote up" for a higher priority for this decision as I get requests for DataFrames.jl functionalities that depend on this decision. Thank you! |
My opinion has become firmly that Therefore the 4th one makes sense to me, as does:
The others seem odd to me (and could be errors), except perhaps that named tuples have two kinds of indexes, the One of the reasons I think @bkamins If you're thinking of DataFrames.jl I suspect you'll want to match column names together, right? |
(I've updated the OP with the two- |
Currently in DataFrames.jl we have a rule that names must match exactly (order and values), except for If we had a decision in Base (via this PR) what are the rules of combining collections with names I would copy them to DataFrames.jl to be consistent. |
OK, thanks, I agree. I have two more examples I think are important. The first is named tuples and dictionaries: out = (a = 0, b = 1) .+ Dict(:a => 2, :b => 4) If broadcast were iteration based then the set of values ( Another example is mixing different types of sortdict .+ hashdict I feel the only semantically useful definition here is that the Basically, I'd like to be able to write generic code over To reach that goal I believe the best way forward is that |
How about the test case involving the lhs?
|
I agree that LHS (broadcasting assignment) test cases are relevant, but they are more complex, so I have left them out intentionally for later. The reason is that But I fully agree that we should have a consistent broadcasting + broadcasting assignment system for types defined in Base. Just for a reference - in DataFrames.jl we have settled that if we have a broadcasting assignment and we have a data frame on LHS then:
|
I agree it's clear that broadcasting on dicts should only care about keys and not about order. But the situation is more difficult for named tuples, since they are somewhat a hybrid of a tuple and of a dict. I suggest this:
|
Before implementing everything in joinstyle.((a=1,b=1) .+ (a=1,b=2)) where Importantly, this code would be forward-compatible even after something similar is implemented in Base. I think this let people try this in more serious code base than some toy experimental packages. Extending this idea, it would be interesting to have different kind "join style": innerjoin.((a=1,b=1) .+ (a=1,c=2)) # => (a=2,)
outerjoin.((a=1,b=1) .+ (a=1,c=2)) # => (a=2,b=1,c=2) #??? (I discussed a similar idea in #32081 (comment)) |
I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary broadcasting. |
Current behavior
I just learned that apparently
broadcast
doesn't work as I expected on dictionaries:AFAICT it seems
broadcast
is working on dictionaries like they are scalars or something. Someone correct me if I'm wrong - but maybe we never added anybroadcast
methods for dictionaries?Proposal
Make
broadcast
work with dictionaries. I can see two possibilities, of which I favor option 2:broadcast
for 1D indexable containers generally works likemap
, so we can extend this toAbstractDict
and havebroadcast
map the key-value pairsbroadcast
deals with matching indices and performing some operation on values. Sobroadcast(f, dict)
could be a bit likemap(f, values(dict))
except the dictionary structure and it's indices are preserved in the output.It seems to me that
map
deals with the iteration protocol andbroadcast
is more about matching indices and mapping those values. Using option 2 would give:IMO this kind of syntax could help a lot with manipulating dictionaries (one example is using
AbstractDict
for output ofgroupby
-style operations, and then acting on the groups) while making no breaking changes to dictionary iteration or map.To me it also dovetails with the "getindices julep" via this comment #24019 (comment) which basically suggests we can get multi-valued getindex between different indexable container types including dictionaries via the lowering transformation:
EDIT: Addendum (27/07/2019)
I forgot to mention a crucial use case of broadcasting over dictionaries where you have two or more dictionaries with the same keys and you want to match up the
values
based on thekeys
. For exampledict1 .+ dict2
. For our hash-basedDict
, the user can't really predict the iteration order and they may actually differ betweendict1
anddict2
even if the user can guarantee that they share the same set ofkeys
(for example,dict2
might have been rehashed anddict1
not, but IMO this implementation detail shouldn't affect the semantics of the broadcast operation).The text was updated successfully, but these errors were encountered: