Removing ufunc-broadcasting across record fields #457

jpivarski · 2020-09-17T22:57:44Z

Currently, all ufuncs are broadcasted across all fields of a record:

>>> ak_array = ak.Array([{"x": 1, "y": 1.1}, {"x": 2, "y": 2.2}, {"x": 3, "y": 3.3}])
>>> ak_array
<Array [{x: 1, y: 1.1}, ... {x: 3, y: 3.3}] type='3 * {"x": int64, "y": float64}'>

>>> ak_array + 1
<Array [{x: 2, y: 2.1}, ... {x: 4, y: 4.3}] type='3 * {"x": int64, "y": float64}'>

This is causing some confusion because the fields of a record have qualitatively different meanings. Some are trigger booleans, some are momenta, some are ML-derived isolation variables, some are strings...

>>> ak.Array(["HAL"]) + 1                      # should this even work?
<Array [[73, 66, 77]] type='1 * var * uint8'>

>>> [chr(x) for x in (ak.Array(["HAL"]) + 1)[0].tolist()]
['I', 'B', 'M']

Furthermore, when @henryiii is writing vector, he has to distinguish between LorentzVector + LorentzVector accidentally working because they're Cartesian (but not preserving their Lorentzness) and getting the wrong answer because they're not Cartesian. Even though the + behavior is defined, due to the fact that they are records, he has to be sure to override every case.

I think there would be fewer surprises for both users and developers if broadcasting a ufunc through a record were an error (withe a nice error message). Custom behaviors for specialized records, like LorentzVectors, would still be possible to define, as they are now, but instead of replacing wrong behavior, they'd be replacing no behavior.

Note that NumPy does not define such an operation on structured arrays:

>>> np_array = np.array([(1, 1.1), (2, 2.2), (3, 3.3)], [("x", int), ("y", float)])
>>> np_array
array([(1, 1.1), (2, 2.2), (3, 3.3)], dtype=[('x', '<i8'), ('y', '<f8')])

>>> np_array + 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: invalid type promotion

Although it does work for Pandas:

>>> df = pd.DataFrame({"x": [1, 2, 3], "y": [1.1, 2.2, 3.3]})
>>> df
   x    y
0  1  1.1
1  2  2.2
2  3  3.3
>>> df + 1
   x    y
0  2  2.1
1  3  3.2
2  4  4.3

it is not our intention to generalize from Pandas, only NumPy.

This would also affect ufuncs that return booleans, like comparison operators. For these, the argument isn't as strong. Maybe we want

>>> ak_array > 1
<Array [{x: False, y: True}, ... y: True}] type='3 * {"x": bool, "y": bool}'>

to work, maybe we don't.

I'm considering removing ufuncs-through-records for all ufuncs, without affecting the custom ufunc behavior that can be assigned to any record with a name. (I'm not considering ufuncs-on-strings right now, though that's something to think about.) Does anyone have a strong argument about that?

(I suppose this needs a deprecation cycle, though it would be a little difficult getting a warning into the middle of the broadcast-and-apply. I'm tempted to remove it all at once, like a band-aid...)

The text was updated successfully, but these errors were encountered:

nsmith- · 2020-09-18T14:40:16Z

For the record I did propose also to only disable ufuncs on arrays where layout.purelist_parameter("__record__") == None but I learned that this name is used for other purposes than just defining the desired behavioral mixin (e.g. if multiple differently-named record is encountered in ArrayBuilder, then the resulting array will be a union of each record type rather than a single record type with optional subfields).

jpivarski · 2020-09-18T15:17:20Z

Also, I want to keep these rules of "what applies when" as simple as possible, to minimize surprising behavior.

jpivarski · 2020-10-30T22:42:01Z

I'm not going to change the policy on this: the behavior will stay the same. (Although NumPy ufuncs won't be allowed on strings, as in the above example: #504.)

nsmith- · 2020-10-30T22:47:27Z

@lgray FYI
Isn't the risk of user error extremely large here?

lgray · 2020-10-30T22:57:10Z

So yeah, my 2 cents here is that the probability of the user doing something they don't intend is incredibly large.

Nick ran into an interesting case where NanoEvents jets (which are four vectors beneath) can be element-wise multiplied by each other. This doesn't have a defined behavior from a physics standpoint and would also be very easy to accidentally do or do intentionally without knowing what it meant. In either case it can result in technically valid but wrong behavior when writing analysis - which is a much less preferred outcome than throwing an error.

jpivarski · 2020-10-30T23:27:18Z

The way it works now is more prone to error for the interpretation of records as objects. "Extremely" is subjective.

Heck, in crafting any kind of response to this, I'm only convincing myself that it needs to be done. It is the case that examples of this antifeature are actually in some of my slides, so "fixing" it will make my slides wrong.

But it is an important point. I think this issue could be implemented as an inhibitor that prevents broadcasting through any records unless a customization is defined. Note that that even means this:

>>> array = ak.Array([{"only": 1}, {"only": 2}, {"only": 3}])
>>> array + 10
<Array [{only: 11}, {only: 12}, {only: 13}] type='3 * {"only": int64}'>

If/when people complain that what used to work no longer works, I'll just have to explain that. (In the above example, they'd have to do array.only + 10.)

jpivarski · 2020-10-30T23:29:43Z

I'm calling it a bug because I've just defined the current behavior as wrong. (Even though I've presented it in talks.)

(A motivator for being short with these things is that the list of issues is a lot longer than I thought, and I have only until December 1 to make this awkward==1.0.0.)

lgray · 2020-10-31T17:01:55Z

Yeah - I agree that a good policy is to not allow broadcasting ufuncs through records unless that ufunc has been defined for that record type. This has the added benefit that people have to agree on the ufuncs/interface for whatever domain-specific set of records they want to use.

jpivarski · 2020-11-03T18:39:28Z

@henryiii I'm working on this now: broadcasting will be prevented from passing through records unless an overload has been implemented. For example, array_of_records * 10 won't multiply all fields of the records by 10 unless those records have explicitly defined a behavior for np.multiply. Thus, a Lorentz vector object in cylindrical won't accidentally get its phi member multiplied by 10, but the cartesian object would have to explicitly define this.

It is a change in interface, and I've even presented examples of this interface in public, so I have to Do The Right Thing and give it a deprecation cycle. It's removing a feature, which is good because changing a feature doesn't have a way to provide both old and new behaviors for an overlapping time period. What will happen is this: if any function attempts to broadcast through a record, it will raise a warning now, saying that this will become an error in version 1.0.0 on Dec 1, 2020, unless that broadcast is an overloaded ufunc.

Then I'm going to push this change into a released version of Awkward Array right away, so that about a month of releases will get the message. I'm sure there are users who update less frequently than that, and unfortunately, they won't get the message. Post-1.0.0, I'll have to start publishing regular release schedules, so that deprecation warnings can refer to both a version number and a date.

jpivarski · 2020-11-03T20:19:54Z

In Awkward Array 0.4.1, which I want to release later today, the warning that you get from

one = awkward1.Array([{"x": 1}, {"x": 2}, {"x": 3}])
two = awkward1.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}])
one + two

is

DeprecationWarning: In version 1.0.0 (target date: 2020-12-01), this will be an error.
(Set ak.deprecations_as_errors = True to get a stack trace now.)

ValueError: cannot broadcast: <class 'awkward1._ext.RecordArray'>, <class 'awkward1._ext.RecordArray'>

(https://github.com/scikit-hep/awkward-1.0/blob/0.4.1/src/awkward1/_util.py#L886)

But

def overload_add(left, right):
    return awkward1.Array({"x": left.x + right.x})

behavior = {}
behavior[numpy.add, "Overload", "Overload"] = overload_add

one = awkward1.Array([{"x": 1}, {"x": 2}, {"x": 3}], with_name="Overload", behavior=behavior)
two = awkward1.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}], with_name="Overload", behavior=behavior)

assert (one + two).tolist() == [{"x": 2.1}, {"x": 4.2}, {"x": 6.3}]

raises no warnings or errors.

lgray · 2020-11-03T20:23:53Z

Looks like the correct behavior to me. :-)

jpivarski added the policy Choice of behavior label Sep 17, 2020

jpivarski closed this as completed Oct 30, 2020

jpivarski reopened this Oct 30, 2020

jpivarski added bug The problem described is something that must be fixed and removed policy Choice of behavior labels Oct 30, 2020

jpivarski mentioned this issue Nov 3, 2020

Remove broadcasting over the fields of records #510

Merged

jpivarski linked a pull request Nov 3, 2020 that will close this issue

Remove broadcasting over the fields of records #510

Merged

jpivarski closed this as completed in #510 Nov 3, 2020

jpivarski mentioned this issue Mar 17, 2022

Prevent reducers like ak.sum on records (v1 and v2) #1375

Closed

jpivarski mentioned this issue Apr 22, 2022

ufuncs on records should not be allowed unless they are explicitly overridden. #1439

Closed

masonproffitt mentioned this issue Apr 6, 2023

Record support for == and != #2367

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing ufunc-broadcasting across record fields #457

Removing ufunc-broadcasting across record fields #457

jpivarski commented Sep 17, 2020

nsmith- commented Sep 18, 2020

jpivarski commented Sep 18, 2020

jpivarski commented Oct 30, 2020

nsmith- commented Oct 30, 2020

lgray commented Oct 30, 2020

jpivarski commented Oct 30, 2020

jpivarski commented Oct 30, 2020

lgray commented Oct 31, 2020

jpivarski commented Nov 3, 2020

jpivarski commented Nov 3, 2020

lgray commented Nov 3, 2020

Removing ufunc-broadcasting across record fields #457

Removing ufunc-broadcasting across record fields #457

Comments

jpivarski commented Sep 17, 2020

nsmith- commented Sep 18, 2020

jpivarski commented Sep 18, 2020

jpivarski commented Oct 30, 2020

nsmith- commented Oct 30, 2020

lgray commented Oct 30, 2020

jpivarski commented Oct 30, 2020

jpivarski commented Oct 30, 2020

lgray commented Oct 31, 2020

jpivarski commented Nov 3, 2020

jpivarski commented Nov 3, 2020

lgray commented Nov 3, 2020