Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solution for ambiguous dim tags #871

Merged
merged 3 commits into from
Feb 11, 2022
Merged

Solution for ambiguous dim tags #871

merged 3 commits into from
Feb 11, 2022

Conversation

albertz
Copy link
Member

@albertz albertz commented Dec 19, 2021

Consider the example (also in the test) to use DotLayer using a square matrix, so its two dimensions are the same.

from returnn.tf.util.data import batch_dim
time_dim = SpatialDim("time")
feat_dim = FeatureDim("feature", dimension=3)
config = Config({
  "extern_data": {
    "data": {"dim_tags": [batch_dim, time_dim, feat_dim]},
    "matrix_ambiguous": {"dim_tags": [feat_dim, feat_dim], "available_for_inference": True},
  },
})

Here, {"class": "dot", "from": ["data:data", "data:matrix_ambiguous"], "reduce": feat_dim} does not work because it is ambiguous.

Introducing the Dim.match_priority solves this problem, by having:

"matrix_non_ambiguous": {
  "dim_tags": [feat_dim.copy(match_priority=1), feat_dim], "available_for_inference": True},

This was suggested here: rwth-i6/returnn_common#17 (comment)

While maybe not the nicest solution, I don't really see any better solution at the moment.

This PR is also a test if anything else breaks.

@albertz albertz requested a review from a team as a code owner December 19, 2021 23:34
@albertz albertz requested a review from Zettelkasten December 19, 2021 23:34
@albertz albertz force-pushed the albert-dot-square-matrix branch from 9120587 to b1f9634 Compare December 20, 2021 08:21
@Zettelkasten
Copy link
Member

Technically this PR looks fine, but I'm not sure if this is really a clean solution in general.
In particular it is not so clean that you have to specify a matching priority when you define your weight variable, which is not really specific to the variable itself but rather specific to the operation you use it as argument to afterwards (in your case, dot layer).
For example, in our MT configs we often use the transposed target embeddings as projection matrix. In this case I need my dot operation to reduce the axis exactly on the other axis then before, i.e. the one with the lower matching priority.
Sure, there are solutions to this (e.g. to reinterpret the variable output to change the matching priorities), but I personally think that this is an indicator that this concept in general is not that complete and perhaps more confusing/error prone.

I think the code you posted for your Linear module implementation is not too bad. I like about it that the variable has unique dim tags, I think that makes it pretty clear what is going on. I think I would prefer it to the somewhat less meaningful matching prios. Note that it would probably also be nice if it exposed the inner out dim tag (so that some other layer could e.g. transpose the variable).

I do agree though that it is not that nice and that it would be very annoying if RETURNN configs would often need to do such extra handling. I don't really know a better solution now either. Maybe just adding utility functions to make the code you already had with the inner dim tag less verbose would help. Maybe you have new thoughts since you opened this, too?

How often do you think does one need these extra checks? Only for layers introducing variables like Linear, probably also for a few others?
Would this also be an issue for RETURNN code that the user usually writes? Or could the user just call what you have in your Linear module (or some related, some what elementary building block)?

@albertz
Copy link
Member Author

albertz commented Dec 21, 2021

In particular it is not so clean that you have to specify a matching priority when you define your weight variable, which is not really specific to the variable itself but rather specific to the operation you use it as argument to afterwards (in your case, dot layer).

It would work for all operations which do some reduce, e.g. also gather, and that will always be the main purpose of this variable in this case.

For example, in our MT configs we often use the transposed target embeddings as projection matrix. In this case I need my dot operation to reduce the axis exactly on the other axis then before, i.e. the one with the lower matching priority.

There is no problem on this with the matching priority because that will be two different dims anyway - the target classes dim and some intermediate hidden dim, which would be almost always something different.

Also, when dim orders do not matter, there is nothing like a transpose.

I think the code you posted for your Linear module implementation is not too bad. I like about it that the variable has unique dim tags, I think that makes it pretty clear what is going on. I think I would prefer it to the somewhat less meaningful matching prios. Note that it would probably also be nice if it exposed the inner out dim tag (so that some other layer could e.g. transpose the variable).

The code has several issues:

  • It is too complicated.
  • It is not straightforward. I only stumbled upon it because I got it wrong first. This is a bad sign. Ideally it should be designed such that the obvious and straightforward solution works. (Although you could argue, the matching priority is not straightforward either.)
  • Now using it for the transposed linear operation as you mentioned before would actually not work anymore, or only with further complicated code.
  • I was also not sure, should I create this dummy dim always or only when in_dim == out_dim? In both cases, it's not so nice. Creating it always is maybe easier to reason about because there will never be two cases, only one case, which always makes it simpler and better.
  • Some other code which creates the variable or any tensor which would be used inplace of the variable would also need to have this extra logic. When assigning linear.weights = ..., it would additionally also need to assign linear.out_dim_inner = ....

Maybe you have new thoughts since you opened this, too?

Not really.

How often do you think does one need these extra checks? Only for layers introducing variables like Linear, probably also for a few others? Would this also be an issue for RETURNN code that the user usually writes? Or could the user just call what you have in your Linear module (or some related, some what elementary building block)?

Depends maybe on your research focus. But when you are writing new models or model parts (deriving new classes from nn.Module), you need to create variables all the time. Well, in many cases, you might reuse existing modules to define the new model, but often enough it could be also more custom. It's hard to tell.

Now when you want to do e.g. param sharing between two different modules (e.g. Linear and LSTM), this is easy and straightforward to do in principle (just linear.weights = lstm.weights), but it would be complicated if you also need to deal with sth like out_dims_inner.

I think this case that we have in_dim == out_dim is not so rare actually.

The potential cases where such matching priorities could go wrong are probably very rare. Although, when this happens, this will be also very annoying to detect even as a bug (because no error will occur, it will just do sth) and to debug.

@albertz albertz force-pushed the albert-dot-square-matrix branch from 2e01ae0 to c1435a5 Compare December 21, 2021 12:22
@albertz
Copy link
Member Author

albertz commented Jan 5, 2022

@Zettelkasten Have you further thought on this? I really don't like the solution with out_dims_inner. It should be possible to write generic code which would copy params around in whatever way. Sth like out_dims_inner would make this impossible because you never know if there is some xy_inner_temp_dummy_dim for some weight which also needs to be copied alongside.

While I totally see your concerns with the proposed solution here, I currently don't see a better solution. Any suggestions are very welcome.

@Zettelkasten
Copy link
Member

Hm no, not really, I thought about this some time, but didn't come up with something else really.

Now when you want to do e.g. param sharing between two different modules (e.g. Linear and LSTM), this is easy and straightforward to do in principle (just linear.weights = lstm.weights), but it would be complicated if you also need to deal with sth like out_dims_inner.

For this concrete example with LSTM and Linear param sharing its not a problem btw, because LSTM has no square params.
But I agree this is a problem in general when you write your own modules and re-use parameters.

@albertz
Copy link
Member Author

albertz commented Feb 2, 2022

This is still an open issue. We need some solution here.

@JackTemaki
Copy link
Collaborator

I do not see any better solution here, in the end it is always the question: what does the user actually want? So for the regular network construction it is fine to throw an error and let the user set the match_priority, and I guess for returnn_common this will be set automatically in many cases. But the docstring and the error message definitely need to be more accurate. I will add some suggestions.

returnn/tf/util/data.py Outdated Show resolved Hide resolved
returnn/tf/util/data.py Outdated Show resolved Hide resolved
@albertz
Copy link
Member Author

albertz commented Feb 11, 2022

Ok, due to the lack of any better solution, I'm merging this now.

@albertz albertz merged commit a10e882 into master Feb 11, 2022
@albertz albertz deleted the albert-dot-square-matrix branch February 11, 2022 11:01
albertz added a commit to rwth-i6/returnn_common that referenced this pull request Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants