Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify dim tags for layers that create new axes #597

Closed
Zettelkasten opened this issue Aug 31, 2021 · 38 comments
Closed

Specify dim tags for layers that create new axes #597

Zettelkasten opened this issue Aug 31, 2021 · 38 comments
Assignees
Labels
potential-new-behavior Discussions about RETURNN behaviour

Comments

@Zettelkasten
Copy link
Member

Zettelkasten commented Aug 31, 2021

E.g. CumConcatLayer from #589 allows (even enforces) you to specify dim_tag: DimensionTag for the output axis.

I propose something similar for all other layers that introduce new dims. Basically for all layers introducing a new dynamic seq length here that could be useful.

This is already possible to do using an additional ReinterpretDataLayer (or so, something that lets you specify new dim tags. I don't think this exists yet, but is easy to add). Then you would do:

'x_': {'class': 'split_dims',  ...},
'x': {'class': 'reinterpret_data', 'from': 'x_', 'set_dim_tags': {
  'stag:split_dims0': DimensionTag(...), 'stag:split_dims1': DimensionTag(...)}}

Here, split_dims0 and so on are the tag descriptions that SplitDimsLayer internally sets.
Not only is this solution pretty verbose, it also relies on these internal tag descriptions. As a user, you would need to look them up frequently, and we would need to make sure that they don't change internally.
I would like to avoid this, proposing to add a dim_tags: List[DimensionTag] option directly to SplitDimsLayer.
With other layers, its the same idea.
This also promotes the best-practice that you should always name your axes properly.

Maybe, for some layers, we would even enforce the use of this. Especially for layers that really introduce an entirely new dim (like SplitDimsLayer). But also for others. When enforced, this would need a new behavior version (#508).

(Related to that I would also allow to specify just a string with the description as dim_tag to make this easier to use. Calling the DimensionTag constructor is not that nice in Sisyphus configs.)


List of layers which need this

Also see: Operations on dimension tags

Layers (potentially) introducing a new dynamic seq length

List

Layers involving some linear transformation on the features

Layers with fixed output shape

Others

@albertz
Copy link
Member

albertz commented Sep 6, 2021

Even for simple layers like LinearLayer, ConvLayer, RecLayer etc where some transformation happens, a new dim tag is introduced for the new feature dim.

@albertz
Copy link
Member

albertz commented Sep 22, 2021

We should be a bit more systematic here by specifying the list of layers which should be extended/modified by such feature. This should directly be edited into the main issue description. I will start on this but please extend. Or if you are unsure about some layer, just ask in a comment.

@albertz albertz added the potential-new-behavior Discussions about RETURNN behaviour label Sep 22, 2021
@albertz
Copy link
Member

albertz commented Sep 22, 2021

Btw, about Sisyphus, or in general simpler ways to specify new dim tags, and also on the dim tag description:

I'm not sure on any of these.

About Sisyphus: Whatever technical limitation there might be in Sisyphus, this could be fixed. E.g. if the issue is that DimensionTag is not pickle-able, or the __repr__ does not work for serialization, we can surely fix that. So I don't really count that as an argument.

On using DimensionTag.description for identification: Currently we rely on that. We even use it for DimensionTag.is_equal. And we also use it for the stag:... syntax when specifying axes in the config, via Data.get_axes_by_tag_name/Data.get_axis_by_tag_name.

Note my recently added comment on that in DimensionTag.is_equal:

# In principle, we would want to check for identity (self is other).
# We currently use the description because the identity would not be the same
# in case of template construction where a dim tag is once created for a template layer,
# and then later again for the real layer.

I don't like it to rely on a string as mean for identification. This can lead to very strange bugs in rare cases which are probably hard to debug.

Also, the automatic description might not always be unique (sometimes it uses the absolute full layer name, sometimes only the base layer name).

Also, the automatic description might even be changed. It is not a good idea to rely on this. This is also what you describe initially ("internal tag descriptions").

With automatic description, I mean all layers which create new dim tags. So when the proposed changes by this issue are implemented, it would all be explicit, so this is not really a problem anymore.

But anyway, I want to get away from relying on description for the identification. Esp if there is not really a big need for it.

See also #634 for some related discussion.

On simpler ways to write that in the config: I was also thinking about this. This might be a valid point.

I think with the proposed new way to define the network via returnn-common, this becomes probably already much simpler.

Maybe it is also just about the DimensionTag constructor. Writing DimensionTag(kind=DimensionTag.Types.Spatial, ...) might be just too long? Maybe we could simply have a small alias for that, like Spatial(...) or new_spatial(...) which would do the same thing? But is this really helpful, or maybe it obfuscates the code more? We should not just optimize the config for easy writing but also clearness and easy reading (even more so than easy writing).

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Related is also to enforce dim tags to be unique (#632).

@albertz
Copy link
Member

albertz commented Sep 23, 2021

One alternative to this is to use the DimensionTag.derived_from_tag logic more, together with all dim tags being unique (#632).

E.g. when you used ConvLayer or SplitDimsLayer or so on some input [B,T,D] which used dim tag time_dim = DimensionTag(...), we could allow to specify axis=Derived(time_dim) in other layers. When all tags are unique, this should also be non-ambiguous in most cases (and if not, just throw an error).

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Note on your initial example using ReinterpretDataLayer with set_dim_tags: The new dim tag would by intention be different from the old dim tag. If you then only use layer x afterwards, there is no real problem. However, if you mix x_ and x, this is maybe not what was being intended. Also, by intention, the new dim tag does not have derived_from_tag set to the old dim tag because we never want that these are being mixed up (that is the purpose of assigning/creating new dim tags via set_dim_tags).

What you actually want is the same_dim_tags_as logic. Which you can e.g. do via CopyLayer, like:

'x': {'class': 'copy', 'from': 'x_', 'out_type': {'same_dim_tags_as':
  Derived(time_dim): DimensionTag(...)}}}

This would use declare_same_as on the dim tag.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Btw, this is not really either-or. We can have both such Derived(...) logic for matching dim tags and also specify new dim tags explicitly for new layers.

I think we need to discuss each (or most) layers individually. E.g. SplitLayer would be kind of specific. And others as well. So we should also open separate issues for each layer. Maybe you start by doing so only for the layer which you care about for now. I don't think we can anyway do all at once, so we should do it layer-by-layer.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Some further ideas along the line of Derived(...) for specifying axes (via Data.get_axis_from_description):

  • Derived(tag: DimensionTag): matches derived tags. This basically sets derived_matches=True for DimensionTag.is_equal.
  • SameStaticDim(dim: int): matches dims which have the same static dim
  • NewSpatialFromLayer(layer_name: str): We can keep track (automatically) which layer created new dim tags. This would match a newly created (spatial) dim tag.
  • FeatureFromLayer(layer_name: str): Gets the feature dim from layers output.

Note that we would enforce in any case that the matching is unique. Once it is not unique, it would throw an error.

For the layers we would probably use the get_layer logic in transform_config_dict. Or we might just assume that they must exist at this point (via other deps). This is still a bit unclear.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Another idea I had (also for returnn-common) was that the user would explicitly specify all output dim tags. So basically replacing what the user anyway writes as comment (like # [B, T, NumHeads, Dim] or so) but as code, as an option to any layer (so LayerBase would handle it), like out_dims={batch_dim, time_dim, num_heads_dim, feature_dim} (a set, without order by intention).

This could not just be used for verification, but we can actually infer some of the options of the layer (e.g. n_out for LinearLayer, etc), and it also can be used to specify the dim tags.

There are some cases where the order is relevant though, such as SplitDimsLayer, where it matters in what order we do the splitting. And there are cases where multiple new spatial dims can be created, e.g. ConvLayer with 2D or 3D conv, where the assignment would be ambiguous. Also, even ConvLayer with 1D conv, it might be ambiguous what is actually the output feature (channel) dim and what is the output spatial (maybe static) dim. So in those cases, we would anyway additionally need to make it explicit with special options and out_dims alone would not be enough.

Or we extend out_dims somehow to make it also non-ambiguous in those cases. But not sure. E.g. for ConvLayer, we could assume that there must be exactly one feature dim in the output, which would make that non-ambiguous. But still not the spatial dims. We anyway also need to fix ConvLayer filter_size and other things to not be a list (which depends on the order of spatial axes) but a dict tag -> filter size (dict[DimensionTag,int]). But this would all be layer-specific again. I'm not sure if maybe out_dims should stay always this same concept and type (i.e. just a set[DimensionTag]).

For ConvLayer specifically, and some other layers, also LinearLayer, we could also allow some dim mapping, like in_to_out_dims: {spatial_in: spatial_out, feature_in: feature_out}.

@Zettelkasten
Copy link
Member Author

One alternative to this is to use the DimensionTag.derived_from_tag logic more, together with all dim tags being unique (#632).

E.g. when you used ConvLayer or SplitDimsLayer or so on some input [B,T,D] which used dim tag time_dim = DimensionTag(...), we could allow to specify axis=Derived(time_dim) in other layers. When all tags are unique, this should also be non-ambiguous in most cases (and if not, just throw an error).

For SplitDimsLayer (because you explicitly mentioned it here), this derived logic would almost never be useful and should fail because there is more than one derived dim (for a non-trivial split), right?
But for many other layers, this makes a lot of sense.
I'm not sure about the other options you proposed apart from Derived, would those really be helpful?

SameStaticDim(dim: int): matches dims which have the same static dim

E.g. this one seems dangerous, because static dims might accidentally match up when the user chooses different hyperparameters.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Also, yet another related thing: It might be helpful to also explicitly specify the input dims or part of it. This is standard in PyTorch, e.g. torch.nn.Linear has in_features, out_features.

This would be partly covered by such in_to_out_dims proposed above.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

One alternative to this is to use the DimensionTag.derived_from_tag logic more, together with all dim tags being unique (#632).
E.g. when you used ConvLayer or SplitDimsLayer or so on some input [B,T,D] which used dim tag time_dim = DimensionTag(...), we could allow to specify axis=Derived(time_dim) in other layers. When all tags are unique, this should also be non-ambiguous in most cases (and if not, just throw an error).

For SplitDimsLayer (because you explicitly mentioned it here), this derived logic would almost never be useful and should fail because there is more than one derived dim (for a non-trivial split), right?

No, in those cases there can only be one dynamic dim tag (if at all), and all others must be static. And then it can be derived. So it should always be possible to derive it for SplitDimsLayer. Only not the order.

I'm not sure about the other options you proposed apart from Derived, would those really be helpful?

Currently people often use things like static:0, static:-1 etc, which depends on the order of the static dims. SameStaticDim would instead rely on the dimension, which is already much better.

NewSpatialFromLayer would actually make your proposal obsolete, because instead of introducing new_dim_tag, you could use NewSpatialFromLayer instead.

But as said, this doesn't have to be either-or, we can do both, and also discuss them independently.

SameStaticDim(dim: int): matches dims which have the same static dim

E.g. this one seems dangerous, because static dims might accidentally match up when the user chooses different hyperparameters.

But then it would throw an error, so it cannot accidentally match multiple. And also this probably would not happen so often.

@Zettelkasten
Copy link
Member Author

I think we need to discuss each (or most) layers individually. E.g. SplitLayer would be kind of specific. And others as well. So we should also open separate issues for each layer. Maybe you start by doing so only for the layer which you care about for now. I don't think we can anyway do all at once, so we should do it layer-by-layer.

Yes true. For me, for most layers such Derived logic you proposed would be enough I think. When I opened this issue, I mostly had in mind layers that don't just map one axis to a single new dim. I think that's basically MergeDimsLayer and SplitDimsLayer.
We can open an issue for those layers separately.

But then there are many layers for which this might be useful, e.g. I would like to use this for RepeatLayer. Or maybe LinearLayer like you mention.

For ConvLayer specifically, and some other layers, also LinearLayer, we could also allow some dim mapping, like in_to_out_dims: {spatial_in: spatial_out, feature_in: feature_out}.

This would work then. So you would e.g. say

{class: linear, from: .., axis: in_dim, n_out: 64, in_to_out_dims: {in_dim: out_dim}}

(I know there is no axis argument yet, but that's just because it is implicitly always F. Also LinearLayer is kind of an exception here, most layers like e.g. RepeatLayer require such argument, probably we should add it to LinearLayer some time too.)

But that's not so nice that here you have to specify the input axis twice. And in the case of LinearLayer, you also specify n_out twice.
So from that view, I'd like

{class: linear, from: ..., axis: in_dim, out_axis: DimensionTag(DimensionTag.Types.Spatial, dimension=64)}

better.
So this would be a layer-specific solution, rather than your more general in_to_out_dims logic which is more like a short-notation for adding a ReinterpretDataLayer afterwards. So much more work to implement, and less generic. I'm not sure.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

In case of LinearLayer, in_to_out_dims is enough, and you do not need axis nor out_axis nor n_out.

@Zettelkasten

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Btw, LinearLayer has also one special case, that it can operate on sparse inputs (sparse=True). Note that I was thinking about adding some sparse_dim: DimensionTag to Data, which would extend the current Data.dim. axis or in_to_out_dims could also handle that.

@Zettelkasten

This comment has been minimized.

@albertz

This comment has been minimized.

@Zettelkasten
Copy link
Member Author

LinearLayer would do one transformation, so in_to_out_dims can only have one entry.

I don't see the point in this generic solution then.
What I thought you meant is that in_to_out_dims could - for any layer - be used to rename any input axes to any output axes. Also axes the layer doesn't operate on, just like you would if you have ReinterpretDataLayer with set_dim_tags.

If in the case of LinearLayer, this dict must always contain only one entry, why not make it explicit and just use two parameters instead, one input axis, and one output new_dim_tag: DimensionTag argument or so?

Having a parameter in_to_out_dims with a much more general interface (a dict that can have multiple entries) that it wouldn't really accept after all seems just confusion to me.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

LinearLayer would do one transformation, so in_to_out_dims can only have one entry.

I don't see the point in this generic solution then.
What I thought you meant is that in_to_out_dims could - for any layer - be used to rename any input axes to any output axes. Also axes the layer doesn't operate on, just like you would if you have ReinterpretDataLayer with set_dim_tags.

No. We should not mix up multiple operations, like the linear transform + assigning new dim tags. ReinterpretDataLayer with set_dim_tags is for assigning new dim tags. LinearLayer should do just a linear transformation. The in_to_out_dims was not intended to be used to assign new dim tags, or rename dim tags. But instead to define the linear transformation and define input and output dims.

If in the case of LinearLayer, this dict must always contain only one entry, why not make it explicit and just use two parameters instead, one input axis, and one output new_dim_tag: DimensionTag argument or so?

Having a parameter in_to_out_dims with a much more general interface (a dict that can have multiple entries) that it wouldn't really accept after all seems just confusion to me.

I thought it might be nice to have such general interface for any such layers which create new axes (which we discuss here in this issue). Maybe it does not fit for all layers, but for many, it would. E.g. for ConvLayer with 2D conv, you would have 3 entries: 2 for the two spatial dims, and then one for the linear feature transformation. Similarly for PoolLayer, ResizeLayer, PadLayer, etc.

It would not fit for e.g. SplitDimsLayer or MergeDimsLayer.

@Zettelkasten
Copy link
Member Author

I thought it might be nice to have such general interface for any such layers which create new axes (which we discuss here in this issue). Maybe it does not fit for all layers, but for many, it would. E.g. for ConvLayer with 2D conv, you would have 3 entries: 2 for the two spatial dims, and then one for the linear feature transformation. Similarly for PoolLayer, ResizeLayer, PadLayer, etc.

It would not fit for e.g. SplitDimsLayer or MergeDimsLayer.

Ah okay okay, I understand now, sorry.
For all layers where this dict could have more than one axis, this makes a lot of sense. I see your point that it might make sense to always use a Dict[DimensionTag, DimensionTag] to have a consistent interface. But as you said, for e.g. SplitDimsLayer it doesn't work out. So I think I would prefer an interface that makes more sense for each specific layer. I also think that LinearLayer is used much more often than the other layers you mentioned.

@Zettelkasten
Copy link
Member Author

Zettelkasten commented Sep 23, 2021

Another idea I had (also for returnn-common) was that the user would explicitly specify all output dim tags. So basically replacing what the user anyway writes as comment (like # [B, T, NumHeads, Dim] or so) but as code, as an option to any layer (so LayerBase would handle it), like out_dims={batch_dim, time_dim, num_heads_dim, feature_dim} (a set, without order by intention).

I really like this idea. Maybe one could also use Python type annotations for this, this would make the syntax nicer (not sure if thats so robust though / whether you can access them from code easily).

This (out_dims={batch_dim, time_dim, feature_dim}) could not just be used for verification, but we can actually infer some of the options of the layer (e.g. n_out for LinearLayer, etc), and it also can be used to specify the dim tags.

How do you mean this last point here? How would e.g. LinearLayer here then figure out the transformed axis? By asserting that all other axes are equal, and exactly one changed? Or maybe you would need to set derived_from_tag of the new axis to the old one or so?
How does that generalize to layers which "replace" more than axis?
This could be cool, but maybe also very magic / does not generalize well to more complicated layers.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

So you say because in_to_out_dims does not fit for all cases here, it does not make sense to use it consistently for those cases where it fits? I think some consistency would be nice in any case. But maybe you just propose to have LinearLayer a special case. Although I don't find it too unnatural to also have in_to_out_dims for that case.

Btw, in_to_out_dims could also be extended, to allow dict[tuple[DimensionTag,...]|DimensionTag, tuple[DimensionTag,...]|DimensionTag], so then it would also fit SplitDimsLayer (in_to_out_dims={merged_dim: (dim1, dim2, dim3)) and MergeDimsLayer (in_to_out_dims={(dim1, dim2, dim3): merged_dim}).

@Zettelkasten
Copy link
Member Author

So you say because in_to_out_dims does not fit for all cases here, it does not make sense to use it consistently for those cases where it fits? I think some consistency would be nice in any case. But maybe you just propose to have LinearLayer a special case. Although I don't find it too unnatural to also have in_to_out_dims for that case.

Btw, in_to_out_dims could also be extended, to allow dict[tuple[DimensionTag,...]|DimensionTag, tuple[DimensionTag,...]|DimensionTag], so then it would also fit SplitDimsLayer (in_to_out_dims={merged_dim: (dim1, dim2, dim3)) and MergeDimsLayer (in_to_out_dims={(dim1, dim2, dim3): merged_dim}).

For n -> n mappings, of course a dict is the way to go.
It's just that I find dicts that only contain a single element a bit funny.
For me, if n -> 1/1 -> n were a special case, 1 -> 1 would also have been one.
But with your proposed extension, it's not too bad. Then none of those cases would be special.
In the case of MergeDimsLayer and SplitDimsLayer we could even extend it to really work on splitting/merging multiple sets of axes at once. Okay, I see!

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Another idea I had (also for returnn-common) was that the user would explicitly specify all output dim tags. So basically replacing what the user anyway writes as comment (like # [B, T, NumHeads, Dim] or so) but as code, as an option to any layer (so LayerBase would handle it), like out_dims={batch_dim, time_dim, num_heads_dim, feature_dim} (a set, without order by intention).

I really like this idea. Maybe one could also use Python type annotations for this, this would make the syntax nicer (not sure if thats so robust though / whether you can access them from code easily).

So this would only be a returnn-common feature then? Because RETURNN itself would use still the same net / layer dict as before, right?

And you mean e.g. like deepmind/tensor_annotations or PEP 646?

Btw, related is also tsalib, xarray, TensorFlow Mesh.

This (out_dims={batch_dim, time_dim, feature_dim}) could not just be used for verification, but we can actually infer some of the options of the layer (e.g. n_out for LinearLayer, etc), and it also can be used to specify the dim tags.

How do you mean this last point here? How would e.g. LinearLayer here then figure out the transformed axis? By asserting that all other axes are equal, and exactly one changed? Or maybe you would need to set derived_from_tag of the new axis to the old one or so?

It would expect that there is exactly one feature dim (of kind feature).

How does that generalize to layers which "replace" more than axis?

It does not generalize to every layer of course. But does it have to?

This could be cool, but maybe also very magic / does not generalize well to more complicated layers.

I did not propose to use it for any cases where it would be ambiguous or require any sort of magic.#

But anyway, this inferring of options (e.g. n_out for LinearLayer) is a separate thing. It is not an argument that we should not introduce out_dims in general.

@albertz
Copy link
Member

albertz commented Sep 23, 2021

Btw, in_to_out_dims could also be extended, to allow dict[tuple[DimensionTag,...]|DimensionTag, tuple[DimensionTag,...]|DimensionTag], so then it would also fit SplitDimsLayer (in_to_out_dims={merged_dim: (dim1, dim2, dim3)) and MergeDimsLayer (in_to_out_dims={(dim1, dim2, dim3): merged_dim}).

...
But with your proposed extension, it's not too bad. Then none of those cases would be special.
In the case of MergeDimsLayer and SplitDimsLayer we could even extend it to really work on splitting/merging multiple sets of axes at once. Okay, I see!

I mean, it was just a proposal, to also have a uniform and consistent interface.

What are the remaining cases (layers) where this would not work? And do they invalidate the proposal?

E.g. VariableLayer and ConstantLayer are such cases, but there out_dims can be used to infer it.
Although, should out_dims be a list/tuple for these cases? I'm not sure if set is a good idea, i.e. to leave an arbitrary order.

RangeFromLengthLayer is also a bit special. Or CumConcatLayer. Or ReinterpretDataLayer with set_dim_tags.

Those are examples where in_to_out_dims is somewhat unclear, so we would not use it.

@albertz
Copy link
Member

albertz commented Dec 1, 2021

Ok, all layers should be covered now. So I'm closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
potential-new-behavior Discussions about RETURNN behaviour
Projects
None yet
Development

No branches or pull requests

2 participants