Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design/Handling of dimension tags #17

Closed
7 tasks done
albertz opened this issue Aug 5, 2021 · 31 comments
Closed
7 tasks done

Design/Handling of dimension tags #17

albertz opened this issue Aug 5, 2021 · 31 comments
Assignees
Milestone

Comments

@albertz
Copy link
Member

albertz commented Aug 5, 2021

Like batch dim, spatial dims (with dynamic lengths), or static dims (named, or also unnamed).

Dim (earlier DimensionTag) in RETURNN. Directly use that, or wrap it somehow?

Should this (the batch dim) include beam information, or be separate from this?

Relevant for all layers which define some shape or dimension (e.g. Const, Variable).

Should this be enforced, i.e. no simple int allowed in n_out or so but always a Dim object?
And maybe better use out_dim instead of n_out (consistent with rwth-i6/returnn#597).
Edit: It was decided to make nn.Dim mandatory, and use out_dim instead of n_out.

Very related is this issue on RETURNN side on explicit dim tags: rwth-i6/returnn#597

Related is also whether we want unique dim tags? (#48, rwth-i6/returnn#632)


This issues covers multiple aspects:

@Atticus1806

This comment has been minimized.

@albertz

This comment has been minimized.

@Atticus1806

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz
Copy link
Member Author

albertz commented Nov 5, 2021

We should clarify this because this is very fundamental to how all our code will look like.

The main question is to decide on these two options:

  • No special handling here in returnn-common for dim tags. Basically just as optional as in RETURNN.
  • Make it mandatory to use. This is whenever axis, axes, new_dim is involved, but also for out_dim in Linear, LSTM, Conv and some others. Edit we decided to do this.

I think when we require them, we should also require them to be unique.

I tend to prefer having them mandatory. This might make some parts maybe a bit more verbose from the first glance. But this should resolve any possible ambiguity on axes, and make the code very explicit and clear.

We might however introduce some shorter aliases, or maybe restructure things a bit. For example:

  • Dim = DimensionTag Edit: Done. Actually the other way around, it was renamed to Dim, and we have an alias for DimensionTag for older code (mostly RETURNN, not relevant for returnn_common).
  • Have one global B or Batch or batch_dim for Dim(kind=DimensionTag.Types.Batch). Edit We have a global batch_dim now.
  • F(dim, name=None, **kwargs) for Dim(kind=DimensionTag.Types.Feature, dimension=dim, description=name, **kwargs). Edit We have FeatureDim(...) now.
  • S(name=None, **kwargs) for Dim(kind=DimensionTag.Types.Spatial, description=name, **kwargs). Edit We have SpatialDim(...) now.

So in your config, it could look like:

input_dim = nn.FeatureDim("mfcc", 40)
classes_dim = nn.FeatureDim("cart", 5000)
time_dim = nn.SpatialDim("time")

extern_data = {
  "data": {"dim_tags": [nn.batch_dim, time_dim, input_dim]},
  "classes": {"dim_tags": [nn.batch_dim, time_dim], "sparse_dim": classes_dim}
}

class Model(nn.Module):
  def __init__(self, in_dim: nn.Dim, out_dim: nn.Dim):
    super().__init__()
    self.linear = nn.Linear(out_dim=out_dim, in_dim=in_dim)
  
  def forward(self, input: nn.LayerRef) -> Layer:
    return self.linear(input)

model = Model(in_dim=input_dim, out_dim=classes_dim)

...

(Just a draft.)

albertz added a commit that referenced this issue Nov 6, 2021
#52

Although this is still work in progress:

- The test is not really testing anything.
- Not using consistent dim tags (#17)
- Missing is normal attention
- Missing is auto-regressive self-attention
@albertz

This comment has been minimized.

@albertz

This comment has been minimized.

@albertz
Copy link
Member Author

albertz commented Nov 8, 2021

Would this imply that "F" or "T" are also not allowed anymore?

@JackTemaki

This comment has been minimized.

@albertz
Copy link
Member Author

albertz commented Nov 8, 2021

Would this imply that "F" or "T" are also not allowed anymore?

I do not have enough knowledge to understand the consequences of this.

Sorry, maybe I was confusing here. This is a design decision we should take. We can either still allow "F" and "T" or not. Very related is also a similar question on RETURNN side: rwth-i6/returnn#586

Or maybe we just should change the behavior (on RETURNN side or here in returnn-common): Whenever "F" or "T" is non-ambiguous (according to some sensible rules), then it is allowed, otherwise not. The rules could be:

  • F == static dimension (only valid if there is exactly one static dimension)
  • T == dynamic dimension (only valid if there is exactly one dynamic dimension; excluding batch dim of course)
  • B == batch dim (only valid if there is only exactly one batch dim)

@Zettelkasten
Copy link
Member

  • All dims which are configurable and supposed to be used by the module for some output dimension would be passed as arguments as well (out_dim, out_spatial_dim, whatever).

Also, we would need to think about how out_shape should work.
Usually, you have some set of common dims (like the batch dim, but maybe more), that the module will not operate on.
We would also need to pass a set of these to each layer in some way. Or maybe we could also supply some explicit out_shape to the module, and then infer it from there.

@albertz
Copy link
Member Author

albertz commented Dec 17, 2021

  • All dims which are configurable and supposed to be used by the module for some output dimension would be passed as arguments as well (out_dim, out_spatial_dim, whatever).

Also, we would need to think about how out_shape should work. Usually, you have some set of common dims (like the batch dim, but maybe more), that the module will not operate on. We would also need to pass a set of these to each layer in some way. Or maybe we could also supply some explicit out_shape to the module, and then infer it from there.

Ah yes, good point. Maybe for the __call__, there could be another in_shape argument and then the module can derive it and add such checks on intermediate outputs. Or maybe better, via #47, we anyway have the shape of any inputs already, so no need for an extra in_shape. The user could have checked in advance that the inputs are as expected.

There could be an out_shape argument for module __call__. Although some modules might return multiple things, then this is a bit ambiguous. Also we cannot really make sure that all modules have this argument, so this would more be a convention than a strict rule. But this is ok.

Or, when we do #47, the user could explicitly do it afterwards like:

out = model(..., out_dim=feature_dim)
out.assert_shape({batch_dim, time_dim, feature_dim})

Maybe that is cleaner?

@Zettelkasten
Copy link
Member

Or maybe better, via #47, we anyway have the shape of any inputs already, so no need for an extra in_shape. The user could have checked in advance that the inputs are as expected

That would be even better, yes. Then every module checking shapes (which ideally, every module should do I think), would have something like this on top:

common_dims = in.get_shape() - {in_dim}
out = ...
out.assert_shape(common_dims | {out_dim})

Or, when we do #47, the user could explicitly do it afterwards like:

out = model(..., out_dim=feature_dim)
out.assert_shape({batch_dim, time_dim, feature_dim})

Maybe that is cleaner?

This wouldn't work for intermediate states right? (Also, when nesting modules, the outer module would also need to know the common dims to assert the output of the inner module).
I think more complicated modules should check their output type themselves, too. Just like you would specify the return type of a function, this is very important documentation.

@albertz
Copy link
Member Author

albertz commented Dec 17, 2021

The out_shape or assert_shape is anyway mostly for the user. A user might not care what happens inside the model, only that he gets some specific output in the end.

However, if you are the author of model, or you want that any user can easily read the code of your model, then sure, you would also have out_shape or assert_shape usages inside the model, for any intermediate outputs, and maybe also the final output.

We should be careful to not make it too verbose or too complicated. When the code looks complex from a first glance, this is also bad. out_shape might be a bit better w.r.t. this aspect because it avoids a separate call assert_shape. But we anyway could support both.

Another aspect: Probably many modules will be lazy w.r.t. the input dim, just like it is standard in RETURNN and other frameworks. E.g. the user would specify out_dim but not in_dim. So in_dim is only known on the first call to the module, and then automatically inferred from the input (in.in_dim or so, which refers to the feature dim). I think enforcing to specify in_dim at module creation would be too verbose in many cases, so that is why we would have to support such lazy in_dim, and also using the feature dim logic.

@albertz
Copy link
Member Author

albertz commented Dec 19, 2021

So, I currently have this Linear module implementation:

class Linear(nn.Module):
  """
  Linear transformation.
  """

  def __init__(self, out_dim: nn.Dim, *, in_dim: Optional[nn.Dim] = None):
    super().__init__()
    self.out_dim = out_dim
    self.out_dim_inner = out_dim
    self.in_dim = in_dim
    self.weight = None  # type: Optional[nn.Parameter]
    self.bias = None  # type: Optional[nn.Parameter]
    if in_dim:
      self._lazy_init(in_dim)

  def _lazy_init(self, in_dim: nn.Dim):
    if self.in_dim:
      assert self.in_dim == in_dim
    else:
      self.in_dim = in_dim
      if in_dim == self.out_dim:
        self.out_dim_inner = self.out_dim.copy(same_as_self=False, description=f"{self}:out-dim-inner")
      self.weight = nn.Parameter((self.in_dim, self.out_dim_inner))
      self.bias = nn.Parameter((self.out_dim_inner,))

  @nn.scoped
  def __call__(self, source: nn.LayerRef) -> nn.Layer:
    self._lazy_init(source.dim)
    out = nn.dot(source, self.weight, reduce=self.in_dim) + self.bias
    if self.out_dim_inner != self.out_dim:
      out = nn.reinterpret_data(out, set_dim_tags={self.out_dim_inner: self.out_dim})
    return out

You see that I needed to introduce this out_dim_inner here. The problem occurs when we have in_dim == out_dim. The dim tags in the parameter are not distinguishable then, and the dot reduce=in_dim is ambiguous.

I'm not really sure about the best way to solve this. I'm not really happy with this solution now.

@Zettelkasten any suggestions or ideas?

@albertz
Copy link
Member Author

albertz commented Dec 19, 2021

While further thinking about this, some thoughts and options:

  • I think no matter what we do, we should not make an exception that the order of axes matters here now in this case. The order should never matter.
  • So, in dim_tags = (in_dim, out_dim) with out_dim = in_dim, we need some way to distinguish those dims. But this cannot be just the order.
  • When nothing is done, I think the current behavior of RETURNN is actually weird or wrong anyway. It looks like it reduces both axes and then fails with some strange other error? There should be some further testing and fixing on RETURNN side to better catch this with a better error message. Already get_axis_from_description should fail due to the ambiguity.
  • One option to distinguish them is in my example code above. I.e. in the case in_dim == out_dim, it does out_dim_ = out_dim.copy(same_as_self=False, ...) and then using shape (in_dim, out_dim_).
    • I don't really like this too much because other code might expect the shape (in_dim, out_dim). Or all other code must also have this special logic for this special case.
    • We probably have this issue also in other instances, e.g. for LSTM or others, when there is some matrix multiplication involved. It should be simple to always make use of the same logic for this case.
    • It's too complicated, too ugly.
  • When reduce=in_dim is searching for matching dims, we need some way that it uniquely finds the right axis.
  • Maybe we can have some matching_priority flag or so, set on dim tags. In get_axis_from_description, when a dim tag matches multiple existing dim tags, but one of the matching tags has a higher matching priority, then it would still be allowed, and it would use that one. So in the example, in_dim == out_dim (but in_dim is not out_dim) but in_dim would be a copy with higher matching priority. The code would maybe look like:
    weight = nn.Parameter((in_dim.copy(matching_priority=1), out_dim))
    Edit I implemented this now here: Solution for ambiguous dim tags returnn#871

albertz added a commit to rwth-i6/returnn that referenced this issue Dec 19, 2021
albertz added a commit to rwth-i6/returnn that referenced this issue Dec 20, 2021
Via:
rwth-i6/returnn_common#17 (comment)

Also:

Data.get_axes_from_description, only unique matching for dim tags
albertz added a commit to rwth-i6/returnn that referenced this issue Dec 21, 2021
albertz added a commit that referenced this issue Dec 22, 2021
@albertz
Copy link
Member Author

albertz commented Jan 3, 2022

Another thing: I wonder whether passing out_spatial_dim as argument is maybe un-intuitive. In many cases, it is a new dimension. So (also as the name implies) it is an output of the function or module. The current code often looks like this:

out_spatial_dim = nn.SpatialDim("whatever")
out = self.conv(..., out_spatial_dim=out_spatial_dim)

This is for dynamic sizes. For static dims, it is even more complicated, as you need to calculate the dim by hand (although maybe you could argue, it is a good thing to have it explicit).

Maybe the module should instead just create such a dim itself, and then return it?

One question is, how should it be returned? Just Tuple[nn.Layer, nn.Dim]? Or namedtuple? Or sth else?

For example, it would then look like:

out, out_spatial_dim = self.conv(...)

I think this is nicer.

The case where it is known beforehand (or expected to match some other existing dim) is maybe valid, but the user could easily still do that by calling declare_same_as afterwards.

out_dim is different though. out_dim would stay an argument. This is because out_dim is usually a feature dim and usually can be arbitrarily defined by the user, like Linear or Conv. Although, maybe for the cases where it is not arbitrary, like merge_dims for example, maybe there it should be returned as well?

Or consider the ConformerConvSubsample.__call__:

@nn.scoped
def __call__(self, inp: nn.LayerRef, *, in_spatial_dim: nn.Dim, out_spatial_dim: nn.Dim) -> nn.LayerRef:
  """forward"""
  in_spatial_dims = [in_spatial_dim, inp.feature_dim]
  in_dim = nn.FeatureDim("dummy-input-feature-dim", 1)
  x = nn.expand_dim(inp, dim=in_dim)
  for i, conv_layer in enumerate(self.conv_layers):
    out_spatial_dims = [nn.SpatialDim(f"conv-{i}-1"), nn.SpatialDim(f"conv-{i}-2")]
    x = conv_layer(x, in_dim=in_dim, in_spatial_dims=in_spatial_dims, out_spatial_dims=out_spatial_dims)
    in_spatial_dims = out_spatial_dims
    in_dim = conv_layer.out_dim
    x = self.activation(x)
    if self.pool_sizes and i < len(self.pool_sizes):
      x = nn.pool(
        x, in_dim=in_dim, in_spatial_dims=in_spatial_dims,
        pool_size=self.pool_sizes[i], padding='same', mode='max')
    if self.dropout:
      x = nn.dropout(x, axis=in_dim, dropout=self.dropout)
  out = nn.merge_dims(x, axes=in_spatial_dims, out_dim=out_spatial_dim)
  return out

With the suggestion, it would look like:

@nn.scoped
def __call__(self, inp: nn.LayerRef, *, in_spatial_dim: nn.Dim) -> nn.LayerRef:
  """forward"""
  in_spatial_dims = [in_spatial_dim, inp.feature_dim]
  in_dim = nn.FeatureDim("dummy-input-feature-dim", 1)
  x = nn.expand_dim(inp, dim=in_dim)
  for i, conv_layer in enumerate(self.conv_layers):
    x, in_spatial_dims = conv_layer(x, in_dim=in_dim, in_spatial_dims=in_spatial_dims)
    in_dim = conv_layer.out_dim
    x = self.activation(x)
    if self.pool_sizes and i < len(self.pool_sizes):
      x = nn.pool(
        x, in_dim=in_dim, in_spatial_dims=in_spatial_dims,
        pool_size=self.pool_sizes[i], padding='same', mode='max')
    if self.dropout:
      x = nn.dropout(x, axis=in_dim, dropout=self.dropout)
  out, out_spatial_dim = nn.merge_dims(x, axes=in_spatial_dims)
  return out, out_spatial_dim

@Zettelkasten opinions?

Edit This is implemented like this now.

albertz added a commit that referenced this issue Jan 5, 2022
This was wrapping reinterpret_data with set_axes.

There should be no layer in returnn-common
which would depend on this,
so this should never be needed.

If there are cases where this is needed,
we probably could and should
fix them such that it is not needed.

#17
@albertz
Copy link
Member Author

albertz commented Jan 7, 2022

The last suggestion was implemented now, i.e. namely that functions and modules return a potentially new dim tag, and the argument (e.g. out_spatial_dim) is optional, and when None, it auto-creates some new dim.

@albertz
Copy link
Member Author

albertz commented Jan 7, 2022

Remaining is the issue with out_dim_inner. There is still the suggested solution here: rwth-i6/returnn#871
This also has some further discussion.
In any case, out_dim_inner should be removed. The param should be enough.

@albertz
Copy link
Member Author

albertz commented Jan 19, 2022

Another external data point is Mesh TensorFlow, and the gpt-neo implementation. Examples:

  • dim_kv = mtf.Dimension("features_per_head", params["n_embd"] // params["n_head"])
    mtfparams = mtf.transformer.attention.attention_params_simple(
        x.mesh,
        io_dim=dim_embd,
        kv_dim=dim_kv,
        heads_dim=dim_heads,
        variable_dtype=variable_dtype
    )
    (via)
  • k = mtf.replace_dimensions(k, k.shape[1], memory_length_dim)
    v = mtf.replace_dimensions(v, v.shape[1], memory_length_dim)
    
    a = mtf_transformer.attention.attention(
        q, k, v,
        memory_length_dim=memory_length_dim,
        key_dim=dim_kv,
        value_dim=dim_kv,
        bias=broadcasted_bias,
        dropout_rate=attn_dropout_rate
    )
  • def causal_linear_attention(q, k, v, eps = 1e-6):
        batch_dim, seq_dim, head_dim, dim_out = (v.shape[0], v.shape[1], v.shape[2], v.shape[3])
        q = mtf.rename_dimension(q, "features_per_head", "features_per_head_in")
        k = mtf.rename_dimension(k, "features_per_head", "features_per_head_in")
    
        dim_in = k.shape[-1]
    
        q = mtf.softmax(q, dim_in)
        k = mtf.exp(k)
    
        cumulative_k = mtf.cumsum(k, seq_dim) + eps
        D_inv = 1. / mtf.einsum([q, cumulative_k], output_shape=[batch_dim, seq_dim, head_dim])
    
        context = mtf.einsum([k, v], output_shape=[batch_dim, seq_dim, head_dim, dim_in, dim_out])
        cumulative_context = mtf.cumsum(context, seq_dim)
    
        attn = mtf.einsum([q, cumulative_context, D_inv], output_shape=[batch_dim, seq_dim, head_dim, dim_out])
        return attn
    (via)
  • def einsum(xs, output_shape=None, reduced_dims=None, name=None):
      ...
    (here)

Note that a mtf.Dimension is very simple. It's just (def):

Dimension = collections.namedtuple("Dimension", ["name", "size"])

where name: str and size: int.

MTF also requires unique dimensions in a shape (see here).

@albertz

This comment was marked as duplicate.

albertz added a commit to rwth-i6/returnn that referenced this issue Feb 11, 2022
Solution for ambiguous dim tags,
e.g. in VariableLayer for square matrix.

Via:
rwth-i6/returnn_common#17 (comment)
@albertz
Copy link
Member Author

albertz commented Feb 11, 2022

Dim.match_priority support (rwth-i6/returnn#871) is merged now, due to the lack of any better solution for the ambiguous dim tags.

Now we need to adopt the code here accordingly (get rid of out_dim_inner etc).

@albertz
Copy link
Member Author

albertz commented Feb 11, 2022

This is all implemented now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants