Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using sum() causes dtype mismatch #135

Closed
JackTemaki opened this issue Apr 27, 2022 · 5 comments
Closed

using sum() causes dtype mismatch #135

JackTemaki opened this issue Apr 27, 2022 · 5 comments

Comments

@JackTemaki
Copy link
Contributor

JackTemaki commented Apr 27, 2022

Writing e.g. l2loss = sum(nn.reduce(param ** 2.0, axis=list(param.shape), mode="sum") for param in net.parameters())
Causes a dtype mismatch when running the network (not during construction with returnn_common surprisingly):

  File "/work/asr4/rossenbach/sisyphus_work_folders/tts_asr_2021_work/i6_core/tools/git/CloneGitRepositoryJob.UbdRLURNoRpi/output/repository/returnn/tf/util/data.py", line 3614, in Data.copy_compatible_to
    line: assert not check_dtype or self.dtype == data.dtype
    locals:
      check_dtype = <local> True
      self = <local> Data{'reduce_output', []}
      self.dtype = <local> 'float32', len = 7
      data = <local> Data{'add_output', [], dtype='int32'}
      data.dtype = <local> 'int32

The error on the Returnn side is justified, because the generated layers are:

    "reduce": {
        "class": "reduce",
        "from": "pow",
        "mode": "sum",
        "axis": [transformer_default_model_dim, _3_transformer_default_model_dim],
        "out_shape": {},
    },
    "constant_0": {"class": "constant", "value": 0},
    "add": {
        "class": "combine",
        "from": ["constant_0", "reduce"],
        "kind": "add",
        "out_shape": {},
    },

As the value is 0 and not 0.0 Returnn assumes int32 instead of float32 (float32 is the correct type of the reduce output). The 0 is added probably on the first sum() internal call of:

  def __add__(self, other: Union[RawTensorTypes, Tensor]) -> Tensor:
    from ._generated_layers import _combine
    return _combine([self, nn.convert_to_tensor(other)], kind="add", name="add")

I am not sure how to deal with the correct dtypes here, as when calling sum() we do may not exactly know which dtype we want to have.

One thing I could image is that we infer the type from self if other is a zero, and extend the nn.convert_to_tensor to have a dtype parameter.

@albertz
Copy link
Member

albertz commented Apr 27, 2022

Yea, sum([p1, p2]) is equivalent to 0 + p1 + p2.
In PyTorch, int + float is allowed and automatically upcasted.
In TF, int + float is not allowed (but when one part is not a tensor, e.g. int, it will automatically be casted to the dtype of the other tensor).
In RETURNN, in the CombineLayer, I think we don't allow this (I assume this is your error; you have not postet the full traceback).

So, obviously we cannot change how sum is defined, as this is defined by Python itself.

We could add an optimization on returnn-common side which automatically reduces 0 + x or x + 0 to x, no matter the type of x.

We could extend RETURNN CombineLayer to allow int + float.

Not sure which way is the best or you would prefer. I think the returnn-common optimization would be nice anyway and would already solve this.

@albertz
Copy link
Member

albertz commented Apr 27, 2022

we do may not exactly know which dtype we want to have.

Note that such upcasting (promotion) rules are well defined in PyTorch/JAX/Numpy.

https://pytorch.org/docs/stable/tensor_attributes.html#torch.torch.dtype
https://jax.readthedocs.io/en/latest/type_promotion.html
https://jax.readthedocs.io/en/latest/design_notes/type_promotion.html

I think TF in its normal behavior actually does not have it.

https://www.tensorflow.org/api_docs/python/tf/math/add

@JackTemaki
Copy link
Contributor Author

Okay then lets start with the optimizations like x+0 -> x; x-0 -> x; x*1 -> x and so on... I think those are a good idea anyway.

@albertz
Copy link
Member

albertz commented Apr 27, 2022

Ok done.

@albertz
Copy link
Member

albertz commented Apr 27, 2022

Btw, if you want to have an L2 loss, I would not recommend to do it this way. Rather, see #59. And you should also enable decouple_constraints then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants