-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Definition of losses, mark_as_loss
?
#56
Comments
I just checked ESPNet as an example. See here and here. It looks like they don't separate this and have the loss logic as part of the model. I wonder though whether this is a good idea or not. Certainly some losses are related to specific models, e.g. transducer or CTC (at least the original literature introduces both together). I also wonder how other frameworks are doing this. |
Fairseq seems to have it decoupled. See here for the model which does not have anything about the loss.
So this kind of assumes a single criterion (loss). Although you could simply add up multiple criteria into a single one (see See
|
mark_as_loss
?)mark_as_loss
?), extended make_root_net_dict
?
I would propose to extend our I would also propose to remove The details of the extended API need to be worked out. |
Currently
Where In this API, what I would propose to define a
And then maybe:
The If
And then:
|
Note that we have some other things coupled to the model definition as well (but this is usually the same in many other frameworks):
In RETURNN, we are also specifically proud of having a unified way to define the network both for training and decoding with search ( Related is #18 on training behavior and a potential train flag on this level, and also on search. Further, some models might define other auxiliary losses, or other multi-task losses. E.g. we often have some auxiliary CTC loss on top of the encoder. This is maybe going to be extended in the future by many more auxiliary losses, and maybe also more local losses. Considering this, maybe it makes it unnecessarily complicated if we strictly try to decouple the loss definitions from the model definitions. Because then the root module (main model) would need to return every intermediate output as well to be able to define all the auxiliary losses. But this implies that all loss options on all these variations (auxiliary losses, multi-task losses, unsupervised losses, and the main loss, where many variations are also possible) needs to be passed to the model somehow. Which is also not nice, as many models usually already have many other options, and this could blow up. Maybe we can also have both. The model somehow can define maybe one or more outputs. Maybe also specifying the output type (logits, log prob, prob or so). And then losses can be defined on all those outputs, or on some, or only on the final output. Maybe some option like If the models define the losses themselves (e.g. via the existing But I don't like it too much that the We could also make use of context managers in some way. E.g. if we stick to
So you would at least have the possibility to make some special use of the losses, if that would ever be needed. |
To summarize my previous post: I'm unsure. Maybe |
We maybe should think about some of the more unusual settings. E.g. not just standard frame-wise (or label-wise) cross entropy, but e.g. min expected WER training. Or also some meta learning. How would this look like? And this probably should be decoupled from the model definition, I assume? |
Min exp WER needs some decoder which performs search, and we get the beam and beam scores.
Or with second pass:
|
I think this boils down to one fundamental question: whether the loss definition should be decoupled from the model definition or not.
I think it would be more clean to decouple it more. However, I think there can still be valid cases where it can be more coupled, or would just complicate everything if the loss is not locally defined alongside with the module. So I think both should be possible. We should prefer it to be decoupled when coupling would not really be needed, e.g. for model parts like Transformer etc. But otherwise Generic regularizing local losses like L2 param norm should be handled in a more generic way (#59). |
Ok, so to come back to this: On the question whether to separate losses or have them alongside the module: I think we should support both ways, as it was argued before. WIth the decision to make all dim tags and axes arguments explicit (#17), we also need to think about extending I currently think about getting rid of such a single function
Note that the root network ( One aspect I'm undecided here is the way to declare the Also, currently we only support one root module. But maybe we want to support multiple? E.g. some losses would not be part of the root module but separate. This should be supported, as it was discussed here. So like:
In this example, Just as a reminder: The assignment of a module call (or module itself) to some unique name context is necessary to get unique names for the parameters. This is the only reason. And the unique names for parameters is necessary for the checkpoint file. |
Note: The handling of extern data was updated, and there is no |
mark_as_loss
?), extended make_root_net_dict
?mark_as_loss
?
Related is also having the training loop and stages explicit (#96). |
Currently to define some tensor (layer ref) as loss, you call
mark_as_loss
on it. The idea was to be somewhat analoge to when you callloss.backward()
in PyTorch.Common code in PyTorch looks like this (see here):
So the
loss.backprop()
call and also the definition ofloss
itself is somewhat separate fromMyModel
. InMyModel
, you would not really define the loss. So this is usually decoupled.This is not how it would work for returnn-common currently, where it cannot be separated.
When you call
make_root_net_dict
(#44) onmodel
, it just callsmodel(...)
(using extern data) and that's it.So the current API (
make_root_net_dict
) implies that the loss is defined inside the model, insideMyModel
, and cannot be decoupled. Or can it?I think we should be able to decouple it, if we want to. Any module (e.g.
Transformer
#53) should just define the model and not be specific about losses.The question is how exactly.
Maybe we can extend
make_root_net_dict
to passtrain_loss
as well or so.(I open a separate issue on this because #38 is just on the aspect of what loss functions or modules we want and their naming and usage conventions.)
The text was updated successfully, but these errors were encountered: