-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to define the API for parameter initialization, regularization (L2, weight dropout, etc), maybe updater opts per-param #59
Comments
Also |
The same also for parameter initialization options. Like |
Then this becomes also relevant here. |
|
|
|
We will handle that in a more generic way (#59).
There are some other potential scope-based or global options (all potential
or |
Param init, regularization (per param) in most frameworks are options to the corresponding module or layer:
In lower level TF, param init and regularization is via a context manager (
|
For param int: No matter if via direct argument, context scope or some other way, there need to be some sensible default, and sometimes also explicit overwrite. The question is, should there be an easy way to overwrite the defaults globally? This could make sense to easily try out some new kind of param initialization. This could be via a parent context scope, or maybe like the PyTorch The problem is, such global overwrite might not make sense for every parameter. Some parameters have other custom default init logic. How to differentiate those from the standard ones? One possibility is maybe some clustering of types of init, e.g. for linear matrix, bias, and maybe other types. And then you could globally overwrite the default per type. Although this is tricky and maybe hard to get right and will likely never fit all cases. Also, if there are multiple (maybe nested) settings of defaults, which one will take precedence? The most recent one or the first (most outer) one? There are reasons for priority on the most outer one to be able to globally overwrite this, otherwise such outer settings would never have an effect. But there are maybe also cases where some inner setting should take precedence. Do we need a priority system for this? But that would make it way too complicated. |
PyTorch @torch.no_grad()
def init_weights(m):
print(m)
if type(m) == nn.Linear:
m.weight.fill_(1.0)
print(m.weight)
net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)
So maybe such One question of e.g. weight norm and other reparameterization is where and how the new underlying params are being added. In PyTorch weight norm, it adds them directly to the module, with postfix Recently there has been some discussion on PyTorch that the I'm not sure though that the new PyTorch parametrization API makes it really cleaner and simpler. While it looks nice from a high-level viewpoint, it hides many things. |
Note, our parameters are now always Maybe they should have some Then overwriting the |
This comment has been minimized.
This comment has been minimized.
L2 on weights would probably also be done via |
Weight dropout could just be done like weight norm explained above. |
I wonder whether we need the param init as an argument to the modules at all. With such argument, there are two ways to overwrite the default:
Vs:
But the second variant would be possible in any case. So, do we really need the first way then? |
Things like param init, weight dropout, weight norm, L2 on weights, etc can all easily be handled in such post-processing way, and recursively via such I also don't like it when there are too many arguments. So I would prefer to keep the argument list short. Context managers are an alternative to arguments. Context managers might be added for param init or other regularization at some point (but not really needed; and I would only do that if the behavior would be simple and straight-forward). Context managers are probably the way to go for One remaining question are options for model-based regularization such as dropout (on activations). Dropout is by far the most common thing but there might be other variants, and a complex model might also have various options where dropout could be added. If this is configured via model argument, this could easily become bloated when the model becomes complex or when there are other types of regularization in the future. So I also don't like this. Context managers might be somehow tricky here as well, to get this right with straight-forward behavior. However, we can maybe again do this via post-processing. Maybe modules could provide functions like |
Another model-based regularization is stochastic depth (#99, #106). This can be implemented efficiently using But the question here for this issue is again, how would the option look like? An argument to the module? Or some module method like |
Note that param init also requires some further thought on the technical side: #92 |
A lot of relevant similar discussion on the API how to define the param init is here: pytorch/pytorch#18182 |
Related is also what param init defaults we should use. See #94. |
Related is TF arg_scope which provides a context scope to overwrite the defaults for kwargs like |
Related is also #96 on explicit training loop and stages. When we implement that, the param init would be outside of the training loop, and probably it becomes all more clear. |
I think this is settled now. For param init, we can use |
I think we should leave this open until this is really implemented, or until we maybe have a separate issue specifically on the technical aspect. |
L2 on params is still another aspect. We could just do |
For L2 on params, we could also use the RETURNN mechanism directly as an option to the |
The I think we can close this here for now. |
It is maybe not such a nice idea if every new
Module
will have this option and then explicitly passes it on to all submodules. Some modules might also not implement this. And maybe there are other options which are not handled yet, e.g. param noise, etc.How should this be handled? Mostly such arguments which every layer can potentially accept. Mostly arguments about certain behavior on parameters (variables).
In TF, the natural way would be to use
tf.variable_scope
context scope and there have some custom getter or other custom logic.So maybe also some context manager here?
The text was updated successfully, but these errors were encountered: