This is a small migration guide, from converting a raw RETURNN network dictionary to the nn
framework.
If you do not find sth, or find sth confusing, the reason might be that this is just incomplete, so please give feedback, and maybe extend the framework!
For further documentation, see:
- RETURNN common homepage has an introduction and small usage examples
- RETURNN common principles
- RETURNN common conventions.
- Docstrings in the code. It is anyway very recommended to use an IDE to be able to use auto-completion, and the IDE would also automatically show you the documentation.
nn.base
docstring.nn.base
defines many important base classes such asnn.Tensor
,nn.Module
, and has some high-level explanation of how it works internallynn.naming
docstring.nn.naming
defines the layer names and parameter names, i.e. how a model (viann.Module
) and all intermediate computations map to RETURNN layers.- Missing pieces for first release and Intermediate usage before first release. This is an overview, also linking to now completed issues. The issues often come with some discussion where you find the rationality behind certain design decisions.
- Translating TF or PyTorch or other code to returnn_common
For many aspects and design decision, RETURNN common follows the PyTorch API.
How to define or create the config, how to write Sisyphus setups, etc.: There is no definite, recommended way yet. We are still figuring out what the nicest way is. It's also up to you. It's very flexible and basically allows to do it in any way you want.
You could have the nn
code to define the network
directly in the config,
instead of the net dict.
You can also dump a generated net dict
and put that into the config.
However, the generated net dict tends to be quite big,
closer to the TF computation graph.
So, to better understand the model definition
and be able to easily extend or change it
for one-off experiments,
it is recommended to always keep the nn
code around,
and maybe not dump the generated net dict at all.
Next to the network (network
),
you also directly should define
the extern data (extern_data
).
Also see How to handle Sisyphus hashes and Extensions for Sisyphus serialization of net dict and extern data.
(Speak to Nick, Benedikt, Albert, or others about examples, but note that they are all work-in-progress. You probably find some example code in i6_experiments.)
The outputs of layers as inputs to other layers are referred to via their names as strings in the net dict, which are defined by the keys of the net dict.
In nn
, the output of a layer/module/function
is of type nn.Tensor
,
and this is also the input type for everything.
LinearLayer
etc.
-> nn.Linear
etc., classes, derive from nn.Module
Functional, e.g. ReduceLayer
-> nn.reduce
etc., pure functions
CombineLayer
-> a + b
etc. works directly
CompareLayer
-> a >= b
etc. works directly
Via ActivationLayer
-> nn.relu
etc. works directly
Layers with hidden state
RecLayer
, SelfAttentionLayer
etc.
There is no hidden state in nn
, it is all explicit.
The nn.LayerState
object is used to pass around state.
See nn.LSTM
for an example.
nn.LSTM
can operate both on a sequence or on a single frame when you pass axis=nn.single_step_dim
.
RecLayer
with subnetwork
-> nn.Loop
CondLayer
-> nn.Cond
MaskedComputationLayer
-> nn.MaskedComputation
SubnetworkLayer
-> define your own module (class, derived from nn.Module
)
ChoiceLayer
-> nn.choice
.
However, also see nn.SearchFunc
and nn.Transformer
as an example.
SelfAttentionLayer
-> nn.SelfAttention
or nn.CausalSelfAttention
.
In general, there is also nn.dot_attention
.
For more complete networks, there is also nn.Transformer
, nn.TransformerEncoder
, nn.TransformerDecoder
, or nn.Conformer
.
It should be straight-forward
to translate custom TF code
directly to nn
code,
mostly just by replacing tf.
with nn.
.
See our SpecAugment code as an example (specaugment_v1_eval_func
vs specaugment_v2
).
You can use nn.make_layer
to wrap a custom layer dict.
This can be used to partially migrate over some network definition.
However, it is recommended to avoid this and rewrite the model definition using the nn
framework directly.
nn.make_layer
is how nn
works internally.
After some discussion,
it was decided to make consistent use of dimension tags (DimensionTag
, or nn.Dim
),
and not allow anything else to specify dimensions or axes.
The concept of axes and the concept of dimensions and dimension values (e.g. output feature dimension of LinearLayer
) is the same when dim tags are used consistently.
The feature dimension is still treated special in some cases,
meaning it is automatically used when not specified,
via the attrib feature_dim
of a tensor.
However, all spatial dims (or reduce dims, etc.) always need to be specified explicitly.
All non-specified dimensions are handled as batch dimensions.
Before:
"y": {"class": "linear", "from": "x", "n_out": 512, "activation": None}
After:
linear_module = nn.Linear(out_dim=nn.FeatureDim("linear", 512))
y = linear_module(x)
On getting the length or dim value as a tensor:
nn.length(x, axis=axis)
or nn.dim_value(x, axis=axis)
.
All parameters (variables) are explicit in nn
,
meaning that no RETURNN layer will create a variable
but all variables are explicitly created by the nn
code
via creating nn.Parameter
.
Parameters must have a unique name from the root module via an attrib chain.
Parameter initial values can be assigned via the initial
attribute, and the nn.init
module provides common helpers such as nn.init.Glorot
.
Modules (nn.Linear
, nn.Conv
etc) should already set a sensible default init,
but this can then be easily overwritten.
You can iterate through all parameters of a module or network
via parameters()
or named_parameters()
.
RETURNN differentiates between layer classes
(derived from LayerBase
)
and loss classes (derived from Loss
).
nn
does not.
nn.cross_entropy
is just a normal function,
getting in tensors, returning a tensor.
To mark it as a loss, such that it is used for training,
you call mark_as_loss
on the tensor.
The normalization behaves different between RETURNN CtcLoss
and nn.ctc_loss
,
and actually probably not as intended for the RETURNN CtcLoss
.
See here for details.
To get the same behavior as before:
ctc_loss = nn.ctc_loss(...)
ctc_loss.mark_as_loss("ctc", custom_inv_norm_factor=nn.length(targets_time_dim))
L2
parameter in a layer
-> weight_decay
attrib of nn.Parameter
By iterating over paramaters()
,
you can easily assign the same weight decay
to all parameters, or a subset of your model.