WIP: Custom encoder/decoder layer sequence. #470

tdomhan · 2018-07-06T09:27:55Z

Encoder/Decoder consisting of custom layers

Also:

Janet RNN
QRNN
Highway layers

Still WIP!

Pull Request Checklist

Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
until you can check this box.
Unit tests pass (pytest)
Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
System tests pass (pytest test/system)
Passed code style checking (./style-check.sh)
You have considered writing a test
Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

fhieber

just some tiny initial comments. I will give it a closer read soon.

fhieber · 2018-07-06T09:36:53Z

sockeye/convolution.py

+        # target: (batch_size, num_hidden) -> (batch_size, 1, num_hidden)
+        target = mx.sym.expand_dims(target, axis=1)
+
+        # Incompatible input shape: expected [80,0], got [80,1,32]


leftover comment

fhieber · 2018-07-06T09:39:18Z

sockeye/inference.py

@@ -161,21 +161,26 @@ def sym_gen(source_seq_len: int):
            source_words = source.split(num_outputs=self.num_source_factors, axis=2, squeeze_axis=True)[0]
            source_length = utils.compute_lengths(source_words)

+


spurious newline

fhieber · 2018-07-06T09:43:46Z

sockeye/transformer.py

@@ -174,13 +174,13 @@ def __call__(self,
        # self-attention
        target_self_att = self.self_attention(inputs=self.pre_self_attention(target, None),


I'd prefer unpacking like target_self_att, _ = self.self_attention(...

sounds good, will do.

fhieber · 2018-07-06T09:43:52Z

sockeye/transformer.py

        target = self.post_self_attention(target_self_att, target)

        # encoder attention
        target_enc_att = self.enc_attention(queries=self.pre_enc_attention(target, None),
                                            memory=source,
-                                            bias=source_bias)
+                                            bias=source_bias)[0]


tdomhan · 2018-07-10T15:07:50Z

thanks for taking a first look. There are still several cleanups necessary though (just as a warning).

fhieber

Haven't looked at rnn.py yet.
Bear with me, this is a lot of new code :)

fhieber · 2018-07-13T06:43:32Z

sockeye/convolution.py

+        raise NotImplementedError("Pooling only available on the encoder side.")
+
+
+class QRNNBlock:


some docstring might be helpful to describe what this implements.

sure. I'll add some documentation here.

fhieber · 2018-07-13T06:45:12Z

sockeye/custom_seq_parser.py

+layer = meta_layer / parallel_layer / repeat_layer / subsample_layer / standard_layer 
+open = "("
+close = ")"
+empty_paran = open close


paran -> paren

fhieber · 2018-07-13T06:45:20Z

sockeye/custom_seq_parser.py

+repeat_layer = "repeat" open int comma layer_chain close
+subsample_layer = "subsample" open optional_params layer_chain_sep layer_chain close
+
+standard_layer = standard_layer_name optional_paranthesis_params


parenthesis

fhieber · 2018-07-13T06:45:35Z

sockeye/custom_seq_parser.py

+separated_layer_chain = layer_chain_sep layer_chain
+more_layer_chains = separated_layer_chain*
+
+optional_paranthesis_params = paranthesis_params?


parenthesis

fhieber · 2018-07-13T06:45:59Z

sockeye/custom_seq_parser.py

+    def __init__(self):
+        super().__init__()
+
+    def visit_paranthesis_params(self, node, rest):


parenthesis

fhieber · 2018-07-13T07:46:27Z

sockeye/layers.py

+
+# TODO: make sure the number of hidden units does not change!
+class ResidualEncoderLayer(EncoderLayer):
+    def __init__(self, layers: List[EncoderLayer]):


fhieber · 2018-07-13T07:46:46Z

sockeye/layers.py

+# TODO: potentially add a projection layer (for when the shapes don't match up). Alternative: check that the input num hidden matches the output num_hidden (maybe add a get_input_num_hidden())
+class ResidualDecoderLayer(NestedDecoderLayer):
+
+    def __init__(self, layers: List[DecoderLayer]):


fhieber · 2018-07-13T07:47:26Z

sockeye/layers.py

+
+
+# TODO: potentially add a projection layer (for when the shapes don't match up). Alternative: check that the input num hidden matches the output num_hidden (maybe add a get_input_num_hidden())
+class ResidualDecoderLayer(NestedDecoderLayer):


why can't we have a ResidualLayer that inherits from SharedEncoderDecoderLayer and implements all 3 methods (encode_sequence, decode_sequence, decode_step)?

you mean inheriting from both NestedDecoderLayer and SharedEncoderDecoderLayer? yes, we potentially could. I'll add a TODO.

fhieber · 2018-07-13T07:47:57Z

sockeye/layers.py

+        return ResidualDecoderLayer(layers)
+
+
+# TODO: make this a block!?


fhieber · 2018-07-13T07:49:08Z

sockeye/train.py

+                                                        num_embed=num_embed_source)
+
+        # TODO: how to set this correctly!?
+        encoder_num_hidden = None


can you elaborate on that TODO?

fhieber · 2018-08-15T08:30:15Z

sockeye/rnn.py

+        dtype: str = C.DTYPE_FP32,
+        prefix: str = ''):
+    """
+    Create a single rnn cell.


missing newline

fhieber · 2018-08-15T08:31:14Z

sockeye/rnn.py

+    """Janet cell, as described in:
+    https://arxiv.org/pdf/1804.04849.pdf
+
+    Parameters


could we keep docstring styles consistent?

fhieber · 2018-08-15T08:32:36Z

sockeye/rnn.py

+        return len(self.rnn_cell.state_info)
+
+    def state_variables(self, step: int) -> Sequence[mx.sym.Symbol]:
+        return [mx.sym.Variable("%rnn_state_%d" % (self.prefix, i))


do we have a constant for this string?

fhieber · 2018-08-15T08:32:52Z

sockeye/rnn.py

+                                                  forget_bias=forget_bias)
+
+    def create_encoder_layer(self, input_num_hidden: int, prefix: str) -> layers.EncoderLayer:
+        return RecurrentEncoderLayer(rnn_config=self.rnn_config, prefix=prefix + "rnn_")


constant available for string?

fhieber · 2018-08-15T08:32:57Z

sockeye/rnn.py

+        return RecurrentEncoderLayer(rnn_config=self.rnn_config, prefix=prefix + "rnn_")
+
+    def create_decoder_layer(self, input_num_hidden: int, prefix: str) -> layers.DecoderLayer:
+        return RecurrentDecoderLayer(rnn_config=self.rnn_config, prefix=prefix + "rnn_")


constant available for string?

fhieber · 2019-03-20T12:28:45Z

What would it take to avoid having an int parameter representing the size of the length dimension in the calling methods of layer implementations? This could be a major step towards converting to hybrid Gluon blocks.
Currently we use seq_len as an argument to all top-level classes (encoder, decoder, attention etc.).
Back when we wrote it the first time, MXNet didn't have many operators and it forced us to know the sequence length to implement the symbolic graphs. However, since then many things have changed and there are operators such as slice_like, broadcast_like etc. that allow performing operations based on the size/axes of the input data.

I took a quick pass over encoder.py and decoder.py to see what blocks us from avoiding the int argument:

Transformers can be implemented fully without knowing the sequence length. We currently use it for the custom ops to create variable length biases, but thats easily avoidable if we know the max_seq_len, which we do at construction time of the classes.
RNNs: for encoders, we can follow this tutorial on control flow operators to implement RNN unrolling with a control flow operator. Some attention.
RNN attention/coverage types: some attention types (LocationAttention) require knowing the sequence length, but I think we can avoid that. GRUCoverage can also be implemented using control flow ops if still necessary.
CNN models: this is where I don't know if we can really do it. My impression is that it should be possible knowing max_seq_len and using ops such as slice_like.

This has implications on this PR, as it may change the signature of your basic Layer classes. What do you think?

tdomhan · 2019-03-25T14:38:39Z

Yeah, I think the main blockers were the RNNs. I'll check regarding CNNs. In general it would be really nice if we could get rid of the int parameter though.

fhieber · 2020-10-15T09:19:23Z

Closing for now as this is needs more work to be applied to Sockeye 2.

First commit custom seq encoder/decoder.

cdb0c61

tdomhan requested review from davvil, fhieber and mjdenkowski as code owners July 6, 2018 09:27

tdomhan added 2 commits July 10, 2018 13:54

rnn refactor.

0484b65

Making pylint happy.

b7ef939

fhieber added the feature label Jul 10, 2018

fhieber reviewed Jul 10, 2018

View reviewed changes

fhieber reviewed Jul 13, 2018

View reviewed changes

tdomhan added 10 commits July 26, 2018 16:56

qrnn doc.

7dac0b1

Addressing Felix's comments.

a194428

Addressing comments.

9bc24d5

Addressing comments.

3cd2d6d

Merge branch 'master' into custom_seq_for_master

671af26

minor

e59d6a2

test_arguments

081ac0a

rnn prefix fix

6dd7180

custom_seq prefix cleanup.

94ae311

fixing tests.

378aeed

tdomhan mentioned this pull request Aug 9, 2018

Transformer Attention Probabilities #504

Closed

8 tasks

tdomhan added 2 commits August 14, 2018 11:47

removed chrono init.

ebb0822

Making mypy happy.

ee54952

fhieber reviewed Aug 15, 2018

View reviewed changes

tdomhan added 3 commits September 21, 2018 14:56

Merge branch 'master' into custom_seq_for_master

d991be1

added TODO

8e5f5a0

HighwayLayer consistency check.

e9bf657

fhieber added the sockeye_1 label Jun 3, 2020

fhieber closed this Oct 15, 2020

		@@ -161,21 +161,26 @@ def sym_gen(source_seq_len: int):
		source_words = source.split(num_outputs=self.num_source_factors, axis=2, squeeze_axis=True)[0]
		source_length = utils.compute_lengths(source_words)

		@@ -174,13 +174,13 @@ def __call__(self,
		# self-attention
		target_self_att = self.self_attention(inputs=self.pre_self_attention(target, None),

		raise NotImplementedError("Pooling only available on the encoder side.")


		class QRNNBlock:



		# TODO: potentially add a projection layer (for when the shapes don't match up). Alternative: check that the input num hidden matches the output num_hidden (maybe add a get_input_num_hidden())
		class ResidualDecoderLayer(NestedDecoderLayer):

		return ResidualDecoderLayer(layers)


		# TODO: make this a block!?

WIP: Custom encoder/decoder layer sequence. #470

WIP: Custom encoder/decoder layer sequence. #470

Conversation

tdomhan commented Jul 6, 2018

Pull Request Checklist

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdomhan commented Jul 10, 2018

fhieber left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhieber commented Mar 20, 2019

tdomhan commented Mar 25, 2019

fhieber commented Oct 15, 2020