Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

new version of seq2seq #1119

Merged
merged 12 commits into from
Aug 29, 2018
Merged

new version of seq2seq #1119

merged 12 commits into from
Aug 29, 2018

Conversation

alexholdenmiller
Copy link
Member

@alexholdenmiller alexholdenmiller commented Aug 28, 2018

  • primary change is massive updates to further modularize and clean out modules.py

  • successfully trains & evals & ranks on babi:task10k:1

  • successfully trains & evals & ranks on convai2

@alexholdenmiller alexholdenmiller changed the title [wip] new version of seq2seq new version of seq2seq Aug 29, 2018
@alexholdenmiller
Copy link
Member Author

trains / evals successfully on convai2, including ranking

"""Add command-line arguments specifically for this agent."""
agent = argparser.add_argument_group('Seq2Seq Arguments')
agent.add_argument('--init-model', type=str, default=None,
help='load dict/features/weights/opts from this file')
help='load dict/model/opts from this path')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the way when we have automatic default in the help with %(default)s, prolly we can fix that in seq2seq now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's do separate diff for those, I also want to reformat the whitespace on all the args

Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to block, but here are my opinions

:param dim: (default 0) dimension to pad

:returns: padded tensor if the tensor is shorter than length
"""
if tensor.size(dim) < length:
return torch.cat(
[tensor, tensor.new(*tensor.size()[:dim],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of pure curiosity, did you benchmark this with other approaches? (namely init the full zero matrix, then assign the tensor into it?). Not that there's need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, should this actually fill_(null_idx)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't benchmark, although 1) this is copied from pytorch source, with dim param added by me and 2) only used for candidate score padding so I'm not super concerned about the perf :) oh I'll update with null_idx

the last forward pass to skip recalcuating the same encoder output
rank_during_training -- (default False) if set, ranks any available
cands during training as well
:param xs: (bsz x seqlen) LongTensor input to the encoder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

return self.START.detach().expand(bsz, 1)

def _decode_forced(self, ys, encoder_states):
"""Decode with teacher forcing."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this means

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_decode_forced is putting tokens from the target as the next input to estimate the target score during training or during eval PPL computation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"teacher-forcing" always inputs the true token at each time step, vs using the model's predicted

class Encoder(nn.Module):
def __init__(self, num_features, padding_idx=0, rnn_class='lstm',
emb_size=128, hidden_size=128, num_layers=2, dropout=0.1,
scores = self._decode(encoder_states, maxlen or self.longest_label)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All much cleaner now, great!

super().__init__()

self.dropout = nn.Dropout(p=dropout)
self.layers = num_layers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change all the names?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for consistency, there was a mix of underscores and non and things didn't match with the opt args


return encoder_output, hidden
return hidden, encoder_output, attn_mask
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional swap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit TBH I think encoder_output, hidden, attn_mask is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swap was intentional because later

hidden = encoder_states[0]
attn_params = (encoder_states[1], encoder_states[2])

but I could change that to [1] and ([0], [2])

it does match the rnn output more closely

if attn_mask is not None:
# remove activation from NULL symbols
# TODO: is this the best operation?
attn_w_premask -= (1 - attn_mask) * 1e20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do th.masked_fill_(attn_w_premask, -1e20). That's what I've seen in fairseq.

Maybe make that 1e20 as a private constant, like _ALMOST_INFINITY? Don't think it matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._YUGE?

try:
out = self.model(batch.text_vec, batch.label_vec)

# generated response
preds, scores = out[0], out[1]
scores = out[0]
preds = scores.max(2)[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i prefer _, preds = scores.max(2)

batch.candidates)

text = [self._v2t(p) for p in preds]
text = [self._v2t(p) for p in scores.max(2)[1].cpu()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO hiding this max() call in here is a bit nefarious.

_, preds = scores.max(2)
text = [self._v2t(p) for p in preds.cpu()]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually introduced a bug too because I used the teacher-forced scores on accident from the if block, I'll update

@@ -775,7 +775,7 @@ def _init_cuda_buffer(self, model, criterion, batchsize, maxlen):
try:
print('preinitializing pytorch cuda buffer')
dummy = torch.ones(batchsize, maxlen).long().cuda()
sc = model(dummy, dummy)[1]
sc = model(dummy, dummy)[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs an explanation. It's assuming model has a particular interface and that interface isn't clear to me right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Assuming this is related to the hidden, encoder_out, attn_mask from earlier)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I need to find a better way to extract the scores here or move it back to s2s

@alexholdenmiller
Copy link
Member Author

alexholdenmiller commented Aug 29, 2018

Note: I squashed old pretrained seq2seq module weights into the new modules and got exactly the same f1, ppl, hits on convai2.

Copy link
Contributor

@uralik uralik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants