Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer Attention Probabilities #504

Closed
wants to merge 14 commits into from

Conversation

tomsbergmanis
Copy link

This change adds attention probabilities for transformer decoder and completes TODO by @fhieber.
Specifically, we compute transformer attention probabilities as the average attention probabilities over all attention heads in all layers.
To do so, we create, MultiHeadAttentionWithProbs, a subclass of MultiHeadAttention which overrides _attend to return attention probabilities.

To evaluate the resulting attention probability matrices we used them as a basis for discrete word alignments. The resulting alignments were then compared against:

  • alignments obtained from LSTM attention matrices,
  • alignments by FastAlign.

When conducting a human evaluation, we found that resulting word alignments are on average judged as acceptable as word alignments from LSTM attention matrices and strictly better than alignments by FastAlign.

Pull Request Checklist

  • Changes are complete (if posting work-in-progress code, prefix your pull request title with '[WIP]'
    until you can check this box.
  • Unit tests pass (pytest)
  • Were system tests modified? If so did you run these at least 5 times to account for the variation across runs?
  • System tests pass (pytest test/system)
  • Passed code style checking (./style-check.sh)
  • You have considered writing a test
  • Updated major/minor version in sockeye/__init__.py. Major version bump if this is a backwards incompatible change.
  • Updated CHANGELOG.md

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@tdomhan
Copy link
Contributor

tdomhan commented Aug 9, 2018

this introduces a lot of code repetition, as dot_attention_with_probs essentially does the same thing as dot_attention, just additionally returning the probs.

I'm actually changing dot_attention to return probs as part of PR #470. So maybe we can hold off with this change until we merged the custom encoder/decoder.

@tomsbergmanis
Copy link
Author

@tdomhan thanks for your answer. Unless you are happy for me to change dot_attention and have another look then, it, of course, can wait till you are done with the custom encoder/decoder.
Maye I ask - do you have a rough estimate of when you might be done with PR #470?

@fhieber
Copy link
Contributor

fhieber commented Oct 15, 2020

Closing this PR as it would need to change the target branch to sockeye_1.

@fhieber fhieber closed this Oct 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants