Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read alignment graph? #144

Closed
reallynotabot opened this issue Apr 11, 2018 · 10 comments
Closed

How to read alignment graph? #144

reallynotabot opened this issue Apr 11, 2018 · 10 comments

Comments

@reallynotabot
Copy link

Trying to understand what the axes and color legend means

  • From reading other issues, I understand that a diagonal line means good alignment, but what exactly is happening at each x,y value and color of the graph?
  • Is the color legend the alignment value, and higher means better?

@NTT123
Copy link

NTT123 commented Apr 12, 2018

There are two important things here.

(1) the encoder (y-axis) which at each step takes an input character and its current state, to output a real vector representing the status of the brain at that moment. The input length here is about 100, so there are about 100 vectors generated by the encoder.

(2) the decoder (x-axis) takes all the vectors (the y-axis) to generate audio frames (mel-spectrogram). The decoder also works step-by-step, at each step (there are about 80-90 steps here) it would decide which vectors (on the y-axis) are important to create audio frames at that particular moment. Bright colors mean to focus more here (on the y-axis) and vice versa.

In short, the encoder reads input characters step-by-step and outputs status vectors. The decoder reads all status vectors and generates audio frames step-by-step.

A good alignment simply means: An "A" sound generated by the decoder should be the result of focusing on the vector generated by the encoder from reading "A" character. The diagonal line is the result when audio frames are created by focusing on the correct input characters in order.

@reallynotabot
Copy link
Author

reallynotabot commented Apr 12, 2018

@NTT123 Thanks for the reply.

The good alignment example makes sense. But this brought up more questions:

  1. If the encoder input length is 100, and the decoder takes all these vectors to generate mel-spectrograms, why is the decoder length 80-90, and not 100?

  2. How does the decoder decide which vectors (on the y-axis) are important to create audio frames at that particular moment?

  3. When you say bright colors mean to focus "more here" - what does the value signify? the yellow color is 0.7+, what does that mean?.
    Similarly, the point 20, 60 (x, y) is in the blue region, does that mean the encoder vector at 60 timestep has close to zero value for decoder mel-spectrograms at 20 timestep?

@NTT123
Copy link

NTT123 commented Apr 12, 2018

@reallynotabot,

(1) the number of decoding steps is determined by the training audio sample. For example, each decoding step generates, say, 5 frames, if an audio sample is about 400 audio frames, then there are 400/5 = 80 decoding steps.

(2) the decoder has to learn which vectors are important. This is the training. Technically, this is an attention mechanism, see here for details.

(3) At each decoding step, the whole y-axis is a weighted sum. All the colors add up to 1.0. The focused vector is actually the weighted average of all encoder's status vectors.

Yes, vector 60 adds very little to the average vector (because the weight is very small) at the 20 decoding step.

@reallynotabot
Copy link
Author

@NTT123 Thanks. This was really helpful! Will go read more about the attention mechanism.

@reallynotabot
Copy link
Author

reallynotabot commented Apr 13, 2018

@NTT123 So I went through the doc you linked, as well as a few others, and still had some questions. Would be great if you could help answer them.
https://distill.pub/2016/augmented-rnns/

The part that is unclear is:

The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query.

  1. What does content-based mean? Is it the character/vectors generated in the encoder, and then focusing on each vector, which is considered content?
  2. What is the query that the attending RNN generates? Is that an audio sample from the decoder?
  3. By item is dot-producted with query, they mean each vector from the encoder is dot-producted with decoder time step, which generates the score and is used as the weight for that vector (which shows up as the color on the attention alignment graph)?

Thanks again for help in clarifying stuff.

@NTT123
Copy link

NTT123 commented Apr 14, 2018

@reallynotabot
The encoder reads the text input and generates a list of real-valued vectors. Attention mechanism is all about deciding which positions of the list are important for generating sound at a moment in time. If the list has 120 vectors, then at each step, the attention mechanism will generate 120 weights, each corresponding to a vector of the list. Weights are real numbers and add up to 1.0.

There are many different kinds of attention mechanisms: content-based, location-based or hybrid mechanisms. The difference is how to compute weights at each step.

For example, Tacotron 1 uses a content-based attention mechanism. Weights are computed from the content (actual value) of the list. Where you look at depending on what you look at.

Tacotron 2 uses a hybrid attention mechanism https://arxiv.org/abs/1506.07503. Where you look at depending on what you look at (content-based) and where you looked at previously (location-based).

This answers your first question. However, your second and third questions aren't about Tacotron. You're asking about neural turing machines with augmented memory. It is my bad recommending you that web page which isn't much related to Tacotron. It talks about a much more powerful neural network with memories and recalling these memories. You need a query (which is a real-valued vector) and the neural network will search for similar memories to this vector (hence the dot product)

@reallynotabot
Copy link
Author

reallynotabot commented Apr 16, 2018

@NTT123 That makes sense. Going back to the attention mechanism itself.

Attention mechanism is all about deciding which positions of the list are important for generating sound at a moment in time.

I'm trying to figure out how the attention mechanism decides which position of the list is important for generating sound? If the weights are computed based on content(actual value) of list, how does it know or learn that a particular content(actual value) should have a higher weight vs a lower weight.

@NTT123
Copy link

NTT123 commented Apr 16, 2018

@reallynotabot an attention mechanism is actually a continuous function with tunable parameters.
Just like f(x; a,b, c) = ax^2 + bx + c, where a,b, c are tunable parameters.

If the parameters are bad, the results would be bad sounds. The bad sounds are compared with training audio samples. The error would be large. (basically, error measures the distance from an audio output generated by the network to the correct training audio sample).

For example, the error is e(x; a, b, c) = (f(x) - y )^2 , where y is here is the correct audio sample.

We want e() to be small. We can tune values of a, b, c.

math 101 told us that gradient of function e() with respect to a,b, c will point to a direction at which e() will increase most (locally).

So, we better follow the opposite direction of the gradient. (e() will decrease most)

That is how neural networks learn. It tunes its parameters by following the opposite direction of the gradient of its error function.

@haqkiemdaim
Copy link

Hi @NTT123, i just want to ask. During the training, a message prompt on the terminal says:

"Decoder stopped with decoder_max_step"

And why is that happen ya?

@anubhootisol
Copy link

I have two questions.
1)I am getting unable to connect error in tensorboard. I want to see an alignment graph for tacotron 2 implementation. Can someone suggest the way to see the graph?
2) We have used tacotron 1 for TTS for English. we used 3.5 hours of paired data. batch size is 32, r=5, we have executed 200k iterations. The alignment graph shows some thick line and there are many outliers. How far we need to continue the iterations further. Can someone suggest us the answer. @

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants