How to read alignment graph? #144

reallynotabot · 2018-04-11T21:02:55Z

Trying to understand what the axes and color legend means

From reading other issues, I understand that a diagonal line means good alignment, but what exactly is happening at each x,y value and color of the graph?
Is the color legend the alignment value, and higher means better?

The text was updated successfully, but these errors were encountered:

NTT123 · 2018-04-12T02:31:52Z

There are two important things here.

(1) the encoder (y-axis) which at each step takes an input character and its current state, to output a real vector representing the status of the brain at that moment. The input length here is about 100, so there are about 100 vectors generated by the encoder.

(2) the decoder (x-axis) takes all the vectors (the y-axis) to generate audio frames (mel-spectrogram). The decoder also works step-by-step, at each step (there are about 80-90 steps here) it would decide which vectors (on the y-axis) are important to create audio frames at that particular moment. Bright colors mean to focus more here (on the y-axis) and vice versa.

In short, the encoder reads input characters step-by-step and outputs status vectors. The decoder reads all status vectors and generates audio frames step-by-step.

A good alignment simply means: An "A" sound generated by the decoder should be the result of focusing on the vector generated by the encoder from reading "A" character. The diagonal line is the result when audio frames are created by focusing on the correct input characters in order.

reallynotabot · 2018-04-12T02:59:29Z

@NTT123 Thanks for the reply.

The good alignment example makes sense. But this brought up more questions:

If the encoder input length is 100, and the decoder takes all these vectors to generate mel-spectrograms, why is the decoder length 80-90, and not 100?
How does the decoder decide which vectors (on the y-axis) are important to create audio frames at that particular moment?
When you say bright colors mean to focus "more here" - what does the value signify? the yellow color is 0.7+, what does that mean?.
Similarly, the point 20, 60 (x, y) is in the blue region, does that mean the encoder vector at 60 timestep has close to zero value for decoder mel-spectrograms at 20 timestep?

NTT123 · 2018-04-12T03:14:16Z

@reallynotabot,

(1) the number of decoding steps is determined by the training audio sample. For example, each decoding step generates, say, 5 frames, if an audio sample is about 400 audio frames, then there are 400/5 = 80 decoding steps.

(2) the decoder has to learn which vectors are important. This is the training. Technically, this is an attention mechanism, see here for details.

(3) At each decoding step, the whole y-axis is a weighted sum. All the colors add up to 1.0. The focused vector is actually the weighted average of all encoder's status vectors.

Yes, vector 60 adds very little to the average vector (because the weight is very small) at the 20 decoding step.

reallynotabot · 2018-04-12T17:00:14Z

@NTT123 Thanks. This was really helpful! Will go read more about the attention mechanism.

reallynotabot · 2018-04-13T19:57:32Z

@NTT123 So I went through the doc you linked, as well as a few others, and still had some questions. Would be great if you could help answer them.
https://distill.pub/2016/augmented-rnns/

The part that is unclear is:

The attention distribution is usually generated with content-based attention. The attending RNN generates a query describing what it wants to focus on. Each item is dot-producted with the query to produce a score, describing how well it matches the query.

What does content-based mean? Is it the character/vectors generated in the encoder, and then focusing on each vector, which is considered content?
What is the query that the attending RNN generates? Is that an audio sample from the decoder?
By item is dot-producted with query, they mean each vector from the encoder is dot-producted with decoder time step, which generates the score and is used as the weight for that vector (which shows up as the color on the attention alignment graph)?

Thanks again for help in clarifying stuff.

NTT123 · 2018-04-14T02:25:17Z

@reallynotabot
The encoder reads the text input and generates a list of real-valued vectors. Attention mechanism is all about deciding which positions of the list are important for generating sound at a moment in time. If the list has 120 vectors, then at each step, the attention mechanism will generate 120 weights, each corresponding to a vector of the list. Weights are real numbers and add up to 1.0.

There are many different kinds of attention mechanisms: content-based, location-based or hybrid mechanisms. The difference is how to compute weights at each step.

For example, Tacotron 1 uses a content-based attention mechanism. Weights are computed from the content (actual value) of the list. Where you look at depending on what you look at.

Tacotron 2 uses a hybrid attention mechanism https://arxiv.org/abs/1506.07503. Where you look at depending on what you look at (content-based) and where you looked at previously (location-based).

This answers your first question. However, your second and third questions aren't about Tacotron. You're asking about neural turing machines with augmented memory. It is my bad recommending you that web page which isn't much related to Tacotron. It talks about a much more powerful neural network with memories and recalling these memories. You need a query (which is a real-valued vector) and the neural network will search for similar memories to this vector (hence the dot product)

reallynotabot · 2018-04-16T14:58:25Z

@NTT123 That makes sense. Going back to the attention mechanism itself.

Attention mechanism is all about deciding which positions of the list are important for generating sound at a moment in time.

I'm trying to figure out how the attention mechanism decides which position of the list is important for generating sound? If the weights are computed based on content(actual value) of list, how does it know or learn that a particular content(actual value) should have a higher weight vs a lower weight.

NTT123 · 2018-04-16T15:53:31Z

@reallynotabot an attention mechanism is actually a continuous function with tunable parameters.
Just like f(x; a,b, c) = ax^2 + bx + c, where a,b, c are tunable parameters.

If the parameters are bad, the results would be bad sounds. The bad sounds are compared with training audio samples. The error would be large. (basically, error measures the distance from an audio output generated by the network to the correct training audio sample).

For example, the error is e(x; a, b, c) = (f(x) - y )^2 , where y is here is the correct audio sample.

We want e() to be small. We can tune values of a, b, c.

math 101 told us that gradient of function e() with respect to a,b, c will point to a direction at which e() will increase most (locally).

So, we better follow the opposite direction of the gradient. (e() will decrease most)

That is how neural networks learn. It tunes its parameters by following the opposite direction of the gradient of its error function.

haqkiemdaim · 2019-05-27T09:37:23Z

Hi @NTT123, i just want to ask. During the training, a message prompt on the terminal says:

"Decoder stopped with decoder_max_step"

And why is that happen ya?

anubhootisol · 2020-01-21T05:17:21Z

I have two questions.
1)I am getting unable to connect error in tensorboard. I want to see an alignment graph for tacotron 2 implementation. Can someone suggest the way to see the graph?
2) We have used tacotron 1 for TTS for English. we used 3.5 hours of paired data. batch size is 32, r=5, we have executed 200k iterations. The alignment graph shows some thick line and there are many outliers. How far we need to continue the iterations further. Can someone suggest us the answer. @

TheNarrator mentioned this issue Apr 14, 2018

Good results at training checkpoints, but silence when running demo server. #147

Closed

Ittiz mentioned this issue Mar 26, 2019

Weird alignment images and bad sound even after 100k steps. MycroftAI/mimic2#27

Open

keithito closed this as completed Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to read alignment graph? #144

How to read alignment graph? #144

reallynotabot commented Apr 11, 2018

NTT123 commented Apr 12, 2018 •

edited

Loading

reallynotabot commented Apr 12, 2018 •

edited

Loading

NTT123 commented Apr 12, 2018 •

edited

Loading

reallynotabot commented Apr 12, 2018

reallynotabot commented Apr 13, 2018 •

edited

Loading

NTT123 commented Apr 14, 2018

reallynotabot commented Apr 16, 2018 •

edited

Loading

NTT123 commented Apr 16, 2018

haqkiemdaim commented May 27, 2019

anubhootisol commented Jan 21, 2020

How to read alignment graph? #144

How to read alignment graph? #144

Comments

reallynotabot commented Apr 11, 2018

NTT123 commented Apr 12, 2018 • edited Loading

reallynotabot commented Apr 12, 2018 • edited Loading

NTT123 commented Apr 12, 2018 • edited Loading

reallynotabot commented Apr 12, 2018

reallynotabot commented Apr 13, 2018 • edited Loading

NTT123 commented Apr 14, 2018

reallynotabot commented Apr 16, 2018 • edited Loading

NTT123 commented Apr 16, 2018

haqkiemdaim commented May 27, 2019

anubhootisol commented Jan 21, 2020

NTT123 commented Apr 12, 2018 •

edited

Loading

reallynotabot commented Apr 12, 2018 •

edited

Loading

NTT123 commented Apr 12, 2018 •

edited

Loading

reallynotabot commented Apr 13, 2018 •

edited

Loading

reallynotabot commented Apr 16, 2018 •

edited

Loading