-
Notifications
You must be signed in to change notification settings - Fork 959
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to read alignment graph? #144
Comments
There are two important things here. (1) the encoder (y-axis) which at each step takes an input character and its current state, to output a real vector representing the status of the brain at that moment. The input length here is about 100, so there are about 100 vectors generated by the encoder. (2) the decoder (x-axis) takes all the vectors (the y-axis) to generate audio frames (mel-spectrogram). The decoder also works step-by-step, at each step (there are about 80-90 steps here) it would decide which vectors (on the y-axis) are important to create audio frames at that particular moment. Bright colors mean to focus more here (on the y-axis) and vice versa. In short, the encoder reads input characters step-by-step and outputs status vectors. The decoder reads all status vectors and generates audio frames step-by-step. A good alignment simply means: An "A" sound generated by the decoder should be the result of focusing on the vector generated by the encoder from reading "A" character. The diagonal line is the result when audio frames are created by focusing on the correct input characters in order. |
@NTT123 Thanks for the reply. The good alignment example makes sense. But this brought up more questions:
|
(1) the number of decoding steps is determined by the training audio sample. For example, each decoding step generates, say, 5 frames, if an audio sample is about 400 audio frames, then there are 400/5 = 80 decoding steps. (2) the decoder has to learn which vectors are important. This is the training. Technically, this is an attention mechanism, see here for details. (3) At each decoding step, the whole y-axis is a weighted sum. All the colors add up to 1.0. The focused vector is actually the weighted average of all encoder's status vectors. Yes, vector 60 adds very little to the average vector (because the weight is very small) at the 20 decoding step. |
@NTT123 Thanks. This was really helpful! Will go read more about the attention mechanism. |
@NTT123 So I went through the doc you linked, as well as a few others, and still had some questions. Would be great if you could help answer them. The part that is unclear is:
Thanks again for help in clarifying stuff. |
@reallynotabot There are many different kinds of attention mechanisms: content-based, location-based or hybrid mechanisms. The difference is how to compute weights at each step. For example, Tacotron 1 uses a content-based attention mechanism. Weights are computed from the Tacotron 2 uses a hybrid attention mechanism https://arxiv.org/abs/1506.07503. Where you look at depending on what you look at (content-based) and where you looked at previously (location-based). This answers your first question. However, your second and third questions aren't about Tacotron. You're asking about neural turing machines with augmented memory. It is my bad recommending you that web page which isn't much related to Tacotron. It talks about a much more powerful neural network with memories and recalling these memories. You need a query (which is a real-valued vector) and the neural network will search for similar memories to this vector (hence the dot product) |
@NTT123 That makes sense. Going back to the attention mechanism itself.
I'm trying to figure out how the attention mechanism decides which position of the list is important for generating sound? If the weights are computed based on |
@reallynotabot an attention mechanism is actually a continuous function with tunable parameters. If the parameters are bad, the results would be bad sounds. The bad sounds are compared with training audio samples. The error would be large. (basically, error measures the distance from an audio output generated by the network to the correct training audio sample). For example, the error is e(x; a, b, c) = (f(x) - y )^2 , where y is here is the correct audio sample. We want e() to be small. We can tune values of a, b, c. math 101 told us that gradient of function e() with respect to a,b, c will point to a direction at which e() will increase most (locally). So, we better follow the opposite direction of the gradient. (e() will decrease most) That is how neural networks learn. It tunes its parameters by following the opposite direction of the gradient of its error function. |
Hi @NTT123, i just want to ask. During the training, a message prompt on the terminal says: "Decoder stopped with decoder_max_step" And why is that happen ya? |
I have two questions. |
Trying to understand what the axes and color legend means
good
alignment, but what exactly is happening at each x,y value and color of the graph?The text was updated successfully, but these errors were encountered: