-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding some RNN logic for QMIX SMAX algorithm #119
Comments
To follow up on this, I looked through the SMAX MAPPO code, and it looks like there a given timestep consists of both last done and last obs, and I think this is correct because when you then sample a batch of this data and feed it into the same ScannedRNN, the dones will correctly reset the hidden states because the observations that are then passed in afterwards are the new observations from the new episodes. So is one of these implementations correct and the other one is wrong, or are they done differently for intentional reasons?
|
@mttga mind checking this, pretty sure the mappo way is correct as we had to fix to use |
@amacrutherford Thanks for the follow up, I'm happy to make a PR with the update if you guys determine it is a bug. I also ran a couple experiments using |
Hey all, hope everyone is doing well! What follows may be bit of a dumb question, but I just wanted to clarify how this is working for my own algorithm development based on your guys' excellent code.
The QMIX code uses a ScannedRNN class where you pass in a sequence of observations and dones, and anywhere where a done condition is true, the hidden state is reset, and we pass the corresponding obs at that timestep through:
This makes sense to me. However I noticed that when data is actually collected, any given timestep actually consists of the last obs + new done, instead of last obs + last done.
Therefore, doesn't it mean that when we're using this RNN and it resets the hidden state and then passes in the observation, we're actually using the previous observation (which is associated with the previous episode) as the first step in the RNN's new sequence with the reset hidden state, instead of the current/new observation (from the new episode after the environment was just reset) generated after the episode was ended with the done being True?
The text was updated successfully, but these errors were encountered: