Enabling periodic evaluation #202

elle-miller · 2024-09-19T09:38:22Z

elle-miller
Sep 19, 2024

Hi there,

I would like to use the SequentialTrainer in a way that enables me to periodically evaluate the agent during training. The provided example only shows evaluation post-training: https://skrl.readthedocs.io/en/latest/api/trainers/sequential.html

In this example, I want to train num_envs for 1000 timesteps each, and evaluate 10 times throughout the process. This means training 100 timesteps, evaluate, and repeat x10.

# setup timing
max_timesteps = 1000
num_eval = 10
train_timesteps = int(max_timesteps // num_eval)
agent_cfg["trainer"]["timesteps"] = train_timesteps

# make trainer
trainer_cfg = agent_cfg["trainer"]
trainer = SequentialTrainer(cfg=trainer_cfg, env=env, agents=agent)

# alternate between train and eval
for step in range(num_eval):
    trainer.train()
    eval_returns = trainer.eval()

Code modifications

The train() function to reset the memory and rollout counter:

self.agents.memory.reset()
self.agents._rollout = 0

The act() function in PPO to only return the mean action under evaluation instead of sampling.

actions, log_prob, outputs = self.policy.act({"states": self._state_preprocessor(states)}, role="policy")
if eval:
    actions = outputs['mean_actions']
self._current_log_prob = log_prob
return actions, log_prob, outputs

The eval() method to include:

ep_length = self.env.env.max_episode_length - 1
eval_returns = torch.zeros(size=(states.shape[0], 1)).to(states.device)
mask = torch.ones(size=(states.shape[0], 1)).to(states.device)

for timestep in tqdm.tqdm(range(self.initial_timestep, ep_length), disable=self.disable_progressbar, file=sys.stdout):

    # compute actions
    with torch.no_grad():
        actions = self.agents.act(states, timestep=timestep, timesteps=ep_length)[0]

        # step the environments
        next_states, rewards, terminated, truncated, infos = self.env.step(actions)
        
        # if env has terminated or truncated, zero out all future rewards
        mask_update = 1 - torch.logical_or(terminated, truncated).float()
        eval_returns += rewards * mask
        mask *= mask_update
        ...
        (code below is the same)

I have applied a mask to the rewards because if an environment terminates or truncates, I want the rewards for that episode to stop accumulating. However, the mean evaluation returns I am getting with this method are not "correct". For example with the Isaac Lab Cartpole environment, with the mask the returns never go past ~155, but if I comment out the mask the returns are -> 300 (optimal policy). When I play the learned policy it is indeed optimal.

Questions:

To use SequentialTrainer in an alternating train()->eval()->train() fashion, are there any other implementation changes that should be made? e.g. environment resets
Are the rewards from terminated/truncated episodes returned in the evaluation .step() already masked out somewhere?

Thanks in advance!

elle-miller · 2024-09-26T16:11:16Z

elle-miller
Sep 26, 2024
Author

Benefits of separating training from evaluation

Here is an example of an agent in the Isaac Lab Cartpole environment. You can see that the evaluation returns are communicating the true learning state of the agent, without the stochasticity of the sampled actions. I was always confused by how the performance would degrade/oscillate in Rewards/Total reward (mean) throughout training, but this would fix that.

Training loop is below. You can reproduce results with minimal example:
https://github.com/elle-miller/skrl_testing

max_timesteps = 100
num_eval = 20
train_timesteps = 50
for step in range(num_eval):
        # global_step includes only training timesteps
        global_step = step * train_timesteps

        # compute evaluation returns
        returns = trainer.eval()
        agent.writer.add_scalar("Eval / Returns", returns['returns'].mean().cpu(), global_step=global_step)
 
        # train
        trainer.train(train_timesteps)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling periodic evaluation #202

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Enabling periodic evaluation #202

elle-miller Sep 19, 2024

Replies: 1 comment

elle-miller Sep 26, 2024 Author

elle-miller
Sep 19, 2024

elle-miller
Sep 26, 2024
Author