How to interpret mean episode reward differences on Epoch 0?

I am following this PyTorch tutorial using generalized advantage estimation / PPO to train different network architectures in an RL environment. I want to compare the performance of different styles of network, so I currently train each network separately and save the mean episode reward values from each of the training epochs; my thinking was that some networks will learn faster over time and exhibit sharp jumps in mean episode reward, where others might have a flatter curve. However, my results look like this:

X-axis: training epoch #, Y-axis: episode mean reward

It seems like one network is already doing nearly 4x better than the other during training starting from Epoch 0. My understanding of how GAE loss / PPO works is that all of the frames per batch are collected first, and then gradient descent is used to update all the parameters. The updated parameters are used for the next round of data collection. This leaves me confused on how to interpret the relative success of the blue network on Epoch 0 (if the updates to the parameters aren’t in effect until the next round of data collection)? Does this just reflect some sort of initialization difference between the networks, or is it possible for them to have “learned” something by the time the first mean rewards are computed? Training loop code is basically from the linked tutorial and included below:

episode_reward_mean_list = [] for tensordict_data in collector: with torch.no_grad(): GAE( tensordict_data, params=loss_module.critic_network_params, target_params=loss_module.target_critic_network_params, ) # get advantages data_view = tensordict_data.reshape(-1) replay_buffer.extend(data_view) # refill the buffer for ep in range(num_epochs): for _ in range(frames_per_batch // minibatch_size): subdata = replay_buffer.sample() loss_vals = loss_module(subdata) loss_value = ( loss_vals[“loss_objective”] + loss_vals[“loss_critic”] + loss_vals[“loss_entropy”] ) loss_value.backward() optim.step() optim.zero_grad() collector.update_policy_weights_() done = tensordict_data.get((“next”, “done”)) episode_reward_mean = ( tensordict_data.get((“next”, “episode_reward”)).mean().item() ) episode_reward_mean_list.append(episode_reward_mean)

submitted by /u/brantacanadensis906
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top