Discrepency in stable baselines training steps vs expected

Hello,

I am new to RL and have created a custom continuous 2D maze env. I want to train an agent to navigate the maze and reach the goal. My max_timesteps for the env are set to 500. I am training with stable baselines PPO. My parameters are:

# config.yaml ppo: policy: ‘MlpPolicy’ n_steps: 250 learning_rate: 0.0003 batch_size: 10 n_epochs: 10 gamma: 0.99 gae_lambda: 0.95 clip_range: 0.2 ent_coef: 0.01 vf_coef: 0.5 max_grad_norm: 0.5 use_sde: true sde_sample_freq: -1 tensorboard_log: “./ppo_tensorboard/” seed: 123 device: “cpu”

The problem is that for whatever number of steps I tell model.learn to train for. For example, I trained model.learn(total_timesteps=10000). For 10,000 timesteps with a max of 500 steps in each iteration, I would expect there to be 20 iterations, but this callback is indiating 4 iterations and 12000 timesteps. Any idea what is causing this?

—————————————– | rollout/ | | | ep_len_mean | 480 | | ep_rew_mean | -2.77e+03 | | time/ | | | fps | 83 | | iterations | 4 | | time_elapsed | 143 | | total_timesteps | 12000 | | train/ | | | approx_kl | 0.021791738 | | clip_range | 0.2 | | entropy_loss | -9.48 | | explained_variance | 0.0275 | | learning_rate | 0.0003 | | loss | 4.38e+04 | | n_updates | 30 | | policy_gradient_loss | -0.109 | | std | 0.998 | | value_loss | 3.06e+04 | —————————————–

submitted by /u/lujan-002
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top
en_USEnglish