PPO Value loss converges immediately while the policy loss struggles

I’m training a PPO custom environment using SB3. as seen from the tensorboard, the value loss converges to almost 0 very quickly while the policy loss seems to deteriorate over time. Also, the entropy loss seems to be stuck.

https://preview.redd.it/qgcxv85ks2qc1.png?width=1648&format=png&auto=webp&s=0d1609d756f42a0f2372b4006836d3eb4d4e3744

But, at the same time, rewards keep increasing, but the test results are getting worse!

https://preview.redd.it/4vv051myr2qc1.png?width=556&format=png&auto=webp&s=a1fe9ca87349d14c918bcb8e68766db6b02a3425

Could you please help me understand where is the problem and how to fix it?

current hyperparameters:

initial_learning_rate = 0.000005
model = MaskablePPO(MaskableActorCriticPolicy, env, tensorboard_log=”./tensorboard” ,n_steps=1024 , learning_rate=initial_learning_rate, ent_coef=0.005 )

submitted by /u/Acceptable_Egg6552
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top
en_USEnglish