PPO Value loss converges immediately while the policy loss struggles

I’m training a PPO custom environment using SB3. as seen from the tensorboard, the value loss converges to almost 0 very quickly while the policy loss seems to deteriorate over time. Also, the entropy loss seems to be stuck.


But, at the same time, rewards keep increasing, but the test results are getting worse!


Could you please help me understand where is the problem and how to fix it?

current hyperparameters:

initial_learning_rate = 0.000005
model = MaskablePPO(MaskableActorCriticPolicy, env, tensorboard_log=”./tensorboard” ,n_steps=1024 , learning_rate=initial_learning_rate, ent_coef=0.005 )

submitted by /u/Acceptable_Egg6552
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top