Hard time understanding PPO loss

I’m implementing PPO method and so far it proved to be successful. I successfully trained it on gym’s Lunar Lander. But final loss graph doesn’t make sense to me. To my understanding we’re trying to minimize it, so lower loss means better model. But look at the loss and avg reward graphs:

https://preview.redd.it/jbqf5bxczj4d1.png?width=996&format=png&auto=webp&s=892e907eb860242d3a4d38309e9f0ce231056371

Around 25-50 steps there’s a big decrease in loss, it should mean that the model became significantly better. But average reward also dropped considerably. Around 100 steps loss increased, so average reward did. It looks as if higher loss means better model, but it doesn’t make sense to me

submitted by /u/Aydiagam
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top
en_USEnglish