Help in understanding PPO

Hello every one! I have some issues with understanding things about PPO from the academic paper and from some code implimintations I found online. In the paper I understand the oproximation is between the output of the old model and the new model. How does that work? How do I update the model and then do the calculation on how much do I update it? Do I need to always save the i-1 model so I could do that calculation? Now for the implimintations. I’m using IsaacGym and run n simulations at a time. All the implimintations are up dating the model on a sequence of actions until the game is done. I want it to run on a random batch from the single actions from my n environments and I have a hard time understanding what I need to save and change. What are the parameters I need to save each iteration? I thought about: abservasion, actions, reward, value (V net output), log probs. Am I missing something I need to save? Sorry if it’s abit of a long post, every help will be awesome.

submitted by /u/razton
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top
en_USEnglish