Why not Just Use the Value Target from the Previous Step in PPO?

For PPO the PPO objective function, why can’t we compute the advantage by subtracting the target state value from the last time step from the target state value of the current time step rather than training a network to estimate the state value of the previous time step?

submitted by /u/MfkinBad
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top