A2C learns and dies repeatedly

I’m currently working on my implementation of A2C on the inverted pendulum problem.

(Reward = 0 * height{0-1}**2, episode length = 1000, dropoff rate = 0.98)

For some reason though it keeps dying in the middle of learning and just does nothing for 750+ games. This is strange because I normalize the advantages before feeding it into the actor network, cap π(s|a) at .05 (so that A/π(s|a) doesn’t blow up), have a learning rate of 0.001 for the actor and 0.005 for the critic, and the networks shouldn’t be overfitting (actor: 3-16-16-3, critic: 4-32-16-1). Additionally, for it to be getting a score as low as it is would mean that it chooses to do nothing out of the 3 actions that is has 90% of the time. The only thing I can think of could be that the critic network gets confused because after discounting future rewards, the values of the states can range from -20 to 60.

I don’t think this is an issue in the environment, rewards, or actor network because my implementation of REINFORCE has a smooth learning curve and gets to a reasonably good performance.

learning curve of A2C, “scores”=cumulative reward, crit_v=cumulative value evaluation by critic, bp_advantage=cumulative gradient magnitudes being propagated into the actor

submitted by /u/AUser213
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top