policy gradient methods: learning optimal variance of stochastic policy?

I have a task with a continuous action space where the optimal policy is stochastic (due to imperfect information). I will use typical policy gradient algorithms (say PPO) and my policy will be a NN outputing the mean and variance of the distribution over actions.

It is usual practice to start with a high variance, and variance tends to reduce to approach zero during training. But what if the variance of the optimal policy is a finite, fairly large value? Does this approach still work, and variance will go back up if it gets too low? Or is there another way of optimizing the variance value ?

submitted by /u/redditDRL
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top