Policy gradients – future looking returns

In the policy gradient approach, one differentiates the expected reward

https://preview.redd.it/qt7993ym3xrc1.png?width=595&format=png&auto=webp&s=1555b1c0d5d10874603b287cb333c1d9fb185b6c

To obtain:

https://preview.redd.it/rjhpiwr24xrc1.png?width=441&format=png&auto=webp&s=3cd43b0fe809010ba1c9433a76a0f21e37c0a88d

(with some abuse of notation). This in turn is broken down to summation over single state-action transitions,

https://preview.redd.it/cif0guhg4xrc1.png?width=365&format=png&auto=webp&s=77749dec8624695dac985450a97ea98693901ee4

Note that the return of the complete trajectory multiples every single state-action $(a^j_i, s^j_i)$ transitions belonging to trajectory $tau^j=(s^j_0,a^j_0, s^j_1, a^j_1, … s^j_n)$. Usually, here comes a hand-waving part that, for each state $s^j_i$ replaces the full return $R(tau^j)$ with the future returns from this state. Rather than

https://preview.redd.it/j44ufjgj4xrc1.png?width=478&format=png&auto=webp&s=c960dbcd21da5974e3359f218ffc2839645a73e3

The standard form dictates:

https://preview.redd.it/2p9dsegp4xrc1.png?width=517&format=png&auto=webp&s=69eeee3cdf95cad73b8f02a057ce995441b18811

with

https://preview.redd.it/a6z5d78r4xrc1.png?width=241&format=png&auto=webp&s=741f3fc05ddcc95590e748674c106d64a4cafa0e

are the sum of rewards $r^j_{i’}$ collected at state-action transition $i$.

I’m looking for either a rigorous explanation showing these two forms are equivalent, or an explanation of the different setups and assumptions that leads to Eq (2) rather than Eq (1).

submitted by /u/Both_Ebb_327
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top
en_USEnglish