Hi everyone, I customize an enviorment,

Basically, the reward function is a weighted sum of 1. score 2. some soft constraints to avoid design violations 3. the number of violation change from previous state to current state, it seems the agent only learns to reduce the score:

But it looks like the agent only learns to decrease the score (as shown below)

for some reason, the ep_rew_mean keeps decreasing as shown below. If i do not misunderstand, ep_rew_mean is a mean of the cumulative reward for each epoch,

The other training plots seem to be normal? Right?

Thank you everyone!

