I’m trying to replace a PID by a RL with stable_baselines3 (seems to be the easiest for beginners). Everything is working properly but whatever algorithm or policy I use, my action is always very noisy and not usable for my real system. My system is very slow (about 200 step to get a response).

Does anyone has some recommendations about algorithms, policy or how to configure the hyperparameters for a slow and continus action ? I try to configure the reward to calm the action but it always finishes with a saturation to the min or the max value even if I try to encourage low difference between last action and the actual.

Many thanks for your help … I’m going crazy about this thing !

