The direction of control in most RL libraries seems inversed

Looking at libraries like Stable Baseline 3 it seems to me that the direction of control is opposite of what ought to be.

As far as I can tell, almost all of the RL examples I see assume something like a well-defined “Environment” style object that is able to evaluate actions and produce rewards as a single unified function (e.g. the step for SB3’s env specifications)

There also seems to be an assumption about training and then deploying (in a “win competition” style of way).

Now, this doesn’t fundamentally limit what one can do with something like SB3, but it makes a “real world” usecase much harder.

Assume for example something like controlling a robot, where the robot’s “actions” might have many stakeholders (remote commands, emergency stops, hardcoded rules that supersede the RL action, hardcoded constraints that modify the action… etc)

In this kinda of environment (any real-world application) it’d make sense for an RL environment to be a “service” style entity, as opposed to the highest-level orchestrator — i.e. something that exposes methods like recommend_action(inputs) (or predict(inputs)) and reward(reward, [prev_inputs, prev_outputs]) (or step or train … the name is not that relevant)

I am 95% sure that “I don’t get it” and for some reason, the kind of interfaces that are common now are either better or necessary, but if someone could help alleviate my confusion about these design choices I’d probably understand a bit more about the RL library ecosystem and the constraints under which it evolved.

submitted by /u/elcric_krej
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top