Self play vs double oracle in MARL zero-sum games

From what I understand:

In self-play, an agent directly plays against versions of itself (either a historical or current copy) to improve its strategy. Basically computes a best response to « everything it was until now ». On the other hand, the double oracle algorithm involves maintaining explicit strategy sets for both agents, which are incrementally expanded. But the expansion is similar. Both compute a best response against the mixed strategy of the other agent’s models and past models and add this new model to their own list of models.

Am I understanding it right? The papers I’ve seen talking about those notions are quite frankly too complicated for me, but it seems the underlying principles aren’t that deep. Maybe I’m missing something major in my understanding of those algorithms?

When are we stopping the search for a best response in both cases in the cases where we use a function approximator such a neural network ?

Is one better than the other? What are trade offs? Maybe one is more likely to converge to a Nash equilibrium?

submitted by /u/Lindayz
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top
en_USEnglish