Struggling with PPO from scratch implementation. (Gymnasium)

I’ve been for the past 5 months working on a from scratch PPO implementation. I am doing most of the work from scratch except numerical computation libraries such as numpy. It started with supervised learning networks to now this. And I just can’t seem to get it. Every paper I read is A. Outdated/Incorrect B. Incomplete. No paper has a full description on what they do and what Hyper Params they use. I tried reading the SB3 code but it’s too different from my implementation and I just don’t understand whats happening as it’s just so many files, I can’t find the little nitts and gritts. So I’m just gonna post my backward method and if someone wishes to read it and would tell me some mistakes/reccomendation. Would be great! Side notes: I made the optim which uses standard gradient descent and the critic just takes state. I’m not using GAE as I’m trying to minimize potential failure points. All the hyperparams are standard vals.

def backward(self): T = len(self.trajectory[‘actions’]) for i in range(T): G = 0 for j in range(i, T): current = self.trajectory[‘rewards’][j] G += current * pow(self.gamma, j – i) # G = np.clip(G, 0, 15) # CRITIC STUFF if np.isnan(G): break state_t = self.trajectory[‘states’][i] action_t = self.trajectory[‘actions’][i] # Calculate critic value for state_t critic_value = self.critic(state_t) # print(f”Critic: {critic_value}”) # print(f”G: {G}”) # Calculate advantage for state-action pair advantages = G – critic_value # print(f”””Return: {G} # Expected Return: {critic}”””) # OLD PARAMS STUFF new_policy = self.forward(state_t, 1000) # PPO STUFF ratio = new_policy / action_t clipped_ratio = np.clip(ratio, 1.0 – self.clip, 1.0 + self.clip) surrogate_loss = -np.minimum(ratio * advantages, clipped_ratio * advantages) # entropy_loss = -np.mean(np.sum(action_t * np.log(action_t), axis=1)) # Param Vector weights_w = self.hidden.weights.flatten() weights_x = self.hidden.bias.flatten() weights_y = self.output.weights.flatten() weights_z = self.output.bias.flatten() weights_w = np.concatenate((weights_w, weights_x)) weights_w = np.concatenate((weights_w, weights_y)) param_vec = np.concatenate((weights_w, weights_z)) param_vec.flatten() loss = np.mean(surrogate_loss) # + self.l2_regularization(param_vec) # print(f”loss: {loss}”) # BACKPROPAGATION next_weights = self.output.weights self.hidden.layer_loss(next_weights, loss, tanh_derivative) self.hidden.zero_grad() self.output.zero_grad() self.hidden.backward() self.output.backward(loss) self.hidden.update_weights() self.output.update_weights() self.critic_backward(G)

submitted by /u/meh_coder
[link] [comments]

Leave a Reply

The Future Is A.I. !
To top