Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about advantage calculation #60

Open
leeacord opened this issue Jun 17, 2024 · 0 comments
Open

Question about advantage calculation #60

leeacord opened this issue Jun 17, 2024 · 0 comments

Comments

@leeacord
Copy link

leeacord commented Jun 17, 2024

Hi,

I have a question regarding the implementation of the advantage calculation. The code snippet is as follows:

def actor_loss(self, seq, target):
# Actions: 0 [a1] [a2] a3
# ^ | ^ | ^ |
# / v / v / v
# States: [z0]->[z1]-> z2 -> z3
# Targets: t0 [t1] [t2]
# Baselines: [v0] [v1] v2 v3
# Entropies: [e1] [e2]
# Weights: [ 1] [w1] w2 w3
# Loss: l1 l2
metrics = {}
# Two states are lost at the end of the trajectory, one for the boostrap
# value prediction and one because the corresponding action does not lead
# anywhere anymore. One target is lost at the start of the trajectory
# because the initial state comes from the replay buffer.
policy = self.actor(tf.stop_gradient(seq['feat'][:-2]))
if self.config.actor_grad == 'dynamics':
objective = target[1:]
elif self.config.actor_grad == 'reinforce':
baseline = self._target_critic(seq['feat'][:-2]).mode()
advantage = tf.stop_gradient(target[1:] - baseline)
action = tf.stop_gradient(seq['action'][1:-1])
objective = policy.log_prob(action) * advantage

advantage = tf.stop_gradient(target[1:] - self._target_critic(seq['feat'][:-2]).mode())

Based on my understanding:

  • seq['feat'] contains time steps from 0 to horizon.
  • target contains time steps from 0 to horizon-1, since the value at the last step is used as a bootstrap for lambda_return.
  • Therefore, baseline in Line 271 includes time steps from 0 to horizon-2, and target[1:] includes time steps from 1 to horizon-1.

If I understand correctly, the code uses $V_{t+1}^{\lambda} - v_\xi\left(\hat{z}_t\right)$ to calculate the advantage,

not $V_t^{\lambda} - v_{\xi}(\hat{z}_t)$ as stated in the paper?

I am very impressed with the Dreamer series of reinforcement learning algorithms. Thank you for your hard work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant