You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In HW1 MLP_policy.py
if discrete is true, I think the out put should be a one-hot vector, or at least when you take actions, you need to take the argmax one in utils.py. I am a nooby in RL, I am not 100% sure, please take a look. In HW1 all problems are continuous, perhaps thats' why your code works.
The text was updated successfully, but these errors were encountered:
Thanks for the author's great code for CS285 2020Fall homework!
There is a small problem in hw1.
In hw1 cs285/policies/MLP_policy.py, the author used the deterministic policy (directly through self.mean_tet to output actions).
This is incorrect in that we can see that self.logstd is set in the original code cs285/policies/MLP_policy.py, which is part of the stochastic policy.
In addition, I found that after modifying the author's code from deterministic policy to stochastic policy, the performance of BC in Ant -v2 is reduced from 4k to 1.4k.
In HW1 MLP_policy.py
if discrete is true, I think the out put should be a one-hot vector, or at least when you take actions, you need to take the argmax one in utils.py. I am a nooby in RL, I am not 100% sure, please take a look. In HW1 all problems are continuous, perhaps thats' why your code works.
The text was updated successfully, but these errors were encountered: