Question on HW1 #4

JiaheXu · 2021-08-07T02:31:30Z

In HW1 MLP_policy.py
if discrete is true, I think the out put should be a one-hot vector, or at least when you take actions, you need to take the argmax one in utils.py. I am a nooby in RL, I am not 100% sure, please take a look. In HW1 all problems are continuous, perhaps thats' why your code works.

mantle2048 · 2021-09-19T10:50:33Z

Hi, I quite agree with you.

Thanks for the author's great code for CS285 2020Fall homework!

There is a small problem in hw1.

In hw1 cs285/policies/MLP_policy.py, the author used the deterministic policy (directly through self.mean_tet to output actions).

This is incorrect in that we can see that self.logstd is set in the original code cs285/policies/MLP_policy.py, which is part of the stochastic policy.

In addition, I found that after modifying the author's code from deterministic policy to stochastic policy, the performance of BC in Ant -v2 is reduced from 4k to 1.4k.

I think 1.4k is what BC should perform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on HW1 #4

Question on HW1 #4

JiaheXu commented Aug 7, 2021 •

edited

Loading

mantle2048 commented Sep 19, 2021 •

edited

Loading

Question on HW1 #4

Question on HW1 #4

Comments

JiaheXu commented Aug 7, 2021 • edited Loading

mantle2048 commented Sep 19, 2021 • edited Loading

JiaheXu commented Aug 7, 2021 •

edited

Loading

mantle2048 commented Sep 19, 2021 •

edited

Loading