Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About PPO #24

Open
LpLegend opened this issue Jan 20, 2021 · 4 comments
Open

About PPO #24

LpLegend opened this issue Jan 20, 2021 · 4 comments

Comments

@LpLegend
Copy link

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward * 0.9 + score * 0.1'

@LpLegend
Copy link
Author

I have changed the activate function from relu to tanh, but there is nothing improvement.

@heyfavour
Copy link

heyfavour commented Jul 21, 2021

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward * 0.9 + score * 0.1'

我也遇到这个问题,我咨询elegantrl作者,他说先tahn,再通过torch.distribution来sample action会影响信息熵,所以是没有办法收敛的,但是我不喜欢elegantrl的ppo写法,所以我还在找别人的代码

@CoulsonZhao
Copy link

Have you got the right code yet? Could you copy a link? Very appreciate!!

@huang-chunyang
Copy link

I don't think this code can solve the problem(pendulum), and another question is why this reward is 'running_reward * 0.9 + score * 0.1'

You can change clip_param from 0.2 into 0.1, constrainting trust regions. This method can work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants