Why we need the features of *all* rounds to predict the final reward? #19
Unanswered
wongsingfo
asked this question in
Q&A
Replies: 2 comments
-
I'm not sure. I was following Suphx's method because it was tested to have worked. Maybe you could do some experiment by replacing the GRU part with 2 layers of MLP of the same number of parameters, and see if the performances are the same. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I think the assumption here is that a player will tend to use the same strategy in all rounds (for both human players and AIs), so that you can predict the behaviour of a player by its actions in previous rounds. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Both mortal and suphx [1] use a global reward predictor to predict the final game reward when the i-th game begins. The predictor uses the features (i.e. scores of 4 players, grand_kyoku, honba, and kyotaku) of not only the i-th round but also all previous rounds.
I am wondering why we need the features before the i-th round? I think they are independent factors for the final reward. In other words, no matter how well or how poor the player performs from the first round to the (i-1)-th round, the expected final ranking should be the same given that the features of the i-th round are the same.
[1] Suphx: Mastering Mahjong with Deep Reinforcement Learning. arXiv preprint arXiv:2003.13590, 2020a. Section 3.2
Beta Was this translation helpful? Give feedback.
All reactions