Multi-armed bandit online learning #2567
Unanswered
fabio-ciani
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi everybody,
I am using this package to implement some (combinatorial) multi-armed bandit algorithms supported by Gaussian processes to defend my master’s thesis.
The idea is to split the training process in two phases: 1. offline on historical data, to optimize the kernel parameters; 2. online in simulated environment, with fixed learnt parameters from the previous phase, incorporating new points at each round.
Unfortunately, my agent does not improve as the number of rounds increases, and ends up choosing a sub-optimal set of arms pretty soon without varying it.
I have already tried to solve this with the following, inspired by your documentation tutorials. Nonetheless, even combining these suggestions, did not resolve the issue.
Note (1):
get_fantasy_model()
crashes because deep copy cannot be executed when grid interpolation is setup. Alternatively, I resorted to manually instantiating another new GP model everytime, together with PyTorch’sload_state_dict()
method.Note (2): My research problem assumes Bernoulli random variables. Therefore, any reward is either successful (i.e., 1) or unsuccessful (i.e., 0). Although, Bernoulli likelihoods are not supported by exact GPs. Also, fitting offline empirical averages of rewards instead of negative/positive feedbacks (0/1) does not help.
Note (3): At this moment, the input feature space lies in 4D. Normalization has been applied.
Hoping this description will guide you to better answer my question, I look forward for any advice. Indeed, I can add details if you wish or need them.
Thanks for your time.
Beta Was this translation helpful? Give feedback.
All reactions