Multi-armed bandit online learning #2567

fabio-ciani · 2024-08-15T14:14:01Z

fabio-ciani
Aug 15, 2024

Hi everybody,

I am using this package to implement some (combinatorial) multi-armed bandit algorithms supported by Gaussian processes to defend my master’s thesis.

The idea is to split the training process in two phases: 1. offline on historical data, to optimize the kernel parameters; 2. online in simulated environment, with fixed learnt parameters from the previous phase, incorporating new points at each round.

Unfortunately, my agent does not improve as the number of rounds increases, and ends up choosing a sub-optimal set of arms pretty soon without varying it.

I have already tried to solve this with the following, inspired by your documentation tutorials. Nonetheless, even combining these suggestions, did not resolve the issue.

Alternatives to radial basis function kernel: Neither linear nor Matérn helped.
Perturbation of offline data features: To mitigate numerical errors (e.g., singular/quasi-singular Gram matrix due to multiple points with the same and/or similar features).
Aggregation of online data observations: Use sample mean of rewards for each arm, instead of considering rewards as new data points. (See previous bullet.)
Gram matrix approximation (SKIP): I must keep this, otherwise out of memory runtime errors are thrown.
Toy problem example (2D): Here, it can be noticed that the learning rate is quite high. Replicating this within my use-case does not help. The loss of my scenario is not decreasing too much after few epochs, hence the optimizer might be stuck into a local minimum, but I am unsure and just guessing this. Optimization routines other than classical Adam conclude with the same result.
Reduced action space and/or feature space: Sub-optimal choice, unchanged during rounds.

Note (1): get_fantasy_model() crashes because deep copy cannot be executed when grid interpolation is setup. Alternatively, I resorted to manually instantiating another new GP model everytime, together with PyTorch’s load_state_dict() method.

Note (2): My research problem assumes Bernoulli random variables. Therefore, any reward is either successful (i.e., 1) or unsuccessful (i.e., 0). Although, Bernoulli likelihoods are not supported by exact GPs. Also, fitting offline empirical averages of rewards instead of negative/positive feedbacks (0/1) does not help.

Note (3): At this moment, the input feature space lies in 4D. Normalization has been applied.

Hoping this description will guide you to better answer my question, I look forward for any advice. Indeed, I can add details if you wish or need them.

Thanks for your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-armed bandit online learning #2567

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Multi-armed bandit online learning #2567

fabio-ciani Aug 15, 2024

Replies: 0 comments

fabio-ciani
Aug 15, 2024