Non-Stationary Latent Auto-Regressive Bandits

Anna L. Trella, Walter H. Dempsey, Asim Gazi, Ziping Xu, Finale Doshi-Velez, Susan Murphy

Theoretical RL, Bandits Thursday, August 7 Poster #40 Accepted — RLC 2025

Abstract

For the non-stationary multi-armed bandit (MAB) problem, many existing methods allow

a general mechanism for the non-stationarity, but rely on a budget for the non-stationarity

that is sub-linear to the total number of time steps $T$. In many real-world settings, however,

the mechanism for the non-stationarity can be modeled, but there is no budget for the non-

stationarity. We instead consider the non-stationary bandit problem where the reward means

change due to a latent, auto-regressive (AR) state. We develop Latent AR LinUCB (LARL),

an online linear contextual bandit algorithm that does not rely on the non-stationary budget,

but instead forms predictions of reward means by implicitly predicting the latent state.

The key idea is to reduce the problem to a linear dynamical system which can be solved as a

linear contextual bandit. In fact, LARL approximates a steady-state Kalman filter and efficiently

learns system parameters online. We provide an interpretable regret bound for LARL with

respect to the level of non-stationarity in the environment. LARL achieves sub-linear regret

in this setting if the noise variance of the latent state process is sufficiently small with respect

to $T$ . Empirically, LARL outperforms various baseline methods in this non-stationary bandit

problem.