Investigating the Utility of Mirror Descent in Off-policy Actor-Critic

Samuel Neumann, Jiamin He, Adam White, Martha White

Foundations Friday, August 8 Poster #41 Accepted — RLC 2025

Abstract

Many policy gradient methods prevent drastic changes to policies during learning. This is commonly achieved through a

Kullback-Leibler (KL) divergence term. Recent work has established a theoretical connection between this heuristic and

Mirror Descent (MD), offering insight into the empirical successes of existing policy gradient and actor-critic

algorithms. This insight has further motivated the development of novel algorithms that better adhere to the principles

of MD, alongside a growing body of theoretical research on policy mirror descent. In this study, we examine the

empirical feasibility of MD-based policy updates in off-policy actor-critic. Specifically, we introduce principled MD

adaptations of three widely used actor-critic algorithms and systematically evaluate their empirical effectiveness. Our

findings indicate that, while MD-style policy updates do not seem to exhibit significant practical advantages over conventional approaches to

off-policy actor-critic, they can somewhat mitigate sensitivity to step size selection with widely used deep-learning optimizers.