Research·arXiv cs.LG·May 11

Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

Researchers propose a Thompson sampling variant that anchors posterior inference to offline sample means, addressing a fundamental asymmetry in offline-to-online bandit learning. Unlike UCB-based approaches, Thompson sampling's probabilistic indices lack inherent optimism guarantees when combining logged and live data, creating comparison and deployment friction. This work bridges that gap by enforcing mean-anchored posteriors, making TS competitive with UCB methods under distribution shift. The contribution matters for practitioners scaling from batch to streaming settings, where bandit algorithms power recommendation and resource allocation systems.

Modelwire context

Explainer

The paper identifies a specific theoretical liability of Thompson sampling that UCB doesn't have: when posteriors are trained on offline data and then deployed online, TS lacks an inherent optimism bias to guide exploration toward high-value regions under distribution shift. Mean-anchoring is the proposed fix, but the contribution is narrower than 'making TS competitive with UCB' suggests—it's about closing one gap in one setting.

This connects directly to the adversarial kernelized bandits work from the same day, which also proved robustness guarantees for bandit algorithms when reward structures shift unexpectedly. Both papers treat distribution shift as a central deployment concern rather than an afterthought. The offline-to-online framing also echoes the safe offline RL paper's focus on batch-to-deployment transitions, though that work prioritizes safety constraints while this one prioritizes regret bounds. Together, they reflect a growing recognition that the gap between training data and live feedback is where bandit and RL systems actually break.

If practitioners adopt mean-anchored Thompson sampling in production recommendation systems and report comparable or better regret than UCB baselines within the next 12 months, that validates the practical relevance. If the method remains confined to theory papers without empirical deployment, the contribution is primarily academic.

Coverage we drew on

Nearly-Optimal Algorithm for Adversarial Kernelized Bandits · arXiv cs.LG

This analysis is generated by Modelwire’s editorial layer from our archive and the summary above. It is not a substitute for the original reporting. How we write it.

MentionsThompson sampling · UCB · offline-to-online learning · bandit algorithms

Read full story at arXiv cs.LG →(arxiv.org)

Modelwire Editorial

This synthesis and analysis was prepared by the Modelwire editorial team. We use advanced language models to read, ground, and connect the day’s most significant AI developments, providing original strategic context that helps practitioners and leaders stay ahead of the frontier.

Our mission How we write

Modelwire summarizes, we don’t republish. The full content lives on arxiv.org. If you’re a publisher and want a different summarization policy for your work, see our takedown page.