Ted Opiyo posted an update — Jul 31, 2025 02:22 AM EDT

Contextual Bandit Theory

What Is a Contextual Bandit?

A contextual bandit is a machine learning model used to make sequential decisions where side information (context) is available before choosing an action. It balances exploration (trying new actions) and exploitation (choosing the best-known action), conditioned on observed context.

Formal Definition

At each time step $t = 1, 2, \dots, T$:

Observe context $x_t \in \mathcal{X}$
Choose action $a_t \in \mathcal{A}$
Receive reward $r_t(a_t)$
Update policy to improve future choices

The objective is to maximise cumulative reward or minimise regret over time.

Why Context Matters

Unlike traditional multi-armed bandits, contextual bandits condition decisions on environmental signals or user data, enabling personalised recommendations or targeted actions.

Example: In online advertising:

Context: User profile, location, device
Actions: Ads to display
Reward: Click or conversion

Algorithms

1. Epsilon-Greedy

Explore randomly with probability $\varepsilon$, otherwise exploit best action.

2. Upper Confidence Bound (UCB)

Choose:

$$ a_t = \arg\max_a \left( \hat{r}_t(a) + \beta \cdot \sqrt{\frac{\ln t}{n_t(a)}} \right) $$

Balances estimated reward and uncertainty.

3. Thompson Sampling

Use Bayesian posterior over reward models.
Sample from the posterior to guide exploration.

4. Policy Gradient (Neural Bandits)

Approximate a policy $\pi(a \mid x)$ using a neural network.
Update parameters based on observed reward and gradient feedback.

Regret Analysis

Regret quantifies the loss due to not always choosing the optimal action:

$$ \text{Regret}T = \sum{t=1}^T \left[ \max_{a \in \mathcal{A}} \mathbb{E}[r_t(a) \mid x_t] - \mathbb{E}[r_t(a_t)] \right] $$

Goal: Keep regret sublinear, e.g. $\mathcal{O}(\sqrt{T})$, so average regret tends to zero over time.

Applications

AdTech: personalised ad targeting
News & media: content recommendations
Healthcare: adaptive clinical trials
Robotics: sensory-based action selection
E-commerce: price and product selection

Theoretical Foundations

Online Convex Optimisation
PAC Learning Theory
Bayesian Decision Theory
Reinforcement Learning (simplified case)

Related Concepts

Reinforcement Learning (RL): Contextual bandits are RL problems with a single state and no long-term planning.
Causal Inference: Bandits operate under a causal action → reward assumption.
Meta-Learning: Improve learning efficiency using past contextual interactions.

References

Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms
Agarwal, A., et al. (2014). A Theory of Contextual Bandits with Applications to Personalized Learning
Russo, D. et al. (2018). A Tutorial on Thompson Sampling
Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction

Log in or sign up for Devpost to join the conversation.