Contextual Bandit Theory
What Is a Contextual Bandit?
A contextual bandit is a machine learning model used to make sequential decisions where side information (context) is available before choosing an action. It balances exploration (trying new actions) and exploitation (choosing the best-known action), conditioned on observed context.
Formal Definition
At each time step $t = 1, 2, \dots, T$:
- Observe context $x_t \in \mathcal{X}$
- Choose action $a_t \in \mathcal{A}$
- Receive reward $r_t(a_t)$
- Update policy to improve future choices
The objective is to maximise cumulative reward or minimise regret over time.
Why Context Matters
Unlike traditional multi-armed bandits, contextual bandits condition decisions on environmental signals or user data, enabling personalised recommendations or targeted actions.
Example: In online advertising:
- Context: User profile, location, device
- Actions: Ads to display
- Reward: Click or conversion
Algorithms
1. Epsilon-Greedy
- Explore randomly with probability $\varepsilon$, otherwise exploit best action.
2. Upper Confidence Bound (UCB)
- Choose:
$$ a_t = \arg\max_a \left( \hat{r}_t(a) + \beta \cdot \sqrt{\frac{\ln t}{n_t(a)}} \right) $$
- Balances estimated reward and uncertainty.
3. Thompson Sampling
- Use Bayesian posterior over reward models.
- Sample from the posterior to guide exploration.
4. Policy Gradient (Neural Bandits)
- Approximate a policy $\pi(a \mid x)$ using a neural network.
- Update parameters based on observed reward and gradient feedback.
Regret Analysis
Regret quantifies the loss due to not always choosing the optimal action:
$$ \text{Regret}T = \sum{t=1}^T \left[ \max_{a \in \mathcal{A}} \mathbb{E}[r_t(a) \mid x_t] - \mathbb{E}[r_t(a_t)] \right] $$
Goal: Keep regret sublinear, e.g. $\mathcal{O}(\sqrt{T})$, so average regret tends to zero over time.
Applications
- AdTech: personalised ad targeting
- News & media: content recommendations
- Healthcare: adaptive clinical trials
- Robotics: sensory-based action selection
- E-commerce: price and product selection
Theoretical Foundations
- Online Convex Optimisation
- PAC Learning Theory
- Bayesian Decision Theory
- Reinforcement Learning (simplified case)
Related Concepts
- Reinforcement Learning (RL): Contextual bandits are RL problems with a single state and no long-term planning.
- Causal Inference: Bandits operate under a causal action → reward assumption.
- Meta-Learning: Improve learning efficiency using past contextual interactions.
References
- Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms
- Agarwal, A., et al. (2014). A Theory of Contextual Bandits with Applications to Personalized Learning
- Russo, D. et al. (2018). A Tutorial on Thompson Sampling
- Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction
Log in or sign up for Devpost to join the conversation.