posted an update

Contextual Bandit Theory

What Is a Contextual Bandit?

A contextual bandit is a machine learning model used to make sequential decisions where side information (context) is available before choosing an action. It balances exploration (trying new actions) and exploitation (choosing the best-known action), conditioned on observed context.

Formal Definition

At each time step $t = 1, 2, \dots, T$:

  1. Observe context $x_t \in \mathcal{X}$
  2. Choose action $a_t \in \mathcal{A}$
  3. Receive reward $r_t(a_t)$
  4. Update policy to improve future choices

The objective is to maximise cumulative reward or minimise regret over time.


Why Context Matters

Unlike traditional multi-armed bandits, contextual bandits condition decisions on environmental signals or user data, enabling personalised recommendations or targeted actions.

Example: In online advertising:

  • Context: User profile, location, device
  • Actions: Ads to display
  • Reward: Click or conversion

Algorithms

1. Epsilon-Greedy

  • Explore randomly with probability $\varepsilon$, otherwise exploit best action.

2. Upper Confidence Bound (UCB)

  • Choose:

$$ a_t = \arg\max_a \left( \hat{r}_t(a) + \beta \cdot \sqrt{\frac{\ln t}{n_t(a)}} \right) $$

  • Balances estimated reward and uncertainty.

3. Thompson Sampling

  • Use Bayesian posterior over reward models.
  • Sample from the posterior to guide exploration.

4. Policy Gradient (Neural Bandits)

  • Approximate a policy $\pi(a \mid x)$ using a neural network.
  • Update parameters based on observed reward and gradient feedback.

Regret Analysis

Regret quantifies the loss due to not always choosing the optimal action:

$$ \text{Regret}T = \sum{t=1}^T \left[ \max_{a \in \mathcal{A}} \mathbb{E}[r_t(a) \mid x_t] - \mathbb{E}[r_t(a_t)] \right] $$

Goal: Keep regret sublinear, e.g. $\mathcal{O}(\sqrt{T})$, so average regret tends to zero over time.


Applications

  • AdTech: personalised ad targeting
  • News & media: content recommendations
  • Healthcare: adaptive clinical trials
  • Robotics: sensory-based action selection
  • E-commerce: price and product selection

Theoretical Foundations

  • Online Convex Optimisation
  • PAC Learning Theory
  • Bayesian Decision Theory
  • Reinforcement Learning (simplified case)

Related Concepts

  • Reinforcement Learning (RL): Contextual bandits are RL problems with a single state and no long-term planning.
  • Causal Inference: Bandits operate under a causal action → reward assumption.
  • Meta-Learning: Improve learning efficiency using past contextual interactions.

References

  • Lattimore, T., & Szepesvári, C. (2020). Bandit Algorithms
  • Agarwal, A., et al. (2014). A Theory of Contextual Bandits with Applications to Personalized Learning
  • Russo, D. et al. (2018). A Tutorial on Thompson Sampling
  • Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction

Log in or sign up for Devpost to join the conversation.