Bayesian One-Step Lookahead Index for a Two-Armed Bandit
You are running an execution algo that chooses between two liquidity venues (arms) each period. Each period, you pick one arm and get exactly one fill.
- Arm A gives price improvement $X_A \sim N(\mu_A, \sigma_A^2)$ per fill, where $\mu_A$ is unknown.
- Arm B gives price improvement $X_B \sim N(\mu_B, \sigma_B^2)$ per fill, where $\mu_B$ is unknown.
You place Gaussian priors on each mean: $\mu_A \sim N(m_A, \tau_A^2)$ and $\mu_B \sim N(m_B, \tau_B^2)$. After each fill, you update your posterior on the chosen arm's mean using the observed price improvement. Future payoffs are discounted at rate $\beta \in (0,1)$.
- Write the Bayesian posterior update for $\mu_A$ after observing a single fill $x$ from Arm A.
- Derive the one-step lookahead index for each arm -- i.e., the expected immediate reward plus discounted expected future value under the myopic policy of always choosing the arm with the higher posterior mean after the next observation.
- Explain how this index balances exploration and exploitation, and how it relates to the Gittins index.
Hints
- Start with the conjugate Gaussian update: what is the posterior distribution of $\mu_A$ after one observation, and what is the predictive distribution of the next fill?
- The posterior mean $m_i'$ is a random variable (it depends on the yet-unseen observation). Compute its distribution and notice that $E[\max(m_i', m_j)]$ has the same form as a call option payoff on a Gaussian.
- Use the identity $E[Z^+] = s\,\phi(\mu/s) + \mu\,\Phi(\mu/s)$ for $Z \sim N(\mu, s^2)$ to get a closed-form expression for the lookahead index.
Worked Solution
How to Think About It: This is a classic explore-vs-exploit problem dressed up in execution/market microstructure language. You have two venues, you do not know which one truly gives better fills, and every fill teaches you something about that venue's quality. The tension: do you keep sending orders to the venue that *looks* best right now (exploit), or do you try the other venue to learn whether it might actually be better (explore)? The one-step lookahead index is the simplest principled way to handle this -- it is not optimal (that requires the full Gittins index), but it captures the key trade-off and is cheap to compute.
Quick Estimate: Suppose $m_A = 0.5$, $\tau_A^2 = 0.1$, $\sigma_A^2 = 1$, and $m_B = 0.3$, $\tau_B^2 = 0.5$, $\sigma_B^2 = 1$, $\beta = 0.95$. Arm A looks better right now ($m_A > m_B$), but Arm B has much more uncertainty ($\tau_B^2 = 0.5$ vs $\tau_A^2 = 0.1$). A pure exploiter always picks A and earns $0.5$ per period. But pulling B once might reveal that $\mu_B$ is actually high -- the option value of that information could justify the short-term cost. The one-step lookahead quantifies exactly this trade-off.
Part 1: Bayesian Posterior Update
With a Gaussian prior $\mu_A \sim N(m_A, \tau_A^2)$ and a Gaussian likelihood $x \mid \mu_A \sim N(\mu_A, \sigma_A^2)$, the posterior is conjugate:
$\mu_A \mid x \sim N(m_A', (\tau_A')^2)$
where the precision (inverse variance) adds:
$\frac{1}{(\tau_A')^2} = \frac{1}{\tau_A^2} + \frac{1}{\sigma_A^2}$
and the posterior mean is the precision-weighted average:
$m_A' = (\tau_A')^2 \left( \frac{m_A}{\tau_A^2} + \frac{x}{\sigma_A^2} \right)$
The observation $x$ is not yet known at decision time, but we know its predictive distribution:
$X_A \sim N\left(m_A, \; \sigma_A^2 + \tau_A^2\right)$
This is the prior predictive -- the marginal distribution of the next fill from Arm A, integrating over the unknown $\mu_A$.
Part 2: One-Step Lookahead Index
Define the *myopic continuation value*: after one more pull, you will forever exploit the arm with the higher posterior mean. The one-step lookahead index for pulling arm $i$ is:
$\nu_i = \underbrace{m_i}_{\text{immediate expected reward}} + \; \beta \cdot \underbrace{E\left[ \max(m_i', m_j) \right]}_{\text{discounted future value under greedy policy}}$
where $j \neq i$, and $m_i'$ is the posterior mean of arm $i$ after one observation. Under the greedy (myopic) continuation policy, you always pick whichever arm has the higher posterior mean, earning that mean per period, so the perpetuity value of always picking the best arm is $\max(m_i', m_j) / (1 - \beta)$. The full index is therefore:
$\nu_i = m_i + \frac{\beta}{1 - \beta} \; E\left[ \max(m_i', m_j) \right]$
Now we need $E[\max(m_i', m_j)]$. Since $m_j$ is fixed (we did not pull arm $j$), and $m_i'$ is a deterministic function of the random observation $X_i$, we can write:
$m_i' = (\tau_i')^2 \left( \frac{m_i}{\tau_i^2} + \frac{X_i}{\sigma_i^2} \right) = \alpha_i \, m_i + (1 - \alpha_i) \, X_i$
where $\alpha_i = \frac{(\tau_i')^2}{\tau_i^2} = \frac{\sigma_i^2}{\sigma_i^2 + \tau_i^2}$ is the shrinkage factor. Since $X_i \sim N(m_i, \sigma_i^2 + \tau_i^2)$, the posterior mean $m_i'$ is itself Gaussian:
$m_i' \sim N\left(m_i, \; (1 - \alpha_i)^2 (\sigma_i^2 + \tau_i^2)\right) = N\left(m_i, \; \frac{\tau_i^4}{\sigma_i^2 + \tau_i^2}\right)$
Let $s_i^2 = \frac{\tau_i^4}{\sigma_i^2 + \tau_i^2}$ be the variance of $m_i'$. Then:
$E[\max(m_i', m_j)] = m_j + E[\max(m_i' - m_j, \; 0)]$
Since $m_i' - m_j \sim N(m_i - m_j, \; s_i^2)$, this is the expectation of the positive part of a Gaussian -- exactly a call-option-type formula:
$E[(m_i' - m_j)^+] = s_i \, \phi\!\left(\frac{m_i - m_j}{s_i}\right) + (m_i - m_j) \, \Phi\!\left(\frac{m_i - m_j}{s_i}\right)$
where $\phi$ and $\Phi$ are the standard normal PDF and CDF. Combining:
$E[\max(m_i', m_j)] = m_j + s_i \, \phi\!\left(\frac{m_i - m_j}{s_i}\right) + (m_i - m_j) \, \Phi\!\left(\frac{m_i - m_j}{s_i}\right)$
The full one-step lookahead index for arm $i$ is:
$\boxed{\nu_i = m_i + \frac{\beta}{1 - \beta} \left[ m_j + s_i \, \phi\!\left(\frac{m_i - m_j}{s_i}\right) + (m_i - m_j) \, \Phi\!\left(\frac{m_i - m_j}{s_i}\right) \right]}$
You pull the arm with the higher $\nu_i$.
Part 3: Exploration vs. Exploitation and the Gittins Connection
The index decomposes into two forces:
- Exploitation: The $m_i$ term rewards picking the arm that looks best right now.
- Exploration: The $s_i \, \phi(\cdot)$ term is the "option value of information." It is large when $s_i$ is large (high posterior uncertainty on arm $i$) and when $m_i \approx m_j$ (the arms are close, so new info could flip the ranking). This is directly analogous to the time value of an option -- uncertainty has value because you can act on what you learn.
When one arm is clearly dominant ($|m_i - m_j| \gg s_i$), the option value vanishes and the index reduces to pure exploitation. When the arms are close and one is highly uncertain, the option value can dominate and justify exploration.
The true Gittins index solves the full infinite-horizon dynamic programming problem, which accounts for multi-step information gains. The one-step lookahead is a lower bound on the Gittins index -- it only values information one step ahead and then reverts to greedy play. In practice, for Gaussian bandits with moderate discount factors, the one-step lookahead is a reasonable approximation and much easier to compute.
Answer: The one-step lookahead index for arm $i$ is $\nu_i = m_i + \frac{\beta}{1-\beta}\left[m_j + s_i\,\phi\!\left(\frac{m_i - m_j}{s_i}\right) + (m_i - m_j)\,\Phi\!\left(\frac{m_i - m_j}{s_i}\right)\right]$ where $s_i^2 = \tau_i^4/(\sigma_i^2 + \tau_i^2)$. Pull whichever arm has the higher index. The $s_i\,\phi(\cdot)$ term is the option value of learning -- it rewards exploring the uncertain arm, and it vanishes when one arm is clearly dominant.
Intuition
This problem is really about the value of information in a trading context. Every time you route an order to a venue, you learn something about that venue's true quality -- and that information has economic value because it can improve all your future decisions. The one-step lookahead index captures this by computing the option value of one additional observation: the $s_i\,\phi(\cdot)$ term is literally the expected improvement in your best-arm estimate from learning, which mirrors how we value options in finance (uncertainty has positive value when you can act on realizations).
The deeper lesson is that exploration is not a cost you pay -- it is an option you buy. The uncertain arm is like a lottery ticket: its expected payoff might be lower than the known arm, but its upside potential (discovering it is actually better) has real value. This explore-exploit framework shows up everywhere in quant work: A/B testing execution algos, trying new signal sources, allocating research effort across strategies. The Gaussian conjugate structure makes the math clean, but the intuition -- that you should explore more when uncertainty is high relative to the gap between alternatives -- applies universally.