Q-Learning vs. Policy Gradient Methods

Machine Learning · Easy · Free problem

Explain the relationship between Q-learning and policy gradient methods in reinforcement learning.

Specifically, address: 1. What does each method learn, and how does it make decisions? 2. What are the key trade-offs in terms of action spaces, sample efficiency, and stability? 3. How do actor-critic methods combine the strengths of both approaches?

Hints

Think about what each method's "output" is: Q-learning outputs a value for every (state, action) pair, while policy gradients output a probability distribution over actions given a state.
The key practical distinction is the action space. Q-learning needs to enumerate all actions to find the argmax. What happens when the action space is continuous?
Actor-critic methods use the critic's value estimate as a baseline to reduce the variance of the policy gradient. The advantage $A(s,a) = Q(s,a) - V(s)$ tells the actor which actions are better than average.

Worked Solution

How to Think About It: Q-learning and policy gradients represent two fundamentally different philosophies for solving RL problems. Q-learning says: "learn how good each action is in each state, then pick the best one." Policy gradients say: "learn a direct mapping from states to actions, and nudge it toward actions that worked well." Neither dominates the other -- each has a sweet spot. The practical question in any application is which trade-offs matter more for your problem.

Key Insight: Q-learning learns a value function and derives the policy implicitly ($\arg\max_a Q(s,a)$). Policy gradients learn the policy directly and never need to evaluate every possible action. This distinction drives all the downstream trade-offs.

The Method:

Q-Learning (value-based): - Learns: The action-value function $Q(s, a) = E[\sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a]$ - Decision rule: $\pi(s) = \arg\max_a Q(s, a)$ - Update: Minimize the temporal difference (TD) error: $L = \left(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\right)^2$ - Exploration: $\epsilon$-greedy (random action with probability $\epsilon$) - Off-policy: Can learn from data collected by a different policy (experience replay)

Policy Gradient (policy-based): - Learns: A parameterized policy $\pi_\theta(a \mid s)$ directly - Objective: Maximize expected return $J(\theta) = E_{\pi_\theta}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right]$ - Update: Gradient ascent using the policy gradient theorem: $\nabla_\theta J = E_{\pi_\theta}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t\right]$ where $G_t$ is the return from time $t$ - Exploration: Built into the stochastic policy (entropy regularization helps) - On-policy: Typically requires fresh data from the current policy

Trade-offs:

| Dimension | Q-Learning | Policy Gradient | |-----------|-----------|----------------| | Action space | Discrete only (need $\arg\max$) | Continuous or discrete | | Sample efficiency | Higher (off-policy, replay buffer) | Lower (on-policy, high variance) | | Stability | Can diverge with function approx. | More stable with proper baselines | | Stochastic policies | No (deterministic $\arg\max$) | Yes (natural output) | | Convergence | Guaranteed only for tabular | Local convergence with smooth $\pi$ |

Actor-Critic (hybrid):

Actor-critic methods combine both approaches: - The actor is a policy network $\pi_\theta(a|s)$ (policy gradient component) - The critic is a value network $V_\phi(s)$ or $Q_\phi(s,a)$ (Q-learning component)

The critic provides a low-variance baseline for the policy gradient update: $\nabla_\theta J \approx E\left[\nabla_\theta \log \pi_\theta(a|s) \cdot (Q(s,a) - V(s))\right]$

The term $A(s,a) = Q(s,a) - V(s)$ is the advantage function. Using it as the "reward signal" for the actor dramatically reduces variance compared to raw returns, while the policy gradient framework handles continuous actions naturally.

Modern algorithms like PPO, SAC, and TD3 are all actor-critic methods that have become the default in practice.

Practical Considerations:

Finance/trading context: Q-learning is often preferred for discrete action spaces (buy/sell/hold, order placement on a discrete grid) because of its sample efficiency. Policy gradients are better for continuous controls (portfolio weights, hedging ratios).
The variance problem: Raw policy gradients have very high variance, making training slow. Baselines, advantage estimation (GAE), and entropy regularization are essential in practice.
The max problem: Q-learning requires computing $\max_a Q(s,a)$, which is intractable for continuous action spaces. This is why DDPG/TD3 use a separate actor network to approximate the argmax.

Answer: Q-learning learns a value function and derives actions implicitly via argmax; policy gradients learn the policy directly. Q-learning is more sample efficient (off-policy) but limited to discrete actions. Policy gradients handle continuous actions and stochastic policies but suffer from high variance. Actor-critic methods combine both: a policy network (actor) updated by policy gradients, with a value network (critic) providing low-variance baselines.

Intuition

The Q-learning vs. policy gradient distinction is really about how you decompose the RL problem. Q-learning separates evaluation (how good is this action?) from selection (pick the best one). Policy gradients merge them into a single optimization. Each decomposition has failure modes: Q-learning breaks when you cannot enumerate actions (continuous control), and policy gradients break when the return signal is too noisy (high variance, slow learning).

In quant finance, this trade-off shows up in execution algorithms. If you are deciding between a small set of order types (limit, market, cancel), Q-learning is natural and sample efficient. If you are choosing a continuous hedge ratio or optimal execution trajectory, you need policy gradients or an actor-critic approach. The practical lesson is that actor-critic methods are almost always the right starting point for non-trivial problems -- they give you the best of both worlds at the cost of maintaining two networks.

Open the full interactive solver →