C++ Low-Latency Interview Questions at HFT Firms

What quant dev loops actually test: caches, lock-free queues, the C++ memory model, and whether you can do a latency budget out loud.

Most candidates over-prepare probability and under-prepare systems. If you are interviewing for a quant developer or core-infrastructure role at firms like Hudson River Trading, Jump Trading, Citadel Securities, or XTX Markets, the C++ round is not LeetCode with fancier syntax. It tests whether you understand what the hardware does with your code — and whether you can reason quantitatively about nanoseconds.

Why HFT firms ask this

An HFT firm's edge often lives inside a microsecond. Interviewers use low-latency questions to separate people who have read a blog post about "cache-friendly code" from people who can estimate the cost of a design decision before writing it. Three question families dominate:

  • Hardware mechanics: cache hierarchy, TLBs, branch prediction, NUMA, false sharing.
  • Concurrency: the C++ memory model, std::atomic, acquire/release semantics, lock-free single-producer/single-consumer (SPSC) queues.
  • Language mechanics under constraints: avoiding allocation on the hot path, virtual-call cost, templates vs. runtime polymorphism, placement new, cache-line alignment.

A fourth theme — order book design — blends systems with market microstructure: "Design a limit order book supporting add/cancel/execute in $O(1)$" is close to a rite of passage.

The numbers you must know cold

Interviewers expect Jeff Dean–style latency numbers without hesitation. Approximate, order-of-magnitude values are fine; not knowing them at all is disqualifying.

OperationApprox. latencyIn 4 GHz cycles
L1 cache hit~1 ns~4
L2 cache hit~4 ns~16
L3 cache hit~12–30 ns~50–120
Main memory (DRAM)~80–100 ns~320–400
Branch misprediction~15–20 cycles~4–5 ns
Mutex lock/unlock (uncontended)~20–40 ns~100+
Kernel-bypass NIC, wire to user space~1–2 μs
Standard kernel network stacktens of μs

Worked example: a tick-to-trade latency budget

Interviewer: "Your strategy must react to a market-data tick and send an order within 1.5 μs, measured wire-to-wire. Kernel-bypass networking costs 1 μs round-trip in total. How many main-memory accesses can your hot path afford?"

Set up the budget. The software budget is what remains after networking:

$$T_{\text{software}} = 1.5\,\mu s - 1.0\,\mu s = 500\ \text{ns}.$$

At 4 GHz, one cycle is $0.25$ ns, so the budget is $500 / 0.25 = 2{,}000$ cycles. Now decompose the hot path: suppose decode + book update + signal + order encode is roughly 800 cycles of pure compute when everything hits L1. That leaves $2{,}000 - 800 = 1{,}200$ cycles for memory stalls. A dependent DRAM miss costs about 400 cycles, so you can afford at most

$$k = \left\lfloor \frac{1{,}200}{400} \right\rfloor = 3$$

serialized cache misses. Three. That single number explains most low-latency design: flat arrays instead of pointer-chasing maps, order books stored as contiguous price-level arrays, and everything hot pinned into a working set that fits in L1/L2.

Strong candidates then add the branch term. If the hot path has $n = 200$ branches and the predictor is $99\%$ accurate with a 16-cycle penalty, the expected cost is

$$E[\text{penalty}] = n(1-p)\,c = 200 \times 0.01 \times 16 = 32\ \text{cycles} \approx 8\ \text{ns},$$

which is noise on average — but the interviewer wants you to note that tail latency is what kills you, and a cold branch on the rare "actually send an order" path is exactly where mispredictions cluster. Hence tricks like keeping the send path warm with dummy work.

The lock-free SPSC queue question

The most common concurrency question: implement a fixed-size ring buffer where one thread produces (network reader) and one consumes (strategy). Things interviewers probe:

  1. Capacity a power of two, so index wrap is idx & (N-1) instead of an expensive modulo.
  2. Memory ordering: producer publishes with a release store to the write index; consumer reads it with an acquire load. Say why: the release/acquire pair guarantees the payload write happens-before the consumer's read. Defaulting everything to seq_cst is correct but signals you don't know what it costs.
  3. False sharing: the head and tail indices must live on separate cache lines (alignas(64)), otherwise the two cores ping-pong the same line and throughput collapses by an order of magnitude.
  4. Follow-up: "Why is SPSC so much easier than MPMC?" Because with one writer per variable you never need a compare-and-swap loop — no contention, no ABA problem.

Traps that fail candidates

  • Reciting "avoid virtual functions" without numbers. A predicted virtual call is a few cycles; the real costs are the missed inlining and the possible I-cache miss. Nuance beats slogans.
  • Claiming volatile gives thread safety. Instant red flag in C++; it is for memory-mapped I/O, not synchronization.
  • Optimizing the mean, ignoring the tail. Firms care about p99/p99.9. Mention allocator jitter, page faults, and why hot paths pre-allocate and pre-fault everything.
  • Not measuring. When asked "how would you verify this is faster," the answer involves cycle counters, perf counters for cache misses and mispredictions, and realistic replayed market data — not "big-O analysis."

Firms differ in emphasis: Optiver and Headlands lean heavily on C++ mechanics, while pure-research tracks may skip this round entirely — check where your target sits on the interview difficulty tier list and calibrate.

Practice next

Drill the adjacent material before the loop: our coding interview question bank covers the algorithmic layer these rounds start from, the full problem bank has 2,800+ worked problems across systems-adjacent probability and brainteasers, and the trading games let you feel why microseconds and adverse selection matter in the first place.

More topic guides

Frequently asked questions

What C++ topics do HFT firms actually test in low-latency interviews?

The core areas are the hardware side (cache hierarchy, branch prediction, false sharing, NUMA), the C++ memory model and atomics (acquire/release semantics, lock-free SPSC queues), and hot-path language mechanics like avoiding allocation, cache-line alignment, and the true cost of virtual calls. Order book design questions combine all three and are near-universal for quant developer roles.

What is the lock-free queue interview question?

You are asked to implement a fixed-size single-producer/single-consumer ring buffer without locks, typically for passing market data between a network thread and a strategy thread. Interviewers check that you use a power-of-two capacity with bitmask wrapping, correct release/acquire memory ordering on the indices, and cache-line padding to prevent false sharing between head and tail.

Do quant developers need to memorize latency numbers?

Yes, at least to order of magnitude: roughly 1 ns for an L1 hit, ~100 ns for a DRAM access, ~15-20 cycles for a branch misprediction, and low microseconds for kernel-bypass networking. Interviewers use these numbers in back-of-envelope latency budget questions, and hesitating on them suggests you have never profiled real code.

Is the HFT C++ interview harder than LeetCode?

It is different rather than strictly harder: algorithmic questions are usually LeetCode-medium, but the differentiating rounds test systems knowledge that LeetCode never touches, like memory ordering, cache behavior, and tail-latency reasoning. Candidates who only grind algorithm problems routinely fail on questions such as explaining why false sharing collapses throughput.

Practice the real thing

QuantVault has 2,800+ quant interview problems with full solutions, intuition, and hints, firm-by-firm interview funnels, and an auto-graded coding judge. Start free.