Streaming Merge Join with Missing Keys
Two streaming data sources are producing tuples in near-real-time:
- **Stream A**: $(t, \text{loc}, \text{temp})$ -- temperature readings tagged by timestamp and location.
- **Stream B**: $(t, \text{loc}, \text{humid})$ -- humidity readings tagged by timestamp and location.
Both streams are *roughly* sorted by time, but events can arrive late or out of order, and either stream may be missing entries for certain $(t, \text{loc})$ pairs. Your job is to produce a joined output stream keyed by $(t, \text{loc})$ that contains the best-available temperature and humidity values for each key.
Design the full system:
1. What data structures do you use to buffer and index incoming events?
2. How do you handle missing keys -- i.e., a $(t, \text{loc})$ that appears in one stream but not the other?
3. How do you handle late and out-of-order arrivals?
4. Define a watermarking strategy: when is it safe to emit a joined record and evict old state?
5. What is the time complexity per event and the space complexity of your windowed state?
Provide working code (Python) for the core join logic.
Open the full interactive solver, hints, and worked solution →