Build wait-free telemetry buffers that never block producers, with overwrite semantics for high-frequency trading observability that doesn't impact system performance.
It was a chilly Tuesday in November, precisely at 9:17 AM, when the disaster struck. I remember the exact moment because I was mid-sip of my third espresso when the alerts started cascading across my monitoring dashboard like a waterfall of red.
Our high-frequency trading system had been designed to handle 10,000 trades per second. We had tested it, benchmarked it, stress-tested it. But on this particular morning, earnings season combined with unexpected Fed commentary had pushed us to 15,000 trades per second. And that's when everything started to fall apart.
The symptoms were insidious at first. Latency spikes. Not the kind you notice immediately - we're talking milliseconds in a world where microseconds matter. But those milliseconds started compounding. Trades were queuing up. Market data was getting stale. And worst of all, our telemetry system - the very tool we relied on to diagnose problems - had become the problem itself.
I pulled up the flame graphs, and there it was: our telemetry capture was blocking. Not just any blocking - synchronized blocking with wait/notify semantics. Every time our metrics buffer filled up, producer threads would park, waiting for the consumer to drain data. In a high-frequency system, that's not a minor inconvenience. It's a catastrophe.
Here's what I saw in the profiler:
Thread State Analysis (9:17:23 AM - 9:17:28 AM): BLOCKED: 47.3% (synchronized wait on MetricsBuffer.lock) RUNNABLE: 31.2% (actual work) WAITING: 18.4% (condition.await in telemetry) TIMED_WAIT: 3.1% (other)
Nearly half our CPU time was spent waiting on a lock. Not processing trades. Not computing risk. Waiting. For telemetry. The very metrics we needed to observe the system were causing the system to fail.
The irony wasn't lost on me. We had built an observability layer that made the system unobservable under stress - the exact moment when observability matters most.
public class BlockingMetricsBuffer { private final Object[] buffer; private int head = 0; private int tail = 0; private final Object lock = new Object(); private final int capacity; public void record(MetricEvent event) { synchronized (lock) { while (isFull()) { try { lock.wait(); // BLOCKING! } catch (InterruptedException e) { Thread.currentThread().interrupt(); return; } } buffer[head] = event; head = (head + 1) % capacity; lock.notifyAll(); } }}
There it was. synchronized. wait(). notifyAll(). The classic producer-consumer pattern that works beautifully for most applications. But for telemetry in a high-frequency system? It was a death sentence.
Have you ever seen a race car pit crew at work? Every movement is measured and precise, every tool positioned for maximum efficiency. Imagine if one of the crew members started blocking the others, creating unnecessary delays that cost precious seconds. This is exactly what happened to our system. The telemetry system, which was meant to help us, had become a bottleneck. And in the world of high-frequency trading, a bottleneck means lost opportunities - and lost money.
By 10:30 AM, I had a temporary fix in place: I disabled telemetry entirely. The system stabilized, but we were flying blind. No metrics. No visibility. No way to know if something else was going wrong.
That afternoon, I sat down with my team and we had an uncomfortable conversation. We needed to rebuild our telemetry from the ground up. Not just faster - fundamentally different. We needed a system that would never block, even if it meant accepting some data loss.
What followed was a two-week deep dive into wait-free algorithms that would change how I think about observability forever.
In quantum physics, the Heisenberg uncertainty principle tells us that the act of measuring a particle changes its behavior. Software observability has an analogous problem: the act of measuring a system can change its performance characteristics.
Consider what happens when you record a metric:
Memory allocation: Creating a metric event object
Synchronization: Coordinating between producer and consumer threads
Memory barriers: Ensuring visibility across CPU cores
I/O operations: Eventually writing metrics to storage or network
Each of these operations has a cost. And in a system processing millions of events per second, those costs multiply.
Cost per metric record (naive implementation):
Operation
Cost
Object allocation
20-50ns
Synchronized block (uncontended)
50-200ns
Synchronized block (contended!)
2,000-10,000ns
Memory barriers
10-40ns
Total (uncontended)
80-290ns
Total (contended)
2,080-10,290ns
At 15,000 trades/second with 10 metrics/trade (150,000 metric records/second):
Uncontended: 12-43ms CPU time/second (acceptable)
Contended: 312-1,543ms CPU time/second (DISASTER)
Under light load, the system appears healthy. Under heavy load - exactly when you need observability most - the observability layer becomes the bottleneck.
The obvious suggestion is to use Java's ConcurrentLinkedQueue or a similar lock-free queue. But these still don't solve our fundamental problem:
// Still problematic for high-frequency telemetryConcurrentLinkedQueue<MetricEvent> queue = new ConcurrentLinkedQueue<>();public void record(MetricEvent event) { queue.offer(event); // Lock-free, but...}
The problems:
Unbounded memory: ConcurrentLinkedQueue is unbounded. Under sustained high load, it will grow until you hit an OutOfMemoryError.
Allocation pressure: Every offer() allocates a new node object. At 150,000 operations/second, that's 150,000 allocations/second - pure fuel for the garbage collector.
GC pauses: When the GC runs, all threads pause. In a trading system, a 50ms GC pause can cost real money.
No backpressure: If the consumer can't keep up, the queue grows without bound. There's no way to shed load.
Now we're back to our original problem. When the queue fills up, producers block. In a trading system, a blocked thread means missed market data, delayed orders, and real financial losses.
public void record(MetricEvent event) { if (!queue.offer(event)) { // Queue full - what now? droppedMetrics.increment(); }}
This doesn't block, but it still has problems:
Contention: Multiple producers still contend on the queue's internal lock
Allocation: ArrayBlockingQueue may still allocate internally
False sharing: The head and tail pointers may share a cache line, causing performance degradation
What we need is something fundamentally different: a data structure designed specifically for high-frequency telemetry where we accept data loss as a feature, not a bug.
Concurrent algorithms are classified by their progress guarantees:
Loading diagram...
Blocking: If a thread holding a lock is paused (preempted, page faulted, or crashed), all other threads waiting for that lock are stuck indefinitely. This is what our original telemetry had.
Obstruction-Free: A thread will complete its operation if it runs alone, without interference from other threads. However, competing threads can cause indefinite retry loops.
Lock-Free: The system as a whole always makes progress. At least one thread will complete its operation in a finite number of steps. However, individual threads might starve under pathological conditions.
Wait-Free: Every thread completes its operation in a bounded number of steps, regardless of what other threads are doing. This is the strongest guarantee.
Normal operation: - 4 telemetry producer threads - 1 consumer thread - Low contention - All algorithms perform similarlyCrisis scenario: - 4 telemetry producers + 20 business threads all recording metrics - Consumer overwhelmed, buffer filling up - High contention on shared state - OS scheduler under pressureWith blocking telemetry: - Producers block waiting for buffer space - Business threads (sharing thread pool) get delayed - System performance degrades - Telemetry shows nothing (threads are blocked!)With lock-free telemetry: - Most producers make progress - Some producers may spin-wait - Under extreme contention, individual threads may starve - System performance partially preservedWith wait-free telemetry: - EVERY producer completes in bounded time - No thread ever waits or spins indefinitely - System performance guaranteed - Even under maximum load, telemetry keeps flowing
The wait-free guarantee means that no matter how chaotic things get, every metric recording operation will complete in a bounded, predictable amount of time. This is crucial for observability: we need to see what's happening, especially when things are going wrong.
Wait-free algorithms typically have higher complexity and sometimes higher constant-factor overhead than lock-free algorithms. Why? Because they must handle worst-case scenarios that rarely occur in practice.
A lock-free algorithm might look like:
while (true) { int current = head.get(); if (head.compareAndSet(current, current + 1)) { return current; // Got a slot! } // CAS failed, retry}
This loop could theoretically run forever if other threads keep winning the CAS race. In practice, this almost never happens - but "almost never" isn't good enough for a telemetry system that needs to remain responsive during a crisis.
A wait-free algorithm must guarantee completion regardless of other threads:
int mySlot = head.getAndIncrement(); // Always succeeds in one step!// But now we need to handle overflow...
The challenge shifts from "win the race" to "handle the consequences." As we'll see, our wait-free telemetry buffer handles overflow by overwriting old data - a trade-off that makes sense for metrics.
Let's examine what we're replacing. Understanding the problems with the blocking approach will clarify why each design decision in our wait-free buffer matters.
public class BlockingCircularBuffer<T> { private final Object[] buffer; private final int capacity; // Head: next position to write // Tail: next position to read private int head = 0; private int tail = 0; private int count = 0; private final Object lock = new Object(); public BlockingCircularBuffer(int capacity) { this.capacity = capacity; this.buffer = new Object[capacity]; } public void put(T element) throws InterruptedException { synchronized (lock) { // Wait while buffer is full while (count == capacity) { lock.wait(); // BLOCKING POINT } buffer[head] = element; head = (head + 1) % capacity; count++; lock.notifyAll(); } } @SuppressWarnings("unchecked") public T take() throws InterruptedException { synchronized (lock) { // Wait while buffer is empty while (count == 0) { lock.wait(); // BLOCKING POINT } T element = (T) buffer[tail]; buffer[tail] = null; // Help GC tail = (tail + 1) % capacity; count--; lock.notifyAll(); return element; } } public int size() { synchronized (lock) { return count; } }}
This is textbook correct. It handles all edge cases: empty buffer, full buffer, multiple producers, single consumer. The synchronized block ensures mutual exclusion, and wait()/notifyAll() handle the blocking semantics.
The actual work (write to buffer, update head) takes about 20-30 nanoseconds. But the synchronization overhead dominates:
Lock acquisition alone costs 50-200ns when uncontended, but balloons to 2,000-5,000ns under contention. Each context switch adds another 1,000-10,000ns. The notifyAll call compounds the problem further by waking ALL waiting threads, triggering a thundering herd that forces most of them right back to sleep. Meanwhile, cache line bouncing ensures the lock state ping-pongs between CPU cores with every acquisition attempt.
Under our peak load scenario:
150,000 metric records/second4 producer threads + 1 consumer threadBuffer capacity: 10,000 entriesObserved behavior: Lock acquisitions/second: 150,000+ Context switches/second: 50,000+ Average time in synchronized: 200-300ns (uncontended) Average time in synchronized: 2,000-8,000ns (contended!) Total synchronization overhead: 300ms-1,200ms per second This leaves only 700ms-(-200ms) for actual work!
Yes, that's negative time for actual work. The system was spending more time on coordination than existed in a second. Threads were queueing up faster than they could be processed.
Beyond the blocking, there's a hidden allocation problem:
// Each wait() call may allocate:// - AbstractQueuedSynchronizer$Node for wait queue// - Condition queue nodes// - Thread state objects// Under high contention:// 4 threads * 150,000 ops/sec * potential allocations =// Hundreds of thousands of small allocations per second
I ran an allocation profiler during our incident:
Hot allocation sites (1 second sample): 2,847,234 bytes: j.u.c.locks.AbstractQueuedSynchronizer$Node 892,456 bytes: MetricEvent objects 234,567 bytes: Various internal structures Total: 3.97 MB/second from synchronization alone
Almost 4 MB/second of allocation just for locking infrastructure. This is pure fuel for the garbage collector, and it was triggering young generation collections every 2-3 seconds.
Now let's build something better. Our wait-free telemetry buffer is based on several key design principles that emerge from understanding the problem deeply.
This is the fundamental insight. For telemetry, we prefer guaranteed low latency over complete data capture, recent data over historical data, and system stability over metric completeness.
When the buffer is full, we don't wait. We overwrite the oldest data. This is the trade-off that enables wait-free operation.
Instead of coordinating between head and tail with locks, we use atomic operations that always succeed:
// Traditional (lock-free but not wait-free):while (true) { int current = head.get(); if (head.compareAndSet(current, current + 1)) { break; // Might loop forever under contention }}// Wait-free approach:int myPosition = head.getAndIncrement(); // Always succeeds in one step!
The difference is profound. compareAndSet can fail and require retry. getAndIncrement always succeeds - it's wait-free by definition.
When a producer detects that it's about to overwrite unconsumed data, it advances the tail pointer automatically:
long writePosition = head.getAndIncrement();int index = (int) (writePosition & mask);// Check if we're about to overwrite unread datalong currentTail = tail.get();if (writePosition - currentTail >= capacity) { // We've caught up to (or passed) the tail // Advance the tail to make room tail.compareAndSet(currentTail, writePosition - capacity + 1);}// Now write - this slot is oursbuffer[index] = element;
This is the magic that enables wait-free overflow handling. We don't wait for the consumer to catch up - we just advance the tail and accept the data loss.
To prevent false sharing, we pad critical fields to occupy separate cache lines:
// Padding before headlong p01, p02, p03, p04, p05, p06, p07;private volatile long head;// Padding between head and taillong p11, p12, p13, p14, p15, p16, p17;private volatile long tail;// Padding after taillong p21, p22, p23, p24, p25, p26, p27;
Each cache line is 64 bytes on modern x86-64 CPUs. Seven long values (56 bytes) plus the actual field (8 bytes) ensures each critical field occupies its own cache line.
package com.trading.telemetry;import java.lang.invoke.MethodHandles;import java.lang.invoke.VarHandle;/** * Wait-Free Overwrite Buffer for High-Frequency Telemetry * * Design principles: * 1. Wait-free guarantee: Every operation completes in bounded steps * 2. Overwrite semantics: Old data is overwritten when buffer is full * 3. Zero allocation: No allocations after initialization * 4. Cache-line padding: Prevents false sharing between producers/consumer * * Performance characteristics: * - Producer: ~10-20ns per write (wait-free) * - Consumer: ~15-25ns per read * - Throughput: 50+ million ops/sec with 4 producers * * Trade-offs: * - Data loss under heavy load (by design) * - Consumer may see gaps in sequence * - Not suitable for ordered/reliable messaging * * @param <T> Element type stored in the buffer */public class WaitFreeOverwriteBuffer<T> { // ========== VarHandle Setup ========== private static final VarHandle HEAD; private static final VarHandle TAIL; private static final VarHandle BUFFER; static { try { MethodHandles.Lookup lookup = MethodHandles.lookup(); HEAD = lookup.findVarHandle( WaitFreeOverwriteBuffer.class, "head", long.class); TAIL = lookup.findVarHandle( WaitFreeOverwriteBuffer.class, "tail", long.class); BUFFER = MethodHandles.arrayElementVarHandle(Object[].class); } catch (ReflectiveOperationException e) { throw new ExceptionInInitializerError(e); } } // ========== Cache Line Padding for Head ========== // 7 longs = 56 bytes padding before head @SuppressWarnings("unused") private long p01, p02, p03, p04, p05, p06, p07; /** * Next position for producers to claim. * Monotonically increasing; wraps via modulo for array access. * Uses getAndIncrement for wait-free slot claiming. */ private volatile long head = 0; // 7 longs = 56 bytes padding between head and tail @SuppressWarnings("unused") private long p11, p12, p13, p14, p15, p16, p17; // ========== Cache Line Padding for Tail ========== /** * Next position for consumer to read. * May be advanced by producers during overflow. * Consumer is sole reader; producers may advance via CAS. */ private volatile long tail = 0; // 7 longs = 56 bytes padding after tail @SuppressWarnings("unused") private long p21, p22, p23, p24, p25, p26, p27; // ========== Buffer Storage ========== /** Pre-allocated buffer array. Size is always power of 2. */ private final Object[] buffer; /** Buffer capacity (power of 2 for fast modulo). */ private final int capacity; /** Bit mask for index calculation: capacity - 1. */ private final int mask; // ========== Metrics (optional, for monitoring) ========== /** Count of items written. */ private volatile long writtenCount = 0; /** Count of items overwritten before being read. */ private volatile long overwrittenCount = 0; // ========== Constructor ========== /** * Creates a new wait-free overwrite buffer. * * @param requestedCapacity Minimum capacity (will be rounded up to power of 2) * @throws IllegalArgumentException if capacity less than 2 */ public WaitFreeOverwriteBuffer(int requestedCapacity) { if (requestedCapacity < 2) { throw new IllegalArgumentException( "Capacity must be at least 2, got: " + requestedCapacity); } this.capacity = roundUpToPowerOf2(requestedCapacity); this.mask = this.capacity - 1; this.buffer = new Object[this.capacity]; } private static int roundUpToPowerOf2(int value) { int highBit = Integer.highestOneBit(value); return (highBit == value) ? value : highBit << 1; } // ========== Producer Operations ========== /** * Records a telemetry event to the buffer. * * WAIT-FREE GUARANTEE: This method always completes in bounded time, * regardless of other threads or buffer state. * * If the buffer is full, the oldest unread data will be overwritten. * This is by design - for telemetry, recent data is more valuable. * * @param element The element to record (should not be null) */ public void record(T element) { // Step 1: Claim a position (WAIT-FREE - always succeeds immediately) // getAndIncrement is a single atomic operation that cannot fail long writePosition = (long) HEAD.getAndAdd(this, 1L); // Step 2: Calculate array index using bitwise AND (faster than modulo) int index = (int) (writePosition & mask); // Step 3: Check for overflow condition // If writePosition has caught up to tail, we're overwriting unread data long currentTail = (long) TAIL.getOpaque(this); if (writePosition - currentTail >= capacity) { // Overflow detected! We're about to overwrite unread data. // Advance the tail to "forget" the oldest entries. // Calculate where tail should be to make room for our write long newTail = writePosition - capacity + 1; // Try to advance tail. Use CAS because: // - Multiple overflow producers might race // - Consumer might have already advanced tail // - We only advance if tail hasn't moved past our target long observedTail = currentTail; while (observedTail < newTail) { // Try to advance tail if (TAIL.compareAndSet(this, observedTail, newTail)) { // Successfully advanced tail // Count how many entries we skipped overwrittenCount += (newTail - observedTail); break; } // CAS failed - someone else advanced tail // Re-read and check if we still need to advance observedTail = (long) TAIL.getOpaque(this); } } // Step 4: Write the element // Use setRelease to ensure the write is visible to the consumer // after it sees the tail advance past this position BUFFER.setRelease(buffer, index, element); // Step 5: Update metrics writtenCount++; } /** * Batch record multiple elements. * More efficient than individual record() calls due to reduced overhead. * * @param elements Array of elements to record * @param offset Starting offset in the array * @param length Number of elements to record */ public void recordBatch(T[] elements, int offset, int length) { for (int i = 0; i < length; i++) { record(elements[offset + i]); } } // ========== Consumer Operations ========== /** * Retrieves and removes the next element from the buffer. * * This method is designed for SINGLE CONSUMER use only. * Using multiple consumers will result in data corruption. * * @return The next element, or null if buffer is empty */ @SuppressWarnings("unchecked") public T poll() { // Read current positions long currentTail = tail; long currentHead = (long) HEAD.getOpaque(this); // Check if buffer is empty if (currentTail >= currentHead) { return null; } // Calculate array index int index = (int) (currentTail & mask); // Read the element with acquire semantics // This ensures we see all writes that happened before the producer's release T element = (T) BUFFER.getAcquire(buffer, index); // Clear the slot to help GC (optional, but recommended) BUFFER.setRelease(buffer, index, null); // Advance tail (plain write is safe - single consumer) tail = currentTail + 1; return element; } /** * Drains all available elements to the provided consumer function. * More efficient than repeated poll() calls. * * @param consumer Function to process each element * @return Number of elements drained */ @SuppressWarnings("unchecked") public int drain(java.util.function.Consumer<T> consumer) { long currentTail = tail; long currentHead = (long) HEAD.getOpaque(this); int count = 0; while (currentTail < currentHead) { int index = (int) (currentTail & mask); T element = (T) BUFFER.getAcquire(buffer, index); if (element != null) { consumer.accept(element); BUFFER.setRelease(buffer, index, null); count++; } currentTail++; } // Batch update tail tail = currentTail; return count; } /** * Drains up to maxElements to the provided consumer. * Useful for rate-limited processing. * * @param consumer Function to process each element * @param maxElements Maximum number of elements to drain * @return Number of elements actually drained */ @SuppressWarnings("unchecked") public int drainTo(java.util.function.Consumer<T> consumer, int maxElements) { long currentTail = tail; long currentHead = (long) HEAD.getOpaque(this); long available = currentHead - currentTail; int toDrain = (int) Math.min(available, maxElements); for (int i = 0; i < toDrain; i++) { int index = (int) (currentTail & mask); T element = (T) BUFFER.getAcquire(buffer, index); if (element != null) { consumer.accept(element); BUFFER.setRelease(buffer, index, null); } currentTail++; } tail = currentTail; return toDrain; } // ========== Query Operations ========== /** * Returns approximate size of unread elements. * May be stale due to concurrent modifications. */ public int size() { long currentHead = (long) HEAD.getOpaque(this); long currentTail = tail; long size = currentHead - currentTail; if (size < 0) return 0; if (size > capacity) return capacity; return (int) size; } /** * Returns true if buffer appears empty. */ public boolean isEmpty() { return (long) HEAD.getOpaque(this) <= tail; } /** * Returns the buffer's capacity. */ public int capacity() { return capacity; } /** * Returns total number of items written since creation. */ public long getWrittenCount() { return writtenCount; } /** * Returns number of items overwritten before being read. */ public long getOverwrittenCount() { return overwrittenCount; } /** * Returns the data loss ratio (overwritten / written). * A high ratio indicates the consumer can't keep up. */ public double getDataLossRatio() { long written = writtenCount; if (written == 0) return 0.0; return (double) overwrittenCount / written; }}
// Wait-free (always succeeds in one step)long writePosition = (long) HEAD.getAndAdd(this, 1L);// vs. Lock-free (might retry indefinitely)while (true) { long current = head; if (HEAD.compareAndSet(this, current, current + 1)) { break; }}
getAndAdd is a single atomic operation that always succeeds. It's implemented as a single CPU instruction (LOCK XADD on x86-64). There's no possibility of retry or spinning - every producer gets a unique position in one step.
Why do we use getOpaque for reading positions?
long currentTail = (long) TAIL.getOpaque(this);
getOpaque provides "opaque" memory ordering - it prevents the compiler from caching the value, but doesn't impose the full memory barrier of a volatile read. For our use case:
We need a fresh value (can't use a cached local variable)
We don't need acquire semantics (we're not coordinating with specific writes)
The slight staleness is acceptable (we'll see the update soon enough)
This is faster than getVolatile while still being correct for our algorithm.
Why do we need the overflow CAS loop?
while (observedTail < newTail) { if (TAIL.compareAndSet(this, observedTail, newTail)) { break; } observedTail = (long) TAIL.getOpaque(this);}
Multiple producers might overflow simultaneously and all try to advance the tail. The CAS loop ensures:
Only one producer's advance "wins"
The tail moves forward monotonically
We don't accidentally move tail backward
But note: this loop is bounded! Each producer only executes the loop body a limited number of times because:
Each CAS failure means another producer advanced tail
Eventually, observedTail >= newTail and we exit
The maximum iterations equals the number of concurrent overflow producers
Why setRelease for writing elements?
BUFFER.setRelease(buffer, index, element);
Release semantics ensure that all writes before this store are visible to any thread that subsequently reads this location with acquire semantics. In our case:
Producer writes the element with release
Consumer reads with acquire
Consumer is guaranteed to see the complete element, not a partially constructed object
We could declare the buffer array elements as volatile:
private volatile Object[] buffer; // Doesn't help - array reference is volatile // but elements are not!
But even if we could make elements volatile, it would be overkill:
Volatile has full sequential consistency overhead
We only need producer-to-consumer visibility
Release/acquire is sufficient and faster
On x86-64, the difference is small (x86 has strong memory ordering), but on ARM or other weakly-ordered architectures, the performance difference can be significant.
The wait-free distribution is tightly clustered around 10-20ns with minimal tail, while the blocking buffer has a long tail extending into tens of microseconds.
The wait-free buffer generates 97% less allocation and 30x fewer GC events. The maximum GC pause dropped from 156ms to 21ms - critical for latency-sensitive systems.
Too small: - High data loss under load - Consumer can't keep up - Metrics gaps during spikesToo large: - Higher memory usage - Longer drain times - Older data when backloggedSizing formula: buffer_size = peak_write_rate * acceptable_latency * safety_factorExample: peak_write_rate = 200,000 events/sec acceptable_latency = 100ms (time for consumer to catch up) safety_factor = 2x buffer_size = 200,000 * 0.1 * 2 = 40,000 entries Round up to power of 2: 65,536 (64K entries)
public class TelemetryConsumer implements Runnable { private final WaitFreeOverwriteBuffer<MetricEvent> buffer; private final MetricsSink sink; private volatile boolean running = true; @Override public void run() { while (running) { int drained = buffer.drain(event -> { try { sink.record(event); } catch (Exception e) { // Log but don't crash - telemetry shouldn't kill the system logger.warn("Failed to record metric", e); } }); if (drained == 0) { // Buffer empty - back off to reduce CPU LockSupport.parkNanos(100_000); // 100μs } } } public void stop() { running = false; }}
Pattern 2: Batch Processing
public class BatchingConsumer { private static final int BATCH_SIZE = 1000; private final List<MetricEvent> batch = new ArrayList<>(BATCH_SIZE); public void processBatch(WaitFreeOverwriteBuffer<MetricEvent> buffer) { batch.clear(); buffer.drainTo(batch::add, BATCH_SIZE); if (!batch.isEmpty()) { // Process entire batch together - more efficient metricsSink.recordBatch(batch); } }}
Pattern 3: Sampling Under Load
public class SamplingConsumer { private final WaitFreeOverwriteBuffer<MetricEvent> buffer; private final Random random = ThreadLocalRandom.current(); public void process() { int size = buffer.size(); // If buffer is getting full, sample instead of draining everything double sampleRate = size > buffer.capacity() * 0.8 ? 0.1 // Sample 10% when buffer is 80%+ full : 1.0; // Process everything otherwise buffer.drain(event -> { if (random.nextDouble() < sampleRate) { sink.record(event); } }); }}
Real telemetry systems handle multiple event types. Here's how to extend the buffer:
public class TypedMetricEvent { public enum Type { COUNTER, GAUGE, HISTOGRAM, TIMER } private final Type type; private final String name; private final long value; private final long timestamp; private final String[] tags; // Use object pooling to avoid allocation private static final ThreadLocal<TypedMetricEvent> POOL = ThreadLocal.withInitial(TypedMetricEvent::new); public static TypedMetricEvent acquire() { return POOL.get(); } public TypedMetricEvent set(Type type, String name, long value, String... tags) { this.type = type; this.name = name; this.value = value; this.timestamp = System.nanoTime(); this.tags = tags; return this; }}
To eliminate all contention, use per-thread buffers:
public class ShardedTelemetryBuffer<T> { private final WaitFreeOverwriteBuffer<T>[] shards; private final ThreadLocal<Integer> shardIndex; @SuppressWarnings("unchecked") public ShardedTelemetryBuffer(int shardCount, int shardCapacity) { shards = new WaitFreeOverwriteBuffer[shardCount]; for (int i = 0; i < shardCount; i++) { shards[i] = new WaitFreeOverwriteBuffer<>(shardCapacity); } AtomicInteger counter = new AtomicInteger(); shardIndex = ThreadLocal.withInitial(() -> counter.getAndIncrement() % shardCount); } public void record(T element) { // Each thread writes to its own shard - zero contention! shards[shardIndex.get()].record(element); } public void drainAll(java.util.function.Consumer<T> consumer) { for (WaitFreeOverwriteBuffer<T> shard : shards) { shard.drain(consumer); } }}
While our buffer never blocks producers, we can signal backpressure to allow upstream adjustment:
public class BackpressureAwareTelemetry { private final WaitFreeOverwriteBuffer<MetricEvent> buffer; private final AtomicBoolean backpressureSignal = new AtomicBoolean(false); public void record(MetricEvent event) { buffer.record(event); // Signal backpressure when buffer is getting full double utilization = (double) buffer.size() / buffer.capacity(); backpressureSignal.set(utilization > 0.8); } public boolean isBackpressured() { return backpressureSignal.get(); } // Upstream can use this to shed load public void recordIfNotBackpressured(MetricEvent event) { if (!isBackpressured()) { record(event); } // Silently drop if backpressured }}
That Tuesday morning crisis taught me something fundamental about observability: the observer must never become the observed problem.
Our original telemetry system was technically correct. It guaranteed delivery of every metric. It maintained strict ordering. It was simple to understand. And it was completely useless when we needed it most - during a crisis, it became part of the crisis.
The wait-free telemetry buffer we built represents a different philosophy:
Availability over Consistency: It's better to have approximate metrics than no metrics at all.
Recent over Complete: Recent data is more valuable than historical data when debugging live issues.
Performance over Guarantees: The system's primary function must never be compromised by observability.
Bounded over Unbounded: Predictable resource usage, even under extreme load.
The results validated this philosophy: 15-20x lower average latency (287ns to 18ns), 100x+ better tail latency (12.5us to 89ns at p99.9), a 97% reduction in GC pressure, and linear scaling with producer count.
But perhaps the most important result isn't in the numbers. It's that during subsequent high-load events, we could actually see what was happening. The telemetry kept flowing. The dashboards stayed updated. We could diagnose issues in real-time instead of piecing together logs after the fact.
Measure first, then optimize. We found the problem through profiling, not guessing.
Question your assumptions. "Telemetry must capture every event" seemed obvious - until it wasn't.
Understand the trade-offs. Wait-free isn't universally better. Know what you're giving up.
Design for the worst case. Systems fail under load. Your observability layer shouldn't.
Hardware matters. Cache lines, memory ordering, atomic operations - these are real constraints that affect real performance.
The 2AM alerts that started this journey were painful. But they led to a telemetry system that now handles 400+ million events per second without impacting the trading system at all. More importantly, it keeps working when everything else is falling apart.
Next time you build a telemetry system, ask yourself: will this help me or hurt me during a crisis? If you can't confidently answer "help," it might be time to reconsider your approach.
The most insidious bugs in lock-free code come from incorrect memory ordering assumptions.
// WRONG: No memory ordering guaranteebuffer[index] = element;head = newHead; // Consumer might see new head before element is written!// CORRECT: Release semantics ensure element is visible before head updateBUFFER.setRelease(buffer, index, element);// Or if using volatile:buffer[index] = element;VarHandle.releaseFence();head = newHead; // Now correctly ordered
Debugging tip: If you see occasional data corruption that's hard to reproduce, suspect memory ordering issues. Add explicit barriers and see if the problem disappears. Then carefully analyze which barriers are actually necessary.
Our overflow detection compares positions, but positions wrap around. Consider:
// Dangerous if positions can wrapif (writePosition - currentTail >= capacity) { // Overflow detected}
With 32-bit integers, wraparound happens after ~4 billion operations. At 100 million ops/sec, that's 40 seconds. At 1 million ops/sec, it's about 70 minutes. Either way, it will happen in production.
Solution: Use 64-bit long for positions. At 100 million ops/sec, wraparound takes 5,800 years.
private volatile long head = 0; // 64-bit, practically never wrapsprivate volatile long tail = 0;
If your buffer allows null elements, the consumer can't distinguish "empty slot" from "null data":
// Problem: null could mean empty OR actual null valueT element = (T) BUFFER.getAcquire(buffer, index);if (element == null) { return null; // Is buffer empty, or was null stored?}
Solutions:
Disallow nulls (our approach):
public void record(T element) { if (element == null) { throw new NullPointerException("Null elements not permitted"); } // ...}
Use a wrapper:
private static final Object EMPTY = new Object();// Store wrapper instead of raw valuebuffer[index] = (element == null) ? NULL_WRAPPER : element;
Track slot state separately:
private final long[] slotState; // Parallel array tracking filled/empty
It's not enough to test that the buffer works - you need to verify the wait-free guarantee:
@Testvoid verifyWaitFreeProperty() { WaitFreeOverwriteBuffer<Long> buffer = new WaitFreeOverwriteBuffer<>(1024); AtomicLong maxOperationTime = new AtomicLong(0); // Run many operations and track worst-case time for (int i = 0; i < 10_000_000; i++) { long start = System.nanoTime(); buffer.record((long) i); long elapsed = System.nanoTime() - start; maxOperationTime.updateAndGet(max -> Math.max(max, elapsed)); } // Wait-free means bounded worst case // Anything over 1us suggests blocking or unbounded spinning assertTrue(maxOperationTime.get() < 1000, "Worst case time " + maxOperationTime.get() + "ns exceeds wait-free bound");}
We didn't replace the old telemetry immediately. Instead, we ran both systems in parallel:
public class DualTelemetryRecorder { private final BlockingCircularBuffer<MetricEvent> oldBuffer; private final WaitFreeOverwriteBuffer<MetricEvent> newBuffer; private final AtomicLong oldCount = new AtomicLong(); private final AtomicLong newCount = new AtomicLong(); public void record(MetricEvent event) { // Always record to new buffer (never blocks) newBuffer.record(event); newCount.incrementAndGet(); // Try to record to old buffer (might block) try { if (oldBuffer.offer(event, 1, TimeUnit.MILLISECONDS)) { oldCount.incrementAndGet(); } } catch (InterruptedException e) { Thread.currentThread().interrupt(); } }}
This let us:
Verify the new buffer was capturing the same metrics
Measure performance difference in production
Build confidence before cutting over
Results from shadow mode:
Old buffer: 2.3% of records timed out under load
New buffer: 0.8% data loss (overwritten, not blocked)
public class TelemetryRouter { private final WaitFreeOverwriteBuffer<MetricEvent> newBuffer; private final BlockingCircularBuffer<MetricEvent> oldBuffer; private volatile int newBufferPercentage = 0; // 0-100 private final Random random = ThreadLocalRandom.current(); public void record(MetricEvent event) { if (random.nextInt(100) < newBufferPercentage) { newBuffer.record(event); } else { try { oldBuffer.put(event); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } } } // Called by feature flag system public void setNewBufferPercentage(int percentage) { this.newBufferPercentage = Math.max(0, Math.min(100, percentage)); }}
The most important improvement isn't in the numbers - it's that we can now actually observe the system during high-load events. When the market goes crazy, we see what's happening in real-time instead of reconstructing events from sparse logs afterward.