In Part 4, we traced a request from the browser through Kong and Apollo Router into four subgraphs. That trace was a conceptual walkthrough. In production, you need the same visibility rendered in real-time — automatically, across every request, in every language.
A federated architecture amplifies the need for observability. A single GraphQL query can fan out to four services written in three languages, each with its own database. When latency spikes, you need to know whether the bottleneck is in the Java Product Catalog's Meilisearch query, the Go Order Service's Stripe call, or the TypeScript User Service's database pool. When error budgets erode, you need to know which subgraph is responsible. When a CPU spike correlates with a slow trace, you need to see the flame graph for that exact span.
This platform uses OpenTelemetry as the instrumentation standard across all services, exporting to a collector that routes signals to purpose-built backends. But we go further than basic telemetry collection. Spanmetrics connectors derive RED metrics (Rate, Errors, Duration) directly from traces, eliminating manual metric instrumentation, while service graph connectors build topology maps from trace data showing how subgraphs call each other. Tail sampling keeps all errors and slow requests while sampling healthy traffic at 25%, ensuring you never lose the traces that matter. On the monitoring side, SLO tracking with Prometheus recording rules computes availability and latency SLIs with 30-day error budget burn, and alerting rules fire on SLO breaches, high error rates, and infrastructure failures. Tying it all together, cross-signal correlations let you click a trace and jump directly to its logs, metrics, and CPU profiles.
This is the Grafana LGTM+ stack — Loki, Grafana, Tempo, Meilisearch/Prometheus, plus Pyroscope for continuous profiling — and it transforms a polyglot federation from a black box into a glass house.
The Observability Architecture
Before diving into configuration, let's see the full signal flow. Every telemetry signal — traces, metrics, logs, profiles — follows a path from application code to a purpose-built backend, with Grafana providing the unified view.
Several things stand out about this architecture:
-
Services only speak OTLP. Every application service exports telemetry in a single protocol (OpenTelemetry Protocol) to a single endpoint (the OTel Collector). Services don't know about Tempo, Prometheus, or Loki.
-
The Collector is a processing hub, not a passthrough. It applies tail sampling, generates RED metrics from traces via connectors, builds service topology graphs, and routes signals to the right backends.
-
Logs flow through a separate path. Application logs go to stdout (container best practice), where Grafana Alloy discovers them via the Docker socket, extracts trace IDs for correlation, and pushes to Loki.
-
Tempo generates additional metrics. Beyond the Collector's spanmetrics, Tempo's own metrics generator produces service graph and span metrics with exemplars, remote-writing them to Prometheus.
-
Grafana correlates everything. A single trace links to its logs in Loki, its metrics in Prometheus, and its CPU profile in Pyroscope. You can navigate between signals without copying trace IDs.
Let's build each layer from the bottom up.
Three Pillars, Three Languages
Traces: Following a Request Across Services
Distributed tracing is the most valuable observability signal in a federated architecture. A trace represents the full lifecycle of a single client request as it moves through the gateway and subgraphs.
Trace Propagation
The Apollo Router creates a root span for every incoming GraphQL operation. When it fans out to subgraphs, it propagates the trace context via W3C traceparent headers. Each subgraph creates child spans that automatically become part of the same trace.
In Tempo's trace view, this renders as a waterfall diagram where you can see exactly where time is spent. The parallel subgraph fetches overlap visually, confirming that the Router executed them concurrently. With TraceQL, you can search for traces matching specific patterns:
This finds every trace where the product-catalog subgraph took over 500ms — something that would be impossible to find from logs alone.
Java (Micronaut) Instrumentation
The Java services use Micronaut's OpenTelemetry integration, which auto-instruments HTTP handlers, database queries, and gRPC calls:
With these dependencies, every incoming HTTP request, every outgoing gRPC call, and every JDBC query generates spans automatically. The Micronaut filter chain creates a parent span for each GraphQL request, and the gRPC client interceptor creates child spans for inventory lookups.
Custom spans for business logic can be added with annotations:
The tracer.spanBuilder() pattern is idiomatic in Java. The try-with-resources on Scope ensures the span is the current active span for any child operations (like HTTP calls to Meilisearch) and that it closes properly even on exceptions.
For the gRPC server in the Inventory service, Micronaut's gRPC integration auto-instruments both the server and client sides:
The trace flows seamlessly: Router span → Product Catalog HTTP span → gRPC client span → Inventory gRPC server span → Inventory JDBC span. Five services, two protocols, one trace.
Go Instrumentation
The Go Order Service uses the standard go.opentelemetry.io/otel SDK with HTTP middleware:
For the Stripe integration, custom spans capture external API calls with business-relevant attributes:
The span.RecordError(err) call is important — it attaches the error to the span, which the tail sampling policy in the OTel Collector uses to decide that this trace should always be kept (never sampled out).
Go's context-based tracing has a distinct advantage: the ctx parameter is required by convention, so trace context propagation is explicit and hard to forget. In contrast, Java and TypeScript rely on thread-local or async-local storage, which can silently lose context in edge cases.
TypeScript Instrumentation
The User Service uses Node.js auto-instrumentation, which patches HTTP, Express, and database drivers at module load time:
Auto-instrumentation captures HTTP request/response, Express route handling, and PostgreSQL query execution without any code changes in the application logic. Every database query in Drizzle ORM generates a span with the SQL statement (parameterized, not with values) and execution time.
The key for TypeScript is that this file must be loaded before any application code. The --require flag in the Dockerfile ensures this:
If instrumentation loads after Express or pg are imported, the monkey-patching won't catch those modules, and you'll get empty traces.
Structured Logging with Trace Context
For logs to participate in cross-signal correlation, every log line must include the trace ID. Each language handles this differently.
Java (Logback + MDC):
Micronaut's OpenTelemetry integration automatically sets traceId and spanId in the Mapped Diagnostic Context (MDC). The Logback JSON encoder includes MDC fields in every log line. No application code changes needed.
Go (slog + trace context):
Go requires explicit trace context extraction because slog doesn't integrate with OpenTelemetry automatically. The helper function keeps the boilerplate manageable. In production, you'd wrap this in a middleware or use a library like go.opentelemetry.io/contrib/bridges/otelslog.
TypeScript (pino + AsyncLocalStorage):
Pino's mixin function runs for every log line, extracting the active span from Node.js AsyncLocalStorage. This works because the OpenTelemetry auto-instrumentation sets up the async context correctly for Express request handlers.
The key requirement across all three languages: structured JSON output to stdout. Docker captures stdout. Alloy parses the JSON. The traceId field becomes a Loki label. Grafana links logs to traces. The chain breaks if any service emits unstructured text logs.
How Traces Differ Across Languages
The three language SDKs produce identical OTLP trace data, but the ergonomics differ:
| Aspect | Java | Go | TypeScript |
|---|---|---|---|
| Auto-instrumentation | Framework-integrated (Micronaut) | Middleware wrapping | Module monkey-patching |
| Context propagation | Thread-local (Scope) | Explicit ctx parameter | AsyncLocalStorage |
| Custom spans | tracer.spanBuilder() | otel.Tracer().Start(ctx) | tracer.startSpan() |
| Risk of losing context | Virtual threads may break thread-local | Forgetting to pass ctx | Async gaps in callbacks |
| gRPC support | Micronaut interceptor (auto) | Middleware (manual) | N/A (no gRPC in User svc) |
Apollo Router: The Root Span
The Apollo Router is the entry point for all federated queries, and it creates the root span that every subgraph trace attaches to. The Router has built-in OTLP export:
The Router produces several span types that are critical for federation debugging. The router span is the top-level span for the entire GraphQL operation, containing the supergraph span that covers the query planning phase. Beneath these sit the subgraph spans — one per subgraph fetch, with the subgraph name as an attribute — and their child subgraph_request spans representing the actual HTTP calls.
When the Router fans out to multiple subgraphs in parallel, the subgraph spans overlap in the trace waterfall — visual confirmation that the query plan is executing concurrently. If you see sequential subgraph spans for a query that should parallelize, check the query plan for unnecessary dependencies.
The subgraph.graphql.operation.name attribute flows into the spanmetrics connector, so you get per-operation metrics at the federation level. You can answer questions like: "What's the P95 latency for the GetProductDetails operation across all subgraphs?"
Metrics: From Spans to RED, Automatically
In a traditional setup, you'd instrument each service to expose Prometheus metrics — counters for requests, histograms for latency, error counters. That means three separate instrumentation efforts in three languages, hoping they use consistent metric names and labels.
Our stack takes a different approach: generate metrics from traces. The OTel Collector's spanmetrics connector watches every span that passes through and automatically produces rate, error, and duration metrics. Zero application-side metric code required.
But we still scrape native metrics from services that expose them. Prometheus pulls from multiple sources:
Notice the two Collector scrape targets. Port 8889 exports the spanmetrics and servicegraph data that the Collector generates from traces. Port 8888 exports the Collector's own health metrics (queue sizes, dropped spans, processing latency). We also scrape Tempo and Pyroscope themselves — observing the observers.
Key metrics to monitor in a federated architecture:
| Metric | Source | Meaning |
|---|---|---|
traces_spanmetrics_duration_milliseconds_* | Spanmetrics connector | Request rate, latency histograms, error rate — derived from traces |
traces_service_graph_request_total | Servicegraph connector | Cross-service call rate (who calls whom) |
http_server_request_duration_seconds | Each subgraph | Native HTTP latency histogram |
apollo_router_http_requests_total | Router | Total queries by operation name |
db_client_connections_usage | Each subgraph | Database connection pool saturation |
grpc_client_duration_seconds | Product Catalog | gRPC call latency to Inventory |
Logs: Grafana Alloy Replaces Promtail
In the original stack, Promtail collected container logs and shipped them to Loki. Promtail works, but it's a single-purpose tool with a YAML-based pipeline that becomes unwieldy for complex parsing.
Grafana Alloy replaces Promtail with a unified telemetry collector built around the River configuration language — a declarative, component-based syntax that reads like a flow diagram. Alloy can collect logs, metrics, traces, and profiles, but here we use it specifically for Docker log collection with trace ID extraction.
Let's walk through what happens when a container emits a log line:
-
Discovery (
discovery.docker): Alloy connects to the Docker socket and discovers all running containers. Every 5 seconds, it checks for new or removed containers. -
Relabeling (
discovery.relabel): The Docker Compose service name (com.docker.compose.servicelabel) becomes theservicelabel in Loki. The container name gets cleaned up (removing the leading/). The compose project name is preserved. -
Collection (
loki.source.docker): Alloy tails the log output from each discovered container and forwards raw log lines to the processing pipeline. -
JSON Parsing (
stage.json): Since all our services emit structured JSON logs, Alloy parses each line and extracts fields:level,msg,timestamp,traceId,spanId,service. -
Label Extraction (
stage.labels): ThelevelandtraceIdfields become Loki labels. This is critical — havingtraceIdas a label means Grafana can link any log line to its corresponding trace in Tempo. -
Timestamp (
stage.timestamp): If the log contains its own timestamp, Alloy uses it instead of the collection time. This prevents clock drift between when the log was emitted and when it was collected. -
Push to Loki (
loki.write): Processed log entries are pushed to Loki's API.
The River syntax has a clear advantage over Promtail's YAML: the data flow is visible. You can see that discovery.docker feeds into discovery.relabel, which feeds into loki.source.docker, which feeds into loki.process, which feeds into loki.write. Each component is a node in a pipeline graph.
With logs in Loki and trace IDs extracted as labels, you can query:
Or find all logs for a specific trace:
The OTel Collector: Processing Hub
The OpenTelemetry Collector is the nervous system of this stack. In a naive setup, it's just a proxy — receive OTLP, forward to backends. Our configuration turns it into a processing hub that generates new telemetry signals, applies intelligent sampling, and enriches data before routing.
The Full Configuration
This is a lot of YAML. Let's break it into the five key concerns.
Concern 1: Tail Sampling
Head-based sampling (deciding at the start of a trace whether to keep it) is simple but wasteful. It either keeps too much healthy traffic or drops error traces. Tail-based sampling waits until the trace is complete, then decides.
The policies are evaluated in order, and a trace is kept if any policy matches:
- Errors: Any trace with an ERROR status code is always kept. If the Go Order Service's Stripe call fails, you'll see it.
- Slow requests: Any trace longer than 500ms is kept. If the Java Product Catalog's Meilisearch query is slow, you'll have the trace.
- Probabilistic: Of the remaining healthy, fast traces, 25% are sampled randomly. This gives you baseline visibility without storing every trace.
The decision_wait: 10s means the Collector buffers spans for 10 seconds before making the sampling decision. This is necessary because spans from different services arrive at different times — the Router's span might arrive before the database span that makes the trace "slow." The tradeoff is a 10-second delay before traces appear in Tempo.
The num_traces: 100000 limits memory usage. If more than 100,000 concurrent traces are being buffered, the oldest ones are force-sampled.
The result: you keep 100% of interesting traces and a representative sample of everything else. Storage costs drop by ~73% without losing visibility into failures.
Concern 2: Spanmetrics Connector
The spanmetrics connector is one of the most powerful features in the OTel Collector. It watches every span in the traces pipeline and generates Prometheus-compatible metrics:
For every span, it produces:
traces_spanmetrics_duration_milliseconds_count— request count (Rate)traces_spanmetrics_duration_milliseconds_sum— total durationtraces_spanmetrics_duration_milliseconds_bucket— latency histogram (Duration)
These metrics are broken down by the configured dimensions: HTTP method, status code, route, RPC method, GraphQL operation name, and service name. This means you can query request rate per GraphQL operation per service — without writing a single line of metrics code in any of the three languages.
The exemplars: enabled: true setting links metrics back to traces. When you see a latency spike in a Prometheus graph, the exemplar gives you the trace ID of the request that caused it. Click the exemplar dot in Grafana, and you're in the Tempo trace view.
The magic is in the pipeline wiring:
The spanmetrics connector appears as an exporter in the traces pipeline and a receiver in the metrics pipeline. Spans flow in, metrics flow out. The same pattern applies to servicegraph.
Concern 3: Service Graph Connector
The servicegraph connector builds a topology of service-to-service communication from trace data:
It produces metrics like:
traces_service_graph_request_total{client="router", server="product-catalog"}— how many requests the Router sends to Product Catalogtraces_service_graph_request_duration_seconds_bucket{client="product-catalog", server="inventory"}— latency histogram for gRPC calls from Product Catalog to Inventorytraces_service_graph_request_failed_total{client="order-service", server="user-service"}— failed cross-service calls
Grafana's node graph visualization uses these metrics to render a live service map. You can see request rates on edges, error rates as red highlights, and click any node to drill into its metrics.
Concern 4: Transform Processor
The transform processor adds computed attributes to spans using the OpenTelemetry Transformation Language (OTTL):
This adds a span.duration_ms attribute to every span, computed from the span's start and end times. This is useful for filtering in TraceQL and for adding duration-based columns in Grafana's trace table view.
Concern 5: Backend Routing
The final piece is routing signals to the right backends. Traces flow to Tempo via OTLP gRPC (and to the spanmetrics/servicegraph connectors), metrics go to Prometheus via the Prometheus exporter on port 8889, and logs reach Loki via the Loki exporter, supplementary to Alloy's Docker log collection.
The resource_to_telemetry_conversion: enabled: true in the Prometheus exporter converts OTel resource attributes (like service.name) into Prometheus labels. Without this, resource attributes would be dropped and you couldn't filter metrics by service.
Grafana Tempo: Traces with Power
Tempo replaces Jaeger as the trace backend. The reasons are practical:
- No indexing required. Tempo stores traces by trace ID only, using object storage (or local filesystem). This makes it dramatically cheaper to operate at scale.
- TraceQL. A purpose-built query language for traces that's far more expressive than Jaeger's tag-based search.
- Metrics generator. Tempo can produce service graph and span metrics with exemplars, remote-writing them to Prometheus.
- Native Grafana integration. Trace-to-logs, trace-to-metrics, trace-to-profiles correlations are built into the Grafana Tempo datasource.
Tempo Configuration
Several design decisions here:
Metrics generator with remote_write: Tempo generates its own span metrics and service graph metrics, remote-writing them to Prometheus with send_exemplars: true. This is complementary to the OTel Collector's spanmetrics — Tempo's generator has access to the full trace (after assembly), while the Collector processes individual spans as they arrive. The external labels source: tempo distinguish these metrics from the Collector's.
Service graph peer attributes: The peer_attributes list tells Tempo what to look for when building the service graph. Beyond service.name, it also tracks db.system (PostgreSQL, Redis) and messaging.system — so the service map shows database nodes, not just application services.
Filter policies for span metrics: The filter_policies section ensures that only server-side spans generate metrics. Without this filter, client spans would double-count every request (the client and server both see the same call). Filtering to SPAN_KIND_SERVER gives accurate request rates.
Block retention: 7 days of trace storage (168h). Traces older than 7 days are compacted away. For a development environment, this is more than sufficient; production deployments would typically use object storage (S3, GCS) with longer retention.
TraceQL Queries
With Tempo, you can search traces using TraceQL — a query language purpose-built for distributed traces:
TraceQL's structural queries (the && between spansets) are particularly powerful for federation debugging. You can find traces where the Router called Product Catalog but not Inventory, suggesting a query plan that skipped entity resolution.
Grafana Pyroscope: Continuous Profiling
Traces tell you where time is spent. Profiles tell you why. Pyroscope provides continuous profiling — CPU flame graphs, memory allocations, and goroutine/thread analysis — for every service, every second.
Pyroscope Configuration
The configuration is intentionally minimal. Pyroscope is a write-heavy system — services push profile data continuously — so the limits section prevents any single service from overwhelming storage. In production, you'd use object storage instead of filesystem.
Trace-to-Profile Correlation
The real power of Pyroscope in this stack is the trace-to-profile link. When you find a slow trace in Tempo, Grafana can show you the CPU flame graph for that exact time window. Was the slowness caused by garbage collection? Regex parsing? A lock contention? The profile answers questions that traces can't.
This correlation is configured in the Grafana datasource (covered in the cross-signal section below).
Prometheus: Recording Rules and SLO Tracking
Raw metrics are useful, but recording rules transform them into higher-level signals: service-level indicators (SLIs) and error budget tracking. This is where observability becomes actionable.
Recording Rules
These recording rules pre-compute RED metrics from the spanmetrics data every 30 seconds. The resulting time series (service:request_rate:5m, service:error_rate:5m, service:latency_p95:5m) are cheap to query in dashboards and alerts because the expensive rate() and histogram_quantile() computations happen once, not on every dashboard refresh.
SLO Tracking
The slo_tracking group computes service-level indicators:
The SLO framework defines two SLIs:
-
Availability SLI — the ratio of non-5xx responses. Target: 99.9%. This means an error budget of 0.1%, which translates to ~43 minutes of downtime per 30-day window.
-
Latency SLI — the percentage of requests completing under 500ms. Target: 95%. If more than 5% of requests exceed half a second, the SLO is breached.
The error budget remaining rule (slo:error_budget_remaining:30d) shows how much of the 30-day error budget has been consumed. At 1.0, the budget is fully intact. At 0.0, it's exhausted. Below 0.0, the SLO is violated.
Federation-Specific Metrics
The third rule group tracks GraphQL Federation-specific signals:
The federation:subgraph_calls_rate:5m metric is uniquely valuable. It shows the actual traffic pattern between services — how many requests the Router sends to each subgraph, and how many internal calls happen (like Product Catalog calling Inventory via gRPC). This is derived entirely from trace data via the servicegraph connector.
Alerting Rules
Recording rules compute what to observe. Alerting rules decide when to act.
The tiered error rate alerts (warning at 5%, critical at 10%) give operators time to investigate before paging. The for clause prevents flapping — a momentary spike doesn't fire the alert.
SLO Alerts
The SLOErrorBudgetFastBurn alert fires when more than 50% of the 30-day error budget has been consumed. This is a multiburn-rate alert pattern — if you're burning budget 14x faster than allowed, you'll hit 50% consumption in roughly 26 hours, giving you a full day to respond.
Federation Alerts
This alert uses the service graph metrics. If the Router's calls to any subgraph have a >5% failure rate, it fires. The $labels.server tells you which subgraph is struggling.
Infrastructure Alerts
Observing the observers. If the OTel Collector starts dropping spans (because Tempo is down, network issues, or buffer overflow), you need to know. If Prometheus storage is growing unchecked, you need to know before the disk fills.
Grafana: Cross-Signal Correlations
The most powerful aspect of the LGTM+ stack is cross-signal correlation. Grafana's datasource configuration wires the signals together so you can jump between traces, logs, metrics, and profiles without manual context-switching.
Datasource Configuration
This is where the magic happens. Let's trace each correlation path.
Correlation 1: Trace → Logs
When you view a trace in Tempo, Grafana adds a "Logs" button. Clicking it opens a Loki query filtered by the trace ID and the service name of the selected span. The spanStartTimeShift and spanEndTimeShift values widen the time window by an hour in each direction, ensuring logs emitted slightly before or after the span are included.
Correlation 2: Trace → Metrics
When viewing a trace, Grafana shows "Request rate", "Error rate", and "P95 latency" links. These execute pre-configured Prometheus queries filtered to the service that produced the span. The $$__tags placeholder is replaced with service_name="product-catalog" (or whichever service the span belongs to).
This answers the question: "Is this slow trace an anomaly, or is the service generally slow right now?" If the P95 latency is elevated across the board, it's a systemic issue. If the trace is an outlier, it's likely request-specific.
Correlation 3: Trace → Profile
When viewing a trace, Grafana shows a "Profiles" link that opens Pyroscope filtered to the same service and time window. If a span is slow and the CPU profile shows 80% of time in java.util.regex.Pattern.match, you've found your bottleneck — a regex-heavy validation that needs optimization.
Correlation 4: Logs → Traces
The reverse direction is equally important. When browsing logs in Loki, you want to jump to the trace that produced a log line.
Loki's derived fields use regex to extract trace IDs from log lines. Two patterns are configured because different language logging libraries format the trace ID differently — JSON format ("traceId":"abc123") and key-value format (trace_id=abc123). When a match is found, Grafana renders a "View Trace" link next to the log line.
Correlation 5: Metrics → Traces (Exemplars)
When Prometheus metrics include exemplars (enabled in our spanmetrics connector), Grafana renders small dots on metric graphs. Each dot represents a specific request with its trace ID. Clicking the dot opens the trace in Tempo.
This is the most powerful debugging path: you see a latency spike in a dashboard, click the exemplar dot at the peak, and you're looking at the exact trace that caused the spike.
The Grafana Service Map
The Tempo datasource configuration includes:
This enables Grafana's service map visualization, powered by the service graph metrics from both the OTel Collector's servicegraph connector and Tempo's metrics generator. The map renders nodes for each service (product-catalog, inventory, order-service, user-service, router) connected by edges showing request flow. Each edge displays request rates, with error rates highlighted in red and latency visible on hover.
For a federated GraphQL platform, this service map is invaluable. You can see at a glance which subgraphs are heavily loaded, which inter-service connections have errors, and how the Router distributes traffic.
The Docker Compose Observability Stack
The full observability stack runs as a Docker Compose overlay. Here's the complete configuration:
Key points about the Compose configuration:
Dependency ordering: The OTel Collector depends on Tempo and Loki (it needs somewhere to send data). Alloy depends on Loki. Grafana depends on Prometheus, Loki, and Tempo (all its datasources). Health checks ensure services are ready before dependents start.
Prometheus flags: --web.enable-remote-write-receiver allows Tempo to push metrics via remote write. --enable-feature=exemplar-storage enables the exemplar storage that links metrics to traces. Both are required for the full correlation stack.
Grafana feature toggles: The GF_FEATURE_TOGGLES_ENABLE environment variable enables TraceQL editor, streaming, cross-signal correlations, and the Tempo service graph visualization.
Alloy Docker socket: Alloy needs read-only access to the Docker socket for container discovery. This is the same pattern Promtail used, but Alloy's discovery is more flexible.
Start the full stack with:
Port Reference
| Service | Port | Purpose |
|---|---|---|
| OTel Collector | 4317 | OTLP gRPC receiver |
| OTel Collector | 4318 | OTLP HTTP receiver |
| OTel Collector | 8889 | Prometheus exporter (spanmetrics) |
| Tempo | 3200 | HTTP API, TraceQL |
| Prometheus | 9090 | Metrics queries, remote write |
| Loki | 3100 | Log queries, push API |
| Pyroscope | 4040 | Profile queries, push API |
| Grafana | 3001 | Unified dashboards |
Grafana Dashboards
The platform ships with pre-provisioned dashboards that use all of the above signals. Let's walk through the two most important ones.
Service RED Metrics Dashboard
This dashboard shows Rate, Errors, and Duration for every service in the federation. It uses a $service template variable that queries Prometheus for all services with HTTP metrics.
Rate row: Two panels — total request rate (line graph) and request rate by endpoint (stacked bar chart). The total rate panel uses:
Errors row: Three panels — error rate percentage (line), errors by status code (stacked bars), and an error log stream from Loki. The log stream panel queries:
This provides real-time error logs alongside the error metrics. When you see the error rate spike, the corresponding error messages appear in the same dashboard.
Duration row: Three panels — latency percentiles (P50/P95/P99 line chart), a latency heatmap, and a table of slow traces from Tempo. The slow traces panel uses TraceQL:
This is the key insight of the RED dashboard: every dimension (rate, errors, duration) is covered by multiple signal types (metrics, logs, traces) in a single view. You don't need to switch between tools to investigate.
SLO Dashboard
The SLO dashboard tracks availability and latency against defined targets:
Availability SLO (target 99.9%): A gauge showing current availability, a gauge showing error budget remaining, and a burn rate chart with 1h and 6h windows. The burn rate threshold line at 14.4x (the "fast burn" rate) shows when you're consuming budget dangerously fast.
Latency SLO (target P99 < 500ms): A gauge showing current compliance, a gauge showing latency budget remaining, and a compliance-over-time chart with the 99.9% target line.
30-day Rolling History: Full-width charts showing availability and latency SLO compliance over a 30-day window, plus an error budget consumption timeline. This answers the question: "Are we trending toward an SLO breach?"
Debugging a Federated Query: The Full Workflow
With the full stack running, debugging a slow query follows a workflow that crosses all signal types. Let's walk through a realistic scenario.
Scenario: Elevated P99 Latency
A user reports that product pages are loading slowly. You open Grafana.
Step 1: Dashboard Overview
The Service RED Metrics dashboard shows elevated P99 latency on the product-catalog service. The gauge is yellow (above the 500ms threshold) instead of its usual green.
Step 2: Identify the Pattern
The latency heatmap shows that most requests are still fast (clustered around 10-50ms), but there's a secondary cluster appearing around 400-600ms. This started about 20 minutes ago.
Step 3: Find a Slow Trace
The "Slow Traces" table in the same dashboard shows several traces with durations over 500ms. You click one.
Step 4: Trace Waterfall
Tempo shows the full trace:
The meilisearch.search span is 490ms — that's the bottleneck.
Step 5: Check Logs
Click the "Logs" button on the meilisearch.search span. Grafana opens Loki filtered by trace ID and service:
The Meilisearch instance is reindexing. During reindexing, search queries are slower because the index is being rebuilt.
Step 6: Confirm Systemic Impact
Click the "Request rate" link from the trace. Prometheus shows that the product-catalog's request rate is normal, but the P95 latency spiked from 50ms to 450ms starting 20 minutes ago — exactly when the reindexing started.
Step 7: Check the Profile
Click the "Profiles" link from the trace. Pyroscope shows the CPU profile for product-catalog during this window. The flame graph confirms: 70% of CPU time is in HTTP client wait (blocked waiting for Meilisearch to respond), not in the Java application itself.
Step 8: Resolution
The root cause is clear: Meilisearch is reindexing, and search queries are slow during the rebuild. Options:
- Wait for reindexing to complete (~15 more minutes based on 42% progress)
- If reindexing is recurring, schedule it during low-traffic hours
- Consider running a Meilisearch replica that serves queries while the primary reindexes
From symptom to root cause in eight steps, across three languages, four signal types, without logging into any service directly.
Architecture Decisions: Why This Stack
Why Tempo Over Jaeger
Jaeger served us well in earlier iterations, but Tempo offers three advantages:
-
No indexing infrastructure. Jaeger requires Elasticsearch or Cassandra for trace storage. Tempo uses object storage (S3/GCS) or local disk — no additional database to manage.
-
TraceQL. Jaeger's search is tag-based: find traces where
service=product-catalogandhttp.status_code=500. TraceQL adds structural queries: find traces where the product-catalog span has a child span with an error. This is essential for federation debugging. -
Metrics generator. Tempo generates service graph and span metrics natively, remote-writing them to Prometheus with exemplars. This creates the metric-to-trace correlation path.
Why Alloy Over Promtail
Promtail is a log shipper. Alloy is a telemetry collector:
-
River configuration. Component-based syntax that shows the data flow. Easier to extend with custom processing stages.
-
Multi-signal support. Alloy can collect metrics, traces, and profiles in addition to logs. While we only use it for logs today, it can replace the OTel Collector for some use cases.
-
Dynamic discovery. Alloy's Docker discovery automatically picks up new containers and drops removed ones. No manual target configuration.
Why Pyroscope
Traces show where time is spent. Profiles show why. In a polyglot platform, each runtime has distinct failure modes that only profiling can expose. Java profiles reveal GC pauses, lock contention, and JIT compilation hotspots. Go profiles show goroutine leaks, mutex contention, and memory allocation patterns. TypeScript profiles reveal event loop blocking, promise chain overhead, and V8 optimization bailouts.
Without profiling, some performance issues are invisible. A trace might show a 200ms database query, but the profile reveals that 150ms of that was spent serializing the result into a JavaScript object — an application-level issue, not a database issue.
Why the OTel Collector Over Direct Export
An alternative architecture has each service export directly to Tempo, Prometheus, and Loki — no Collector in the middle. This works for small deployments, but breaks down in a federated platform:
-
Configuration consistency. With direct export, each service in each language needs Tempo's endpoint, Prometheus's push gateway URL, and Loki's push API. Change any backend, and you touch every service's configuration. With the Collector, services point to one endpoint.
-
Sampling decisions. Tail sampling requires seeing all spans of a trace before deciding. Individual services can't tail-sample because they only see their own spans. The Collector sees every span and makes trace-level decisions.
-
Derived metrics. The spanmetrics connector requires seeing spans from all services to generate consistent metrics. If each service exported directly, you'd need to instrument metrics separately in each language.
-
Buffering and retry. If Tempo is temporarily down, the Collector buffers spans and retries. Without the Collector, spans are lost during backend outages.
Why Connectors Over Manual Metrics
The spanmetrics and servicegraph connectors generate metrics from traces. The alternative is instrumenting each service with Prometheus client libraries in three languages and hoping the metric names, labels, and buckets are consistent.
Connectors give you:
- Consistency. Every service gets the same metrics with the same dimensions, regardless of language.
- Zero application code. No
prometheus.NewHistogramVec()in Go, noMeterRegistryin Java, noprom-clientin TypeScript. - Exemplars. Connectors automatically link metrics to the traces they were derived from.
- Federation visibility. The servicegraph connector shows cross-service call patterns that no individual service can see.
Practical Tips: Lessons from Building This Stack
Tip 1: Start with Traces, Derive Everything Else
If you can only instrument one signal type, choose traces. The spanmetrics connector generates your metrics. The trace ID in logs enables correlation. Traces are the foundational signal from which everything else can be derived.
Many teams start with metrics (Prometheus counters/histograms in application code) and add tracing later. This creates a maintenance burden: three languages, three metrics libraries, constant drift in metric names and label cardinality. Starting with traces and using connectors avoids this entirely.
Tip 2: Watch Cardinality
The spanmetrics connector dimensions determine the cardinality of generated metrics. Every unique combination of service.name x http.method x http.status_code x http.route x graphql.operation.name creates a separate time series in Prometheus.
If your GraphQL schema has 50 operations, 4 services, 3 HTTP methods, and 5 status codes, that's 50 * 4 * 3 * 5 = 3,000 time series — manageable. But if you add user.id as a dimension, you'd have 3,000 * N_users — a cardinality explosion that will crash Prometheus.
Rule of thumb: only add dimensions with bounded, known cardinality (methods, status codes, routes, operation names). Never add user IDs, request IDs, or other unbounded values as metric dimensions.
Tip 3: Tail Sampling Tradeoffs
The 10-second decision_wait in tail sampling means traces appear in Tempo with a ~10s delay. For real-time debugging, this is usually acceptable. For systems that need sub-second trace visibility (trading platforms, real-time bidding), use head-based sampling instead and accept the cost of storing more traces.
The num_traces: 100000 buffer also uses memory. At ~1KB per trace, 100K traces buffer uses ~100MB. Under heavy load, increase this or risk force-sampling traces before all their spans arrive.
Tip 4: Exemplar Budget
Prometheus exemplars are stored in a fixed-size circular buffer per time series. With --enable-feature=exemplar-storage, Prometheus stores the most recent exemplars for each series. If you have many series and high traffic, exemplars from low-traffic series may be evicted before you investigate.
For critical services, consider separate recording rules that pre-compute with exemplar-preserving aggregations, or increase Prometheus's exemplar storage configuration.
Tip 5: Test the Full Correlation Path
After deploying the stack, verify each correlation works end-to-end:
- Generate a request:
curl http://localhost:8000/graphql -d '{"query":"{ products { name } }"}' - Find the trace in Tempo via TraceQL:
{resource.service.name = "product-catalog"} - Click "Logs" — verify Loki shows logs for that trace
- Click "Request rate" — verify Prometheus shows metrics for that service
- Find a log in Loki — verify the "View Trace" link opens the correct trace in Tempo
- Open a Prometheus graph — verify exemplar dots appear and link to Tempo traces
If any link is broken, the most common causes are: mismatched label names (service vs service_name), missing trace ID in logs (check structured logging), or missing Grafana feature toggles.
What We Didn't Cover
This article focused on the observability infrastructure — how signals are collected, processed, and correlated. We deliberately omitted several operational topics. Custom dashboard creation goes beyond the provisioned dashboards to cover domain-specific panels. Alertmanager integration is needed to route Prometheus alerts to notification channels like Slack, PagerDuty, or email. Multi-tenancy requires tenant isolation in Loki, Tempo, and Pyroscope for multi-team deployments. Object storage backends — Tempo and Loki should use S3/GCS in production instead of local filesystem. And horizontal scaling of the Collector, Tempo, and Loki enables higher throughput as traffic grows.
These are operational concerns that depend on your deployment environment. The signal collection and correlation architecture described here works identically whether the backends are running on a laptop or in a Kubernetes cluster.
Series Conclusion
Over five articles, we've built a complete GraphQL Federation platform from the ground up:
- Part 1 established why monolithic GraphQL fails at scale and how federation distributes ownership
- Part 2 implemented subgraphs in Java, Go, and TypeScript with entity resolution
- Part 3 added gRPC for internal communication and REST for Stripe payments
- Part 4 composed Kong and Apollo Router into a secure, intelligent gateway layer
- Part 5 wired OpenTelemetry across all languages with the Grafana LGTM+ stack — Tempo for traces, Prometheus for metrics and SLO tracking, Loki for logs, Pyroscope for profiles, and Alloy for log collection — with spanmetrics connectors, tail sampling, alerting rules, and cross-signal correlations
The platform runs entirely in Docker Compose — make up-full starts 20+ containers covering four application services, two gateways, a frontend, three databases, a search engine, an object store, and eight observability components. Every query is traced. Every metric is derived from traces. Every log is correlated to a trace. Every profile is linked to a span. Error budgets are tracked. Alerts fire when SLOs breach.
Federation isn't simple. Polyglot federation is harder still. But when each service is independently deployable, each team owns its domain, and every request is observable end-to-end across all signal types, the complexity pays for itself.
The gap between "we have monitoring" and "we have observability" is the difference between dashboards you stare at and signals you navigate. The LGTM+ stack, with its cross-signal correlations, makes every investigation a directed graph traversal instead of a guessing game. Start with any signal — a metric spike, a log error, a slow trace, a hot flame graph — and follow the links to the root cause.
That's the promise of modern observability. And in a polyglot federation, where a single query traverses four services in three languages, it's not optional. It's the foundation everything else rests on.
This concludes the Polyglot GraphQL Federation series. The full source code, including all observability configurations referenced in this article, is available in the project repository.
