Polyglot GraphQL Federation: Part 5 - Observability Across the Stack

In Part 4, we traced a request from the browser through Kong and Apollo Router into four subgraphs. That trace was a conceptual walkthrough. In production, you need the same visibility rendered in real-time — automatically, across every request, in every language.

A federated architecture amplifies the need for observability. A single GraphQL query can fan out to four services written in three languages, each with its own database. When latency spikes, you need to know whether the bottleneck is in the Java Product Catalog's Meilisearch query, the Go Order Service's Stripe call, or the TypeScript User Service's database pool. When error budgets erode, you need to know which subgraph is responsible. When a CPU spike correlates with a slow trace, you need to see the flame graph for that exact span.

This platform uses OpenTelemetry as the instrumentation standard across all services, exporting to a collector that routes signals to purpose-built backends. But we go further than basic telemetry collection. Spanmetrics connectors derive RED metrics (Rate, Errors, Duration) directly from traces, eliminating manual metric instrumentation, while service graph connectors build topology maps from trace data showing how subgraphs call each other. Tail sampling keeps all errors and slow requests while sampling healthy traffic at 25%, ensuring you never lose the traces that matter. On the monitoring side, SLO tracking with Prometheus recording rules computes availability and latency SLIs with 30-day error budget burn, and alerting rules fire on SLO breaches, high error rates, and infrastructure failures. Tying it all together, cross-signal correlations let you click a trace and jump directly to its logs, metrics, and CPU profiles.

This is the Grafana LGTM+ stack — Loki, Grafana, Tempo, Meilisearch/Prometheus, plus Pyroscope for continuous profiling — and it transforms a polyglot federation from a black box into a glass house.

The Observability Architecture

Before diving into configuration, let's see the full signal flow. Every telemetry signal — traces, metrics, logs, profiles — follows a path from application code to a purpose-built backend, with Grafana providing the unified view.

Loading diagram...

Several things stand out about this architecture:

Services only speak OTLP. Every application service exports telemetry in a single protocol (OpenTelemetry Protocol) to a single endpoint (the OTel Collector). Services don't know about Tempo, Prometheus, or Loki.
The Collector is a processing hub, not a passthrough. It applies tail sampling, generates RED metrics from traces via connectors, builds service topology graphs, and routes signals to the right backends.
Logs flow through a separate path. Application logs go to stdout (container best practice), where Grafana Alloy discovers them via the Docker socket, extracts trace IDs for correlation, and pushes to Loki.
Tempo generates additional metrics. Beyond the Collector's spanmetrics, Tempo's own metrics generator produces service graph and span metrics with exemplars, remote-writing them to Prometheus.
Grafana correlates everything. A single trace links to its logs in Loki, its metrics in Prometheus, and its CPU profile in Pyroscope. You can navigate between signals without copying trace IDs.

Let's build each layer from the bottom up.

Three Pillars, Three Languages

Traces: Following a Request Across Services

Distributed tracing is the most valuable observability signal in a federated architecture. A trace represents the full lifecycle of a single client request as it moves through the gateway and subgraphs.

Trace Propagation

The Apollo Router creates a root span for every incoming GraphQL operation. When it fans out to subgraphs, it propagates the trace context via W3C traceparent headers. Each subgraph creates child spans that automatically become part of the same trace.

Trace: 4bf92f3577b34da6a3ce929d0e0e4736
├── Router: POST /graphql (45ms)
│   ├── QueryPlanning (3ms)
│   ├── Fetch: products (18ms)
│   │   ├── ProductDataFetcher.getById (2ms)
│   │   ├── PostgreSQL: SELECT * FROM products (8ms)
│   │   └── Meilisearch: search (6ms)
│   ├── Fetch: inventory (10ms)  [parallel]
│   │   ├── InventoryDataFetcher.__resolveReference (1ms)
│   │   └── PostgreSQL: SELECT * FROM inventory (7ms)
│   └── Fetch: users (14ms)  [parallel]
│       ├── Review.resolveReference (1ms)
│       └── PostgreSQL: SELECT * FROM reviews (11ms)
└── ResponseMerge (2ms)

In Tempo's trace view, this renders as a waterfall diagram where you can see exactly where time is spent. The parallel subgraph fetches overlap visually, confirming that the Router executed them concurrently. With TraceQL, you can search for traces matching specific patterns:

{resource.service.name = "product-catalog" && duration > 500ms}

This finds every trace where the product-catalog subgraph took over 500ms — something that would be impossible to find from logs alone.

Java (Micronaut) Instrumentation

The Java services use Micronaut's OpenTelemetry integration, which auto-instruments HTTP handlers, database queries, and gRPC calls:

// build.gradle.kts
dependencies {
    implementation("io.micronaut.tracing:micronaut-tracing-opentelemetry-http")
    runtimeOnly("io.opentelemetry:opentelemetry-exporter-otlp")
    runtimeOnly("io.opentelemetry:opentelemetry-sdk-extension-autoconfigure")
}

# application.yml
otel:
  exporter:
    otlp:
      endpoint: http://otel-collector:4317
  resource:
    attributes:
      service.name: product-catalog
  traces:
    exporter: otlp

With these dependencies, every incoming HTTP request, every outgoing gRPC call, and every JDBC query generates spans automatically. The Micronaut filter chain creates a parent span for each GraphQL request, and the gRPC client interceptor creates child spans for inventory lookups.

Custom spans for business logic can be added with annotations:

@Singleton
public class SearchService {
    private final Tracer tracer;
 
    public SearchResult search(String query, int limit) {
        Span span = tracer.spanBuilder("meilisearch.search")
            .setAttribute("search.query", query)
            .setAttribute("search.limit", limit)
            .startSpan();
 
        try (Scope scope = span.makeCurrent()) {
            var results = meilisearchClient.index("products").search(query);
            span.setAttribute("search.totalHits", results.getHits().size());
            return mapResults(results);
        } finally {
            span.end();
        }
    }
}

The tracer.spanBuilder() pattern is idiomatic in Java. The try-with-resources on Scope ensures the span is the current active span for any child operations (like HTTP calls to Meilisearch) and that it closes properly even on exceptions.

For the gRPC server in the Inventory service, Micronaut's gRPC integration auto-instruments both the server and client sides:

@Singleton
public class InventoryGrpcService extends InventoryServiceGrpc.InventoryServiceImplBase {
 
    @Override
    public void getInventory(GetInventoryRequest request,
                             StreamObserver<GetInventoryResponse> observer) {
        // Span created automatically by gRPC interceptor
        // Parent context propagated from Product Catalog's gRPC client
        var inventory = inventoryRepository.findByProductId(request.getProductId());
        // JDBC span created automatically for the database query
        observer.onNext(toProto(inventory));
        observer.onCompleted();
    }
}

The trace flows seamlessly: Router span → Product Catalog HTTP span → gRPC client span → Inventory gRPC server span → Inventory JDBC span. Five services, two protocols, one trace.

Go Instrumentation

The Go Order Service uses the standard go.opentelemetry.io/otel SDK with HTTP middleware:

// cmd/server/main.go
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
 
func initTracer(ctx context.Context) (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return nil, err
    }
 
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName("order-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}
 
// Wrap the HTTP handler with OTel middleware
handler := otelhttp.NewHandler(
    graphqlHandler,
    "graphql",
)

For the Stripe integration, custom spans capture external API calls with business-relevant attributes:

func (c *Client) CreatePaymentIntent(ctx context.Context, amount int64,
    currency string) (*stripe.PaymentIntent, error) {
 
    ctx, span := otel.Tracer("stripe").Start(ctx, "stripe.CreatePaymentIntent")
    defer span.End()
 
    span.SetAttributes(
        attribute.Int64("payment.amount", amount),
        attribute.String("payment.currency", currency),
    )
 
    pi, err := paymentintent.New(&stripe.PaymentIntentParams{
        Amount:   stripe.Int64(amount),
        Currency: stripe.String(currency),
    })
 
    if err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return nil, err
    }
 
    span.SetAttributes(attribute.String("payment.intent_id", pi.ID))
    return pi, nil
}

The span.RecordError(err) call is important — it attaches the error to the span, which the tail sampling policy in the OTel Collector uses to decide that this trace should always be kept (never sampled out).

Go's context-based tracing has a distinct advantage: the ctx parameter is required by convention, so trace context propagation is explicit and hard to forget. In contrast, Java and TypeScript rely on thread-local or async-local storage, which can silently lose context in edge cases.

TypeScript Instrumentation

The User Service uses Node.js auto-instrumentation, which patches HTTP, Express, and database drivers at module load time:

// services/user-ts/src/instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
 
const sdk = new NodeSDK({
  serviceName: 'user-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
    }),
  ],
});
 
sdk.start();

Auto-instrumentation captures HTTP request/response, Express route handling, and PostgreSQL query execution without any code changes in the application logic. Every database query in Drizzle ORM generates a span with the SQL statement (parameterized, not with values) and execution time.

The key for TypeScript is that this file must be loaded before any application code. The --require flag in the Dockerfile ensures this:

CMD ["node", "--require", "./dist/instrumentation.js", "./dist/index.js"]

If instrumentation loads after Express or pg are imported, the monkey-patching won't catch those modules, and you'll get empty traces.

Structured Logging with Trace Context

For logs to participate in cross-signal correlation, every log line must include the trace ID. Each language handles this differently.

Java (Logback + MDC):

// Micronaut auto-populates MDC with trace context
// logback.xml pattern includes traceId and spanId
@Singleton
public class ProductService {
    private static final Logger LOG = LoggerFactory.getLogger(ProductService.class);
 
    public Product getById(String id) {
        LOG.info("Fetching product: {}", id);
        // Output: {"timestamp":"2026-04-04T14:32:01Z","level":"INFO",
        //   "msg":"Fetching product: abc123","traceId":"4bf92f...",
        //   "spanId":"1a2b3c..","service":"product-catalog"}
        return repository.findById(id);
    }
}

Micronaut's OpenTelemetry integration automatically sets traceId and spanId in the Mapped Diagnostic Context (MDC). The Logback JSON encoder includes MDC fields in every log line. No application code changes needed.

Go (slog + trace context):

// Extract trace context from span and add to structured log
func logWithTrace(ctx context.Context, msg string, args ...any) {
    span := trace.SpanFromContext(ctx)
    if span.SpanContext().IsValid() {
        args = append(args,
            slog.String("traceId", span.SpanContext().TraceID().String()),
            slog.String("spanId", span.SpanContext().SpanID().String()),
        )
    }
    slog.InfoContext(ctx, msg, args...)
}
 
// Usage in resolvers
func (r *queryResolver) Order(ctx context.Context, id string) (*model.Order, error) {
    logWithTrace(ctx, "Fetching order",
        slog.String("orderId", id),
    )
    return r.orderService.GetByID(ctx, id)
}

Go requires explicit trace context extraction because slog doesn't integrate with OpenTelemetry automatically. The helper function keeps the boilerplate manageable. In production, you'd wrap this in a middleware or use a library like go.opentelemetry.io/contrib/bridges/otelslog.

TypeScript (pino + AsyncLocalStorage):

// Pino logger with trace context hook
import pino from 'pino';
import { trace, context } from '@opentelemetry/api';
 
const logger = pino({
  mixin() {
    const span = trace.getSpan(context.active());
    if (span) {
      const ctx = span.spanContext();
      return {
        traceId: ctx.traceId,
        spanId: ctx.spanId,
      };
    }
    return {};
  },
  formatters: {
    level(label) { return { level: label }; },
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});
 
// Usage in resolvers
logger.info({ userId: user.id }, 'User login successful');
// Output: {"level":"info","time":"2026-04-04T14:32:01.123Z",
//   "traceId":"4bf92f...","spanId":"7d8e9f...",
//   "userId":"user-123","msg":"User login successful"}

Pino's mixin function runs for every log line, extracting the active span from Node.js AsyncLocalStorage. This works because the OpenTelemetry auto-instrumentation sets up the async context correctly for Express request handlers.

The key requirement across all three languages: structured JSON output to stdout. Docker captures stdout. Alloy parses the JSON. The traceId field becomes a Loki label. Grafana links logs to traces. The chain breaks if any service emits unstructured text logs.

How Traces Differ Across Languages

The three language SDKs produce identical OTLP trace data, but the ergonomics differ:

Loading diagram...

Aspect	Java	Go	TypeScript
Auto-instrumentation	Framework-integrated (Micronaut)	Middleware wrapping	Module monkey-patching
Context propagation	Thread-local (Scope)	Explicit `ctx` parameter	AsyncLocalStorage
Custom spans	`tracer.spanBuilder()`	`otel.Tracer().Start(ctx)`	`tracer.startSpan()`
Risk of losing context	Virtual threads may break thread-local	Forgetting to pass `ctx`	Async gaps in callbacks
gRPC support	Micronaut interceptor (auto)	Middleware (manual)	N/A (no gRPC in User svc)

Apollo Router: The Root Span

The Apollo Router is the entry point for all federated queries, and it creates the root span that every subgraph trace attaches to. The Router has built-in OTLP export:

# gateway/router/router.yaml (telemetry section)
telemetry:
  exporters:
    tracing:
      otlp:
        enabled: true
        endpoint: http://otel-collector:4317
        protocol: grpc
  instrumentation:
    spans:
      mode: spec_compliant
      router:
        attributes:
          http.request.method: true
          url.path: true
      subgraph:
        attributes:
          subgraph.name: true
          subgraph.graphql.operation.name: true

The Router produces several span types that are critical for federation debugging. The router span is the top-level span for the entire GraphQL operation, containing the supergraph span that covers the query planning phase. Beneath these sit the subgraph spans — one per subgraph fetch, with the subgraph name as an attribute — and their child subgraph_request spans representing the actual HTTP calls.

When the Router fans out to multiple subgraphs in parallel, the subgraph spans overlap in the trace waterfall — visual confirmation that the query plan is executing concurrently. If you see sequential subgraph spans for a query that should parallelize, check the query plan for unnecessary dependencies.

The subgraph.graphql.operation.name attribute flows into the spanmetrics connector, so you get per-operation metrics at the federation level. You can answer questions like: "What's the P95 latency for the GetProductDetails operation across all subgraphs?"

Metrics: From Spans to RED, Automatically

In a traditional setup, you'd instrument each service to expose Prometheus metrics — counters for requests, histograms for latency, error counters. That means three separate instrumentation efforts in three languages, hoping they use consistent metric names and labels.

Our stack takes a different approach: generate metrics from traces. The OTel Collector's spanmetrics connector watches every span that passes through and automatically produces rate, error, and duration metrics. Zero application-side metric code required.

But we still scrape native metrics from services that expose them. Prometheus pulls from multiple sources:

# observability/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
rule_files:
  - recording-rules.yml
  - alerting-rules.yml
 
scrape_configs:
  # OTel Collector exports spanmetrics + servicegraph + app metrics
  - job_name: otel-collector
    static_configs:
      - targets: ["otel-collector:8889"]
 
  # OTel Collector self-monitoring
  - job_name: otel-collector-internal
    static_configs:
      - targets: ["otel-collector:8888"]
 
  - job_name: apollo-router
    metrics_path: /metrics
    static_configs:
      - targets: ["router:9090"]
 
  - job_name: product-catalog
    metrics_path: /prometheus
    static_configs:
      - targets: ["product-catalog:4001"]
 
  - job_name: inventory
    metrics_path: /prometheus
    static_configs:
      - targets: ["inventory:4004"]
 
  - job_name: order-service
    metrics_path: /metrics
    static_configs:
      - targets: ["order:4002"]
 
  - job_name: user-service
    metrics_path: /metrics
    static_configs:
      - targets: ["user:4003"]
 
  - job_name: kong
    static_configs:
      - targets: ["kong:8001"]
 
  - job_name: tempo
    static_configs:
      - targets: ["tempo:3200"]
 
  - job_name: pyroscope
    static_configs:
      - targets: ["pyroscope:4040"]

Notice the two Collector scrape targets. Port 8889 exports the spanmetrics and servicegraph data that the Collector generates from traces. Port 8888 exports the Collector's own health metrics (queue sizes, dropped spans, processing latency). We also scrape Tempo and Pyroscope themselves — observing the observers.

Key metrics to monitor in a federated architecture:

Metric	Source	Meaning
`traces_spanmetrics_duration_milliseconds_*`	Spanmetrics connector	Request rate, latency histograms, error rate — derived from traces
`traces_service_graph_request_total`	Servicegraph connector	Cross-service call rate (who calls whom)
`http_server_request_duration_seconds`	Each subgraph	Native HTTP latency histogram
`apollo_router_http_requests_total`	Router	Total queries by operation name
`db_client_connections_usage`	Each subgraph	Database connection pool saturation
`grpc_client_duration_seconds`	Product Catalog	gRPC call latency to Inventory

Logs: Grafana Alloy Replaces Promtail

In the original stack, Promtail collected container logs and shipped them to Loki. Promtail works, but it's a single-purpose tool with a YAML-based pipeline that becomes unwieldy for complex parsing.

Grafana Alloy replaces Promtail with a unified telemetry collector built around the River configuration language — a declarative, component-based syntax that reads like a flow diagram. Alloy can collect logs, metrics, traces, and profiles, but here we use it specifically for Docker log collection with trace ID extraction.

// observability/alloy/config.alloy
 
// Docker log discovery
discovery.docker "containers" {
  host = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}
 
// Relabel to extract service name and metadata
discovery.relabel "docker_logs" {
  targets = discovery.docker.containers.targets
 
  rule {
    source_labels = ["__meta_docker_container_label_com_docker_compose_service"]
    target_label  = "service"
  }
 
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/(.*)"
    target_label  = "container"
  }
 
  rule {
    source_labels = ["__meta_docker_container_label_com_docker_compose_project"]
    target_label  = "project"
  }
}
 
// Collect logs from discovered Docker containers
loki.source.docker "containers" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.relabel.docker_logs.output
  forward_to = [loki.process.pipeline.receiver]
}
 
// Log processing pipeline
loki.process "pipeline" {
  // Parse JSON structured logs
  stage.json {
    expressions = {
      level     = "level",
      msg       = "msg",
      timestamp = "timestamp",
      traceId   = "traceId",
      spanId    = "spanId",
      service   = "service",
    }
  }
 
  // Extract trace context for log-to-trace correlation
  stage.labels {
    values = {
      level   = "",
      traceId = "",
    }
  }
 
  // Use embedded timestamp if present
  stage.timestamp {
    source = "timestamp"
    format = "RFC3339Nano"
  }
 
  // Drop noisy labels
  stage.label_drop {
    values = ["filename"]
  }
 
  forward_to = [loki.write.default.receiver]
}
 
// Write logs to Loki
loki.write "default" {
  endpoint {
    url = "http://loki:3100/loki/api/v1/push"
  }
}

Let's walk through what happens when a container emits a log line:

Discovery (discovery.docker): Alloy connects to the Docker socket and discovers all running containers. Every 5 seconds, it checks for new or removed containers.
Relabeling (discovery.relabel): The Docker Compose service name (com.docker.compose.service label) becomes the service label in Loki. The container name gets cleaned up (removing the leading /). The compose project name is preserved.
Collection (loki.source.docker): Alloy tails the log output from each discovered container and forwards raw log lines to the processing pipeline.
JSON Parsing (stage.json): Since all our services emit structured JSON logs, Alloy parses each line and extracts fields: level, msg, timestamp, traceId, spanId, service.
Label Extraction (stage.labels): The level and traceId fields become Loki labels. This is critical — having traceId as a label means Grafana can link any log line to its corresponding trace in Tempo.
Timestamp (stage.timestamp): If the log contains its own timestamp, Alloy uses it instead of the collection time. This prevents clock drift between when the log was emitted and when it was collected.
Push to Loki (loki.write): Processed log entries are pushed to Loki's API.

The River syntax has a clear advantage over Promtail's YAML: the data flow is visible. You can see that discovery.docker feeds into discovery.relabel, which feeds into loki.source.docker, which feeds into loki.process, which feeds into loki.write. Each component is a node in a pipeline graph.

With logs in Loki and trace IDs extracted as labels, you can query:

{service="product-catalog"} |= "error" | json | duration > 100ms

Or find all logs for a specific trace:

{traceId="4bf92f3577b34da6a3ce929d0e0e4736"}

The OTel Collector: Processing Hub

The OpenTelemetry Collector is the nervous system of this stack. In a naive setup, it's just a proxy — receive OTLP, forward to backends. Our configuration turns it into a processing hub that generates new telemetry signals, applies intelligent sampling, and enriches data before routing.

The Full Configuration

# observability/otel-collector/otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
 
  memory_limiter:
    check_interval: 5s
    limit_mib: 512
    spike_limit_mib: 128
 
  resource:
    attributes:
      - key: deployment.environment
        value: development
        action: upsert
 
  # Tail sampling: keep all errors, slow requests, and sample healthy traffic
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 100
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes:
            - ERROR
      - name: slow-requests
        type: latency
        latency:
          threshold_ms: 500
      - name: probabilistic-sample
        type: probabilistic
        probabilistic:
          sampling_percentage: 25
 
  # Add span duration as attribute for filtering
  transform:
    trace_statements:
      - context: span
        statements:
          - set(attributes["span.duration_ms"],
              Duration(end_time, start_time) / 1000000)
 
connectors:
  # Generate RED metrics from traces (Rate, Errors, Duration)
  spanmetrics:
    histogram:
      explicit:
        buckets: [2ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms,
                  1s, 2.5s, 5s, 10s]
    dimensions:
      - name: http.method
      - name: http.status_code
      - name: http.route
      - name: rpc.method
      - name: graphql.operation.name
      - name: service.name
    exemplars:
      enabled: true
    namespace: traces.spanmetrics
 
  # Generate service topology from traces
  servicegraph:
    latency_histogram_buckets: [2ms, 5ms, 10ms, 25ms, 50ms, 100ms,
                                 250ms, 500ms, 1s, 2.5s, 5s]
    dimensions:
      - http.method
      - http.status_code
    store:
      ttl: 2s
      max_items: 1000
 
exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
 
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: ecommerce
    enable_open_metrics: true
    resource_to_telemetry_conversion:
      enabled: true
 
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
 
  debug:
    verbosity: basic
 
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
 
service:
  extensions: [health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, resource, transform,
                   tail_sampling, batch]
      exporters: [otlp/tempo, spanmetrics, servicegraph, debug]
 
    metrics:
      receivers: [otlp, spanmetrics, servicegraph]
      processors: [memory_limiter, batch]
      exporters: [prometheus, debug]
 
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki, debug]

This is a lot of YAML. Let's break it into the five key concerns.

Concern 1: Tail Sampling

Head-based sampling (deciding at the start of a trace whether to keep it) is simple but wasteful. It either keeps too much healthy traffic or drops error traces. Tail-based sampling waits until the trace is complete, then decides.

tail_sampling:
  decision_wait: 10s
  num_traces: 100000
  expected_new_traces_per_sec: 100
  policies:
    - name: errors
      type: status_code
      status_code:
        status_codes:
          - ERROR
    - name: slow-requests
      type: latency
      latency:
        threshold_ms: 500
    - name: probabilistic-sample
      type: probabilistic
      probabilistic:
        sampling_percentage: 25

The policies are evaluated in order, and a trace is kept if any policy matches:

Errors: Any trace with an ERROR status code is always kept. If the Go Order Service's Stripe call fails, you'll see it.
Slow requests: Any trace longer than 500ms is kept. If the Java Product Catalog's Meilisearch query is slow, you'll have the trace.
Probabilistic: Of the remaining healthy, fast traces, 25% are sampled randomly. This gives you baseline visibility without storing every trace.

The decision_wait: 10s means the Collector buffers spans for 10 seconds before making the sampling decision. This is necessary because spans from different services arrive at different times — the Router's span might arrive before the database span that makes the trace "slow." The tradeoff is a 10-second delay before traces appear in Tempo.

The num_traces: 100000 limits memory usage. If more than 100,000 concurrent traces are being buffered, the oldest ones are force-sampled.

Loading diagram...

The result: you keep 100% of interesting traces and a representative sample of everything else. Storage costs drop by ~73% without losing visibility into failures.

Concern 2: Spanmetrics Connector

The spanmetrics connector is one of the most powerful features in the OTel Collector. It watches every span in the traces pipeline and generates Prometheus-compatible metrics:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [2ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms,
                  1s, 2.5s, 5s, 10s]
    dimensions:
      - name: http.method
      - name: http.status_code
      - name: http.route
      - name: rpc.method
      - name: graphql.operation.name
      - name: service.name
    exemplars:
      enabled: true
    namespace: traces.spanmetrics

For every span, it produces:

traces_spanmetrics_duration_milliseconds_count — request count (Rate)
traces_spanmetrics_duration_milliseconds_sum — total duration
traces_spanmetrics_duration_milliseconds_bucket — latency histogram (Duration)

These metrics are broken down by the configured dimensions: HTTP method, status code, route, RPC method, GraphQL operation name, and service name. This means you can query request rate per GraphQL operation per service — without writing a single line of metrics code in any of the three languages.

The exemplars: enabled: true setting links metrics back to traces. When you see a latency spike in a Prometheus graph, the exemplar gives you the trace ID of the request that caused it. Click the exemplar dot in Grafana, and you're in the Tempo trace view.

The magic is in the pipeline wiring:

service:
  pipelines:
    traces:
      exporters: [otlp/tempo, spanmetrics, servicegraph, debug]
    metrics:
      receivers: [otlp, spanmetrics, servicegraph]

The spanmetrics connector appears as an exporter in the traces pipeline and a receiver in the metrics pipeline. Spans flow in, metrics flow out. The same pattern applies to servicegraph.

Concern 3: Service Graph Connector

The servicegraph connector builds a topology of service-to-service communication from trace data:

servicegraph:
  latency_histogram_buckets: [2ms, 5ms, 10ms, 25ms, 50ms, 100ms,
                               250ms, 500ms, 1s, 2.5s, 5s]
  dimensions:
    - http.method
    - http.status_code
  store:
    ttl: 2s
    max_items: 1000

It produces metrics like:

traces_service_graph_request_total{client="router", server="product-catalog"} — how many requests the Router sends to Product Catalog
traces_service_graph_request_duration_seconds_bucket{client="product-catalog", server="inventory"} — latency histogram for gRPC calls from Product Catalog to Inventory
traces_service_graph_request_failed_total{client="order-service", server="user-service"} — failed cross-service calls

Grafana's node graph visualization uses these metrics to render a live service map. You can see request rates on edges, error rates as red highlights, and click any node to drill into its metrics.

Concern 4: Transform Processor

The transform processor adds computed attributes to spans using the OpenTelemetry Transformation Language (OTTL):

transform:
  trace_statements:
    - context: span
      statements:
        - set(attributes["span.duration_ms"],
            Duration(end_time, start_time) / 1000000)

This adds a span.duration_ms attribute to every span, computed from the span's start and end times. This is useful for filtering in TraceQL and for adding duration-based columns in Grafana's trace table view.

Concern 5: Backend Routing

The final piece is routing signals to the right backends. Traces flow to Tempo via OTLP gRPC (and to the spanmetrics/servicegraph connectors), metrics go to Prometheus via the Prometheus exporter on port 8889, and logs reach Loki via the Loki exporter, supplementary to Alloy's Docker log collection.

The resource_to_telemetry_conversion: enabled: true in the Prometheus exporter converts OTel resource attributes (like service.name) into Prometheus labels. Without this, resource attributes would be dropped and you couldn't filter metrics by service.

Grafana Tempo: Traces with Power

Tempo replaces Jaeger as the trace backend. The reasons are practical:

No indexing required. Tempo stores traces by trace ID only, using object storage (or local filesystem). This makes it dramatically cheaper to operate at scale.
TraceQL. A purpose-built query language for traces that's far more expressive than Jaeger's tag-based search.
Metrics generator. Tempo can produce service graph and span metrics with exemplars, remote-writing them to Prometheus.
Native Grafana integration. Trace-to-logs, trace-to-metrics, trace-to-profiles correlations are built into the Grafana Tempo datasource.

Tempo Configuration

# observability/tempo/tempo-config.yaml
stream_over_http_enabled: true
 
server:
  http_listen_port: 3200
  grpc_listen_port: 9095
 
distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
        http:
          endpoint: 0.0.0.0:4318
 
ingester:
  max_block_duration: 5m
 
compactor:
  compaction:
    block_retention: 168h  # 7 days
 
metrics_generator:
  registry:
    external_labels:
      source: tempo
      cluster: local
  storage:
    path: /var/tempo/generator/wal
    remote_write:
      - url: http://prometheus:9090/api/v1/write
        send_exemplars: true
  traces_storage:
    path: /var/tempo/generator/traces
  processor:
    service_graphs:
      dimensions:
        - service.namespace
      enable_client_server_prefix: true
      peer_attributes:
        - service.name
        - db.system
        - messaging.system
      max_items: 10000
    span_metrics:
      dimensions:
        - http.method
        - http.status_code
        - http.route
        - rpc.method
        - graphql.operation.name
      enable_target_info: true
      filter_policies:
        - include:
            match_type: strict
            attributes:
              - key: span.kind
                value: SPAN_KIND_SERVER
 
storage:
  trace:
    backend: local
    wal:
      path: /var/tempo/wal
    local:
      path: /var/tempo/blocks
 
overrides:
  defaults:
    metrics_generator:
      processors:
        - service-graphs
        - span-metrics

Several design decisions here:

Metrics generator with remote_write: Tempo generates its own span metrics and service graph metrics, remote-writing them to Prometheus with send_exemplars: true. This is complementary to the OTel Collector's spanmetrics — Tempo's generator has access to the full trace (after assembly), while the Collector processes individual spans as they arrive. The external labels source: tempo distinguish these metrics from the Collector's.

Service graph peer attributes: The peer_attributes list tells Tempo what to look for when building the service graph. Beyond service.name, it also tracks db.system (PostgreSQL, Redis) and messaging.system — so the service map shows database nodes, not just application services.

Filter policies for span metrics: The filter_policies section ensures that only server-side spans generate metrics. Without this filter, client spans would double-count every request (the client and server both see the same call). Filtering to SPAN_KIND_SERVER gives accurate request rates.

Block retention: 7 days of trace storage (168h). Traces older than 7 days are compacted away. For a development environment, this is more than sufficient; production deployments would typically use object storage (S3, GCS) with longer retention.

TraceQL Queries

With Tempo, you can search traces using TraceQL — a query language purpose-built for distributed traces:

// Find slow product catalog operations
{resource.service.name = "product-catalog" && duration > 500ms}

// Find failed Stripe payment attempts
{resource.service.name = "order-service" && span.http.status_code >= 500
  && name = "stripe.CreatePaymentIntent"}

// Find traces that touched both inventory and user services
{resource.service.name = "inventory"} && {resource.service.name = "user-service"}

// Find traces where any span has an error
{status = error}

// Aggregate: P99 latency by service
{} | rate() by (resource.service.name)

TraceQL's structural queries (the && between spansets) are particularly powerful for federation debugging. You can find traces where the Router called Product Catalog but not Inventory, suggesting a query plan that skipped entity resolution.

Grafana Pyroscope: Continuous Profiling

Traces tell you where time is spent. Profiles tell you why. Pyroscope provides continuous profiling — CPU flame graphs, memory allocations, and goroutine/thread analysis — for every service, every second.

Pyroscope Configuration

# observability/pyroscope/pyroscope-config.yaml
storage:
  backend: filesystem
  filesystem:
    data_path: /data
 
server:
  http_listen_port: 4040
  grpc_listen_port: 4041
 
self_profiling:
  disable_push: true
 
analytics:
  reporting_enabled: false
 
limits:
  max_label_names_per_series: 30
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

The configuration is intentionally minimal. Pyroscope is a write-heavy system — services push profile data continuously — so the limits section prevents any single service from overwhelming storage. In production, you'd use object storage instead of filesystem.

Trace-to-Profile Correlation

The real power of Pyroscope in this stack is the trace-to-profile link. When you find a slow trace in Tempo, Grafana can show you the CPU flame graph for that exact time window. Was the slowness caused by garbage collection? Regex parsing? A lock contention? The profile answers questions that traces can't.

This correlation is configured in the Grafana datasource (covered in the cross-signal section below).

Prometheus: Recording Rules and SLO Tracking

Raw metrics are useful, but recording rules transform them into higher-level signals: service-level indicators (SLIs) and error budget tracking. This is where observability becomes actionable.

Recording Rules

# observability/prometheus/recording-rules.yml
groups:
  # RED metrics aggregation from OTel spanmetrics
  - name: service_red_metrics
    interval: 30s
    rules:
      # Request rate per service (from spanmetrics)
      - record: service:request_rate:5m
        expr: >
          sum by (service_name)
            (rate(traces_spanmetrics_duration_milliseconds_count[5m]))
 
      # Error rate per service
      - record: service:error_rate:5m
        expr: |
          sum by (service_name)
            (rate(traces_spanmetrics_duration_milliseconds_count
              {http_status_code=~"5.."}[5m]))
          / on(service_name)
          sum by (service_name)
            (rate(traces_spanmetrics_duration_milliseconds_count[5m]))
 
      # P50 latency per service
      - record: service:latency_p50:5m
        expr: |
          histogram_quantile(0.50,
            sum by (service_name, le)
              (rate(traces_spanmetrics_duration_milliseconds_bucket[5m]))
          )
 
      # P95 latency per service
      - record: service:latency_p95:5m
        expr: |
          histogram_quantile(0.95,
            sum by (service_name, le)
              (rate(traces_spanmetrics_duration_milliseconds_bucket[5m]))
          )
 
      # P99 latency per service
      - record: service:latency_p99:5m
        expr: |
          histogram_quantile(0.99,
            sum by (service_name, le)
              (rate(traces_spanmetrics_duration_milliseconds_bucket[5m]))
          )

These recording rules pre-compute RED metrics from the spanmetrics data every 30 seconds. The resulting time series (service:request_rate:5m, service:error_rate:5m, service:latency_p95:5m) are cheap to query in dashboards and alerts because the expensive rate() and histogram_quantile() computations happen once, not on every dashboard refresh.

SLO Tracking

The slo_tracking group computes service-level indicators:

  - name: slo_tracking
    interval: 30s
    rules:
      # Availability SLI: 1 - error rate (target: 99.9%)
      - record: slo:availability:5m
        expr: |
          1 - (
            sum by (service_name)
              (rate(traces_spanmetrics_duration_milliseconds_count
                {http_status_code=~"5.."}[5m]))
            / on(service_name)
            sum by (service_name)
              (rate(traces_spanmetrics_duration_milliseconds_count[5m]))
          )
 
      # Latency SLI: % of requests under 500ms (target: 95%)
      - record: slo:latency_good:5m
        expr: |
          sum by (service_name)
            (rate(traces_spanmetrics_duration_milliseconds_bucket
              {le="500"}[5m]))
          / on(service_name)
          sum by (service_name)
            (rate(traces_spanmetrics_duration_milliseconds_count[5m]))
 
      # Error budget remaining (30-day window, 99.9% target)
      - record: slo:error_budget_remaining:30d
        expr: |
          1 - (
            (1 - slo:availability:5m) / (1 - 0.999)
          )

The SLO framework defines two SLIs:

Availability SLI — the ratio of non-5xx responses. Target: 99.9%. This means an error budget of 0.1%, which translates to ~43 minutes of downtime per 30-day window.
Latency SLI — the percentage of requests completing under 500ms. Target: 95%. If more than 5% of requests exceed half a second, the SLO is breached.

The error budget remaining rule (slo:error_budget_remaining:30d) shows how much of the 30-day error budget has been consumed. At 1.0, the budget is fully intact. At 0.0, it's exhausted. Below 0.0, the SLO is violated.

Loading diagram...

Federation-Specific Metrics

The third rule group tracks GraphQL Federation-specific signals:

  - name: federation_metrics
    interval: 30s
    rules:
      # Query plan execution rate
      - record: federation:query_plan_rate:5m
        expr: >
          sum(rate(traces_spanmetrics_duration_milliseconds_count
            {graphql_operation_name!=""}[5m]))
 
      # Cross-subgraph request rate (from service graph)
      - record: federation:subgraph_calls_rate:5m
        expr: >
          sum by (client, server)
            (rate(traces_service_graph_request_total[5m]))
 
      # Subgraph error rate
      - record: federation:subgraph_error_rate:5m
        expr: |
          sum by (server)
            (rate(traces_service_graph_request_failed_total[5m]))
          / on(server)
          sum by (server)
            (rate(traces_service_graph_request_total[5m]))

The federation:subgraph_calls_rate:5m metric is uniquely valuable. It shows the actual traffic pattern between services — how many requests the Router sends to each subgraph, and how many internal calls happen (like Product Catalog calling Inventory via gRPC). This is derived entirely from trace data via the servicegraph connector.

Alerting Rules

Recording rules compute what to observe. Alerting rules decide when to act.

# observability/prometheus/alerting-rules.yml
groups:
  - name: service_alerts
    rules:
      # High error rate (>5% for 5 minutes)
      - alert: HighErrorRate
        expr: service:error_rate:5m > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on {{ $labels.service_name }}"
          description: >
            {{ $labels.service_name }} error rate is
            {{ $value | humanizePercentage }} (threshold: 5%)
 
      # Critical error rate (>10% for 2 minutes)
      - alert: CriticalErrorRate
        expr: service:error_rate:5m > 0.10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical error rate on {{ $labels.service_name }}"
          description: >
            {{ $labels.service_name }} error rate is
            {{ $value | humanizePercentage }} (threshold: 10%)
 
      # High P95 latency (>2s for 5 minutes)
      - alert: HighLatency
        expr: service:latency_p95:5m > 2000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency on {{ $labels.service_name }}"
          description: >
            {{ $labels.service_name }} P95 latency is
            {{ $value }}ms (threshold: 2000ms)
 
      # Service down (no requests for 5 minutes)
      - alert: ServiceDown
        expr: service:request_rate:5m == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.service_name }} is not receiving requests"

The tiered error rate alerts (warning at 5%, critical at 10%) give operators time to investigate before paging. The for clause prevents flapping — a momentary spike doesn't fire the alert.

SLO Alerts

  - name: slo_alerts
    rules:
      # SLO error budget fast burn
      - alert: SLOErrorBudgetFastBurn
        expr: slo:error_budget_remaining:30d < 0.5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: >
            SLO error budget critically low for
            {{ $labels.service_name }}
          description: >
            Only {{ $value | humanizePercentage }}
            of error budget remaining
 
      # Latency SLO breach
      - alert: LatencySLOBreach
        expr: slo:latency_good:5m < 0.95
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Latency SLO breach for {{ $labels.service_name }}"
          description: >
            Only {{ $value | humanizePercentage }}
            of requests under 500ms (target: 95%)

The SLOErrorBudgetFastBurn alert fires when more than 50% of the 30-day error budget has been consumed. This is a multiburn-rate alert pattern — if you're burning budget 14x faster than allowed, you'll hit 50% consumption in roughly 26 hours, giving you a full day to respond.

Federation Alerts

  - name: federation_alerts
    rules:
      - alert: SubgraphHighErrorRate
        expr: federation:subgraph_error_rate:5m > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate on subgraph {{ $labels.server }}"
          description: >
            Subgraph {{ $labels.server }} error rate is
            {{ $value | humanizePercentage }}

This alert uses the service graph metrics. If the Router's calls to any subgraph have a >5% failure rate, it fires. The $labels.server tells you which subgraph is struggling.

Infrastructure Alerts

  - name: infrastructure_alerts
    rules:
      - alert: OTelCollectorDroppedSpans
        expr: rate(otelcol_exporter_send_failed_spans_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "OTel Collector is dropping spans"
          description: "{{ $value }} spans/sec being dropped"
 
      - alert: PrometheusStorageHigh
        expr: >
          prometheus_tsdb_storage_blocks_bytes / (1024*1024*1024) > 5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus storage exceeding 5GB"

Observing the observers. If the OTel Collector starts dropping spans (because Tempo is down, network issues, or buffer overflow), you need to know. If Prometheus storage is growing unchecked, you need to know before the disk fills.

Grafana: Cross-Signal Correlations

The most powerful aspect of the LGTM+ stack is cross-signal correlation. Grafana's datasource configuration wires the signals together so you can jump between traces, logs, metrics, and profiles without manual context-switching.

Datasource Configuration

# observability/grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
 
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo
      httpMethod: POST
 
  - name: Tempo
    type: tempo
    access: proxy
    uid: tempo
    url: http://tempo:3200
    editable: true
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
        spanStartTimeShift: "-1h"
        spanEndTimeShift: "1h"
        tags:
          - key: service.name
            value: service
        filterByTraceID: true
        filterBySpanID: false
      tracesToMetrics:
        datasourceUid: prometheus
        spanStartTimeShift: "-1h"
        spanEndTimeShift: "1h"
        tags:
          - key: service.name
            value: service_name
        queries:
          - name: Request rate
            query: >
              sum(rate(traces_spanmetrics_duration_milliseconds_count
                {$$__tags}[5m]))
          - name: Error rate
            query: >
              sum(rate(traces_spanmetrics_duration_milliseconds_count
                {$$__tags,http_status_code=~"5.."}[5m]))
          - name: P95 latency
            query: >
              histogram_quantile(0.95, sum by (le)
                (rate(traces_spanmetrics_duration_milliseconds_bucket
                  {$$__tags}[5m])))
      tracesToProfiles:
        datasourceUid: pyroscope
        tags:
          - key: service.name
            value: service_name
        profileTypeId: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true
      search:
        hide: false
      lokiSearch:
        datasourceUid: loki
      traceQuery:
        timeShiftEnabled: true
        spanStartTimeShift: "1h"
        spanEndTimeShift: "-1h"
 
  - name: Loki
    type: loki
    access: proxy
    uid: loki
    url: http://loki:3100
    editable: true
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"traceId":"(\w+)"'
          name: TraceID
          url: "$${__value.raw}"
          urlDisplayLabel: "View Trace"
        - datasourceUid: tempo
          matcherRegex: 'trace_id=(\w+)'
          name: TraceID
          url: "$${__value.raw}"
          urlDisplayLabel: "View Trace"
 
  - name: Pyroscope
    type: grafana-pyroscope-datasource
    access: proxy
    uid: pyroscope
    url: http://pyroscope:4040
    editable: true

This is where the magic happens. Let's trace each correlation path.

Correlation 1: Trace → Logs

tracesToLogsV2:
  datasourceUid: loki
  tags:
    - key: service.name
      value: service
  filterByTraceID: true

When you view a trace in Tempo, Grafana adds a "Logs" button. Clicking it opens a Loki query filtered by the trace ID and the service name of the selected span. The spanStartTimeShift and spanEndTimeShift values widen the time window by an hour in each direction, ensuring logs emitted slightly before or after the span are included.

Loading diagram...

Correlation 2: Trace → Metrics

tracesToMetrics:
  datasourceUid: prometheus
  queries:
    - name: Request rate
      query: >
        sum(rate(traces_spanmetrics_duration_milliseconds_count
          {$$__tags}[5m]))
    - name: Error rate
      query: >
        sum(rate(traces_spanmetrics_duration_milliseconds_count
          {$$__tags,http_status_code=~"5.."}[5m]))
    - name: P95 latency
      query: >
        histogram_quantile(0.95, sum by (le)
          (rate(traces_spanmetrics_duration_milliseconds_bucket
            {$$__tags}[5m])))

When viewing a trace, Grafana shows "Request rate", "Error rate", and "P95 latency" links. These execute pre-configured Prometheus queries filtered to the service that produced the span. The $$__tags placeholder is replaced with service_name="product-catalog" (or whichever service the span belongs to).

This answers the question: "Is this slow trace an anomaly, or is the service generally slow right now?" If the P95 latency is elevated across the board, it's a systemic issue. If the trace is an outlier, it's likely request-specific.

Correlation 3: Trace → Profile

tracesToProfiles:
  datasourceUid: pyroscope
  tags:
    - key: service.name
      value: service_name
  profileTypeId: "process_cpu:cpu:nanoseconds:cpu:nanoseconds"

When viewing a trace, Grafana shows a "Profiles" link that opens Pyroscope filtered to the same service and time window. If a span is slow and the CPU profile shows 80% of time in java.util.regex.Pattern.match, you've found your bottleneck — a regex-heavy validation that needs optimization.

Correlation 4: Logs → Traces

The reverse direction is equally important. When browsing logs in Loki, you want to jump to the trace that produced a log line.

# Loki datasource
derivedFields:
  - datasourceUid: tempo
    matcherRegex: '"traceId":"(\w+)"'
    name: TraceID
    url: "$${__value.raw}"
    urlDisplayLabel: "View Trace"
  - datasourceUid: tempo
    matcherRegex: 'trace_id=(\w+)'
    name: TraceID
    url: "$${__value.raw}"
    urlDisplayLabel: "View Trace"

Loki's derived fields use regex to extract trace IDs from log lines. Two patterns are configured because different language logging libraries format the trace ID differently — JSON format ("traceId":"abc123") and key-value format (trace_id=abc123). When a match is found, Grafana renders a "View Trace" link next to the log line.

Correlation 5: Metrics → Traces (Exemplars)

# Prometheus datasource
exemplarTraceIdDestinations:
  - name: traceID
    datasourceUid: tempo

When Prometheus metrics include exemplars (enabled in our spanmetrics connector), Grafana renders small dots on metric graphs. Each dot represents a specific request with its trace ID. Clicking the dot opens the trace in Tempo.

This is the most powerful debugging path: you see a latency spike in a dashboard, click the exemplar dot at the peak, and you're looking at the exact trace that caused the spike.

Loading diagram...

The Grafana Service Map

The Tempo datasource configuration includes:

serviceMap:
  datasourceUid: prometheus
nodeGraph:
  enabled: true

This enables Grafana's service map visualization, powered by the service graph metrics from both the OTel Collector's servicegraph connector and Tempo's metrics generator. The map renders nodes for each service (product-catalog, inventory, order-service, user-service, router) connected by edges showing request flow. Each edge displays request rates, with error rates highlighted in red and latency visible on hover.

For a federated GraphQL platform, this service map is invaluable. You can see at a glance which subgraphs are heavily loaded, which inter-service connections have errors, and how the Router distributes traffic.

The Docker Compose Observability Stack

The full observability stack runs as a Docker Compose overlay. Here's the complete configuration:

# docker-compose.observability.yml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.115.1
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Self-monitoring metrics
      - "8889:8889"   # Spanmetrics + app metrics export
    volumes:
      - ./observability/otel-collector/otel-collector-config.yaml:
          /etc/otelcol-contrib/config.yaml
    depends_on:
      tempo:
        condition: service_healthy
      loki:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q",
             "http://localhost:13133/"]
      interval: 5s
      timeout: 5s
      retries: 5
 
  tempo:
    image: grafana/tempo:2.7.1
    ports:
      - "3200:3200"   # HTTP API + TraceQL
      - "9095:9095"   # gRPC
    volumes:
      - ./observability/tempo/tempo-config.yaml:/etc/tempo/config.yaml
      - tempo_data:/var/tempo
    command: ["-config.file=/etc/tempo/config.yaml"]
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q",
             "http://localhost:3200/ready"]
      interval: 5s
      timeout: 5s
      retries: 5
 
  prometheus:
    image: prom/prometheus:v3.2.1
    ports:
      - "9090:9090"
    volumes:
      - ./observability/prometheus/prometheus.yml:
          /etc/prometheus/prometheus.yml
      - ./observability/prometheus/recording-rules.yml:
          /etc/prometheus/recording-rules.yml
      - ./observability/prometheus/alerting-rules.yml:
          /etc/prometheus/alerting-rules.yml
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--web.enable-remote-write-receiver"
      - "--enable-feature=exemplar-storage"
      - "--storage.tsdb.retention.time=7d"
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q",
             "http://localhost:9090/-/healthy"]
      interval: 5s
      timeout: 5s
      retries: 5
 
  loki:
    image: grafana/loki:3.4.2
    ports:
      - "3100:3100"
    volumes:
      - ./observability/loki/loki-config.yaml:
          /etc/loki/local-config.yaml
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q",
             "http://localhost:3100/ready"]
      interval: 5s
      timeout: 5s
      retries: 5
 
  alloy:
    image: grafana/alloy:v1.5.1
    volumes:
      - ./observability/alloy/config.alloy:
          /etc/alloy/config.alloy
      - /var/run/docker.sock:/var/run/docker.sock:ro
    command: ["run", "/etc/alloy/config.alloy",
              "--storage.path=/var/lib/alloy/data"]
    depends_on:
      loki:
        condition: service_healthy
 
  pyroscope:
    image: grafana/pyroscope:1.10.0
    ports:
      - "4040:4040"
    volumes:
      - ./observability/pyroscope/pyroscope-config.yaml:
          /etc/pyroscope/config.yaml
      - pyroscope_data:/data
    command: ["-config.file=/etc/pyroscope/config.yaml"]
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q",
             "http://localhost:4040/ready"]
      interval: 5s
      timeout: 5s
      retries: 5
 
  grafana:
    image: grafana/grafana:11.5.2
    ports:
      - "3001:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
      GF_FEATURE_TOGGLES_ENABLE: >
        traceqlEditor traceQLStreaming
        correlations tempoServiceGraph
    volumes:
      - ./observability/grafana/provisioning:
          /etc/grafana/provisioning
      - grafana_data:/var/lib/grafana
    depends_on:
      prometheus:
        condition: service_healthy
      loki:
        condition: service_healthy
      tempo:
        condition: service_healthy
 
volumes:
  prometheus_data:
  loki_data:
  grafana_data:
  tempo_data:
  pyroscope_data:

Key points about the Compose configuration:

Dependency ordering: The OTel Collector depends on Tempo and Loki (it needs somewhere to send data). Alloy depends on Loki. Grafana depends on Prometheus, Loki, and Tempo (all its datasources). Health checks ensure services are ready before dependents start.

Prometheus flags: --web.enable-remote-write-receiver allows Tempo to push metrics via remote write. --enable-feature=exemplar-storage enables the exemplar storage that links metrics to traces. Both are required for the full correlation stack.

Grafana feature toggles: The GF_FEATURE_TOGGLES_ENABLE environment variable enables TraceQL editor, streaming, cross-signal correlations, and the Tempo service graph visualization.

Alloy Docker socket: Alloy needs read-only access to the Docker socket for container discovery. This is the same pattern Promtail used, but Alloy's discovery is more flexible.

Start the full stack with:

make up-full
# or equivalently:
docker compose -f docker-compose.yml \
  -f docker-compose.observability.yml up -d

Port Reference

Service	Port	Purpose
OTel Collector	4317	OTLP gRPC receiver
OTel Collector	4318	OTLP HTTP receiver
OTel Collector	8889	Prometheus exporter (spanmetrics)
Tempo	3200	HTTP API, TraceQL
Prometheus	9090	Metrics queries, remote write
Loki	3100	Log queries, push API
Pyroscope	4040	Profile queries, push API
Grafana	3001	Unified dashboards

Grafana Dashboards

The platform ships with pre-provisioned dashboards that use all of the above signals. Let's walk through the two most important ones.

Service RED Metrics Dashboard

This dashboard shows Rate, Errors, and Duration for every service in the federation. It uses a $service template variable that queries Prometheus for all services with HTTP metrics.

Rate row: Two panels — total request rate (line graph) and request rate by endpoint (stacked bar chart). The total rate panel uses:

sum(rate(http_server_request_duration_seconds_count
  {service_name=~"$service"}[$__rate_interval]))

Errors row: Three panels — error rate percentage (line), errors by status code (stacked bars), and an error log stream from Loki. The log stream panel queries:

{service_name=~"$service"} |~ "(?i)(error|exception|fatal|panic)" | json

This provides real-time error logs alongside the error metrics. When you see the error rate spike, the corresponding error messages appear in the same dashboard.

Duration row: Three panels — latency percentiles (P50/P95/P99 line chart), a latency heatmap, and a table of slow traces from Tempo. The slow traces panel uses TraceQL:

{resource.service.name =~ "$service" && duration > 500ms}

This is the key insight of the RED dashboard: every dimension (rate, errors, duration) is covered by multiple signal types (metrics, logs, traces) in a single view. You don't need to switch between tools to investigate.

SLO Dashboard

The SLO dashboard tracks availability and latency against defined targets:

Availability SLO (target 99.9%): A gauge showing current availability, a gauge showing error budget remaining, and a burn rate chart with 1h and 6h windows. The burn rate threshold line at 14.4x (the "fast burn" rate) shows when you're consuming budget dangerously fast.

Latency SLO (target P99 < 500ms): A gauge showing current compliance, a gauge showing latency budget remaining, and a compliance-over-time chart with the 99.9% target line.

30-day Rolling History: Full-width charts showing availability and latency SLO compliance over a 30-day window, plus an error budget consumption timeline. This answers the question: "Are we trending toward an SLO breach?"

Debugging a Federated Query: The Full Workflow

With the full stack running, debugging a slow query follows a workflow that crosses all signal types. Let's walk through a realistic scenario.

Scenario: Elevated P99 Latency

A user reports that product pages are loading slowly. You open Grafana.

Step 1: Dashboard Overview

The Service RED Metrics dashboard shows elevated P99 latency on the product-catalog service. The gauge is yellow (above the 500ms threshold) instead of its usual green.

Step 2: Identify the Pattern

The latency heatmap shows that most requests are still fast (clustered around 10-50ms), but there's a secondary cluster appearing around 400-600ms. This started about 20 minutes ago.

Step 3: Find a Slow Trace

The "Slow Traces" table in the same dashboard shows several traces with durations over 500ms. You click one.

Step 4: Trace Waterfall

Tempo shows the full trace:

Trace: 7a2b3c4d5e6f...
├── Router: POST /graphql (520ms)
│   ├── QueryPlanning (2ms)
│   ├── Fetch: product-catalog (505ms) ← bottleneck
│   │   ├── ProductDataFetcher.search (498ms) ← here
│   │   │   ├── meilisearch.search (490ms) ← root cause
│   │   │   └── PostgreSQL: SELECT (3ms)
│   │   └── Response serialization (2ms)
│   ├── Fetch: inventory (8ms) [parallel]
│   └── Fetch: user-service (12ms) [parallel]
└── ResponseMerge (1ms)

The meilisearch.search span is 490ms — that's the bottleneck.

Step 5: Check Logs

Click the "Logs" button on the meilisearch.search span. Grafana opens Loki filtered by trace ID and service:

14:32:01 WARN  product-catalog  Meilisearch search took 490ms
                                 for query "wireless headphones"
                                 index=products totalHits=1247
14:32:00 INFO  product-catalog  Meilisearch health check: indexing
                                 in progress (42% complete)

The Meilisearch instance is reindexing. During reindexing, search queries are slower because the index is being rebuilt.

Step 6: Confirm Systemic Impact

Click the "Request rate" link from the trace. Prometheus shows that the product-catalog's request rate is normal, but the P95 latency spiked from 50ms to 450ms starting 20 minutes ago — exactly when the reindexing started.

Step 7: Check the Profile

Click the "Profiles" link from the trace. Pyroscope shows the CPU profile for product-catalog during this window. The flame graph confirms: 70% of CPU time is in HTTP client wait (blocked waiting for Meilisearch to respond), not in the Java application itself.

Step 8: Resolution

The root cause is clear: Meilisearch is reindexing, and search queries are slow during the rebuild. Options:

Wait for reindexing to complete (~15 more minutes based on 42% progress)
If reindexing is recurring, schedule it during low-traffic hours
Consider running a Meilisearch replica that serves queries while the primary reindexes

From symptom to root cause in eight steps, across three languages, four signal types, without logging into any service directly.

Loading diagram...

Architecture Decisions: Why This Stack

Why Tempo Over Jaeger

Jaeger served us well in earlier iterations, but Tempo offers three advantages:

No indexing infrastructure. Jaeger requires Elasticsearch or Cassandra for trace storage. Tempo uses object storage (S3/GCS) or local disk — no additional database to manage.
TraceQL. Jaeger's search is tag-based: find traces where service=product-catalog and http.status_code=500. TraceQL adds structural queries: find traces where the product-catalog span has a child span with an error. This is essential for federation debugging.
Metrics generator. Tempo generates service graph and span metrics natively, remote-writing them to Prometheus with exemplars. This creates the metric-to-trace correlation path.

Why Alloy Over Promtail

Promtail is a log shipper. Alloy is a telemetry collector:

River configuration. Component-based syntax that shows the data flow. Easier to extend with custom processing stages.
Multi-signal support. Alloy can collect metrics, traces, and profiles in addition to logs. While we only use it for logs today, it can replace the OTel Collector for some use cases.
Dynamic discovery. Alloy's Docker discovery automatically picks up new containers and drops removed ones. No manual target configuration.

Why Pyroscope

Traces show where time is spent. Profiles show why. In a polyglot platform, each runtime has distinct failure modes that only profiling can expose. Java profiles reveal GC pauses, lock contention, and JIT compilation hotspots. Go profiles show goroutine leaks, mutex contention, and memory allocation patterns. TypeScript profiles reveal event loop blocking, promise chain overhead, and V8 optimization bailouts.

Without profiling, some performance issues are invisible. A trace might show a 200ms database query, but the profile reveals that 150ms of that was spent serializing the result into a JavaScript object — an application-level issue, not a database issue.

Why the OTel Collector Over Direct Export

An alternative architecture has each service export directly to Tempo, Prometheus, and Loki — no Collector in the middle. This works for small deployments, but breaks down in a federated platform:

Configuration consistency. With direct export, each service in each language needs Tempo's endpoint, Prometheus's push gateway URL, and Loki's push API. Change any backend, and you touch every service's configuration. With the Collector, services point to one endpoint.
Sampling decisions. Tail sampling requires seeing all spans of a trace before deciding. Individual services can't tail-sample because they only see their own spans. The Collector sees every span and makes trace-level decisions.
Derived metrics. The spanmetrics connector requires seeing spans from all services to generate consistent metrics. If each service exported directly, you'd need to instrument metrics separately in each language.
Buffering and retry. If Tempo is temporarily down, the Collector buffers spans and retries. Without the Collector, spans are lost during backend outages.

Why Connectors Over Manual Metrics

The spanmetrics and servicegraph connectors generate metrics from traces. The alternative is instrumenting each service with Prometheus client libraries in three languages and hoping the metric names, labels, and buckets are consistent.

Connectors give you:

Consistency. Every service gets the same metrics with the same dimensions, regardless of language.
Zero application code. No prometheus.NewHistogramVec() in Go, no MeterRegistry in Java, no prom-client in TypeScript.
Exemplars. Connectors automatically link metrics to the traces they were derived from.
Federation visibility. The servicegraph connector shows cross-service call patterns that no individual service can see.

Practical Tips: Lessons from Building This Stack

Tip 1: Start with Traces, Derive Everything Else

If you can only instrument one signal type, choose traces. The spanmetrics connector generates your metrics. The trace ID in logs enables correlation. Traces are the foundational signal from which everything else can be derived.

Many teams start with metrics (Prometheus counters/histograms in application code) and add tracing later. This creates a maintenance burden: three languages, three metrics libraries, constant drift in metric names and label cardinality. Starting with traces and using connectors avoids this entirely.

Tip 2: Watch Cardinality

The spanmetrics connector dimensions determine the cardinality of generated metrics. Every unique combination of service.name x http.method x http.status_code x http.route x graphql.operation.name creates a separate time series in Prometheus.

If your GraphQL schema has 50 operations, 4 services, 3 HTTP methods, and 5 status codes, that's 50 * 4 * 3 * 5 = 3,000 time series — manageable. But if you add user.id as a dimension, you'd have 3,000 * N_users — a cardinality explosion that will crash Prometheus.

Rule of thumb: only add dimensions with bounded, known cardinality (methods, status codes, routes, operation names). Never add user IDs, request IDs, or other unbounded values as metric dimensions.

Tip 3: Tail Sampling Tradeoffs

The 10-second decision_wait in tail sampling means traces appear in Tempo with a ~10s delay. For real-time debugging, this is usually acceptable. For systems that need sub-second trace visibility (trading platforms, real-time bidding), use head-based sampling instead and accept the cost of storing more traces.

The num_traces: 100000 buffer also uses memory. At ~1KB per trace, 100K traces buffer uses ~100MB. Under heavy load, increase this or risk force-sampling traces before all their spans arrive.

Tip 4: Exemplar Budget

Prometheus exemplars are stored in a fixed-size circular buffer per time series. With --enable-feature=exemplar-storage, Prometheus stores the most recent exemplars for each series. If you have many series and high traffic, exemplars from low-traffic series may be evicted before you investigate.

For critical services, consider separate recording rules that pre-compute with exemplar-preserving aggregations, or increase Prometheus's exemplar storage configuration.

Tip 5: Test the Full Correlation Path

After deploying the stack, verify each correlation works end-to-end:

Generate a request: curl http://localhost:8000/graphql -d '{"query":"{ products { name } }"}'
Find the trace in Tempo via TraceQL: {resource.service.name = "product-catalog"}
Click "Logs" — verify Loki shows logs for that trace
Click "Request rate" — verify Prometheus shows metrics for that service
Find a log in Loki — verify the "View Trace" link opens the correct trace in Tempo
Open a Prometheus graph — verify exemplar dots appear and link to Tempo traces

If any link is broken, the most common causes are: mismatched label names (service vs service_name), missing trace ID in logs (check structured logging), or missing Grafana feature toggles.

What We Didn't Cover

This article focused on the observability infrastructure — how signals are collected, processed, and correlated. We deliberately omitted several operational topics. Custom dashboard creation goes beyond the provisioned dashboards to cover domain-specific panels. Alertmanager integration is needed to route Prometheus alerts to notification channels like Slack, PagerDuty, or email. Multi-tenancy requires tenant isolation in Loki, Tempo, and Pyroscope for multi-team deployments. Object storage backends — Tempo and Loki should use S3/GCS in production instead of local filesystem. And horizontal scaling of the Collector, Tempo, and Loki enables higher throughput as traffic grows.

These are operational concerns that depend on your deployment environment. The signal collection and correlation architecture described here works identically whether the backends are running on a laptop or in a Kubernetes cluster.

Series Conclusion

Over five articles, we've built a complete GraphQL Federation platform from the ground up:

Part 1 established why monolithic GraphQL fails at scale and how federation distributes ownership
Part 2 implemented subgraphs in Java, Go, and TypeScript with entity resolution
Part 3 added gRPC for internal communication and REST for Stripe payments
Part 4 composed Kong and Apollo Router into a secure, intelligent gateway layer
Part 5 wired OpenTelemetry across all languages with the Grafana LGTM+ stack — Tempo for traces, Prometheus for metrics and SLO tracking, Loki for logs, Pyroscope for profiles, and Alloy for log collection — with spanmetrics connectors, tail sampling, alerting rules, and cross-signal correlations

The platform runs entirely in Docker Compose — make up-full starts 20+ containers covering four application services, two gateways, a frontend, three databases, a search engine, an object store, and eight observability components. Every query is traced. Every metric is derived from traces. Every log is correlated to a trace. Every profile is linked to a span. Error budgets are tracked. Alerts fire when SLOs breach.

Federation isn't simple. Polyglot federation is harder still. But when each service is independently deployable, each team owns its domain, and every request is observable end-to-end across all signal types, the complexity pays for itself.

The gap between "we have monitoring" and "we have observability" is the difference between dashboards you stare at and signals you navigate. The LGTM+ stack, with its cross-signal correlations, makes every investigation a directed graph traversal instead of a guessing game. Start with any signal — a metric spike, a log error, a slow trace, a hot flame graph — and follow the links to the root cause.

That's the promise of modern observability. And in a polyglot federation, where a single query traverses four services in three languages, it's not optional. It's the foundation everything else rests on.

This concludes the Polyglot GraphQL Federation series. The full source code, including all observability configurations referenced in this article, is available in the project repository.

Polyglot GraphQL Federation: Part 5 - Observability Across the Stack

Arthur Costa