Semantics First: Correlating Signals with OpenTelemetry for Platform Observability

Table of Contents

Overview
#

In platform engineering, semantics shared meaning about services, requests, resources, and failure modes matter more than raw signal volume. OpenTelemetry gives you spans, traces, metrics, and logs, but it’s Semantic Conventions and correlation that turn those signals into answers. This post lays out a semantics-first approach you can standardize across your platform so every team inherits powerful, low-friction observability by default.

Why platform observability is hard
#

Platform teams operate polyglot, multi-tenant systems: Kubernetes, serverless, data pipelines, message buses, edge gateways, and CI/CD runners—each speaking its own dialect. The “observability debt” show up as:

Dashboards that don’t line up (fields named differently per service)
High-cardinality costs from free-text labels (e.g., full URLs, user IDs)
Logs that can’t be tied back to a request
Traces with missing hops due to broken header propagation
Metrics that trend but can’t be explained

A semantics-first strategy fixes this by enforcing consistent names, attributes, and correlation IDs across all four signals.

OTel signals—built for correlation
#

1) Traces & spans (the backbone)

Trace: a single request’s journey across services.
Span: one timed operation within that journey (e.g., “HTTP GET /orders/{id}”).
Correlation keys: trace_id, span_id, parent_span_id.
Span events: time-stamped notes (e.g., “retry attempt=2”), great for recording meaningful, structured “logs inside the span”.

2) Metrics (the pulse)

Counters, gauges, histograms summarize service health (RPS, latency, error rate).
Exemplars link metric data points to example trace_ids, so you can jump from a latency spike directly into a representative trace.
Cardinality control is crucial: use templated labels and avoid user-specific labels.

3) Logs (the narrative)

Rich context for decisions and errors.
Must be structured (JSON) with standardized keys and injected correlation IDs (trace_id, span_id) so a single query pivots from a log line to the originating trace and spans.

4) Resources (the identity)

Describe where telemetry came from: examples - service.name, service.version, service.namespace, deployment.environment, k8s.cluster.name, k8s.namespace.name, k8s.pod.name, cloud.region, host., container..
Resource attributes are the join keys for cross-cutting views and tenancy boundaries.

Semantic Conventions: your platform contract
#

OTel’s Semantic Conventions define standard attribute names and when to set them. Treat them as a platform contract, some of the examples are mentioned below:

Base resource identity (applies to every process):
- service.name, service.version, service.namespace
- deployment.environment (e.g., dev|staging|prod)
- telemetry.sdk.* (auto-populated by OTel SDKs)
Workload context (for K8s/containers):
- k8s.cluster.name, k8s.namespace.name, k8s.pod.name, k8s.node.name
- container.id, container.image.name, container.image.tag
Protocol families:
- HTTP: http.method, http.route, http.status_code, server.address, client.address
- DB: db.system, db.name, db.operation
- Messaging: messaging.system, messaging.destination, messaging.operation

Platform good practices while capturing obs data:

Never invent ad-hoc keys when a SemConv exists.
Normalize values (e.g., lowercase environment names).
Template high-cardinality fields (routes, topics).
Redact/avoid sensitive payloads and PII in attributes.

Correlation strategy
#

Propagate context everywhere:
- Use W3C Trace Context across HTTP/gRPC/messaging. Ship language-specific middleware with your service templates so teams don’t forget.
Inject trace IDs into logs:
- Bridge your logging framework with OTel context:
  - Java (SLF4J/Logback MDC), .NET (ILogger scopes), Python (structlog/loguru), JS (winston/pino).
  - Emit trace_id and span_id fields in every log line.
  - Standardize log keys: examples - severity, message, service.name, deployment.environment, trace_id, span_id.
Enable metrics exemplars:
- Histograms for latency/error rate should attach exemplars with trace_id.
Unify resources at the edge:
- Run the OpenTelemetry Collector as a DaemonSet or sidecar to enrich all signals with k8s and service resource attributes using k8sattributes and resource processors. This guarantees consistent identity even when apps are misconfigured.
Normalize & route:
- Use Collector processors to:
  - Set defaults (e.g., if service.namespace missing, derive from k8s.namespace.name).
  - Drop noisy attributes (user IDs, full URLs).
  - Route by deployment.environment : send dev to a low-cost store, prod to your primary store.

Platform pattern (what you standardize)
#

Golden scaffolds (your IDP templates):
- OTel SDK + auto-instrumentation wired in.
- Middleware for context propagation (HTTP/gRPC/messaging).
- Logging bridge preconfigured for trace_id/span_id injection.
- App health metrics (RPS/latency/errors) instruments included.
Cluster-level collectors:
- receivers: otlp, k8s_events, filelog (if scraping app logs), prometheus (for legacy exporters).
- processors: k8sattributes, resource, attributes (drop/rename), batch, transform.
- exporters: the backends you use (multiple allowed).
Dashboards & SLOs (generated from semantics):
- Service Overview (RPS, p50/p90/p99 latency, error rate) keyed by service.name,service.namespace, deployment.environment.
- “Follow the trace” links on every panel via exemplars.
- Error analytics by exception.type, http.status_code.

Gotchas to avoid
#

Cardinality explosions: Don’t label by raw URL, request IDs, or user IDs. Use templated routes (/users/{id} , instead of /users/1abbcsdee2) and coarse-grained dimensions - like (http.method , http.route, http.status_code) , over fine grained (user.id, request.id)
Partial propagation: One missing hop breaks correlation. Ship middleware by default, test with synthetic traces through gateways, queues, and batch jobs. Flow : [User → API Gateway → Order Service → Kafka → Invoice Processor → DB] with this - You’re not testing functionality, you’re testing trace continuity.
Invented fields: app, env, svc sound harmless but break discoverability. Use service.* and deployment.environment.
Logs as strings: Unstructured logs destroy correlation. Emit JSON, keep keys consistent.

Observability Contract (Platform Standard)
#

Every service onboarded to the platform must adhere to the following semantic and correlation requirements. This ensures consistent observability, cross-signal correlation, and zero-config dashboards across the platform. Add it to your internal service template (language-agnostic, enforced by linters/CI checks):

Resource Attributes (Identity Layer): Defines who emitted the telemetry and where it runs. Must be present on every span, log, and metric - examples : service.name, service.namespace, service.version, k8s.cluster.name, container.image.name

Inject via OTel Collector (k8sattributes processor) so it’s consistent across all signals.

Trace & Span Standards: Defines how to represent requests and operations. Enforced via OTel SDKs and auto-instrumentation.
Metrics Standards: Defines what to measure and how to correlate. example - http.server.request.duration, http.server.requests, errors.total
Logging Standards: Defines what context each log line carries

Structured JSON logs only.
Auto-inject OTel context → logger via middleware or adapter.

{
 "timestamp": "2025-09-30T10:12:45Z",
 "severity": "ERROR",
 "message": "Failed to connect to merchant bank account",
 "trace_id": "abc123...",
 "span_id": "def456...",
 "service.name": "payment-service",
 "deployment.environment": "prod",
 "k8s.pod.name": "payment-service-abc"
}

Sampling & Retention Policy

Environment	Sampling	Retention	Export Destination
dev	10%	7 days	Low-cost backend
staging	25%	14 days	Shared backend
prod	100% (errors), tail sampling for slow requests	30 days	Primary observability backend

Validation & Enforcement

Linting: Check for required attributes before deploy
CI Gate: Reject merge if service.name or deployment.environment missing
Collector Audit Logs: Flag dropped signals due to missing semantics
Dashboard as Code: Provision generic dashboards for services via code. grafana-sdk , perses-dev

Takeway
#

You don’t win observability by collecting more, you win by agreeing on meaning. OpenTelemetry gives you the plumbing, but Semantic Conventions and correlation are the strategy

Conform to well-known attribute names (don’t improvise).
Enforce correlation (trace_id, span_id) across spans, logs, and metrics.
Centralize enrichment in the Collector so teams inherit consistency.
Ship semantics via golden templates so every new service is observable on day one.

Overview #

Why platform observability is hard #

OTel signals—built for correlation #

Semantic Conventions: your platform contract #

Correlation strategy #

Platform pattern (what you standardize) #

Gotchas to avoid #

Observability Contract (Platform Standard) #

Takeway #