Overview #
In platform engineering, semantics shared meaning about services, requests, resources, and failure modes matter more than raw signal volume. OpenTelemetry gives you spans, traces, metrics, and logs, but it’s Semantic Conventions and correlation that turn those signals into answers. This post lays out a semantics-first approach you can standardize across your platform so every team inherits powerful, low-friction observability by default.
Why platform observability is hard #
Platform teams operate polyglot, multi-tenant systems: Kubernetes, serverless, data pipelines, message buses, edge gateways, and CI/CD runners—each speaking its own dialect. The “observability debt” show up as:
- Dashboards that don’t line up (fields named differently per service)
- High-cardinality costs from free-text labels (e.g., full URLs, user IDs)
- Logs that can’t be tied back to a request
- Traces with missing hops due to broken header propagation
- Metrics that trend but can’t be explained
A semantics-first strategy fixes this by enforcing consistent names, attributes, and correlation IDs across all four signals.
OTel signals—built for correlation #
1) Traces & spans (the backbone)
- Trace: a single request’s journey across services.
- Span: one timed operation within that journey (e.g., “HTTP GET /orders/{id}”).
- Correlation keys: trace_id, span_id, parent_span_id.
- Span events: time-stamped notes (e.g., “retry attempt=2”), great for recording meaningful, structured “logs inside the span”.
2) Metrics (the pulse)
- Counters, gauges, histograms summarize service health (RPS, latency, error rate).
- Exemplars link metric data points to example trace_ids, so you can jump from a latency spike directly into a representative trace.
- Cardinality control is crucial: use templated labels and avoid user-specific labels.
3) Logs (the narrative)
- Rich context for decisions and errors.
- Must be structured (JSON) with standardized keys and injected correlation IDs (trace_id, span_id) so a single query pivots from a log line to the originating trace and spans.
4) Resources (the identity)
- Describe where telemetry came from: examples - service.name, service.version, service.namespace, deployment.environment, k8s.cluster.name, k8s.namespace.name, k8s.pod.name, cloud.region, host., container..
- Resource attributes are the join keys for cross-cutting views and tenancy boundaries.
Semantic Conventions: your platform contract #
OTel’s Semantic Conventions define standard attribute names and when to set them. Treat them as a platform contract, some of the examples are mentioned below:
-
Base resource identity (applies to every process):
- service.name, service.version, service.namespace
- deployment.environment (e.g., dev|staging|prod)
- telemetry.sdk.* (auto-populated by OTel SDKs)
-
Workload context (for K8s/containers):
- k8s.cluster.name, k8s.namespace.name, k8s.pod.name, k8s.node.name
- container.id, container.image.name, container.image.tag
-
Protocol families:
- HTTP: http.method, http.route, http.status_code, server.address, client.address
- DB: db.system, db.name, db.operation
- Messaging: messaging.system, messaging.destination, messaging.operation
Platform good practices while capturing obs data:
- Never invent ad-hoc keys when a SemConv exists.
- Normalize values (e.g., lowercase environment names).
- Template high-cardinality fields (routes, topics).
- Redact/avoid sensitive payloads and PII in attributes.
Correlation strategy #
-
Propagate context everywhere:
- Use W3C Trace Context across HTTP/gRPC/messaging. Ship language-specific middleware with your service templates so teams don’t forget.
-
Inject trace IDs into logs:
- Bridge your logging framework with OTel context:
- Java (SLF4J/Logback MDC), .NET (ILogger scopes), Python (structlog/loguru), JS (winston/pino).
- Emit trace_id and span_id fields in every log line.
- Standardize log keys: examples - severity, message,
service.name
,deployment.environment
, trace_id, span_id.
- Bridge your logging framework with OTel context:
-
Enable metrics exemplars:
- Histograms for latency/error rate should attach exemplars with trace_id.
-
Unify resources at the edge:
- Run the OpenTelemetry Collector as a DaemonSet or sidecar to enrich all signals with k8s and service resource attributes using k8sattributes and resource processors. This guarantees consistent identity even when apps are misconfigured.
-
Normalize & route:
- Use Collector processors to:
- Set defaults (e.g., if
service.namespace
missing, derive fromk8s.namespace.name
). - Drop noisy attributes (user IDs, full URLs).
- Route by
deployment.environment
: send dev to a low-cost store, prod to your primary store.
- Set defaults (e.g., if
- Use Collector processors to:
Platform pattern (what you standardize) #
-
Golden scaffolds (your IDP templates):
- OTel SDK + auto-instrumentation wired in.
- Middleware for context propagation (HTTP/gRPC/messaging).
- Logging bridge preconfigured for
trace_id/span_id
injection. - App health metrics (RPS/latency/errors) instruments included.
-
Cluster-level collectors:
- receivers: otlp,
k8s_events
, filelog (if scraping app logs), prometheus (for legacy exporters). - processors:
k8sattributes
, resource, attributes (drop/rename), batch, transform. - exporters: the backends you use (multiple allowed).
- receivers: otlp,
-
Dashboards & SLOs (generated from semantics):
- Service Overview (RPS, p50/p90/p99 latency, error rate) keyed by
service.name
,service.namespace
,deployment.environment
. - “Follow the trace” links on every panel via exemplars.
- Error analytics by
exception.type
,http.status_code
.
- Service Overview (RPS, p50/p90/p99 latency, error rate) keyed by
Gotchas to avoid #
-
Cardinality explosions: Don’t label by raw URL, request IDs, or user IDs. Use templated routes (
/users/{id}
, instead of/users/1abbcsdee2
) and coarse-grained dimensions - like (http.method
,http.route
,http.status_code
) , over fine grained (user.id
,request.id
) -
Partial propagation: One missing hop breaks correlation. Ship middleware by default, test with synthetic traces through gateways, queues, and batch jobs. Flow : [User → API Gateway → Order Service → Kafka → Invoice Processor → DB] with this - You’re not testing functionality, you’re testing trace continuity.
-
Invented fields: app, env, svc sound harmless but break discoverability. Use
service.*
anddeployment.environment
. -
Logs as strings: Unstructured logs destroy correlation. Emit JSON, keep keys consistent.
Observability Contract (Platform Standard) #
Every service onboarded to the platform must adhere to the following semantic and correlation requirements. This ensures consistent observability, cross-signal correlation, and zero-config dashboards across the platform. Add it to your internal service template (language-agnostic, enforced by linters/CI checks):
- Resource Attributes (Identity Layer): Defines who emitted the telemetry and where it runs. Must be present on every span, log, and metric - examples :
service.name
,service.namespace
,service.version
,k8s.cluster.name
,container.image.name
Inject via OTel Collector (k8sattributes processor) so it’s consistent across all signals.
-
Trace & Span Standards: Defines how to represent requests and operations. Enforced via OTel SDKs and auto-instrumentation.
-
Metrics Standards: Defines what to measure and how to correlate. example -
http.server.request.duration
,http.server.requests
,errors.total
-
Logging Standards: Defines what context each log line carries
- Structured JSON logs only.
- Auto-inject OTel context → logger via middleware or adapter.
{
"timestamp": "2025-09-30T10:12:45Z",
"severity": "ERROR",
"message": "Failed to connect to merchant bank account",
"trace_id": "abc123...",
"span_id": "def456...",
"service.name": "payment-service",
"deployment.environment": "prod",
"k8s.pod.name": "payment-service-abc"
}
- Sampling & Retention Policy
Environment | Sampling | Retention | Export Destination |
---|---|---|---|
dev | 10% | 7 days | Low-cost backend |
staging | 25% | 14 days | Shared backend |
prod | 100% (errors), tail sampling for slow requests | 30 days | Primary observability backend |
- Validation & Enforcement
- Linting: Check for required attributes before deploy
- CI Gate: Reject merge if
service.name
ordeployment.environment
missing - Collector Audit Logs: Flag dropped signals due to missing semantics
- Dashboard as Code: Provision generic dashboards for services via code. grafana-sdk , perses-dev
Takeway #
You don’t win observability by collecting more, you win by agreeing on meaning. OpenTelemetry gives you the plumbing, but Semantic Conventions and correlation are the strategy
- Conform to well-known attribute names (don’t improvise).
- Enforce correlation (trace_id, span_id) across spans, logs, and metrics.
- Centralize enrichment in the Collector so teams inherit consistency.
- Ship semantics via golden templates so every new service is observable on day one.