Skip to content

OpenTelemetry for AI Agent Observability and Tracing

OpenTelemetry provides a vendor-neutral standard for tracing LLM calls, tool invocations, and sub-agent handoffs — making agent execution trees visible in any observability backend.

Why OTel for agents

OpenTelemetry instruments agent systems by attaching spans to LLM calls, tool invocations, and sub-agent handoffs. This produces a trace tree that any compatible backend, such as Datadog, Grafana, or Jaeger, can ingest and visualize. Ad-hoc logging is fragile, hard to compose, and locked to a single backend.

The mechanism is semantic conventions. The OpenTelemetry GenAI SIG defines standard attribute names, span types, metrics, and events for AI systems. Because every instrumented framework writes to the same attribute schema, backends correlate spans across agent boundaries, frameworks, and vendors without bespoke parsing. A span’s gen_ai.operation.name, gen_ai.usage.input_tokens, and parent/child relationships encode the execution tree natively. This removes per-backend log parsers and lets you correlate multi-agent traces through shared trace context.

GenAI semantic conventions

The GenAI semantic conventions define standard span attributes for LLM interactions. Some early attributes are deprecated as the conventions mature:

Attribute Purpose
gen_ai.system Provider identifier (deprecated; replaced by gen_ai.provider.name)
gen_ai.request.model Model invoked
gen_ai.usage.input_tokens Tokens consumed in the request
gen_ai.usage.output_tokens Tokens generated in the response
gen_ai.operation.name Operation type (chat, create_agent, invoke_agent)
gen_ai.provider.name Provider name

Provider-specific conventions cover Anthropic, OpenAI, Bedrock, Azure AI Inference, and MCP.

Agent span types

The agent span conventions define two primary span types:

Create Agent (gen_ai.operation.name = create_agent) covers agent initialization. It carries attributes for agent ID, name, description, version, and requested model.

Invoke Agent (gen_ai.operation.name = invoke_agent) covers agent execution. It carries conversation ID, input and output types, token usage, temperature, and finish reasons.

Agent-specific attributes include gen_ai.agent.id, gen_ai.agent.name, gen_ai.agent.description, and gen_ai.agent.version.

Trace structure for multi-agent runs

A well-instrumented agent system produces a trace tree:

Root span: user request
  ├── invoke_agent: orchestrator
  │   ├── chat: LLM call (model selection, token count)
  │   ├── execute_tool: file_read (latency, output size)
  │   ├── chat: LLM call (reasoning step)
  │   └── invoke_agent: sub-agent handoff
  │       ├── chat: LLM call
  │       └── execute_tool: api_call
  └── final response

Each span carries timing, token counts, and error state.

Instrumentation approaches

Frameworks instrument OTel in two ways:

Baked-in instrumentation means the framework emits OTel traces natively, for example CrewAI. Adoption is simpler, but it couples the framework to OTel versions.

External instrumentation libraries are separate packages that add OTel spans around framework calls, for example Traceloop and Langtrace. Maintenance stays decoupled, but you risk fragmentation.

Both approaches produce interoperable traces through shared semantic conventions.

What to capture

Signal Value
Token usage per call Cost tracking and budget enforcement
Latency per span Bottleneck identification
Tool call inputs/outputs Debugging incorrect tool usage
Error types and rates Reliability measurement
Model and temperature Reproducibility
Conversation/session ID Multi-turn correlation

Token usage and latency are the minimum viable signals. Tool input and output and model parameters add debugging depth, at the cost of larger traces.

Detecting problems from traces

Structured traces let you detect agent problems automatically:

  • Loop patterns: repeated identical tool calls or LLM requests within a trace point to stuck agents
  • Cost anomalies: token usage spikes per trace against historical baselines
  • Latency drift: rising span durations within a session can signal a growing prompt or slower model throughput
  • Error cascades: tool failures that propagate through sub-agent chains

When this backfires

OTel instrumentation is not cost-free:

  • Telemetry volume at scale: AI workloads generate 10–50× more telemetry than traditional services, because every LLM call produces token-level metrics, prompt and response events, and nested tool spans. Storage costs scale with trace depth, and capturing full prompt and response bodies adds more.
  • PII exposure: prompts often contain user data. Forwarding raw tool inputs and LLM prompt content to observability backends without sanitization creates compliance risk under GDPR, HIPAA, and similar regulations.
  • Setup overhead for prototypes: OTel SDK configuration, exporter setup, and collector deployment add days to weeks of effort. For experimental or short-lived agents, a lightweight structured log to stdout is faster to iterate on.
  • Spec instability: GenAI semantic conventions are still stabilizing, and attribute names are already deprecated, for example gen_ai.system to gen_ai.provider.name. Baked-in instrumentation in frameworks can lag upstream spec changes.

Example

Minimal Python example using the OTel SDK to instrument an LLM call with GenAI semantic conventions:

from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes  # opentelemetry-semantic-conventions-ai

tracer = trace.get_tracer("my-agent")

with tracer.start_as_current_span("chat") as span:
    span.set_attribute(SpanAttributes.GEN_AI_SYSTEM, "anthropic")
    span.set_attribute(SpanAttributes.GEN_AI_REQUEST_MODEL, "claude-3-5-sonnet-20241022")

    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    )

    span.set_attribute(SpanAttributes.GEN_AI_USAGE_INPUT_TOKENS, response.usage.input_tokens)
    span.set_attribute(SpanAttributes.GEN_AI_USAGE_OUTPUT_TOKENS, response.usage.output_tokens)

For sub-agent handoffs, wrap the child agent call in an invoke_agent span and propagate the trace context so the parent trace links to the child execution tree.

Key Takeaways

  • OpenTelemetry GenAI semantic conventions provide standard attribute names (gen_ai.*) for LLM calls, tool invocations, and agent spans.
  • Trace trees make multi-agent execution visible: which agent decided what, where time was spent, and where failures occurred.
  • Token usage and latency per span are the minimum viable signals for agent observability.
  • Frameworks either bake in OTel instrumentation or support it through external libraries — both produce interoperable traces.
  • Structured traces enable automated detection of loops, cost anomalies, and error cascades.
Feedback