Skip to content
$ clawproof --check 02 --verbose
Quality#02

Logging & Audit Trails

When something breaks, can you trace what happened, why, and who approved it?

On this page

The Failure Scenario

A customer reports that your AI agent sent them a contract with the wrong pricing tier: $4,800/year instead of $48,000. The sales team wants to know what happened. You check the logs and find a single line: "agent completed task: send_contract." No record of which pricing data the agent retrieved, which template it selected, what the user's original request said, or whether any approval step was triggered.

The engineering team spends two days trying to reproduce the issue in staging. They cannot, because the agent's behavior depends on the retrieval context at the time. Which documents were returned from the vector store, what the conversation history looked like, and what the model's confidence score was. None of this was logged. The incident becomes a he-said-she-said between the customer and the team.

This is what happens when you treat agent logging the same way you treat application logging. Traditional apps log request/response pairs. Agents make dozens of internal decisions, like tool selections, parameter choices, retrieval rankings, and confidence assessments. Any one of them can be the root cause of a failure. If you are not capturing the decision chain, you cannot debug, you cannot audit, and you cannot improve.

Why This Matters

Agent systems are inherently non-deterministic. The same input can produce different outputs depending on model state, retrieved context, and tool availability. This makes post-hoc debugging impossible without comprehensive logging. You cannot rely on "just running it again" because the conditions that produced the failure may never recur in exactly the same way.

Audit trails are also a regulatory requirement in most industries that handle customer data or financial transactions. SOC 2 Type II, HIPAA, and GDPR all require the ability to reconstruct who accessed what data, when, and for what purpose. An AI agent that queries a customer database without logging the query, the results returned, and the downstream action taken is a compliance gap that auditors will flag.

Beyond compliance, structured logs are the foundation of agent improvement. Every logged decision is a training signal. Which tool calls failed and why? Where did the agent hesitate and request clarification? Which retrieval results were irrelevant? Without this data, you are flying blind, shipping agent updates based on vibes instead of evidence.

How to Implement

Assign a trace ID to every agent invocation: a unique identifier that follows the request from the initial user message through every tool call, retrieval query, and model inference. This trace ID must propagate across service boundaries so you can reconstruct the full decision chain from a single identifier. Use OpenTelemetry-compatible trace IDs if you want to integrate with existing observability tooling.

Log at five critical points in the agent lifecycle: (1) the incoming request with full context, (2) every retrieval query and its results, (3) every tool call with parameters and response, (4) every decision point where the agent chose between alternatives, and (5) the final output delivered to the user. Each log entry should include the trace ID, timestamp, agent version, and a structured payload, not a formatted string.

Set retention policies that match your compliance requirements and debugging needs. Raw logs with full context should be retained for at least 90 days. Anonymized aggregate data, such as tool call success rates, latency percentiles, and error categories, should be retained indefinitely for trend analysis. Implement automatic PII redaction at the logging layer so sensitive data never hits your log storage unmasked.

agent_logger.py
import structlog
import uuid
from datetime import datetime, timezone

logger = structlog.get_logger()

class AgentAuditLogger:
    def __init__(self, agent_id: str, agent_version: str):
        self.agent_id = agent_id
        self.agent_version = agent_version

    def start_trace(self, user_input: str, session_id: str) -> str:
        trace_id = str(uuid.uuid4())
        logger.info("agent.invocation.start",
            trace_id=trace_id,
            agent_id=self.agent_id,
            agent_version=self.agent_version,
            session_id=session_id,
            input_hash=hashlib.sha256(user_input.encode()).hexdigest(),
            input_length=len(user_input),
            timestamp=datetime.now(timezone.utc).isoformat(),
        )
        return trace_id

    def log_tool_call(self, trace_id: str, tool: str,
                      params: dict, result: dict, latency_ms: float):
        logger.info("agent.tool.call",
            trace_id=trace_id,
            tool_name=tool,
            params=redact_pii(params),
            result_status=result.get("status"),
            result_size=len(str(result)),
            latency_ms=latency_ms,
        )

    def log_decision(self, trace_id: str, decision_type: str,
                     options: list[str], chosen: str, confidence: float):
        logger.info("agent.decision",
            trace_id=trace_id,
            decision_type=decision_type,
            options_count=len(options),
            chosen=chosen,
            confidence=round(confidence, 4),
        )

Production Checklist

  • โœ“Every agent invocation generates a unique trace ID that propagates through all downstream calls.
  • โœ“Tool calls are logged with full parameters, response status, and latency (before and after execution).
  • โœ“Retrieval queries log the query text, number of results, relevance scores, and which results were used.
  • โœ“Decision points log the available options, the chosen action, and the confidence score.
  • โœ“PII redaction runs at the logging layer. Sensitive fields are masked before reaching log storage.
  • โœ“Log retention is configured: 90+ days for raw logs, indefinite for aggregated metrics.
  • โœ“A dashboard exists showing tool call success rates, agent latency p50/p95/p99, and error categories.
  • โœ“Logs are queryable by trace ID within 60 seconds, not buried in a data lake with a 4-hour query time.
  • โœ“Alert rules fire on anomalies: spike in tool failures, unusual tool call patterns, or confidence drops.

Common Pitfalls

The biggest mistake is logging the final output but not the intermediate steps. When an agent sends an incorrect email, knowing that it sent an email is useless. You need to know which data it retrieved, how it interpreted the user's request, and what template it selected. Log the decision chain, not just the result.

Another pitfall is logging too much unstructured data. A 50KB JSON blob dumped into a log line is technically a record, but it is not queryable, not alertable, and not useful during an incident. Use structured fields with consistent naming conventions. If you cannot write a log query for a specific failure mode in under 30 seconds, your logging schema needs work.

Teams also frequently forget to log failed actions, like tool calls that errored, retrieval queries that returned zero results, and approval requests that timed out. These negative signals are often more valuable than success logs because they reveal where the agent is struggling and where it might be silently degrading.

Terminal Output

terminal
$ clawproof --check 02

  CHECK 02 โ€” Logging & Audit Trails
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

  [PASS] Trace ID generation configured (UUIDv4)
  [PASS] Tool call logging: before + after execution
  [PASS] Retrieval query logging with relevance scores
  [WARN] Decision point logging found in 2/5 agent flows โ€” incomplete
  [PASS] PII redaction layer active (regex + NER-based)
  [PASS] Log retention: 90 days raw, indefinite aggregated
  [FAIL] No anomaly alert rules configured for agent logs
  [PASS] Trace ID query latency: ~8 seconds (target: <60s)
  [PASS] Failed action logging enabled

  Result: 7 passed, 1 warning, 1 failed
  Status: NEEDS ATTENTION
$ clawproof --assess

Need help implementing this?

We help teams build agent governance frameworks and implement production-grade controls. From quick assessments to full implementation. Built by practitioners who run agents in production every day.

โœ“ Big 4 + DAX backgroundโœ“ Daily agent operationsโœ“ DACH compliance expertise

Stay clawproof

New checks, playbooks, and postmortems. Twice a month.

No spam. Unsubscribe anytime.