Multi-Agent Coordination

Two agents, one resource, zero coordination. Race conditions aren't just for code.

On this page

The Failure Scenario
Why This Matters
How to Implement
Production Checklist
Common Pitfalls
Terminal Output

The Failure Scenario

An e-commerce platform deploys two agents: one handles inventory updates from supplier feeds, the other processes customer orders. Both agents have write access to the same inventory database. A supplier feed comes in at 14:02 showing 3 units of a popular item restocked. At 14:02:05, the inventory agent begins updating the count from 0 to 3. But at 14:02:03, before the write commits, the order agent reads the inventory as 0 and tells a customer the item is out of stock. Two seconds later, the inventory update commits. The customer is already gone.

The next day, the inverse happens. Both agents read the inventory at the same moment: 5 units available. The order agent sells 3 units and writes 2. The inventory agent processes a supplier adjustment of -1 and writes 4 (it read 5, subtracted 1). The inventory agent's write lands last, overwriting the order agent's update. The database now shows 4 units, but only 2 actually exist. Three orders will ship with no inventory to fulfill.

These are classic concurrency bugs: read-write races and lost updates. Every database engineer knows how to prevent them with locks and transactions. Agent developers often miss this because each agent feels like an independent service. In reality, they're concurrent writers to shared state, and they need coordination primitives.

Why This Matters

Multi-agent architectures are becoming the default for complex workflows. Instead of one monolithic agent that handles everything, teams decompose tasks: a research agent gathers data, an analysis agent processes it, a writing agent drafts the output, and a review agent checks quality. This is good software design, with separation of concerns, single responsibility, and independent scaling. But it introduces coordination problems that single-agent systems don't have.

The fundamental challenge is shared state. When two agents can read from and write to the same resource, whether that's a database, a file, an API endpoint, or a shared memory store, every concurrent access becomes a potential data corruption event. LLM-based agents make this worse because their execution time is non-deterministic. You can't predict how long an agent will take to process a step, which means you can't predict when concurrent writes will collide.

Deadlocks are the other side of the coordination coin. If Agent A holds a lock on the inventory table and waits for a lock on the pricing table, while Agent B holds a lock on the pricing table and waits for a lock on the inventory table, both agents halt indefinitely. In traditional systems, database deadlock detection resolves this automatically. In agent systems where locks are application-level, deadlocks can freeze entire workflows with no automatic resolution.

How to Implement

The orchestrator pattern is the most reliable approach for multi-agent coordination. Instead of agents communicating peer-to-peer or independently accessing shared resources, a central orchestrator manages task assignment, resource locking, and result collection. The orchestrator ensures that only one agent writes to a given resource at a time, resolves conflicts when they occur, and enforces execution ordering when tasks have dependencies.

For shared-resource access, implement distributed locks with TTL (time-to-live) expiration. When an agent needs to modify a shared resource, it acquires a lock through the orchestrator, performs its work, and releases the lock. The TTL ensures that if an agent crashes mid-task, the lock expires rather than blocking all other agents indefinitely. Use advisory locks for read-heavy workloads where occasional stale reads are acceptable, and exclusive locks for write operations where consistency is required.

State synchronization between agents should use an event-driven model rather than shared mutable state. When Agent A updates inventory, it emits an InventoryUpdated event. Agent B subscribes to that event stream and updates its local view. This eliminates read-write races because agents react to committed state changes rather than reading in-flight state. The event log also provides a complete audit trail of what each agent did and when.

orchestrator/coordinator.py

import time
import uuid
from dataclasses import dataclass, field
from enum import Enum

class LockStatus(Enum):
    ACQUIRED = "acquired"
    DENIED = "denied"
    EXPIRED = "expired"

@dataclass
class ResourceLock:
    resource_id: str
    agent_id: str
    acquired_at: float
    ttl_seconds: float = 30.0
    lock_id: str = field(default_factory=lambda: str(uuid.uuid4()))

    @property
    def is_expired(self) -> bool:
        return time.time() - self.acquired_at > self.ttl_seconds

class AgentCoordinator:
    def __init__(self):
        self.locks: dict[str, ResourceLock] = {}
        self.event_log: list[dict] = []
        self.agent_lock_order: dict[str, list[str]] = {}  # deadlock prevention

    def acquire_lock(self, resource_id: str, agent_id: str, ttl: float = 30.0) -> ResourceLock | None:
        existing = self.locks.get(resource_id)
        if existing and not existing.is_expired:
            if existing.agent_id == agent_id:
                return existing  # re-entrant
            self._emit_event("lock_denied", agent_id=agent_id, resource=resource_id,
                             held_by=existing.agent_id)
            return None

        # Deadlock prevention: enforce global lock ordering
        if not self._check_lock_order(agent_id, resource_id):
            self._emit_event("deadlock_prevented", agent_id=agent_id, resource=resource_id)
            return None

        lock = ResourceLock(resource_id, agent_id, time.time(), ttl)
        self.locks[resource_id] = lock
        self._emit_event("lock_acquired", agent_id=agent_id, resource=resource_id,
                         lock_id=lock.lock_id)
        return lock

    def release_lock(self, lock_id: str, agent_id: str) -> bool:
        for resource_id, lock in self.locks.items():
            if lock.lock_id == lock_id and lock.agent_id == agent_id:
                del self.locks[resource_id]
                self._emit_event("lock_released", agent_id=agent_id, resource=resource_id)
                return True
        return False

    def _check_lock_order(self, agent_id: str, resource_id: str) -> bool:
        """Prevent deadlocks by enforcing consistent lock acquisition order."""
        held = self.agent_lock_order.get(agent_id, [])
        if held and resource_id < held[-1]:  # must acquire in alphabetical order
            return False
        return True

    def _emit_event(self, event_type: str, **kwargs):
        self.event_log.append({"type": event_type, "ts": time.time(), **kwargs})

Production Checklist

✓Map every shared resource (databases, APIs, files, caches) and identify which agents have read and write access to each
✓Implement an orchestrator or coordinator service that manages task assignment and resource locking across agents
✓Use distributed locks with TTL expiration for all shared-resource write operations. Never rely on agents to release locks voluntarily
✓Enforce a global lock ordering convention to prevent deadlocks (e.g., always acquire locks in alphabetical resource-ID order)
✓Adopt event-driven state synchronization: agents emit events on state changes, other agents subscribe rather than polling shared state
✓Set up deadlock detection that alerts when an agent has been waiting for a lock longer than 2x the expected task duration
✓Test concurrent access patterns in staging: run 10+ agents simultaneously against shared resources and verify data consistency
✓Implement idempotent operations for all agent writes. If a task retries after a lock timeout, it should not duplicate side effects
✓Monitor lock contention metrics: high contention on a single resource indicates an architectural bottleneck that needs redesign
✓Add a kill switch per agent that the orchestrator can trigger if an agent becomes unresponsive while holding locks

Common Pitfalls

Where this breaks down most often is treating agent coordination as an application-layer concern when it should be an infrastructure concern. Teams build locking logic inside the agent's prompt or decision loop, for example: "Before writing to the database, check if another agent is writing." This doesn't work. An LLM cannot reliably implement mutual exclusion through natural language reasoning. Coordination must happen in deterministic code outside the LLM, in the orchestrator or tool-execution layer.

Another frequent failure is building multi-agent systems without idempotency. When an agent times out, the orchestrator retries the task. If the agent already performed half the work (wrote 3 of 5 database rows), the retry will duplicate those writes. Every agent operation that modifies state must be idempotent: use upserts instead of inserts, check for existing records before creating new ones, and include idempotency keys in API calls.

Teams also over-complicate coordination by defaulting to peer-to-peer communication between agents. Agent A sends a message to Agent B, which sends a response to Agent C, which updates Agent A. This creates a distributed system with all the failure modes of distributed systems: message loss, ordering violations, and split-brain scenarios. Start with a centralized orchestrator. Move to peer-to-peer only when you've outgrown the orchestrator's throughput; even then, use a message broker rather than direct agent-to-agent calls.

Terminal Output

terminal

$ clawproof --check 10

  CHECK 10 — Multi-Agent Coordination
  ─────────────────────────────────────────────
  ✓ Shared resources mapped: 4 resources, 6 agents with write access
  ✓ Orchestrator pattern detected: centralized task coordinator
  ✓ Distributed locks with TTL: 30s default, per-resource configurable
  ✓ Global lock ordering enforced (alphabetical resource-ID)
  ✗ FAIL: Agent "data_enricher" writes are not idempotent — uses INSERT, not UPSERT
  ✓ Event-driven sync: 3 event channels, all agents subscribed
  ✗ FAIL: No deadlock detection alerting configured — locks can expire but no alert fires
  ✓ Kill switch available per agent via orchestrator admin API

  Result: 2 issues found — fix idempotency and add deadlock alerting
  Severity: MEDIUM — data corruption risk under concurrent load

$ clawproof --related

Previous← #09 Cost Controls & Rate Limiting