Computer Architecture Today

Informing the broad computing community about current activities, advances and future directions in computer architecture.

Large language model (LLM) agents are quickly moving from “single agent” to *multi-agent systems*: tool-using agents, planner-orchestrator, debate teams, specialized sub-agents that collaborate to solve tasks. At the same time, the *context* these agents must operate within is becoming more complex: longer histories, multiple modalities, structured traces, and customized environments. This combination creates a bottleneck that looks surprisingly familiar to computer architects: memory.

In computer systems, performance and scalability are often limited not by compute, but by memory hierarchy, bandwidth, and consistency. Multi-agent systems are heading toward the same wall — except their “memory” is not raw bytes, but semantic context used for reasoning. After dipping our heads building various LLM multi-agent frameworks over the past two years (e.g., OrcaLoca for software issue localization, MAGE for RTL design, Pro-V for RTL verification, and PettingLLMs enabling RL training on multiple LLM agents), we would like to share our insights learned from our experience through the lens of a computer architect. This blog frames multi-agent memory as a computer architecture problem, proposes a simple architecture-inspired model, and highlights the key challenges and protocol gaps that define the road ahead.

While our perspectives are still preliminary and evolving, we hope they serve as a starting point to ignite a broader conversation.


Multi-Agent Memory Systems in Growing Complex Contexts

Why memory matters: Context is changing

  • Longer context windows: Long-context evaluation suites like RULER and LongBench show that “real” long-context ability involves more than simple retrieval — it includes multi-hop tracing, aggregation, and sustained reasoning as length scales.
  • Multi-modal inputs: Benchmarks such as MMMU (static images: charts, diagrams, tables) and VideoMME(videos with audio and subtitles) demonstrate that models must handle diverse visual modalities alongside text, extending beyond single-modality processing.
  • Structured data & traces: Text-to-SQL (e.g., Spider, BIRD) highlight that agents increasingly operate over structured, executable data — database schemas and generated SQL queries — rather than only raw chat history.
  • Customized environments: In SWE-bench and Multi-SWE-bench, models are evaluated by applying patches to real repositories and running tests in containerized (Docker) environments, making “environment state + execution” part of the memory problem. Similarly, WebArena and OSWorld provide realistic, reproducible interactive environments that stress long-horizon state tracking and grounded actions.

Bottom line: Context is no longer a static prompt — it’s a dynamic, multi-format, partially persistent memory system.


Basic Prototypes: Shared vs. Distributed Agent Memory

Before we talk about “hierarchies,” it helps to name the two simplest prototypes, which mirror classical memory systems.

1) Shared Memory

All agents access a shared memory pool (e.g., a shared vector store, shared document database).

  • Pros: Easy to share knowledge; fast reuse.
  • Cons: Requires coherence support. Without coordination, agents overwrite each other, read stale info, or rely on inconsistent versions of shared facts.

2) Distributed Memory

Each agent owns local memory (local scratchpad, local cache, local long-term store) and shares via synchronization.

  • Pros: Isolation by default; more scalable; fewer contention issues.
  • Cons: Needs explicit synchronization; state divergence becomes common unless carefully managed.

Most real systems sit somewhere in between: local working memory plus selectively shared artifacts.


An Agent Memory Architecture Inspired by Modern Computer Architecture Design

Computer architecture teaches a practical lesson: you don’t build “one memory.” You build a memory hierarchy with different layers optimized for latency, bandwidth, capacity, and persistence.

A useful mapping for agents will be:

Agent I/O Layer

What it is: Interfaces that ingest and emit information.

  • Audio/speech
  • Text documents
  • Images
  • Network calls/web data

Analogy: Devices and I/O subsystems feeding the CPU.

Agent Cache Layer

What it is: Fast, limited-capacity memory optimized for immediate reasoning.

  • Compressed context
  • Recent trajectories and tool calls
  • Short-term latent storage (e.g., KV cache, embeddings of recent steps)

Analogy: CPU caches (L1/L2/L3): small, fast, and constantly refreshed.

Agent Memory Layer

What it is: Large-capacity, slower memory optimized for retrieval and persistence.

  • Full dialogue history
  • External knowledge databases (vector DBs, graph DBs, document stores)
  • Long-term latent storage

Analogy: Main memory + storage hierarchy.

This framing emphasizes a key principle: Agent performance is an end-to-end data movement problem. Even if the model is powerful, if relevant information is stuck in the wrong layer (or never loaded), reasoning accuracy and efficiency degrade.

And just like in hardware, caching is not optional. Similar to computer memory hierarchies, agent memory benefits from I/O and caching layers to improve efficiency and scalability.


Protocol Extensions for Multi-Agent Scenarios

Architecture layers need protocols. In multi-agent settings, protocols determine what can be shared, how fast, and under what rules.

Today, many agent frameworks rely on MCP (Model Context Protocol) as a connectivity layer. Agents registered via MCP can connect and communicate, but inter-agent bandwidth remains limited by message-passing. MCP largely uses JSON-RPC, so it’s best viewed as a protocol for agent context I/O: request/response, tool invocation, and structured messages.

That’s necessary — but not sufficient.

Missing Piece 1: Agent Cache Sharing Protocol

Many recent studies, such as DriodSpeak and Cache to cache, explored KV cache sharing between LLM. However, we still lack a principled and unified protocol for sharing cached artifacts across agents.

Goal: Enable one agent’s cached results to be transformed and reused by other agents.

In architecture terms, this is like enabling cache transfers or shared cache behavior — except the payload is semantic and may require transformation before reuse.

Missing Piece 2: Agent Memory Access Protocol

Although frameworks like Letta and Mem0 support shared state within agent memory, the protocol defines how agents read/write each other’s memory is missing.

Goal: Define memory access semantics: permissions, scope, and granularity.

Key questions:

  • Can Agent B read Agent A’s long-term memory, or only shared memory?
  • Is access read-only, append-only, or read-write?
  • What is the unit of access: a document, a chunk, a key-value record, a “thought,” a trace segment?
  • Can we support “agent RDMA”-like patterns: low-latency direct access to remote memory without expensive message-level serialization?

Without a memory access protocol, inter-agent collaboration is forced into slow, high-level message passing, which wastes bandwidth and loses structure.


The Next Frontier: Multi-Agent Memory Consistency

The largest conceptual gap is consistency. The goal of memory consistency in computer architecture and systems design is to define constraints on the order of reads and writes to memory addresses. Consistency models (e.g., sequential consistency, TSO, and release consistency) clarify what behaviors programmers can rely on.

For agent memory, the goal shifts: It’s not about bytes at an address, but about maintaining a coherent semantic context that supports correct reasoning and coordination.

Why Agent Consistency Is Harder

  • The “state” is not a scalar value; it’s a plan, a summary, a retrieval result, a tool trace.
  • Writes are not deterministic; they may be speculative or wrong.
  • Conflicts aren’t simple write-write conflicts — they’re semantic contradictions.
  • Freshness depends on the environment state (repo version, API results, and permissions).

What a Multi-Agent Memory Consistency Layer Might Need

A practical direction is to define consistency around the artifacts agents actually share — cached evidence, tool traces, plans, and long-term records — across both shared and distributed memory setups (often a hybrid: local caches + shared store). The layer should expose a consistency model (e.g., session, causal, eventual semantic, and stronger guarantees for “committed” outputs), provide richer communication primitives than plain message passing, and include conflict-resolution policies (source ranking, timestamps, consensus, and optional human intervention for high-stakes conflicts).

Research on this is still rare, but it is likely to become foundational — much like coherence and consistency were for multiprocessors.

Conclusion

Many agent memory systems today resemble human memory — informal, redundant, and hard to control — leaving a large opportunity for computer architecture researchers to rethink what “memory” should mean for agents at scale. To move from ad-hoc prompting to reliable multi-agent systems, we need better memory hierarchies, explicit protocols for cache sharing and memory access, and principled consistency models that keep shared context coherent.

Acknowledgement

We sincerely thank Wentao Ni, Hejia Zhang, Mingrui Yin, Jiaying Yang, and Yujie Zhao for their invaluable contributions through brainstorming, discussions, data collection, and survey work over the past few months. This article would not have been possible without their dedicated efforts.

About the authors:

Zhongming Yu is a PhD student in the Computer Science and Engineering Department at University of California, San Diego. His research interests are in combining machine learning and computer systems, with a special focus on LLM agent system for machine learning systems, evolving ML and systems, and autonomous software engineering. 

Jishen Zhao is a Professor in the Computer Science and Engineering Department at University of California, San Diego. Her research spans and stretches the boundary across computer architecture, system software, and machine learning, with an emphasis on memory systems, machine learning and systems codesign, and system support for smart applications.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.