maartengrootendorst.com - A Visual Guide to LLM Agents

Exploring the main components of Single- and Multi-Agents

Translations - Korean - Chinese - Vietnamese - French

LLM Agents are becoming widespread, seemingly taking over the “regular” conversational LLM we are familiar with. These incredible capabilities are not easily created and require many components working in tandem.

e5923a3a649255c00e3eff3ef237a936_MD5

With over 60 custom visuals in this post, you will explore the field of LLM Agents, their main components, and explore Multi-Agent frameworks.

👈 click on the stack of lines on the left to see a Table of Contents (ToC)

What Are LLM Agents?

To understand what LLM Agents are, let us first explore the basic capabilities of an LLM. Traditionally, an LLM does nothing more than next-token prediction.

c629ac79128862f2b990907c6d5fef4a_MD5

By sampling many tokens in a row, we can mimic conversations and use the LLM to give more extensive answers to our queries.

496ff502ac9f31c3339948ab446601cb_MD5

However, when we continue the “conversation”, any given LLM will showcase one of its main disadvantages. It does not remember conversations!

26684d71f87e679854dac725b949971e_MD5

There are many other tasks that LLMs often fail at, including basic math like multiplication and division:

44d794b054d7d024da9200936e262619_MD5

Does this mean LLMs are horrible? Definitely not! There is no need for LLMs to be capable of everything as we can compensate for their disadvantage through external tools, memory, and retrieval systems.

Through external systems, the capabilities of the LLM can be enhanced. Anthropic calls this “The Augmented LLM”.

a98976fd23fecf39889d7c504b61b255_MD5

For instance, when faced with a math question, the LLM may decide to use the appropriate tool (a calculator).

1cecce477d202975aac761130f914e20_MD5

So is this “Augmented LLM” then an Agent? No, and maybe a bit yes…

Let’s start with a definition of Agents:¹

An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.

— Russell & Norvig, AI: A Modern Approach (2016)

Agents interact with their environment and typically consist of several important components:

Environments — The world the agent interacts with
Sensors — Used to observe the environment
Actuators — Tools used to interact with the environment
Effectors — The “brain” or rules deciding how to go from observations to actions

be54807f03b7a17840b3a5e7ee807a1f_MD5

This framework is used for all kinds of agents that interact with all kinds of environments, like robots interacting with their physical environment or AI agents interacting with software.

We can generalize this framework a bit to make it suitable for the “Augmented LLM” instead.

174b1c76dcc76950d35dfd94c008996b_MD5

Using the “Augmented” LLM, the Agent can observe the environment through textual input (as LLMs are generally textual models) and perform certain actions through its use of tools (like searching the web).

To select which actions to take, the LLM Agent has a vital component: its ability to plan. For this, LLMs need to be able to “reason” and “think” through methods like chain-of-thought.

a5fd0d116acc9e2eafafdabce0ccfd3d_MD5

For more information about reasoning, check out The Visual Guide to Reasoning LLMs

Using this reasoning behavior, LLM Agents will plan out the necessary actions to take.

65adc6eede1cfecbf2bb0b0df57d36b3_MD5

This planning behavior allows the Agent to understand the situation (LLM), plan next steps (planning), take actions (tools), and keep track of the taken actions (memory).

25e520cb27d95c08cddfef6ece0fe8ab_MD5

Depending on the system, you can LLM Agents with varying degrees of autonomy.

edbc39e7f03437022cf3ffb6eaf4cf87_MD5

Depending on who you ask, a system is more “agentic” the more the LLM decides how the system can behave.

In the next sections, we will go through various methods of autonomous behavior through the LLM Agent’s three main components: Memory, Tools, and Planning.

Memory

LLMs are forgetful systems, or more accurately, do not perform any memorization at all when interacting with them.

For instance, when you ask an LLM a question and then follow it up with another question, it will not remember the former.

1ca72432ab041fcb3aaf2ea20a4122ac_MD5

We typically refer to this as short-term memory, also called working memory, which functions as a buffer for the (near-) immediate context. This includes recent actions the LLM Agent has taken.

However, the LLM Agent also needs to keep track of potentially dozens of steps, not only the most recent actions.

77418ece6b079b57311944cb98a0c09c_MD5

This is referred to as long-term memory as the LLM Agent could theoretically take dozens or even hundreds of steps that need to be memorized.

d7c452ca9dd95eef4c6c0ef50402062f_MD5

Let’s explore several tricks for giving these models memory.

Short-Term Memory

The most straightforward method for enabling short-term memory is to use the model’s context window, which is essentially the number of tokens an LLM can process.

466df84c3124fd77abe7b0b0b208153f_MD5

The context window tends to be at least 8192 tokens and sometimes can scale up to hundreds of thousands of tokens!

A large context window can be used to track the full conversation history as part of the input prompt.

465c8dde9ebee6dd2acf56324d8170d7_MD5

This works as long as the conversation history fits within the LLM’s context window and is a nice way of mimicking memory. However, instead of actually memorizing a conversation, we essentially “tell” the LLM what that conversation was.

For models with a smaller context window, or when the conversation history is large, we can instead use another LLM to summarize the conversations that happened thus far.

0e4e0eb48b00193f2f8b5f8714c629ae_MD5

By continuously summarizing conversations, we can keep the size of this conversation small. It will reduce the number of tokens while keeping track of only the most vital information.

Long-term Memory

Long-term memory in LLM Agents includes the agent’s past action space that needs to be retained over an extended period.

A common technique to enable long-term memory is to store all previous interactions, actions, and conversations in an external vector database.

To build such a database, conversations are first embedded into numerical representations that capture their meaning.

c71b2e1623df394ca7f0e085d21ce4d2_MD5

After building the database, we can embed any given prompt and find the most relevant information in the vector database by comparing the prompt embedding with the database embeddings.

072b570d88056e134cd918ffeaa05e07_MD5

This method is often referred to as Retrieval-Augmented Generation (RAG).

Long-term memory can also involve retaining information from different sessions. For instance, you might want an LLM Agent to remember any research it has done in previous sessions.

Different types of information can also be related to different types of memory to be stored. In psychology, there are numerous types of memory to differentiate, but the Cognitive Architectures for Language Agents paper coupled four of them to LLM Agents.²

7928c008f493c3e9299d8aa666ffeeef_MD5

This differentiation helps in building agentic frameworks. Semantic memory (facts about the world) might be stored in a different database than working memory (current and recent circumstances).

Tools

Tools allow a given LLM to either interact with an external environment (such as databases) or use external applications (such as custom code to run).

e90efe0abc77ba85852d3657dd6c98ec_MD5

Tools generally have two use cases: fetching data to retrieve up-to-date information and taking action like setting a meeting or ordering food.

To actually use a tool, the LLM has to generate text that fits with the API of the given tool. We tend to expect strings that can be formatted to JSON so that it can easily be fed to a code interpreter.

489d2dfbc4650d174d8e61cec50925d3_MD5

Note that this is not limited to JSON, we can also call the tool in code itself!

You can also generate custom functions that the LLM can use, like a basic multiplication function. This is often referred to as function calling.

87370cf9fdc873cb738eb95919b03674_MD5

Some LLMs can use any tools if they are prompted correctly and extensively. Tool-use is something that most current LLMs are capable of.

b200fe030d061d3c46968a543ec4322e_MD5

A more stable method for accessing tools is by fine-tuning the LLM (more on that later!).

Tools can either be used in a given order if the agentic framework is fixed…

429baea19f566f77e74cd438fe7e6783_MD5

…or the LLM can autonomously choose which tool to use and when. LLM Agents, like the above image, are essentially sequences of LLM calls (but with autonomous selection of actions/tools/etc.).

cb955d4aeb12e0046f03cfce6f97269b_MD5

In other words, the output of intermediate steps is fed back into the LLM to continue processing.

1899b5db3e0675288b4f8f483b388577_MD5

Toolformer

Tool use is a powerful technique for strengthening LLMs’ capabilities and compensating for their disadvantages. As such, research efforts on tool use and learning have seen a rapid surge in the last few years.

73ccff0dd266bd1bb28f5048a4541031_MD5

Annotated and cropped picture of the “ Tool Learning with Large Language Models: A Survey ” paper. With an increasing focus on tool use, (Agentic) LLMs are expected to become more powerful.

Much of this research involves not only prompting LLMs for tool use but training them specifically for tool use instead.

One of the first techniques to do so is called Toolformer, a model trained to decide which APIs to call and how.³

It does so by using the [ and ] tokens to indicate the start and end of calling a tool. When given a prompt, for example “ What is 5 times 3?”, it starts generating tokens until it reaches the [ token.

79f5f745b56838af9996d469c8ce1c6b_MD5

After that, it generates tokens until it reaches the → token which indicates that the LLM stops generating tokens.

b6ae71a982e8e5700dce9148a1d22b15_MD5

Then, the tool will be called, and the output will be added to the tokens generated thus far.

0df07b8521dbcd8604fdb087ceec3490_MD5

The ] symbol indicates that the LLM can now continue generating if necessary.

Toolformer creates this behavior by carefully generating a dataset with many tool uses the model can train on. For each tool, a few-shot prompt is manually created and used to sample outputs that use these tools.

014051cf5b84d99633320067e75ced80_MD5

The outputs are filtered based on correctness of the tool use, output, and loss decrease. The resulting dataset is used to train an LLM to adhere to this format of tool use.

Since the release of Toolformer, there have been many exciting techniques such as LLMs that can use thousands of tools (ToolLLM ⁴) or LLMs that can easily retrieve the most relevant tools (Gorilla ⁵).

Either way, most current LLMs (beginning of 2025) have been trained to call tools easily through JSON generation (as we saw before).

Model Context Protocol (MCP)

Tools are an important component of Agentic frameworks, allowing LLMs to interact with the world and extend their capabilities. However, enabling tool use when you have many different API becomes troublesome as any tool needs to be:

Manually tracked and fed to the LLM
Manually described (including its expected JSON schema)
Manually updated whenever its API changes

ca7a12f20bbfc495a27e683b1f1c1edf_MD5

To make tools easier to implement for any given Agentic framework, Anthropic developed the Model Context Protocol (MCP).⁶ MCP standardizes API access for services like weather apps and GitHub.

It consists of three components:

MCP Host — LLM application (such as Cursor) that manages connections
MCP Client — Maintains 1:1 connections with MCP servers
MCP Server — Provides context, tools, and capabilities to the LLMs

95fe0311ce3b78b94e5b742b5a584013_MD5

For example, let’s assume you want a given LLM application to summarize the 5 latest commits from your repository.

The MCP Host (together with the client) would first call the MCP Server to ask which tools are available.

acd8acd7c8db02c788a62aafa8587d77_MD5

The LLM receives the information and may choose to use a tool. It sends a request to the MCP Server via the Host, then receives the results, including the tool used.

6b34fab0374089dba8e6bc6cef61cc99_MD5

Finally, the LLM receives the results and can parse an answer to the user.

cdad0c9a6660989a7e98763fc919ce83_MD5

This framework makes creating tools easier by connecting to MCP Servers that any LLM application can use. So when you create an MCP Server to interact with Github, any LLM application that supports MCP can use it.

Planning

Tool use allows an LLM to increase its capabilities. They are typically called using JSON-like requests.

But how does the LLM, in an agentic system, decide which tool to use and when?

This is where planning comes in. Planning in LLM Agents involves breaking a given task up into actionable steps.

aee94fe0353c1b8e20a86fb350dff0b8_MD5

This plan allows the model to iteratively reflect on past behavior and update the current plan if necessary.

152186303d5576b5c88ec7ba5be945ff_MD5

I love it when a plan comes together!

To enable planning in LLM Agents, let’s first look at the foundation of this technique, namely reasoning.

Reasoning

Planning actionable steps requires complex reasoning behavior. As such, the LLM must be able to showcase this behavior before taking the next step in planning out the task.

“Reasoning” LLMs are those that tend to “think” before answering a question.

7ea1c8f5a955dba515fd92b43c4e4ba1_MD5

This reasoning behavior can be enabled by roughly two choices: fine-tuning the LLM or specific prompt engineering.

With prompt engineering, we can create examples of the reasoning process that the LLM should follow. Providing examples (also called few-shot prompting ⁷) is a great method for steering the LLM’s behavior.

97d14a36b1cd0f95cec1625764c04f79_MD5

This methodology of providing examples of thought processes is called Chain-of-Thought and enables more complex reasoning behavior.⁸

Chain-of-thought can also be enabled without any examples (zero-shot prompting) by simply stating “Let’s think step-by-step”.⁹

973ac5c2329d5f83007074708c07d0ad_MD5

When training an LLM, we can either give it a sufficient amount of datasets that include thought-like examples or the LLM can discover its own thinking process.

A great example is DeepSeek-R1 where rewards are used to guide the usage of thinking processes.

a59af0adb9995843c79ab50d538dc589_MD5

For more information about Reasoning LLMs, see my visual guide.

Reasoning and Acting

Enabling reasoning behavior in LLMs is great but does not necessarily make it capable of planning actionable steps.

The techniques we focused on thus far either showcase reasoning behavior or interact with the environment through tools.

e1c5de37bd34b858ffbd06bc84e6899a_MD5

Chain-of-Thought, for instance, is focused purely on reasoning.

One of the first techniques to combine both processes is called ReAct (Reason and Act).¹⁰

547ef67b2a2a1726a5cf80ce97927a1a_MD5

ReAct does so through careful prompt engineering. The ReAct prompt describes three steps:

Thought - A reasoning step about the current situation
Action - A set of actions to execute (e.g., tools)
Observation - A reasoning step about the result of the action

The prompt itself is then quite straightforward.

9916b570ee25ce7ac6aa2e9709c4dca8_MD5

The LLM uses this prompt (which can be used as a system prompt) to steer its behaviors to work in cycles of thoughts, actions, and observations.

e0ec6f5fc467b8948f32b6eb3ae23149_MD5

It continues this behavior until an action specifies to return the result. By iterating over thoughts and observations, the LLM can plan out actions, observe its output, and adjust accordingly.

As such, this framework enables LLMs to demonstrate more autonomous agentic behavior compared to Agents with predefined and fixed steps.

Reflecting

Nobody, not even LLMs with ReAct, will perform every task perfectly. Failing is part of the process as long as you can reflect on that process.

This process is missing from ReAct and is where Reflexion comes in. Reflexion is a technique that uses verbal reinforcement to help agents learn from prior failures.¹¹

The method assumes three LLM roles:

Actor — Chooses and executes actions based on state observations. We can use methods like Chain-of-Thought or ReAct.
Evaluator — Scores the outputs produced by the Actor.
Self-reflection — Reflects on the action taken by the Actor and scores generated by the Evaluator.

b1d89a4f0cb6ccdba80c4bfc6882445f_MD5

Memory modules are added to track actions (short-term) and self-reflections (long-term), helping the Agent learn from its mistakes and identify improved actions.

A similar and elegant technique is called SELF-REFINE, where actions of refining output and generating feedback are repeated.¹²

53d8568e59eca294d75a15745ae03261_MD5

The same LLM is in charge of generating the initial output, the refined output, and feedback.

da0361343b1692c765473b7cec046a3d_MD5

Annotated figure of the “ SELF-REFINE: Iterative Refinement with Self-Feedback ” paper.

Interestingly, this self-reflective behavior, both Reflexion and SELF-REFINE, closely resembles that of reinforcement learning where a reward is given based on the quality of the output.

Multi-Agent Collaboration

The single Agent we explored has several issues: too many tools may complicate selection, context becomes too complex, and the task may require specialization.

Instead, we can look towards Multi-Agents, frameworks where multiple agents (each with access to tools, memory, and planning) are interacting with each other and their environments:

b794cd1b1cc23e337dfe120d89b80452_MD5

These Multi-Agent systems usually consist of specialized Agents, each equipped with their own toolset and overseen by a supervisor. The supervisor manages communication between Agents and can assign specific tasks to the specialized Agents.

5a78ae163469d348c2b2f04fe8d5e355_MD5

Each Agent might have different types of tools available, but there might also be different memory systems.

In practice, there are dozens of Multi-Agent architectures with two components at their core:

Agent Initialization — How are individual (specialized) Agents created?
Agent Orchestration — How are all Agents coordinated?

9264268c16986414b42048416bb786f0_MD5

Let’s explore various interesting Multi-Agent frameworks and highlight how these components are implemented.

Interactive Simulacra of Human Behavior

Arguably one of the most influential, and frankly incredibly cool, Multi-Agent papers is called “ Generative Agents: Interactive Simulacra of Human Behavior ”.¹³

In this paper, they created computational software agents that simulate believable human behavior, which they call Generative Agents.

30cbdb4cf1a651d7ede34f3cea760554_MD5

The profile each Generative Agent is given makes them behave in unique ways and helps create more interesting and dynamic behavior.

Each Agent is initialized with three modules (memory, planning, and reflection) very much like the core components that we have seen previously with ReAct and Reflexion.

16a5ab848e02a10fc02d09532bc9b0c7_MD5

The Memory module is one of the most vital components in this framework. It stores both the planning and reflection behaviors, as well as all events thus far.

For any given next step or question, memories are retrieved and scored on their recency, importance, and relevance. The highest scoring memories are shared with the Agent.

6575404cca29db8fcc2758a640ec3786_MD5

Annoted figure of the Generative Agents: Interactive Simulacra of Human Behavior paper.

Together, they allow for Agents to freely go about their behavior and interact with one another. As such, there is very little Agent orchestration as they do not have specific goals to work to.

1c029f6b07e954275c91d88d9c0077d5_MD5

Annotated image from the interactive demo.

There are too many amazing snippets of information in this paper, but I want to highlight their evaluation metric.¹⁴

Their evaluation involved the believability of the Agent’s behaviors as the main metric, with human evaluators scoring them.

cb1c5a7f1411e7bae3f5a18718adf0e4_MD5

Annotated figure of the Generative Agents: Interactive Simulacra of Human Behavior paper.

It showcases how important observation, planning, and reflection are together in the performance of these Generative Agents. As explored before, planning is not complete without reflective behavior.

Modular Frameworks

Whatever framework you choose for creating Multi-Agent systems, they are generally composed of several ingredients, including its profile, perception of the environment, memory, planning, and available actions.¹⁵ ¹⁶

ddf342ca51f5774ef96d5d7d2eb2807a_MD5

Popular frameworks for implementing these components are AutoGen ¹⁷, MetaGPT ¹⁸, and CAMEL ¹⁹. However, each framework approaches communication between each Agent a bit differently.

With CAMEL, for instance, the user first creates its question and defines AI User and AI Assistant roles. The AI user role represents the human user and will guide the process.

a07e25046b9d8147bf67aa6d13592646_MD5

After that, the AI User and AI Assistant will collaborate on resolving the query by interacting with each other.

928e925deb3d9beedb5fe7416da23d12_MD5

This role-playing methodology enables collaborative communication between agents.

AutoGen and MetaGPT have different methods of communication, but it all boils down to this collaborative nature of communication. Agents have opportunities to engage and talk with one another to update their current status, goals, and next steps.

In the last year, and especially the last few weeks, the growth of these frameworks has been explosive.

6d4fdd5e25ace1e883029795e18a1c3a_MD5

2025 is going to be a truly exciting year as these frameworks keep maturing and developing!

Conclusion

This concludes our journey of LLM Agents! Hopefully, this post gives a better understanding of how LLM Agents are built.

To see more visualizations related to LLMs and to support this newsletter, check out the book I wrote on Large Language Models!

62e10f5d2ab56b2bf5bc1a062a79555c_MD5

Official website of the book. You can order the book on Amazon. All code is uploaded to GitHub.

Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. pearson. ↩
Sumers, Theodore, et al. “Cognitive architectures for language agents.” Transactions on Machine Learning Research (2023). ↩
Schick, Timo, et al. “Toolformer: Language models can teach themselves to use tools.” Advances in Neural Information Processing Systems 36 (2023): 68539-68551. ↩
Qin, Yujia, et al. “Toolllm: Facilitating large language models to master 16000+ real-world apis.” arXiv preprint arXiv:2307.16789 (2023). ↩
Patil, Shishir G., et al. “Gorilla: Large language model connected with massive apis.” Advances in Neural Information Processing Systems 37 (2024): 126544-126565. ↩
“Introducing the Model Context Protocol.” Anthropic, www.anthropic.com/news/model-context-protocol. Accessed 13 Mar. 2025. ↩
Mann, Ben, et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 1 (2020): 3. ↩
Wei, Jason, et al. “Chain-of-thought prompting elicits reasoning in large language models.” Advances in neural information processing systems 35 (2022): 24824-24837. ↩
Kojima, Takeshi, et al. “Large language models are zero-shot reasoners.” Advances in neural information processing systems 35 (2022): 22199-22213. ↩
Yao, Shunyu, Zhao, Jeffrey, Yu, Dian, Du, Nan, Shafran, Izhak, Narasimhan, Karthik, and Cao, Yuan. ReAct: Synergizing Reasoning and Acting in Language Models. Retrieved from https://par.nsf.gov/biblio/10451467. International Conference on Learning Representations (ICLR). ↩
Shinn, Noah, et al. “Reflexion: Language agents with verbal reinforcement learning.” Advances in Neural Information Processing Systems 36 (2023): 8634-8652. ↩
Madaan, Aman, et al. “Self-refine: Iterative refinement with self-feedback.” Advances in Neural Information Processing Systems 36 (2023): 46534-46594. ↩
Park, Joon Sung, et al. “Generative agents: Interactive simulacra of human behavior.” Proceedings of the 36th annual acm symposium on user interface software and technology. 2023. ↩
To see a cool interactive playground of the Generative Agents, follow this link: https://reverie.herokuapp.com/arXiv_Demo/ ↩
Wang, Lei, et al. “A survey on large language model based autonomous agents.” Frontiers of Computer Science 18.6 (2024): 186345. ↩
Xi, Zhiheng, et al. “The rise and potential of large language model based agents: A survey.” Science China Information Sciences 68.2 (2025): 121101. ↩
Wu, Qingyun, et al. “Autogen: Enabling next-gen llm applications via multi-agent conversation.” arXiv preprint arXiv:2308.08155 (2023). ↩
Hong, Sirui, et al. “Metagpt: Meta programming for multi-agent collaborative framework.” arXiv preprint arXiv:2308.00352 3.4 (2023): 6. ↩
Li, Guohao, et al. “Camel: Communicative agents for” mind” exploration of large language model society.” Advances in Neural Information Processing Systems 36 (2023): 51991-52008. ↩