Context Engineering Explained In 10 Minutes |

You might have heard a lot about the term 'context engineering' recently. This article explains what it is, why it has emerged, and demonstrates how to implement several different context engineering approaches from scratch using LangGraph.

Why Context Engineering Matters

As AI applications handle more information, they can fail in several ways when their context grows. These failure modes include:

Context Poisoning: A hallucination makes its way into the context and becomes repeatedly referenced.
Distraction: The context grows so long that the model overfocuses on irrelevant parts of it.
Confusion: Irrelevant information in the context affects the model's behavior.
Clash: Contradictory information appears in the context window, leading to incorrect outputs.

To manage these problems, various techniques have been developed. This article will cover the implementation of six distinct methods: offloading, RAG, tool loadout, context pruning, summarization, and quarantine.

1. Retrieval-Augmented Generation (RAG)

RAG is the act of selectively adding relevant information to help an LLM generate a better response. This technique ensures only information relevant to the task at hand enters the context window of the LLM, which is a cornerstone of many production AI systems.

A Simple RAG Agent Implementation

Let's consider building an agent that can selectively retrieve information from a few different blog posts.

Load and Chunk Data: First, we load the pages and split them into smaller chunks. In RAG systems, we typically chunk our context into blocks and retrieve these blocks based on semantic similarity to load into the LLM's context window.
Create a Vector Store: We then create an in-memory vector store. By default, this can be set up to retrieve a specific number of semantically relevant documents for any given question. For instance, asking "Types of reward hacking" would return a list of relevant text chunks.
Turn Retriever into a Tool: This retriever can be turned into a tool that an agent can use. We give the tool a name and a description. This tool will take a query, get the relevant documents, and concatenate them into a single string.
Build the Agent: To build the agent, we bind the tool to an LLM. An agent is essentially an LLM calling tools in a loop until a termination condition is met. In LangGraph, this is implemented as a simple graph with two nodes: one for the LLM call and one for tool execution.

The process flows like this: - A user message initiates the research topic. - The LLM makes a tool call. - The tool node executes the tool and appends the result (a tool message) to the message list. - This repeats until the LLM decides no more tool calls are needed and responds directly.

For example, when asked, "What are the types of reward hacking discussed in the blogs?", the agent might make one tool call to retrieve blog posts, receive the concatenated text, decide it needs more information, make a second tool call, and then finally synthesize the collected information into a summary.

The Problem of Token Accumulation

A crucial observation from running such agents is the rapid growth of the token count. In a LangSmith trace, you might see the token count jump to over 25,000. Each tool call to retrieve documents can be token-heavy, with observations getting appended to the message list at every turn. This accumulation is a primary motivation for context engineering.

2. Context Pruning

Context pruning is the act of removing irrelevant or unneeded information from the context. This directly helps with the problem of context distraction, where a large context causes the model to overfocus and fail to perform novel actions. Recent reports have shown that LLM performance can degrade in surprising ways as the overall context grows, a particular problem for agents.

To implement pruning, the setup is identical to the RAG agent, but with one key difference: we apply pruning within the tool node. We use another LLM prompted to remove irrelevant information relative to the initial request.

The process is: 1. The tool gets an observation. 2. It fetches the initial user request. 3. A smaller, efficient model (like GPT-4-mini) is run on the raw tool output with a prompt to prune it. 4. The pruned, more condensed context is written back to the message list.

This simple method can significantly reduce context bloat, making tool observations far more compact.

3. Summarization

Similar to pruning, summarization condenses tool observations. The key difference is that summarization boils the entire context down into a compressed summary, whereas pruning strips away irrelevant parts.

Summarization is useful when the context is broadly relevant but may be redundant. You want to compress it while retaining all the key information.
Pruning is better when some parts of the context are relevant and others are explicitly irrelevant.

The implementation is nearly identical to pruning. In the tool node, we use a summarization prompt with a small model to condense the raw observation from a tool call. The resulting summary is then passed back to the main LLM.

A word of caution: Both summarization and pruning risk information loss. It's critical to implement them carefully. Some developers use fine-tuned models for this step to ensure key events or information are retained.

4. Context Offloading

Context offloading is the act of storing information outside the LLM's context, often managed via tool calls. This is an intuitive and widely used technique.

For example, a research agent might create a research plan and save it to a file. This is done because the full research process might exceed the context window, but the plan needs to be recoverable for the final writing phase.

Another approach involves creating a plan.md file that is continually rewritten during the agent's execution. This iterative rewriting acts as a form of recitation, encouraging the agent to rethink its to-dos and re-contextualize its progress, which helps keep it on track.

Offloading with LangGraph State

LangGraph provides a convenient way to do this using its state object, which persists throughout the agent's lifetime. It's a perfect place to offload information and keep it out of the LLM's context window.

We can define a custom state object with a scratchpad key. Then, we create tools to write to and read from this scratchpad. An agent can then kick off a research process, read from the scratchpad, create a plan, write the plan to the scratchpad, perform searches, and iteratively update the scratchpad as it gathers information.

This process mirrors how humans work: we take notes during research and then collate those notes into a final report.

Offloading to Long-Term Memory

Sometimes, you want to persist information across different agent runs. In LangGraph, each run is a "thread." LangGraph includes a long-term memory store that you can write to in one thread and read from in another, similar to ChatGPT's memory feature.

This store can be an in-memory key-value store for local testing or backed by Redis or Postgres in production. The implementation involves updating the tool node to read and write from this store instead of the state object.

For example: 1. In a first session (thread one), you ask the agent to research Company A. It saves its findings to the long-term store. 2. In a new session (thread two), you ask how Company B relates to Company A. The agent reads the notes about Company A from the store, gaining instant context before proceeding with the new research.

This powerful idea allows an agent to offload context to either a temporary state or a permanent store and reuse it as needed.

5. Tool Loadout

Tool loadout is the practice of actively selecting relevant tools based on your task. This helps with context confusion, which occurs when agents perform poorly because they have too many tools with overlapping definitions, making it hard to choose the right one.

A simple and effective solution is to perform semantic retrieval on tool descriptions based on the user's task. Only the relevant tools are retrieved and bound to the LLM on the fly.

For instance, we can create a tool registry from all functions in Python's math module, embed their descriptions, and save this to the LangGraph store. In the LLM node, before making a call, we search the store for tools with descriptions that are semantically similar to the user's query. Only these relevant tools are bound to the LLM for that specific task.

When you ask the agent to "calculate the arccos of 0.5," it will dynamically bind only a handful of relevant math tools (like acos, cos, sin) instead of the entire math library, making its decision-making far more efficient.

6. Context Quarantine

Context quarantine involves isolating context in different LLMs, often structured as sub-agents. This can help with context clash or distraction by quarantining different topics into their own context windows, rather than making a single agent grapple with conflicting subtopics.

This approach is common in advanced research systems. By using isolated contexts, the overall system can utilize more tokens, as each sub-agent might have its own large context window.

Multi-Agent Supervisor Architecture

In LangGraph, this can be implemented with a supervisor-worker architecture: - A supervisor agent offloads research tasks to specialized sub-agents. - Sub-agents (e.g., a math agent and a search agent) perform their tasks in isolation. - Results are circulated back to the supervisor to decide the next step.

For example, to find the combined headcount of FAANG companies, the supervisor would first delegate the task to a research expert to find the numbers. The researcher returns its findings. The supervisor then passes these numbers to a math expert for calculation. The final result is returned to the supervisor, who provides the answer.

A Note on Multi-Agent Systems

However, multi-agent systems can be risky. When sub-agents perform tasks in parallel, they make decisions independently, which can lead to contradictions or conflicts that affect the overall system. A classic example is having multiple agents write different sections of a report independently, resulting in a disjointed final document.

A good mitigation strategy is to constrain information gathering to parallel agents, but centralize decision-making or synthesis. For tasks like research, which is primarily information gathering, the risk of conflict is lower. A final "writer" agent can then de-conflict any inconsistencies to ensure the final output is coherent.

These six techniques provide a powerful toolkit for managing LLM context. They can be implemented easily in LangGraph and adapted for your specific use case to build more robust and efficient AI agents.