Loading episodes…
0:00 0:00

Adapting the Memory System to Your LLM Tool

00:00
BACK TO HOME

Adapting the Memory System to Your LLM Tool

10xTeam May 22, 2026 14 min read

Part 2 of “Inside Claude’s Cognition” Series

The Universal Problem

Every LLM has a context window. Every LLM forgets between conversations. Every developer who works on multiple projects faces the same problem I described in Part 1:

I have a 200,000-token context window per request. You have years of projects. Each conversation starts fresh—I have no inherent memory between sessions.

This problem doesn’t change with the tool. ChatGPT has it. GPT-4 has it. Local Llama has it. The context limit is a law of physics for language models.

What changes is the solution. Claude Code has auto-memory. ChatGPT doesn’t. Anthropic’s Files API exists; OpenAI Assistants work differently. Your constraints shape your memory system.

This essay shows how to adapt the three-tier memory pattern to whatever tool you use.


The Universal Pattern (Your Tool Doesn’t Matter)

From Part 1:

  1. Tier 1 — Auto-memory: Lean summaries (what are we building? what did we decide?).
  2. Tier 2 — In-repo docs: Deep context (full analyses, roadmaps, architecture).
  3. Tier 3 — Session context: Ephemeral work (code changes, debug logs, live reasoning).

This pattern is tool-agnostic. The files are Markdown + JSON (open standards). The philosophy is “persist outside, load on-demand.”

The only thing that changes: where the memory lives and how it gets loaded.


The Tool Selector

Pick your tool below. I’ll show you:

  • Where to store memory (filesystem? API? database?).
  • How to signal “load memory” to the LLM.
  • Trade-offs (cost, speed, ease).
  • A worked example.

Option 1: Claude Code (Free, Built-In)

You’re already using this. Skip to Part 3.

But if you’re reading this to understand other tools:

  • Memory lives in .claude/projects/<project-id>/memory/.
  • Claude Code auto-loads it.
  • Lazy-loading: you say “resume PyAcademy,” it loads only project_pycademy.md.

Option 2: ChatGPT (Free, Manual)

Where memory lives

Two places:

Global (shared across all conversations):

ChatGPT → Settings → Custom instructions → Instructions

## My Projects & Conventions

### Active Projects
- **PyAcademy:** Learning framework. Phase: 0 (bug fixes). GitHub: /autonomous/pycademy
- **DarJS:** SMB framework. Phase: 6/6 done. 258 tests. GitHub: /autonomous/darjs

### Cross-Project Conventions
- **Languages:** JavaScript/TypeScript for tooling. Python for data. Go for systems.
- **Testing:** Vitest for JS. Always >70% coverage.
- **Code:** camelCase (JS), snake_case (Python). No comments—write clearer code.
- **Memory:** I save decisions to GitHub. Load /projects/<name>/memory.md at session start.

Per-project (in GitHub or a shared doc):

/autonomous/pycademy/
├── memory.md         ← Tier 1 (load this at start)
├── ANALYSIS.md       ← Tier 2 (load only if task needs it)
└── ROADMAP.md        ← Tier 2

How to signal “load memory”

Start each conversation:

I'm resuming PyAcademy work.

My project memory is here:
https://github.com/your-username/autonomous/blob/main/pycademy/memory.md

(paste the contents of memory.md below)

---
[paste memory.md contents]
---

Next step: Phase 0, XSS bug fixes. See ANALYSIS.md §2 for details.

Trade-offs

Pros:

  • Free. No API fees.
  • Custom instructions are always loaded.
  • Simple—just copy-paste.

Cons:

  • Manual: you copy-paste memory at the start of every conversation.
  • Global instructions have a ~150k token limit (you can fit ~20 project summaries).
  • No lazy-loading: if memory is long, it counts against your conversation budget.
  • ChatGPT can’t automatically know to check GitHub for project files.

Example: Starting a Session

You: I'm resuming PyAcademy. Let me load the project memory.

[paste memory.md]

Phase 0, start with XSS. What are the top 3 files I need to fix?

Me: [loads memory, checks Custom Instructions for conventions, 
     sees "PyAcademy: Phase 0, bug fixes", sees ANALYSIS.md link]
    
    The top 3 XSS vectors are:
    1. lesson.title in innerHTML (line ~1750)
    2. exercise.problem in innerHTML (line ~1918)
    3. Python output in innerHTML (line ~2054)

Option 3: OpenAI Assistants API (Paid, Structured)

Where memory lives

Anthropic Files API or OpenAI Files API:

import os
from openai import OpenAI

client = OpenAI()

# Upload memory files once
response_pycademy = client.files.create(
    file=open("/projects/pycademy/memory.md", "rb"),
    purpose="assistants"
)
response_analysis = client.files.create(
    file=open("/projects/pycademy/ANALYSIS.md", "rb"),
    purpose="assistants"
)

# Create assistant with files attached
assistant = client.beta.assistants.create(
    name="DevOps Assistant",
    instructions="You help with software development. Reference the attached memory and analysis files.",
    model="gpt-4-turbo",
    tools=[{"type": "code_interpreter"}, {"type": "retrieval"}],
    file_ids=[response_pycademy.id, response_analysis.id]
)

How to signal “load memory”

Every time you create a message, the assistant has access to all attached files:

message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Phase 0, start with XSS. What are the top 3 files?"
)

The assistant automatically retrieves relevant memory files (via semantic search) and includes them in the context.

Trade-offs

Pros:

  • Automatic: you don’t manually copy-paste.
  • Per-file attachment: lazy-loading at the tool level.
  • Semantic search: “what did we decide about testing?” finds the answer across all files.

Cons:

  • API cost: ~$0.10–$0.30 per conversation.
  • Slower: retrieval → embedding → LLM (adds ~2–5 sec latency).
  • Less fine-grained control: tool decides what to retrieve, not you.

Example: Starting a Session

# One-time setup
assistant = client.beta.assistants.create(..., file_ids=[...])

# Every conversation
thread = client.beta.threads.create()

message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="Phase 0, start with XSS fixes."
)

# Assistant auto-retrieves memory.md + ANALYSIS.md
# No manual copy-paste needed

Where memory lives

Anthropic’s Files API (newer, purpose-built for this):

from anthropic import Anthropic

client = Anthropic()

# Upload memory files
with open("/projects/pycademy/memory.md", "rb") as f:
    memory_response = client.beta.files.upload(file=f)

with open("/projects/pycademy/ANALYSIS.md", "rb") as f:
    analysis_response = client.beta.files.upload(file=f)

# Use in message
message = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    system=[
        {
            "type": "text",
            "text": "You are a developer assistant. Reference the attached memory and analysis."
        },
        {
            "type": "document",
            "source": {
                "type": "file",
                "file_id": memory_response.id,
            }
        },
        {
            "type": "document",
            "source": {
                "type": "file",
                "file_id": analysis_response.id,
            }
        }
    ],
    messages=[
        {"role": "user", "content": "Phase 0, XSS fixes. What are the top 3?"}
    ]
)

Trade-offs

Pros:

  • Purpose-built for context (not a side-effect of retrieval).
  • Faster than Assistants (no embedding search).
  • Cheaper than Assistants (~$0.03 per file load).
  • Automatic: file is always in context, no manual paste.

Cons:

  • API-only (no web interface).
  • Requires code / automation (can’t use ChatGPT web and do this).

Example: Starting a Session

# Load memory once per session
message = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    system=[
        {"type": "text", "text": "..."},
        {"type": "document", "source": {"type": "file", "file_id": memory_id}},
        {"type": "document", "source": {"type": "file", "file_id": analysis_id}},
    ],
    messages=[{"role": "user", "content": "Phase 0, XSS fixes."}]
)

# Memory is in context. No manual loading.

Option 5: LangChain + Vector DB (Powerful, Complex)

Where memory lives

Vector database (Pinecone, Weaviate, Milvus):

from langchain.vectorstores import Pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.memory import EntityMemory
from langchain.agents import AgentType, initialize_agent, load_tools

embeddings = OpenAIEmbeddings()
vectorstore = Pinecone.from_documents(
    documents=[
        Document(page_content=open("/projects/pycademy/memory.md").read(), 
                 metadata={"project": "pycademy", "type": "memory"}),
        Document(page_content=open("/projects/pycademy/ANALYSIS.md").read(),
                 metadata={"project": "pycademy", "type": "analysis"}),
    ],
    embedding=embeddings,
    index_name="project-memory"
)

memory = EntityMemory(llm=llm, vectorstore=vectorstore, k=3)
agent = initialize_agent(tools, llm, agent=AgentType.OPENAI_FUNCTIONS, memory=memory)

# Use
response = agent.run("Phase 0, XSS fixes. What are the top 3?")

How it works

  • You ask a question.
  • LangChain embeds the question.
  • Vector DB finds the top-K most similar memory entries.
  • Those are injected into the LLM context.
  • LLM answers with memory context.

Trade-offs

Pros:

  • Semantic search: “what did we decide about testing?” finds the answer, not just keyword matches.
  • Scales to unlimited memory (vector DB can hold millions of entries).
  • Flexible: you control embedding model, retrieval strategy.

Cons:

  • Complex: requires setup, infrastructure, debugging.
  • Overkill for structured projects (you don’t need semantic search; decisions are organized by folder/phase).
  • API costs: embedding API + vector DB.
  • Slower: embedding + retrieval adds latency.

When to use

If you have:

  • 100+ projects with overlapping lessons.
  • Unstructured memory (notes, discussions, logs).
  • Need for “find all decisions about authentication across all projects.”

Otherwise, use simpler tools.


Option 6: Plain GitHub (Free, Manual, Durable)

Where memory lives

Your repo:

/autonomous/
├── pycademy/
│   ├── memory.md         ← Tier 1 (you paste this at session start)
│   ├── ANALYSIS.md       ← Tier 2
│   └── ROADMAP.md        ← Tier 2
├── darjs/
│   ├── memory.md
│   └── ...
├── GLOBAL_CONVENTIONS.md ← Cross-project rules
└── MEMORY_INDEX.md       ← Which projects have memory

How to signal “load memory”

You: Resuming PyAcademy. Loading memory...

[manually copy-paste from /autonomous/pycademy/memory.md]

---

OK, loaded. Phase 0, XSS fixes.

Trade-offs

Pros:

  • Free. No API calls.
  • Durable: version controlled, backed up.
  • Simple: just files in your repo.
  • Searchable: git grep or GitHub search.

Cons:

  • Manual: you copy-paste at the start of every conversation.
  • No auto-loading: LLM doesn’t know to check GitHub.
  • Token cost: memory counts against your conversation budget.

When to use

  • You’re using ChatGPT web or free tools (no API access).
  • You want version history and durable storage.
  • You don’t mind manual copy-paste.

Option 7: Local Llama + Custom Retrieval (Complete Control)

Where memory lives

Local SQLite + embeddings:

import sqlite3
from sentence_transformers import SentenceTransformer

# Embed and store memory
model = SentenceTransformer('all-MiniLM-L6-v2')
conn = sqlite3.connect("memory.db")

for project in ["pycademy", "darjs"]:
    text = open(f"/projects/{project}/memory.md").read()
    embedding = model.encode(text).tolist()
    conn.execute(
        "INSERT INTO memory (project, text, embedding) VALUES (?, ?, ?)",
        (project, text, json.dumps(embedding))
    )
conn.commit()

# At inference time, retrieve relevant memory
def retrieve_memory(query, k=3):
    query_embedding = model.encode(query).tolist()
    # Find top-k by cosine similarity
    results = conn.execute(
        "SELECT project, text FROM memory ORDER BY distance LIMIT ?", (k,)
    ).fetchall()
    return results

# Inject into Llama context
relevant_memory = retrieve_memory("PyAcademy Phase 0")
context = "\n\n".join([text for _, text in relevant_memory])
prompt = f"Memory:\n{context}\n\nUser: Phase 0, XSS fixes."
response = llama.generate(prompt)

Trade-offs

Pros:

  • Complete control. Shape the system exactly as you want.
  • No API costs. Everything runs locally.
  • As smart as the embedding model you pick.

Cons:

  • Infrastructure work: set up embeddings, vector DB, retrieval.
  • Debugging complexity: why didn’t it retrieve X?
  • Embedding quality depends on your model choice.

When to use

  • You’re running local Llama or self-hosted LLM.
  • You want zero API costs and full control.
  • You’re willing to build infrastructure.

Comparison: All Options

Tool Memory storage Load mechanism Cost Setup Latency
Claude Code .claude/projects/ Auto Free Built-in ~0 ms
ChatGPT GitHub / Google Docs Manual paste Free Minimal ~0 ms
OpenAI Assistants Files API Auto retrieval $$ Moderate ~2–5 sec
Anthropic Files API Files API Auto inject $ Moderate ~0 ms
LangChain + VectorDB Vector DB Auto retrieval $$$ High ~2–5 sec
GitHub Git repo Manual paste Free Minimal ~0 ms
Local Llama + SQLite SQLite Custom retrieval None High ~0 ms

Recommendation: Pick Your Tool

You want free + minimal setup

ChatGPT + GitHub. Manually paste memory.md at the start of conversations. Use Custom Instructions for global rules.

You want free + automation (but need API)

Anthropic Files API (if you can switch to Anthropic). Purpose-built, cheap, fast.

You’re using OpenAI API

Assistants API (built-in retrieval) or LangChain (more control).

You’re using local Llama

Local SQLite + embeddings. You build it, you own it.

You want the best experience

Claude Code (auto-memory) or Anthropic Files API (if you’re on API).


Part 2 Summary

The three-tier memory pattern (Tier 1: lean summaries, Tier 2: deep docs, Tier 3: session context) is universal. But where you store memory and how you signal “load memory” depends on your tool.

The principle: Persist outside the LLM. Load on-demand. Organize by scope.

The implementation: Varies by tool. Pick the one that matches your workflow, budget, and technical comfort.

No tool is “best.” The best tool is the one you’ll actually use consistently.


Next in the series: Part 3: Token Economics — The math of context budgets, cache amortization, and why long sessions beat short ones.


Filed under: LLM memory management, tool comparison, context strategies.

Date: 2026-04-24 · Reading time: ~10 min


Join the 10xdev Community

Subscribe and get 8+ free PDFs that contain detailed roadmaps with recommended learning periods for each programming language or field, along with links to free resources such as books, YouTube tutorials, and courses with certificates.

Audio Interrupted

We lost the audio stream. Retry with shorter sentences?