Agentic QA in 2026: Let One AI Break Your Framework and Another Fix It |

Most demos of AI coding agents show the happy path.

You ask for an app. The agent creates files. Tests pass. Everyone claps.

Real software does not feel like that. Real software is full of broken setup flows, confusing commands, half-finished CLIs, missing documentation, and “obvious” steps that only work because the original developer knows the secret handshake.

That is why one of the most useful agentic workflows I have been experimenting with is not:

Let the agent build the whole product.

It is:

Let the agent try to build the product like a new developer, hit the wall, write the bug report, and stop.

Then another agent fixes the reported issue.

Then the QA agent tries again.

This sounds simple, but it is one of the patterns that makes agentic development feel less like a magic trick and more like engineering.

The Setup

I have an internal framework called DarJS.

You do not need to know DarJS to understand the workflow. Think of it as a private app framework for building small-business software, especially Moroccan SMB apps: pharmacies, corner shops, restaurants, service businesses, suppliers, invoices, stock, payments, and local business fields.

The framework has:

a CLI
an MCP server
app templates
model generators
scenario tests
locale generation
database migration tools

In other words, it has exactly the kind of developer experience that can silently rot if nobody tests it from the outside.

So I created an AGENTS.md file for Codex. The file did not tell the agent to be a normal feature developer. It gave it a narrower job:

You are a QA build agent.
Try to build apps using this framework.
Do not fix framework bugs.
When blocked, append a precise issue to CODEX_REPORT.md and stop.

That one instruction changes the whole behavior of the agent.

Instead of treating failure as something to hack around, the agent treats failure as the output.

Is This a 2026 Best Practice or Just an Experiment?

It is both.

The exact file name CODEX_REPORT.md is my experiment. There is no universal law that says every repo needs that file.

But the underlying pattern is very much in line with where agentic development is going in 2026:

keep agent instructions in version-controlled markdown
make agent roles explicit
give agents a workflow, not just a goal
let tools expose deterministic actions
store cross-agent state somewhere durable
make the next agent’s job obvious

This is not far from how modern coding agents already treat project memory. Claude Code documents CLAUDE.md as a project memory file for team-shared instructions. OpenAI Codex supports AGENTS.md as repo guidance, similar to a README for agents. MCP formalizes the tool side: a server exposes actions, and the agent calls them through a structured protocol.

So the idea of shared markdown files is not strange anymore. It is becoming normal.

The important distinction is this:

AGENTS.md or CLAUDE.md tells the agent how to behave.

CODEX_REPORT.md records what happened.

One is instruction memory. The other is workflow state.

That second part is still more experimental. But it works because it is plain text, reviewable, diffable, and easy for any agent or human to continue from.

The First Run

I asked Codex to choose a Moroccan small-business app itself. It picked a practical one: hanout-maroc, a small local shop app.

That is already a useful detail. I did not give it a perfectly polished product spec. I gave it a domain: small businesses in Morocco. The agent had enough context to choose a realistic workflow:

products
stock
sales
supplier restocking
unpaid supplier invoices
cash payments in MAD

Then it followed the framework instructions.

The first instruction in the framework was:

node packages/mcp/bin/dar-mcp.js apps/<your-app-name>

The MCP server should start. Then the agent should call get_workflow. Then, for a fresh app, call init_app.

But the server refused to start.

Why?

Because it required manifest.js to already exist.

That created a circular dependency:

Need MCP server to call init_app
Need manifest.js to start MCP server
Need init_app to create manifest.js

This is exactly the kind of bug a framework author can miss. If you already have apps on disk, everything looks fine. If you are a new developer starting from nothing, the first command fails.

The QA agent did not patch the MCP server. It wrote a report.

The Report as an Agent Handoff

The report was not a vague note like:

MCP is broken.

It used a structured format:

## Issue #1 - MCP server cannot start for a fresh app before init_app

**Date:** 2026-05-22
**Phase:** Phase 0 - Initialise the app directory
**App:** hanout-maroc
**Severity:** blocker

**Command / tool call:**
node packages/mcp/bin/dar-mcp.js apps/hanout-maroc

**Expected:**
The MCP server should start for a fresh app target so the agent can call
get_workflow first and then init_app.

**Got:**
Error: no manifest.js found...

**Workaround (if any):**
none - stopped here

**Status:** open

Then I asked for the fix ideas to be written as tasks for Claude Code, using checkbox format:

**Claude Code tasks:**
- [ ] Update the MCP binary so it accepts a fresh app directory.
- [ ] Ensure get_workflow works before app initialization.
- [ ] Ensure init_app creates the app skeleton.
- [ ] Add a root npm script for the MCP server.
- [ ] Add a test for the fresh-app flow.

This is where the shared file becomes more than a log.

It becomes the handoff contract between agents.

Codex was not responsible for fixing. Claude was. The file made that separation explicit.

Why a Shared Markdown File Works

A shared markdown file is not glamorous. It is not a vector database. It is not a multi-agent orchestration framework. It does not have a dashboard.

That is why it works.

It has useful properties:

humans can read it
agents can read it
git can diff it
tasks can be checked off
reports can be appended without destroying history
the file survives context resets
it does not depend on a specific vendor

This matters because agentic development has a context problem.

The chat transcript is temporary. The model context window is limited. Tool output can disappear from the next session. But a file in the repo is durable.

If Codex writes a report and Claude opens the same repo later, Claude does not need the original conversation. It needs the file.

This is the same reason developers still use README.md, TODO.md, changelogs, issue templates, and migration notes. They are boring because they are good.

In 2026, a lot of “agent memory” will still just be disciplined file design.

The Second Run

Claude fixed the first issue.

The MCP server could now start for a fresh app directory. Codex retried the build.

This time it got further:

Started the MCP server.
Called get_workflow.
Called init_app.
Read the available mixins.
Created two scenarios:
- cashier records a shop sale
- owner records supplier restock on credit
Created a Product model.
Ran model validation.

The standalone model was valid.

But the next workflow step failed in a more subtle way.

The model file existed:

apps/hanout-maroc/models/Product.js

And validation passed:

Product - valid

But manifest.js still had:

models: [],

So the locale tool saw zero models. Migration would also see zero models. Health checks and scenarios would not be testing the app the agent thought it had built.

This is a great example of why agentic QA should not stop at “file was created.”

The question is not:

Did the generator write a file?

The question is:

Can a new user proceed through the documented workflow?

Again, Codex stopped and wrote Issue #2.

The Pattern

The loop looks like this:

Codex QA agent
  |
  v
tries documented workflow
  |
  v
hits first blocker
  |
  v
appends CODEX_REPORT.md
  |
  v
Claude Code fixer
  |
  v
checks off tasks and patches framework
  |
  v
Codex retries from the workflow

This is not multi-agent theater. The agents have different jobs.

The QA agent is intentionally conservative. It should not patch the framework because that would hide the bug. Its job is to reproduce the new-developer experience.

The fixer agent is allowed to edit framework code. Its job is to take a precise report and turn it into a patch.

The shared file keeps the boundary clean.

Why Not Just Use GitHub Issues?

You can.

For a team, GitHub Issues, Linear, or Jira may be the right long-term system.

But during local framework development, a repo-local markdown report has advantages:

no API setup
no authentication
no extra MCP integration
works offline
easy to commit beside the code
easy for agents to append safely
visible in the editor immediately

I would not say CODEX_REPORT.md replaces your issue tracker. I would say it is a good local buffer.

Once the pattern matures, you can promote entries into GitHub issues automatically.

Think of it like a lab notebook for agents.

What Makes This Different From Normal Testing?

Unit tests check code behavior.

Scenario tests check app behavior.

Agentic QA checks the developer workflow.

That last layer is underrated.

An agent can test questions like:

Is the documented first command actually runnable?
Does a fresh app path work, or only an existing app?
Does the generated file get registered where the framework expects it?
Do the next-tool hints match reality?
Does the CLI output enough information for a new developer to recover?
Does the framework silently succeed while doing nothing useful?

These are not always easy to catch with conventional tests because they live across documentation, commands, generated files, and user expectations.

An agent pretending to be a new developer is surprisingly good at finding them.

The Key Design Rule

If you want to try this, do not start by asking:

How do I make the agent autonomous?

Start with:

What role is this agent allowed to play?

For my QA agent, the rules were:

build only inside the app directory
never edit framework packages
follow the documented workflow
report blockers in a fixed format
append only
stop after a blocker

Those constraints are the product.

Without them, the agent might “helpfully” patch the framework, manually edit the manifest, skip the broken step, and declare success.

That feels productive in the moment. But it destroys the signal.

The whole point is to catch the places where the framework fails a real user.

When Shared Files Are a Good Practice

Shared files are useful when they are:

version-controlled
narrow in purpose
append-only or easy to review
written in a predictable structure
used by both humans and agents
small enough to stay readable

Good examples:

AGENTS.md for agent behavior
CLAUDE.md for Claude Code project memory
CODEX_REPORT.md for QA findings
TODO.md for scoped local tasks
DECISIONS.md for architecture decisions
RUNBOOK.md for operational steps

Bad examples:

one giant memory file with every thought
vague notes nobody acts on
agent scratchpads committed forever
files that duplicate your actual source of truth
reports with no reproduction steps

The trick is to treat shared files like interfaces.

If another agent cannot reliably consume the file, the format is too loose.

Practical Template

Here is the minimal version of the pattern:

# QA_REPORT.md

## Issue #1 - Short title

**Date:** YYYY-MM-DD
**Role:** qa-agent
**Area:** setup | cli | docs | migration | tests
**Severity:** blocker | degraded | cosmetic

**Command / tool call:**
```
exact command
```

**Expected:**
What should have happened.

**Got:**
Exact output or behavior.

**Reproducible steps:**
1. Step one
2. Step two
3. Step three

**Fix tasks:**
- [ ] Task for fixer agent
- [ ] Test to add

**Status:** open

Do not make the template clever. Clever templates decay. Boring templates get used.

The Bigger Lesson

The best agentic workflows in 2026 are not about giving one model a huge prompt and hoping it behaves like an entire engineering team.

They are about small roles, durable handoffs, and verifiable progress.

One agent can be a QA builder.

Another can be a fixer.

Another can be a reviewer.

Another can write release notes.

The connective tissue does not have to be fancy. Sometimes it is just markdown, git, and a few strict rules.

That is the part I find most interesting.

Agentic development is not replacing software engineering discipline. It is making the discipline more important.

The agent can run the loop faster than you can. But you still need to define the loop.

In this case, the loop was:

Build -> hit a wall -> report precisely -> stop -> fix -> retry

That is not just an experiment.

That is a useful 2026 engineering pattern.