I Built a Framework as a Deliberate Stress Test for Agentic Development. Here's What the Data Says. |

I didn’t build DarJS because I needed another framework.

I built it because I wanted to know what actually happens when you push agentic development past the demo — past the single-file autocomplete, past the “generate this component” sessions — into a sustained, multi-phase, multi-agent build where the AI has to remember decisions made six weeks ago and the whole thing has to hold together under real QA pressure.

Eleven phases. 1708 tests. Three complete QA runs with separate Builder and Fixer agents. A post-mortem that identified the same root cause appearing five different times in five different forms.

This is what I found.

The Experiment Setup

The hypothesis was specific: if you design a system correctly for agentic consumption — layered architecture, machine-readable contracts, structured health checks — can an AI agent build a real production app against it using nothing but tool calls?

Not “can AI help you code.” That question is settled. The harder question: can AI be the primary executor of a defined build protocol, calling tools, following phases, stopping at blockers, reporting issues — with a human as orchestrator, not typist?

graph LR
    subgraph Framework["DarJS Framework (packages/)"]
        MCP["MCP Server\n23 tools"]
        CLI["CLI Layer"]
        Core["Core Layer\nModels, Mixins, Adapters"]
    end
    subgraph Protocol["Build Protocol"]
        Phases["11 Phases\nSpec-driven exit criteria"]
        Agents["2 Agent Roles\nBuilder + Fixer"]
        Health["Health Check\nSession start protocol"]
    end
    subgraph App["QA Build (apps/)"]
        Builder["Builder Agent\nCalls tools only\nStops at blockers"]
        Fixer["Fixer Agent\nReads reports\nFixes framework"]
    end
    Framework --> App
    Protocol --> App
    Builder -->|"issue report"| Fixer
    Fixer -->|"fix commit"| Framework

The framework was purpose-built to be agent-driven. Every CLI command has --json output. Every model has a machine-readable contract. There’s a health command that returns structured status per check. The MCP server wraps all of it as callable tools.

This is not a prototype. It’s an instrument for a specific measurement.

What Karpathy Got Right — The Part Most People Skip

Karpathy’s “vibe coding” framing landed everywhere in 2025. What got less attention was the practice underneath it.

Look at what he actually builds: nanoGPT, micrograd, llm.c. Every one of these is a from-scratch implementation of something that already exists in a better form somewhere else. He could have used PyTorch’s full stack. He chose to build the minimal version himself, every line understood, nothing borrowed.

That’s not vibe coding. That’s controlled construction for the purpose of deep understanding.

The DarJS build was exactly this. I could have used an existing framework. The point was to build from the ground up — model layer, mixin system, adapter pattern, CLI, MCP server — so that every design decision was deliberate and every failure mode was legible.

Karpathy’s insight that transfers directly: you cannot debug what you did not build. When the Builder agent reported that getMixinVocab() was returning stale data, I could trace it to a single wrong assumption in the prototype chain walk — because I had written that walk. A team using a third-party framework would have filed a bug report with the upstream maintainer.

The vibe coding framing was a deliberate provocation. The underlying practice is meticulous.

Where the Current Methodology Cracks — Three Documented Failure Modes

After three QA runs, the same failure pattern appeared repeatedly. Not different bugs — the same structural failure in three different parts of the system.

flowchart TD
    Root["Root Cause:\nTwo representations of the same truth"]
    
    F1["Failure 1\nMixin vocab hardcoded\ninstead of derived\n→ stale data returned to agent"]
    F2["Failure 2\nMemoryAdapter behavior\ndivergent from PrismaAdapter\n→ unit tests passed, prod failed"]
    F3["Failure 3\nContract index read from\ntwo sources simultaneously\n→ inconsistent search results"]
    F4["Failure 4\nPageDef generator included\nfields it should have excluded\n→ PIN exposed in list view"]
    F5["Failure 5\nCLI tool returned success\nbefore side effect completed\n→ agent moved on, state unchanged"]

    Root --> F1
    Root --> F2
    Root --> F3
    Root --> F4
    Root --> F5

    style Root fill:#4f1e1e,color:#fff
    style F1 fill:#3a1e1e,color:#ddd
    style F2 fill:#3a1e1e,color:#ddd
    style F3 fill:#3a1e1e,color:#ddd
    style F4 fill:#3a1e1e,color:#ddd
    style F5 fill:#3a1e1e,color:#ddd

Five failures. One root cause: a second representation of data that was allowed to drift from the source.

What makes this interesting is that none of these were caught by the unit test suite. All 1708 tests were passing when the Builder agent hit these blockers. Unit tests verify that each piece works in isolation. They say nothing about whether two pieces that work separately work together — which is exactly what an agent doing a multi-phase build finds out.

This is the specific failure mode that current agentic methodology does not address well. The agent doesn’t know which source to trust. It trusts the most recent one. If the most recent one is stale, the agent builds on stale ground — confidently, for several phases, before anything visibly breaks.

The Karpathy Debugging Insight, Applied

Karpathy has said that in the AI era, debugging becomes more important, not less. I’d extend that: the ability to design systems that are debuggable by agents becomes a first-class architecture concern.

This is what led to the two-agent pattern — not from reading a methodology document, but from experiencing what happens when one agent tries to be both builder and debugger.

sequenceDiagram
    participant B as Builder Agent
    participant R as Issue Report
    participant F as Fixer Agent

    Note over B: Experiences the bug as a user would.\nCannot rationalize around it.\nIs simply blocked.
    B->>R: Exact call, expected, actual, blast radius
    Note over R: The report is precise because\nthe reporter had no internals access.\nImprecision = can't fix.
    F->>R: Reads the report
    Note over F: Has full internals access.\nCan trace to root cause.\nFixes without touching the build.
    F->>B: Fix committed, issue resolved
    B->>B: Resumes from the blocked phase

The separation isn’t about workflow cleanliness. It’s about epistemics.

A developer who built the system will rationalize around its bugs. “Oh, that’s probably the adapter, I’ll just work around it.” An agent with no internals access can’t do that — it is blocked, and it reports the blockage accurately. That precision is valuable. Contaminating it by giving the same agent the ability to edit the framework source removes the one thing that made the report useful.

Karpathy’s from-scratch builds give him the same kind of epistemic access. He knows exactly where to look because he built every layer. The two-agent pattern operationalizes this for teams where one agent plays the role of “user who doesn’t know the internals.”

Where DarJS Aligns with 2026 Practices — And Where It Got There First

After completing the build, I ran a gap analysis against the 2026 field consensus — research from teams at Anthropic, LinearB, the AI Engineering World’s Fair, and GitHub’s developer experience data.

graph TB
    subgraph Converged["Independent Convergence — DarJS arrived here before the field named it"]
        C1["Health check as session start\n→ 2026 calls this 'queryable state'"]
        C2["Builder/Fixer role separation\n→ 2026 calls this 'bounded autonomy'"]
        C3["Contract system + MCP tools\n→ 2026 calls this 'context pipeline'"]
        C4["Phase specs with exit criteria\n→ 2026 calls this 'Spec-Driven Development'"]
    end
    subgraph Gaps["Gaps — what 2026 has that DarJS didn't"]
        G1["Explicit autonomy tier selection\n(Tier 1/2/3 per task)"]
        G2["Per-action verification loops\n(not just phase-exit checks)"]
        G3["Scope boundary in every spec\n(what is explicitly out of scope)"]
    end
    style Converged fill:#1e3a5f,color:#fff
    style Gaps fill:#3a3a1e,color:#fff

The convergences are more interesting than the gaps. The health-check-as-session-start pattern emerged from operational need: after three QA runs, reading the issue log to determine current state was too slow and too unreliable. The structured { status: ok | warn | error } health response emerged because agents need machine-readable answers, not prose summaries. The 2026 literature calls this “queryable state.” We arrived at it by eliminating what didn’t work.

The gaps are real and additive. None of them require rethinking the core approach — they’re refinements. Per-action verification loops would have caught several of the five failures earlier in the phase, before multiple downstream steps had built on stale state. Explicit scope boundaries in phase specs would have prevented two cases where the Builder agent drifted into work that belonged to a later phase.

The Finding That Changes How I Think About Frameworks

The most unexpected result from the experiment: the MCP server is not an interface to the framework — it is the framework, from the agent’s perspective.

When a Builder agent runs a build session, it never reads a source file. It never calls a function directly. Everything it knows about the system comes through the tool responses. The quality of those responses — their precision, their structure, their completeness — determines the quality of the build.

A framework designed for human developers optimizes for readability, good defaults, and helpful error messages. A framework designed for agent builders optimizes for something different: machine-readable output, explicit state, structured errors with status fields, commands that verify their own side effects.

graph LR
    subgraph Human["Framework for Human Developers"]
        H1["Readable error messages"]
        H2["Sensible defaults"]
        H3["Good docs"]
        H4["Helpful warnings"]
    end
    subgraph Agent["Framework for Agent Builders"]
        A1["Structured JSON output\nfor every read operation"]
        A2["Explicit state\n(health check)"]
        A3["Machine-readable errors\n(status field, not prose)"]
        A4["Tools that verify\ntheir own side effects"]
    end
    Human -.->|"optimizes for\nreadability"| HH[Human]
    Agent -.->|"optimizes for\nparseable certainty"| AA[Agent]
    style Human fill:#1e3a5f,color:#fff
    style Agent fill:#1e4f3a,color:#fff

These are not the same optimization target. Most frameworks are built for the first column. The interesting design space in 2026 is frameworks that satisfy both — or that consciously choose the second column because the agent is the primary consumer.

What I’d Do Differently — One Thing

If I ran the experiment again with everything I know now: integration tests at every phase handoff, written at the moment the handoff is wired.

Not at the end. Not “we’ll add integration coverage later.” At the moment Layer A’s output becomes Layer B’s input, you write the test that runs both in sequence with real filesystem state. The five root-cause failures I documented were all invisible to unit tests. All five would have been caught by integration tests at the handoff they crossed.

This is the one practice the 2026 methodology adds that would have materially changed the QA run results. Everything else was already there.

The Bottom Line

The thesis that motivated the experiment — that framework-first is the right approach for agent-driven development in 2026 — holds. The data supports it. Teams that give agents a stable, queryable, tool-wrapped surface to build against outperform teams running agents against raw codebases.

But the more interesting finding is narrower: the failure modes of agentic development are not random. They are structural. They appear at the same places in every build — layer boundaries, adapter seams, generator outputs, tool return values. They are invisible to unit tests and guaranteed to appear under real agent load.

Karpathy’s instinct to build things from scratch and understand every layer is not nostalgia for pre-AI craftsmanship. It’s the prerequisite for building systems where, when the agent reports a blocker, you can actually fix it.

The vibe is fine for exploration. The structure is what makes exploration repeatable.

DarJS is open source at github.com/techiediaries/darjs. The full methodology — phases, specs, health check design, two-agent protocol — is documented in the repo. The 14-part tutorial series covering the build from scratch launches on 10xdev.blog next month.