10 Rules for Building Software AI Agents Can Actually Drive |

Everyone is talking about “AI-assisted development.” Far fewer people are talking about what happens when your AI agent hits a wall at phase 7 of 11 because two representations of the same data silently diverged three weeks ago.

This is a list built from that experience — 11 phases, 1708 tests, three complete QA runs, and a post-mortem that identified the same root cause appearing five different times in five different forms.

Every rule on this list is sourced from a documented failure. Not a hypothetical. A real broken build, a real blocked agent session, a real set of bugs that were invisible until they weren’t.

1. Separate Your Layers Before Writing Line One

The single most important architectural decision you will make is where the boundary between “framework logic” and “app code” sits — and you must make it before writing any code.

graph LR
    subgraph packages["📦 Core Layer (framework)"]
        M[Model / Logic]
        V[Validators]
        A[Adapters]
    end
    subgraph apps["🏗️ Integration Layer (app)"]
        CLI[CLI Commands]
        HTTP[HTTP Handlers]
        DB[Database Wiring]
    end
    packages --> apps
    style packages fill:#1e3a5f,color:#fff,stroke:#2d6a9f
    style apps fill:#1e4f3a,color:#fff,stroke:#2d9a6a

The test for a clean boundary: Can you run your entire core layer test suite with zero network calls, zero database connections, and zero filesystem reads?

If no — the boundary is broken. Fix it before phase 2.

Why this matters for agents: An agent working in the app layer cannot break the framework. An agent working in the framework layer cannot accidentally ship app-specific assumptions into core. The separation is not aesthetics — it is the mechanism that makes agent role boundaries enforceable.

2. Write All Phase Specs Before Implementing Phase 1

Write your Phase 2 spec before you write a single line of Phase 1 code. Phase 2’s spec tells you what Phase 1 must produce. Design problems in Phase 1 are only visible from Phase 2’s perspective.

The corrected exit criterion format:

Phase 3 exit:
  ✓ 87/87 unit tests passing
  ✓ Cross-layer test: Phase 3 output is valid input for Phase 4

The cross-layer test is the one that matters. A test that asserts expect(result).toBeDefined() passes just as well as a test that asserts the output satisfies its consumer. The first kind of test will pass through every bug that lives at the layer boundary.

2026 addition: Each phase spec now needs two fields that most teams skip:

Scope boundary — what is explicitly out of scope for this phase
Prior decisions — decisions already locked that this phase cannot reopen

Without these, agents re-derive scope on every session start, sometimes arriving at different answers.

3. One Source of Truth — No Exceptions, No Compromises

Before creating any derived representation of existing data, ask:

Can this be derived from the live implementation at call time?

If yes — derive it. Never maintain a separate copy.

The failure mode is quiet. Both representations look correct in isolation. They diverge over time. You don’t notice until an agent uses the stale one and produces output that is confidently, invisibly wrong.

flowchart TD
    Q{Can this be derived\nat call time?}
    Q -->|Yes| D[Derive it live\nNever duplicate]
    Q -->|No| N{Unavoidable\nduplicate}
    N --> A[Option A: Add @depends-on annotation\n+ pre-commit hook check]
    N --> B[Option B: Write divergence test\nparametrized over the registry]
    D --> OK[✓ Safe]
    A --> OK
    B --> OK
    style OK fill:#1e4f3a,color:#fff
    style D fill:#1e3a5f,color:#fff

When two representations are unavoidable: annotate the dependency and enforce it at commit time with a pre-commit hook that verifies the dependency still holds. Without enforcement — not a manual convention, actual enforcement — they will diverge. Always. Within weeks.

4. Write the Fake Adapter First, Run the Same Tests Against Both

For every external dependency (database, filesystem, network), define an interface. Write a fake in-memory implementation before writing the real one.

The fake must be a full implementation — not a stub returning hardcoded values, but code that actually applies the same logic the real adapter would.

The four behavioral categories where fake and real adapters silently diverge:

Category	Fake adapter	Real adapter
Type coercion	No coercion	DB coerces `"42"` → `42`
JSON columns	Returns object directly	Requires explicit parse/stringify
Null vs undefined	Often returns `undefined`	Returns `null` for missing fields
Constraint timing	Checks immediately	Enforces at commit time

None of these are caught until you run against the real system — after significant work has been built on top of the wrong assumption.

The fix: write a shared contract suite that both the fake and the real adapter must pass. If the fake passes and the real adapter fails — the adapters have diverged. Fix before shipping.

5. Integration Tests at Every Layer Handoff — Not at the End

sequenceDiagram
    participant L1 as Layer A
    participant FS as Filesystem
    participant L2 as Layer B
    participant T as Integration Test

    T->>L1: Write with real state
    L1->>FS: Actual filesystem write
    T->>L2: Read from same filesystem
    L2->>FS: Reads what Layer A actually wrote
    T->>T: Assert: every output item\nis a real input item
    Note over T: Written when the handoff is wired,\nnot when the bug is discovered

Unit tests cannot find bugs at layer boundaries. Every time Layer A produces output consumed by Layer B, you need a test that runs both in sequence with real state — not mocks.

Three bug categories that are invisible to unit tests and guaranteed to appear at handoffs:

Prototype chain behavior — unit tests inject fake class objects; real code walks prototype chains that are only constructed by actually requiring files
Cross-phase state — unit tests for Phase A and Phase B never run in sequence; the seam has no test; the bug lives in the seam
Runtime environment differences — behavior that is correct in one runner context is incorrect in another

These bugs are not rare edge cases. They appear at every significant layer boundary in every project.

6. Design Your CLI for Machines on Day 1

Every read command gets a --json flag before you ship it. Not as an afterthought. Not “we’ll add it later.” Day one.

tool inspect --json      # structured output an agent can parse
tool health --json       # machine-readable status per check
tool find "query" --json # search with scores

A tool that produces only human-readable output cannot be reliably driven by an agent. Adding --json to an existing command later means changing output format without breaking consumers. Adding it from the start costs nothing.

The ordering trap:

// WRONG — early return runs before the side effect
async function sync({ json, fix }) {
  if (json) return buildOutput(data)  // fix never runs when json=true
  if (fix) await writeChanges(data)
}

// CORRECT — side effects always run first
async function sync({ json, fix }) {
  if (fix) await writeChanges(data)
  if (json) return buildOutput(data)
}

The early-return pattern produces tools that appear to succeed while doing nothing. The output looks correct. The filesystem is unchanged. This is the hardest failure mode to debug.

7. The Two-Agent Pattern — Roles That Cannot Bleed

When AI agents are building and testing against your system, enforce role separation structurally:

graph TB
    subgraph Builder["🔨 Builder / QA Agent"]
        B1[Works in app layer only]
        B2[Follows phase protocol]
        B3[Stops at first blocker]
        B4[Files issue report]
    end
    subgraph Report["📋 Issue Report"]
        R1[Exact call that failed]
        R2[Expected vs actual]
        R3[What downstream is blocked]
        R4[status: open/resolved]
    end
    subgraph Fixer["🔧 Fixer Agent"]
        F1[Works in framework only]
        F2[Reads issue reports]
        F3[Fixes core code]
        F4[Never touches app layer]
    end
    Builder -->|writes| Report
    Report -->|feeds| Fixer
    style Builder fill:#1e3a5f,color:#fff
    style Fixer fill:#4f1e1e,color:#fff
    style Report fill:#3a3a1e,color:#fff

Why the separation: The builder agent experiences bugs as a user would. It cannot rationalize them away — it is blocked. This produces precise reports. The fixer has full internals context and can fix without breaking other consumers.

A vague issue report cannot be actioned. An exact call + expected vs actual + status field can be triaged in seconds.

8. Three Tool Failure Modes — Test for All Three

Every tool has three ways to fail that look like success:

Failure mode	Symptom	How teams miss it	How to catch it
Returns nothing	Tool reports success, state unchanged	Only check return value	Check state after the call
Returns wrong data	Tool returns stale or incorrect data	Trust the output	Check output against live source
Reports error while succeeding	Tool throws, but side effects already ran	Assume error = no effect	Check state even when the tool throws

Every operation that has a side effect needs a test that checks the state — not just the return value.

This is the failure mode that blocks agent sessions silently. The agent calls the tool. The tool “succeeds.” The agent moves to the next phase. Phase 5 fails because Phase 3 never actually wrote the file it appeared to write.

9. Health as the Session Start Protocol

Your system must answer “what is the current state?” without the caller reading source files, documentation, or historical logs.

flowchart LR
    Start([New agent session]) --> H[Run health check]
    H --> Green{All checks\npassing?}
    Green -->|Yes| Build[Proceed with build]
    Green -->|No| Stop[Stop — known regression exists\nFix before new build]
    Stop --> Report[Read issue log\nFix root cause]
    Report --> H
    style Green fill:#1e4f3a,color:#fff
    style Stop fill:#4f1e1e,color:#fff

The health check IS the session start protocol. Every new agent session calls it first. If health is green, proceed. If health is red, don’t start a new build — there is a known regression.

Required output format:

{
  "adapter_parity":  { "status": "ok" },
  "vocab_alignment": { "status": "error", "detail": "Model X: 3 fields in index, 15 in class" },
  "contract_index":  { "status": "ok" },
  "open_issues":     { "status": "warn", "count": 2 }
}

An issue log is not a health check. Reading history to determine current state is expensive, error-prone, and gets worse as the project ages. A health command that runs in under 2 seconds and returns structured JSON is the only mechanism that scales.

10. Spec-Driven Development — Context Is the Product

In 2026, the dominant insight across every team doing serious agentic work is this: prompt engineering is the wrong lever. Phrasing the request better gets you 5–10% improvement. Giving the agent the right context gets you 3–10× improvement.

graph TB
    subgraph Always["Always Injected (session start)"]
        A1[Layer boundaries]
        A2[Phase spec + exit criteria]
        A3[Prior decisions locked]
        A4[Health status]
    end
    subgraph OnDemand["On Demand (per task)"]
        B1[Relevant contracts]
        B2[Adjacent code symbols]
        B3[Issue reports for this area]
    end
    subgraph Never["Never Injected"]
        C1[Full codebase]
        C2[Resolved issues]
        C3[Completed phase specs]
    end
    Always --> Agent([AI Agent])
    OnDemand --> Agent
    style Always fill:#1e3a5f,color:#fff
    style OnDemand fill:#1e4f3a,color:#fff
    style Never fill:#3a1e1e,color:#fff

The spec format that works:

What this phase builds (one sentence)
Exit criterion — specific passing test counts + cross-layer tests
Scope boundary — what is explicitly out of scope
Prior decisions already locked
Task breakdown — sub-tasks, each with a single verifier
Verification criteria — per-action, not just phase exit

Teams using this format report 20–45% faster cycle times and 3–10× higher first-pass agent success rates. The improvement isn’t from better prompting. It’s from better context engineering.

The Meta-Rule

Every rule on this list is a correction to a failure that looked fine until it wasn’t. The failures had one thing in common: the thing that broke was invisible to the test suite that was passing.

Unit tests pass. Integration fails. The fake adapter passes. The real adapter fails. The tool reports success. The state is unchanged.

The pattern is always the same: a gap between what we checked and what we assumed was checked. Every rule on this list closes one of those gaps.

What’s Coming on 10xdev.blog

These 10 rules are the foundation of a full tutorial series launching on 10xdev.blog next month.

“Framework-First: The Better Way to Build in the Age of AI” is a 14-part series for developers who want to stop building apps and start building leverage.

graph LR
    P0["Part 0\nWhy Framework-First\nWins in 2026"]
    P1["Parts 1–3\nCore Layer:\nModels, Fields,\nMixins"]
    P2["Parts 4–6\nAdapters,\nCLI, Generators"]
    P3["Part 7\nMCP Server:\nThe AI Interface"]
    P4["Parts 8–11\nAuth, UI, Scenarios,\nIntegration"]
    P5["Parts 12–14\nObservability,\nSecurity, Ship It"]
    P0 --> P1 --> P2 --> P3 --> P4 --> P5
    style P3 fill:#4f2d1e,color:#fff,stroke:#9a6d2d

The series covers:

Building the framework layer that AI agents can call directly via MCP tool calls
Every bug story from this post — in depth, with the fix and the test that prevents it
How to design CLI tools that work for both humans and agents
The two-agent pattern: how to set up Builder and Fixer agents that never step on each other
How to ship a real app using nothing but tool calls from an AI agent — zero manual file editing

Each part comes with runnable code, a test suite, and the exact decision log from when it was built.

Subscribe at 10xdev.blog — Part 0 drops next week.

These rules come from building DarJS — a model-driven framework for the agentic era. 11 phases, 1708 tests, and every mistake documented.