Everyone is talking about “AI-assisted development.” Far fewer people are talking about what happens when your AI agent hits a wall at phase 7 of 11 because two representations of the same data silently diverged three weeks ago.
This is a list built from that experience — 11 phases, 1708 tests, three complete QA runs, and a post-mortem that identified the same root cause appearing five different times in five different forms.
Every rule on this list is sourced from a documented failure. Not a hypothetical. A real broken build, a real blocked agent session, a real set of bugs that were invisible until they weren’t.
1. Separate Your Layers Before Writing Line One
The single most important architectural decision you will make is where the boundary between “framework logic” and “app code” sits — and you must make it before writing any code.
graph LR
subgraph packages["📦 Core Layer (framework)"]
M[Model / Logic]
V[Validators]
A[Adapters]
end
subgraph apps["🏗️ Integration Layer (app)"]
CLI[CLI Commands]
HTTP[HTTP Handlers]
DB[Database Wiring]
end
packages --> apps
style packages fill:#1e3a5f,color:#fff,stroke:#2d6a9f
style apps fill:#1e4f3a,color:#fff,stroke:#2d9a6a
The test for a clean boundary: Can you run your entire core layer test suite with zero network calls, zero database connections, and zero filesystem reads?
If no — the boundary is broken. Fix it before phase 2.
Why this matters for agents: An agent working in the app layer cannot break the framework. An agent working in the framework layer cannot accidentally ship app-specific assumptions into core. The separation is not aesthetics — it is the mechanism that makes agent role boundaries enforceable.
2. Write All Phase Specs Before Implementing Phase 1
Write your Phase 2 spec before you write a single line of Phase 1 code. Phase 2’s spec tells you what Phase 1 must produce. Design problems in Phase 1 are only visible from Phase 2’s perspective.
The corrected exit criterion format:
Phase 3 exit:
✓ 87/87 unit tests passing
✓ Cross-layer test: Phase 3 output is valid input for Phase 4
The cross-layer test is the one that matters. A test that asserts expect(result).toBeDefined() passes just as well as a test that asserts the output satisfies its consumer. The first kind of test will pass through every bug that lives at the layer boundary.
2026 addition: Each phase spec now needs two fields that most teams skip:
- Scope boundary — what is explicitly out of scope for this phase
- Prior decisions — decisions already locked that this phase cannot reopen
Without these, agents re-derive scope on every session start, sometimes arriving at different answers.
3. One Source of Truth — No Exceptions, No Compromises
Before creating any derived representation of existing data, ask:
Can this be derived from the live implementation at call time?
If yes — derive it. Never maintain a separate copy.
The failure mode is quiet. Both representations look correct in isolation. They diverge over time. You don’t notice until an agent uses the stale one and produces output that is confidently, invisibly wrong.
flowchart TD
Q{Can this be derived\nat call time?}
Q -->|Yes| D[Derive it live\nNever duplicate]
Q -->|No| N{Unavoidable\nduplicate}
N --> A[Option A: Add @depends-on annotation\n+ pre-commit hook check]
N --> B[Option B: Write divergence test\nparametrized over the registry]
D --> OK[✓ Safe]
A --> OK
B --> OK
style OK fill:#1e4f3a,color:#fff
style D fill:#1e3a5f,color:#fff
When two representations are unavoidable: annotate the dependency and enforce it at commit time with a pre-commit hook that verifies the dependency still holds. Without enforcement — not a manual convention, actual enforcement — they will diverge. Always. Within weeks.
4. Write the Fake Adapter First, Run the Same Tests Against Both
For every external dependency (database, filesystem, network), define an interface. Write a fake in-memory implementation before writing the real one.
The fake must be a full implementation — not a stub returning hardcoded values, but code that actually applies the same logic the real adapter would.
The four behavioral categories where fake and real adapters silently diverge:
| Category | Fake adapter | Real adapter |
|---|---|---|
| Type coercion | No coercion | DB coerces "42" → 42 |
| JSON columns | Returns object directly | Requires explicit parse/stringify |
| Null vs undefined | Often returns undefined |
Returns null for missing fields |
| Constraint timing | Checks immediately | Enforces at commit time |
None of these are caught until you run against the real system — after significant work has been built on top of the wrong assumption.
The fix: write a shared contract suite that both the fake and the real adapter must pass. If the fake passes and the real adapter fails — the adapters have diverged. Fix before shipping.
5. Integration Tests at Every Layer Handoff — Not at the End
sequenceDiagram
participant L1 as Layer A
participant FS as Filesystem
participant L2 as Layer B
participant T as Integration Test
T->>L1: Write with real state
L1->>FS: Actual filesystem write
T->>L2: Read from same filesystem
L2->>FS: Reads what Layer A actually wrote
T->>T: Assert: every output item\nis a real input item
Note over T: Written when the handoff is wired,\nnot when the bug is discovered
Unit tests cannot find bugs at layer boundaries. Every time Layer A produces output consumed by Layer B, you need a test that runs both in sequence with real state — not mocks.
Three bug categories that are invisible to unit tests and guaranteed to appear at handoffs:
- Prototype chain behavior — unit tests inject fake class objects; real code walks prototype chains that are only constructed by actually requiring files
- Cross-phase state — unit tests for Phase A and Phase B never run in sequence; the seam has no test; the bug lives in the seam
- Runtime environment differences — behavior that is correct in one runner context is incorrect in another
These bugs are not rare edge cases. They appear at every significant layer boundary in every project.
6. Design Your CLI for Machines on Day 1
Every read command gets a --json flag before you ship it. Not as an afterthought. Not “we’ll add it later.” Day one.
tool inspect --json # structured output an agent can parse
tool health --json # machine-readable status per check
tool find "query" --json # search with scores
A tool that produces only human-readable output cannot be reliably driven by an agent. Adding --json to an existing command later means changing output format without breaking consumers. Adding it from the start costs nothing.
The ordering trap:
// WRONG — early return runs before the side effect
async function sync({ json, fix }) {
if (json) return buildOutput(data) // fix never runs when json=true
if (fix) await writeChanges(data)
}
// CORRECT — side effects always run first
async function sync({ json, fix }) {
if (fix) await writeChanges(data)
if (json) return buildOutput(data)
}
The early-return pattern produces tools that appear to succeed while doing nothing. The output looks correct. The filesystem is unchanged. This is the hardest failure mode to debug.
7. The Two-Agent Pattern — Roles That Cannot Bleed
When AI agents are building and testing against your system, enforce role separation structurally:
graph TB
subgraph Builder["🔨 Builder / QA Agent"]
B1[Works in app layer only]
B2[Follows phase protocol]
B3[Stops at first blocker]
B4[Files issue report]
end
subgraph Report["📋 Issue Report"]
R1[Exact call that failed]
R2[Expected vs actual]
R3[What downstream is blocked]
R4[status: open/resolved]
end
subgraph Fixer["🔧 Fixer Agent"]
F1[Works in framework only]
F2[Reads issue reports]
F3[Fixes core code]
F4[Never touches app layer]
end
Builder -->|writes| Report
Report -->|feeds| Fixer
style Builder fill:#1e3a5f,color:#fff
style Fixer fill:#4f1e1e,color:#fff
style Report fill:#3a3a1e,color:#fff
Why the separation: The builder agent experiences bugs as a user would. It cannot rationalize them away — it is blocked. This produces precise reports. The fixer has full internals context and can fix without breaking other consumers.
A vague issue report cannot be actioned. An exact call + expected vs actual + status field can be triaged in seconds.
8. Three Tool Failure Modes — Test for All Three
Every tool has three ways to fail that look like success:
| Failure mode | Symptom | How teams miss it | How to catch it |
|---|---|---|---|
| Returns nothing | Tool reports success, state unchanged | Only check return value | Check state after the call |
| Returns wrong data | Tool returns stale or incorrect data | Trust the output | Check output against live source |
| Reports error while succeeding | Tool throws, but side effects already ran | Assume error = no effect | Check state even when the tool throws |
Every operation that has a side effect needs a test that checks the state — not just the return value.
This is the failure mode that blocks agent sessions silently. The agent calls the tool. The tool “succeeds.” The agent moves to the next phase. Phase 5 fails because Phase 3 never actually wrote the file it appeared to write.
9. Health as the Session Start Protocol
Your system must answer “what is the current state?” without the caller reading source files, documentation, or historical logs.
flowchart LR
Start([New agent session]) --> H[Run health check]
H --> Green{All checks\npassing?}
Green -->|Yes| Build[Proceed with build]
Green -->|No| Stop[Stop — known regression exists\nFix before new build]
Stop --> Report[Read issue log\nFix root cause]
Report --> H
style Green fill:#1e4f3a,color:#fff
style Stop fill:#4f1e1e,color:#fff
The health check IS the session start protocol. Every new agent session calls it first. If health is green, proceed. If health is red, don’t start a new build — there is a known regression.
Required output format:
{
"adapter_parity": { "status": "ok" },
"vocab_alignment": { "status": "error", "detail": "Model X: 3 fields in index, 15 in class" },
"contract_index": { "status": "ok" },
"open_issues": { "status": "warn", "count": 2 }
}
An issue log is not a health check. Reading history to determine current state is expensive, error-prone, and gets worse as the project ages. A health command that runs in under 2 seconds and returns structured JSON is the only mechanism that scales.
10. Spec-Driven Development — Context Is the Product
In 2026, the dominant insight across every team doing serious agentic work is this: prompt engineering is the wrong lever. Phrasing the request better gets you 5–10% improvement. Giving the agent the right context gets you 3–10× improvement.
graph TB
subgraph Always["Always Injected (session start)"]
A1[Layer boundaries]
A2[Phase spec + exit criteria]
A3[Prior decisions locked]
A4[Health status]
end
subgraph OnDemand["On Demand (per task)"]
B1[Relevant contracts]
B2[Adjacent code symbols]
B3[Issue reports for this area]
end
subgraph Never["Never Injected"]
C1[Full codebase]
C2[Resolved issues]
C3[Completed phase specs]
end
Always --> Agent([AI Agent])
OnDemand --> Agent
style Always fill:#1e3a5f,color:#fff
style OnDemand fill:#1e4f3a,color:#fff
style Never fill:#3a1e1e,color:#fff
The spec format that works:
- What this phase builds (one sentence)
- Exit criterion — specific passing test counts + cross-layer tests
- Scope boundary — what is explicitly out of scope
- Prior decisions already locked
- Task breakdown — sub-tasks, each with a single verifier
- Verification criteria — per-action, not just phase exit
Teams using this format report 20–45% faster cycle times and 3–10× higher first-pass agent success rates. The improvement isn’t from better prompting. It’s from better context engineering.
The Meta-Rule
Every rule on this list is a correction to a failure that looked fine until it wasn’t. The failures had one thing in common: the thing that broke was invisible to the test suite that was passing.
Unit tests pass. Integration fails. The fake adapter passes. The real adapter fails. The tool reports success. The state is unchanged.
The pattern is always the same: a gap between what we checked and what we assumed was checked. Every rule on this list closes one of those gaps.
What’s Coming on 10xdev.blog
These 10 rules are the foundation of a full tutorial series launching on 10xdev.blog next month.
“Framework-First: The Better Way to Build in the Age of AI” is a 14-part series for developers who want to stop building apps and start building leverage.
graph LR
P0["Part 0\nWhy Framework-First\nWins in 2026"]
P1["Parts 1–3\nCore Layer:\nModels, Fields,\nMixins"]
P2["Parts 4–6\nAdapters,\nCLI, Generators"]
P3["Part 7\nMCP Server:\nThe AI Interface"]
P4["Parts 8–11\nAuth, UI, Scenarios,\nIntegration"]
P5["Parts 12–14\nObservability,\nSecurity, Ship It"]
P0 --> P1 --> P2 --> P3 --> P4 --> P5
style P3 fill:#4f2d1e,color:#fff,stroke:#9a6d2d
The series covers:
- Building the framework layer that AI agents can call directly via MCP tool calls
- Every bug story from this post — in depth, with the fix and the test that prevents it
- How to design CLI tools that work for both humans and agents
- The two-agent pattern: how to set up Builder and Fixer agents that never step on each other
- How to ship a real app using nothing but tool calls from an AI agent — zero manual file editing
Each part comes with runnable code, a test suite, and the exact decision log from when it was built.
Subscribe at 10xdev.blog — Part 0 drops next week.
These rules come from building DarJS — a model-driven framework for the agentic era. 11 phases, 1708 tests, and every mistake documented.