The Three Failure Modes of a Contract Index

The DarJS contract system had 100% coverage. Every public function had a @reuse-when field. Every module had @tags. The TF-IDF index was built and the dar find command was running. By every metric the system was complete.

Then Ahmed asked: is this actually enough for a small NLP model to help juniors?

The answer, after running real queries, was no. And the reasons why reveal three independent ways a retrieval system can fail that coverage metrics never catch.

Failure One: The Index Points at the Wrong Data

dar find loaded its contracts from _publish/contracts.json. That file was built by tools/build-contracts.js, which read .d.ts type declaration files from five packages. 110 contracts.

Meanwhile, a different tool — tools/extract-contracts-full.js, built in the previous session — read JS source files across all eleven packages and produced _publish/contracts-graph.json. 238 contracts. The wizard. The CBVs. The SchemaAdapter. The MCP tools. The inspector. Everything added in the last three sessions of work.

dar find never saw any of it. The pointer was never updated.

The query dar find "multi step form wizard" returned MultiStorable at 27%. dar find "protect route managers only" returned requireRole at 9% — routing decision: “generate new.” Both of those answers are wrong. The first is noise. The second is correct but at such low confidence that the system routes away from it.

The system wasn’t broken in the sense that the code had a bug. It was broken in the sense that it was answering questions from a stale snapshot while a newer, more complete snapshot existed and was simply not wired in.

This is the first failure mode: the index points at the wrong data source. It’s silent. Metrics look fine. The system runs. The answers are wrong.

The fix was a one-line path change in packages/nlp/index.js. Ten minutes to diagnose, ten seconds to fix. But the damage — every question about wizard forms, CBVs, and schema adapters routing to “generate new” for however long the system was in use — was proportional to how long it went undetected.

Failure Two: Coverage Is Not Vocabulary

After fixing the index, the query dar find "timestamps on my model" returned Approvable at 14%. The Trackable mixin — which is exactly what you need for timestamps — wasn’t in the index at all. No @contract block. No @reuse-when. Not findable.

But even among contracts that were in the index, something subtler was wrong. The query dar find "protect route managers only" had requireRole in the index with a full @reuse-when field. The score was still 9%. The routing decision was “generate new.” The contract existed; the retrieval failed.

Why? The @reuse-when field said: “You need to restrict a route to specific named roles”. The query said: “protect route managers only.” The words don’t overlap. “Restrict” and “protect” are synonyms in context but share no tokens. “Named roles” and “managers” share nothing. TF-IDF is a token overlap model. Synonyms don’t help it.

This is the vocabulary gap that article 012 named. But running real queries against a real index makes it concrete in a way the principle doesn’t: you can have 100% @reuse-when coverage and still have a retrieval system that fails on the most obvious queries, because the fields were written from the implementor’s perspective rather than the caller’s.

The implementor writes: “You need to restrict a route to specific named roles.”
The caller types: “protect route managers only.”

Both are correct descriptions of the same thing. TF-IDF treats them as unrelated documents.

The fix is not better algorithms. The fix is writing @reuse-when the way the caller would ask, not the way the implementor would describe. “I need to protect a route so only certain roles can access it” matches “protect route managers only” because both use the word “protect.” That’s the entire mechanism. The vocabulary in the field has to be the vocabulary of the person who doesn’t know the function exists yet.

Failure Three: The Missing Second Layer

Even after fixing the index pointer and improving the vocabulary, some queries still returned wrong answers:

dar find "how do I add a field to a model"  →  getFields() adapter (wrong)
dar find "how do I write a test scenario"   →  runScenario() MCP tool (wrong)
dar find "multi step form wizard"           →  PageDefRenderer.formContext (partially right, but unhelpful)

These queries are structurally different from the others. “Add timestamps to a model” is a feature query — there’s a function that does that. “How do I add a field to a model” is a procedure query — there’s a workflow that does that: edit the model file, run dar schema prisma, run npx prisma db push.

No function in the codebase is called “addFieldToModel.” The action is a sequence of developer steps, not a single callable. The contract corpus had 245 function contracts describing things that exist in the code. It had zero entries describing how to accomplish tasks using those things.

This is the second-layer gap: a contract corpus needs function contracts and procedure contracts. Function contracts answer “what does this do?” Procedure contracts answer “how do I accomplish this?”

Juniors and NLP models hit procedure questions constantly, because most of the time they’re not looking for a function — they’re trying to figure out what to do. “How do I add a field” is the first question anyone asks when they start building on a framework. Without procedure contracts, every such query routes to “generate new,” which means LLM tokens, which means latency and cost, for questions that have a five-line answer.

The fix was a packages/cli/procedures.js file with thirteen @contract blocks written as coordinator roles. Each block describes one common task, has an @example showing the CLI command or code pattern, and has a @reuse-when written in the language a confused junior would use: “I need to add a field to a model, or I want to extend a model with more data, or I changed my model schema and need to update the database.”

After the fix:

dar find "add a field to a model"         → addFieldToModel    53% ✓ reuse as-is
dar find "run migration database"         → runMigration       64% ✓ reuse as-is
dar find "multi step form wizard"         → configureWizard    90% ✓ reuse as-is
dar find "define transitions states"      → defineTransitions  68% ✓ reuse as-is

What Coverage Metrics Don’t Measure

The three failure modes have a property in common: none of them show up in coverage reports.

“100% @reuse-when coverage” doesn’t tell you whether the index is current. It doesn’t tell you whether the vocabulary matches the queries. It doesn’t tell you whether procedure-level knowledge exists at all.

The only measurement that tells you whether a retrieval system works is running it against real queries — the actual questions a junior developer or an AI model would ask — and reading the answers.

That test reveals things the metrics can’t:

Is the right data loaded? (structural correctness)
Does the vocabulary match the queries? (semantic alignment)
Are the concepts present? (knowledge completeness)

The score threshold tells you one more thing: a correct match at 9% confidence routes to “generate new” just like no match at all. Correctness at low confidence is indistinguishable from absence to any routing layer. The match has to be strong enough to trigger reuse, which means the vocabulary has to be close enough to produce a score above the threshold, which means writing @reuse-when from the caller’s vocabulary is not optional — it’s the mechanism.

The Pattern

Build a contract index. Build it from the right source. Write the vocabulary from the caller’s perspective. Include procedure contracts alongside function contracts.

Then run real queries. Not coverage checks. Not field-presence metrics. Actual questions a junior or an AI model would ask when they don’t know the answer. If the first result is wrong, or the score is below 30%, or the routing decision says “generate new” for something you know exists — that’s a failure, regardless of what the coverage report says.

The gap between “the system exists” and “the system works” is always a test you have to run, not a metric you have to measure.

Next: TBD