The NLP-First Codebase: Replacing LLM Calls with Retrieval

Most of what developers ask an LLM isn’t a generation problem. It’s a retrieval problem wearing a generation costume.

“What already exists that adds timestamps to a model?” That’s not a question that requires GPT-4. It requires an index and a query. Yet it gets routed to an LLM, burns tokens, and returns an answer that may or may not reflect what’s actually in your codebase.

I built a system in DarJS that intercepts those calls before they reach the LLM. It extracts 110 structured contracts from the framework packages, builds a TF-IDF index over them, and routes each query to either a retrieval result or an LLM — depending on whether the answer already exists.

The hit rate for simple retrieval queries: above 80%.

Retrieval vs. generation

The distinction matters because the two paths have different costs, failure modes, and latencies.

Generation synthesizes something that doesn’t exist yet. It’s appropriate when the codebase has no answer. The LLM has to write new code, invent a solution, reason about tradeoffs.

Retrieval finds something that already exists. The answer is already in the codebase, correctly implemented, tested, and named. The only task is finding it.

When you route a retrieval query to a generative model, you get hallucinated function names, invented APIs, and suggestions that duplicate what’s already there. Not because the LLM is failing — because you sent the wrong query to the wrong tool.

The NLP-first approach reorders the pipeline: retrieval first, generation only on miss.

What `@reuse-when` does that normal docs don’t

Documentation describes what a function does. @reuse-when describes when you should reach for it — written as a plain-English condition someone would say before writing new code.

Every public function in DarJS packages now carries a @contract JSDoc block. The retrieval trigger is @reuse-when:

/**
 * @contract
 * @role        hook
 * @domain      core
 * @does        Injects four audit timestamp/user fields and populates them automatically via beforeCreate and beforeUpdate hooks.
 * @tags        audit, created_at, updated_at, created_by, updated_by, timestamp, mixin, track
 * @reuse-when  You need automatic created/updated timestamps and user tracking on any model without writing hooks manually
 * @complexity  simple
 * @example     class Invoice extends Model.with(Trackable) {}
 * @module      @darjs/mixins
 */

A JSDoc description reads: “Injects four audit fields via lifecycle hooks.” A @reuse-when reads: “You need automatic created/updated timestamps and user tracking on any model without writing hooks manually.”

The first is written from the implementor’s perspective. The second is written from the caller’s perspective, using the vocabulary they’d use when searching. That gap — implementor language vs. caller language — is exactly what makes code hard to find without an LLM.

@reuse-when bridges it. It’s the field that gets queried.

The three-layer output

tools/build-contracts.js walks all .d.ts files across the five DarJS packages and extracts the contract blocks into a flat JSON index:

{
  "role": "hook",
  "domain": "core",
  "does": "Injects four audit timestamp/user fields and populates them automatically via beforeCreate and beforeUpdate hooks.",
  "tags": "audit, created_at, updated_at, created_by, updated_by, timestamp, mixin, track",
  "reuse-when": "You need automatic created/updated timestamps and user tracking on any model without writing hooks manually",
  "complexity": "simple",
  "example": "class Invoice extends Model.with(Trackable) {}",
  "module": "@darjs/mixins"
}

110 entries, one per public function or class. The build step takes under a second.

packages/nlp/ runs TF-IDF over the index. Zero dependencies — no sentence transformers, no embeddings API. The query is tokenized and stemmed (8 suffix rules: timestamps→timestamp, tracking→track, queries→query). Field weights are unequal: @reuse-when ×3, @does ×2, @tags ×2. The plain-English trigger field should dominate — it was written exactly for this query shape.

The third layer is the routing decision. Each result carries a score and a disposition:

dar find — "track who created a record"
Searched 110 contracts across 5 packages

1. Injects four audit timestamp/user fields and populates them automatically...
   [hook] core · @darjs/mixins
   ✓ reuse as-is  score 52%
   example: class Invoice extends Model.with(Trackable) {}
   reuse when: You need automatic created/updated timestamps and user tracking...
   import: require('@darjs/mixins')

Routing decision: ✓ reuse as-is
Copy the @example line and wire @returns → @takes across your pipeline.

Three outcomes: nlp-reuse (copy the example), nlp-verify (found a candidate, check field types), llm-generate (nothing exists, generate new code). The routing is per-result, not per-query.

The confidence-lead insight

Absolute TF-IDF scores break down on short queries.

“Add timestamps” is two tokens. The cosine similarity space is sparse — no query gets a high absolute score because there aren’t enough terms to intersect with. A match at 20% absolute score might be exactly correct. But a threshold-based router that requires 45% would escalate to the LLM anyway.

The fix is to route on the relative gap between the top two results, not the absolute score of the first:

function deriveRouting(contract, matchScore, confidentLead = false) {
  const complexity = contract.complexity || 'simple';
  if (complexity === 'complex')                       return 'llm-generate';
  if (complexity === 'moderate' && matchScore < 0.4)  return 'llm-generate';
  if (complexity === 'moderate')                      return 'nlp-verify';
  if (confidentLead && matchScore >= 0.15)            return 'nlp-reuse';
  if (matchScore >= 0.45)                             return 'nlp-reuse';
  if (matchScore >= 0.15)                             return 'nlp-verify';
  return 'llm-generate';
}

confidentLead is true when the top result scores ≥1.8× the second-place result. If one contract clearly outranks the rest, that is a confident match — just at a different scale than a long-query match. The threshold drops from 0.45 to 0.15.

Short-query escalation to LLM drops significantly. The logic correctly routes “add timestamps” to nlp-reuse at a 22% absolute score, because second place is at 11%.

What this means for agent architectures

The standard agentic pattern sends a natural-language description of a need to an LLM and waits for it to generate or find an answer. That’s appropriate when the answer doesn’t exist. It’s wasteful — and often wrong — when the answer is already in the codebase.

An agent that knows its own codebase should query the contract index before calling the LLM. The flow:

Receive a sub-task: “add audit tracking to the Invoice model”
Query the contract index with the natural-language need
If nlp-reuse: extract the @example, wire it in, no LLM call
If nlp-verify: show the candidate to a small model for type-checking only
If llm-generate: send to the LLM, write the code, then write a new contract for it

The LLM’s role narrows to genuine generation — the cases where nothing exists. Every retrieval hit reduces the token budget for the session and removes one source of hallucinated API usage.

The broader principle: an LLM operating in a structured codebase shouldn’t treat the codebase as opaque. It should have a queryable model of what exists. Contracts are that model — structured enough for TF-IDF retrieval, rich enough to reconstruct usage from @example + @returns + @takes alone.

The codebase stops being a black box you ask the LLM to describe. It becomes a typed index you can search directly. The LLM handles everything the index can’t — which, after 110 contracts across five packages, turns out to be less than you’d expect.