9 Experiments on AI Agents and Architecture — Including What Failed

Note: archway has since been renamed to verikt. The content below reflects the original project name at the time of writing.

I kept having the same conversation with my AI agent. Every session. “This project uses hexagonal architecture. Domain logic goes in domain/. HTTP handlers go in adapter/http/. Don’t put business logic in the handler.”

The agent would acknowledge it, build something reasonable, and then quietly put the order validation inside the HTTP handler anyway. Next session, same conversation. Next engineer on the team, same conversation again.

I wanted to stop guessing whether context injection actually worked and start measuring it. So I designed 9 controlled experiments. Same model — Claude Sonnet 4.6 via the CLI — same task, one variable changed. Condition A gets no architectural context. Condition B gets the output of archway guide: a generated file containing the architecture pattern, installed capabilities, dependency rules, and explicit prohibition rules. archway check measures violations on the output. Every experiment persists the full prompt, raw response, and generated files. Anyone can rerun them.

I expected the guide to help. I didn’t expect to learn more from where it failed.

The finding that changed how I write guides

EXP-01 was the baseline: build an order management service in Go, no architecture hints in the task, one variable — guide or no guide. The task described only the business problem: create and retrieve orders, validate customer name, expose a REST API. The kind of thing you’d get from a ticket.

Without the guide, the agent produced everything in package main. With the guide, it produced adapter/, domain/, port/, service/. 0% compliance vs 100%. That was the expected result.

What I didn’t expect was UUID v7. The guide mentioned it five times — always as a suggestion. “Consider using UUID v7 for primary keys.” The agent ignored all five and used uuid.New() (UUID v4) every time.

I added one line: NEVER use uuid.New() — always use uuid.Must(uuid.NewV7()).

The agent followed it immediately. Every run. Without exception.

That single finding changed how I write every guide. Soft recommendations don’t work. Agents treat them as optional context, not instructions. Explicit NEVER rules work. Every anti-pattern that archway check detects now has a corresponding NEVER rule in the guide output. Prevention and detection stay in sync.

There’s also a cost finding worth noting. The guide is served entirely from prompt cache — zero fresh tokens overhead. The guided runs cost $0.0985 against $0.1547 for the unguided runs. 36% cheaper, not more expensive.

Consistency matters more than correctness

EXP-02 ran the same task three times per condition. The question wasn’t whether the guide helped — EXP-01 answered that. The question was whether it helped reliably.

Without the guide: [2, 1, 1] violations across three runs, all fails. With the guide: [0, 0, 0], all passes. Variance reduction: 22.2×.

The guide doesn’t make agents better on average. It makes them consistently correct. That distinction matters more than I initially realized. A team of three engineers running the same agent on the same task without the guide will produce three subtly different architectures. With it, they get one.

The boundary

EXP-03 was designed to find where NEVER rules break down. The task: a job runner in Go, “run up to 3 jobs concurrently.” The guide says NEVER use naked goroutines.

Architecture violations dropped significantly with the guide — the structural result held. But anti-pattern violations didn’t move. Both conditions produced similar goroutine counts. The task explicitly requires concurrency. The agent produces goroutines because the task demands them.

This is the boundary. The guide prevents incidental anti-patterns — the patterns agents reach for by default. It cannot override what the task requires. For those cases, you need archway check in CI as a hard gate. Prevention and detection are complementary. Neither works alone.

The original hypothesis that failed

EXP-05 was a 2×2: lazy prompt vs thorough prompt, crossed with guide vs no guide. The original hypothesis was that a lazy prompt with the guide would produce comparable results to a thorough prompt without it — B≈C.

That hypothesis failed. B=3 violations, C=7. Difference of 4, not the ≤1 I needed.

But the pivoted finding was sharper: both guide conditions (lazy and thorough prompt) produced hexagonal structure. Both no-guide conditions produced flat — even the thorough prompt that explicitly said “no business logic in HTTP handlers.” The thorough prompt without the guide produced more violations than the lazy one without it. Extra specificity didn’t help. The agent produced a different but equally non-conforming flat structure.

Engineers don’t need to write architecture instructions in every prompt. They need the guide loaded.

When the guide doesn’t help at all

This is the part I wouldn’t have published if I’d designed these experiments to sell something.

EXP-04 and EXP-07 both tested feature additions on an existing conforming hexagonal service. EXP-04: add order cancellation. EXP-07: add a discount system, three runs per condition.

Both experiments, both conditions: zero violations. The hypothesis for EXP-07 — that the guide would reduce variance on feature additions — was falsified. There was no variance to reduce. The existing architecture taught the agent the correct patterns without any help from the guide.

When the codebase is already well-structured, the code is the context. The guide is redundant.

Two of nine experiments showed no effect. That narrows the claim: the guide’s value is highest on greenfield tasks, unfamiliar codebases, and messy projects where the existing code doesn’t teach the right patterns. On clean services, the architecture speaks for itself.

What this means for archway

The guide was always designed as a context engineering tool — an enforcer that works through context injection, not a linter that runs after the fact. Its job is to tell agents what to do before they write code. archway check’s job is to catch what they do wrong after. The experiments validated that design: the guide’s failure on task-driven anti-patterns (EXP-03) is exactly the gap where archway check in CI picks up.

v0.2.0 ships alongside this post with a Rust analysis engine — tree-sitter parsing that makes adding new languages ~200 LOC instead of ~2,000. TypeScript is next.

brew install dcsg/tap/archway
archway guide

Full experiment results with per-experiment pages, methodology, and raw artifacts: archway.dcsg.me/experiments