My AI-Augmented Coding Workflow

I've been trialling a specific setup on my personal side projects: Claude Code as the AI engine, Conductor to orchestrate multiple agents in parallel, and a structured workflow on top to keep everything from going sideways. Think of it a bit like an orchestra — the musicians are talented, but without a conductor, a shared score, and a rehearsal structure, you get noise instead of music. That's roughly what unstructured AI coding feels like, too.

Nothing revolutionary on its own — but when these pieces clicked together, it changed how I write software with AI tools in a pretty fundamental way.

Let me explain what I mean.


First, the problem

Here's a pattern I kept hitting: open an AI coding session, paste a ticket, get 200 lines of code, notice it looks mostly right, fix one thing, break another, go back and forth until the AI's context is a mess of contradictory corrections, give up and write it myself. Sound familiar?

That's what I'd call "vibe coding" — and it works fine for prototyping or tiny tasks. But the moment the codebase has any real history to it — existing patterns, architectural decisions made months ago, reasons why that component does it that way — vibe coding starts falling apart. The AI doesn't know what it doesn't know, and it guesses confidently.

The root problem isn't the AI tools. It's how we're using them. We're handing the world's fastest typist a vague request with no design brief, no style guide, and no definition of done — and then acting surprised when the output doesn't fit.


The shift I made: from developer to maestro

Here's the framing that actually changed things for me.

Before AI tools, engineers spent a lot of time on mechanical work: translating a clear requirement into code, writing boilerplate, hunting through docs, and debugging syntax. AI mostly eliminates that mechanical layer. What it can't do is decide what to build, understand why a decision was made two years ago, or know when "good enough" is genuinely good enough for this context.

So engineering judgment doesn't go away — it just moves. Instead of spending your energy on implementation, you spend it on direction. Writing precise specs. Validating AI output against intent. Decomposing complex problems into things the AI can actually execute cleanly.

That's the role shift. From developer to product engineer — someone who owns outcomes, not just code.

Photo by Robert Katzki / Unsplash

The three-layer framework I'm trialling

The workflow has three layers, each solving a different problem.

Layer 1 (Agent OS) teaches the AI how we build — persistent, version-controlled standards it can load at the start of any session.

Layer 2 (Spec-Driven Development via RPI) is where the spec gets produced: a three-phase Research → Plan → Implement process that ends with a human-approved spec before a single line of implementation code gets written.

Layer 3 (quality gates) makes sure what was built is actually correct — not just "it compiles."

Miss any one of them, and you get a specific, predictable kind of failure. Let me walk through each.


Layer 1: Agent OS — giving the AI a persistent memory

Every time you open a new AI coding session, the AI starts completely fresh. It has no memory of the architectural decision you made last sprint, the error handling pattern your team settled on, or why you're not using class components. Without something to address this, every session re-teaches that context from scratch — or more commonly, never teaches it at all, and the AI just guesses.

Agent OS is a lightweight open-source framework that solves this. The idea is simple: encode your team's standards in plain markdown files that live in your repo, and load them into every AI session. Because they're in git, they're versioned, reviewable, and shared across your whole team.

The structure has three layers:

  • Standards (techstack.md, codestyle.md, bestpractices.md) — how we build software in general
  • Product (mission.md, roadmap.md) — what we're building and why
  • Specs — what we're building next, per feature

It works with any AI coding tool. For Claude Code (which I use for personal projects), these files integrate through slash commands. For GitHub Copilot, you reference them directly in your chat prompts. The important thing is that every session starts from a shared, consistent foundation — instead of the AI starting from scratch and guessing.

Think of it the way you think about your linting config: it encodes style decisions so you don't relitigate them on every PR. Agent OS encodes architectural decisions so you don't re-explain them to every AI session.


Layer 2: Spec-Driven Development — how the spec gets made

This is the highest-leverage part of the whole workflow, and the most misunderstood. Spec-Driven Development isn't a document format — it's a principle: nothing gets implemented until a human has reviewed and approved a technical spec. The spec covers requirements, component decomposition, state management approach, API contracts, error handling, and explicit acceptance criteria.

Here's the thing that makes this click: catching a misunderstanding in a spec takes about five minutes. Catching it after implementation, untangling it, and redirecting takes hours.

But the spec doesn't write itself. This is where RPI — Research, Plan, Implement comes in. RPI is the three-phase process I use to produce the spec. It's not the implementation workflow; it's the spec-writing workflow. The output of a full RPI cycle is a human-approved spec that implementation can then run cleanly against.

Phase 1: Research

Open a fresh Claude Code session. Give it one instruction: read the relevant files and produce a factual report. No code. No opinions. No "here's how I'd approach it." Only what exists.

The output is a research.md document: what files are involved, what patterns are in use, what open questions need answering. Before moving on, do a quick FAR check — is this research Factual, Actionable, and Relevant?

This is where scope misunderstandings are cheapest to catch.

Phase 2: Plan

Fresh Claude Code session. The agent loads the research document and produces a plan.md: phased tasks, checkbox-tracked, with explicit success criteria and an out-of-scope list.

Before moving on, do a FACTS check — is the plan Feasible, Atomic, Complete, Testable, and Scoped?

Why a fresh session? The research session is done. Carrying its context forward pollutes planning with half-formed thoughts from the exploration phase. Planning needs a clean slate seeded only by the finished research.

Phase 3: The spec

The plan feeds into the final spec document — the deliverable the human reviews and signs off on before implementation starts. During that review, you do one extra thing: for each acceptance criterion, you decide whether it's [auto] (verifiable by a test) or [manual] (requires human inspection — visual review, screen reader, device testing). This decision, made before implementation, determines what gets tested. Without it, test coverage is an afterthought decided by how much time is left at the end of the sprint.

An example:

CriteriaLabelWhy
Scheduled articles excluded from feed before publish date[auto]Unit test against the query logic
Date picker is keyboard-accessible[manual]Requires a screen reader or a11y tooling
UI looks correct on 320px viewport[manual]Visual — requires browser inspection
No TypeScript errors[auto]tsc --noEmit in the build

Once the spec is approved, implementation begins — and the spec is what every subsequent phase is measured against. Implementation runs in fresh Claude Code sessions per phase, each loading the spec and the standards and nothing else, checking off plan tasks and writing tests for every [auto] criterion alongside the code. Context hygiene matters here: keep Claude Code's context window below roughly 40% and it stays coherent. Let it fill with the full history from Research and Plan and things start degrading mid-task. If that happens, the plan's checkboxes are the recovery mechanism — reload, see what's unchecked, resume.

This is also where Conductor earns its place. Rather than running Research, Plan, and the implementation phases sequentially in a single terminal window, Conductor lets you spin up multiple Claude Code agents in parallel — each in its own isolated git worktree. Research for feature A runs at the same time as the implementation of feature B. It's like having a small team of focused agents, each with a clean context and a specific job, none of them stepping on each other.


Layer 3: Quality gates — explicit, not hoped-for

AI-generated code passes the build and matches the plan's tasks. But it silently misses what wasn't in the instructions: the edge case nobody mentioned, the keyboard handler that wasn't specified, the error state the spec didn't call out.

The spec review gate lives in Layer 2 — by the time implementation starts, [auto] and [manual] criteria are already decided. From there, three more gates run during and after implementation. Tests get generated alongside code in the implement phase, not in a cleanup pass afterwards. Then a pre-PR standards audit: a read-only Claude Code session reviews the diff against your standards files and lists violations — no fixes, just a list you act on. Finally, PR review, where the reviewer checks against the spec rather than instinct.

That standards audit gate consistently catches things that slip through: inconsistent error handling, missing TypeScript return types, interactive elements without keyboard support, hardcoded strings that should be constants.


What a day actually looks like

This is the comparison I find most useful. Not before/after AI — before/after a structured AI workflow.

Without structure: Open Copilot, paste ticket, get code, something's wrong, fix it, the fix broke something else, context fills up with contradictions, start over, give up, write it manually. End of day: variable output, standards drift accumulating.

With the workflow: Read ticket, run /shape-spec in Claude Code, answer six questions, review and approve the spec (twenty minutes). New Claude Code session, run research (eight minutes), FAR check. New session, run plan, FACTS check. Three implement phases via Conductor — each agent in its own worktree, each with tests passing before moving on. Standards compliance audit catches two things: a missing return type and a missing aria-label. Fix both in ten minutes. Raise PR.

The PR reviewer has a clear checklist from the spec. The review takes fifteen minutes. First pass.

That morning: one feature implemented, tested, PR'd, reviewed. Afternoon: same again


The tooling stack

The two tools doing the heavy lifting here are Claude Code and Conductor.

Claude Code is Anthropic's terminal-based agentic coding tool. It's what actually runs each RPI phase. What makes it particularly well-suited to this workflow is how cleanly it integrates with Agent OS: it reads CLAUDE.md at the repo root automatically on startup, and loads slash commands from .claude/commands/ — which is exactly where Agent OS installs /inject-standards, /shape-spec, and /discover-standards. Every session starts context-aware with no manual setup.

Conductor (conductor.build) is a macOS app that runs multiple Claude Code agents in parallel, each in an isolated git worktree. The name is apt — and not by accident. An orchestral conductor doesn't play an instrument; they make sure every musician reads from the same score and plays their part at the right moment. Conductor works the same way: it directs the agents — which one is researching, which one is planning, which one is implementing — each focused, each with a clean context, none of them stepping on each other. Research for feature A runs while the implementation of feature B is already underway. It's orchestration on top of Claude Code, and it's where the workflow starts feeling less like a process and more like actual leverage.

And in that analogy, you're the composer. You don't play the notes — you decide what gets played, review the output at each rehearsal gate, and make sure the final performance matches what you had in mind.

For context: Claude Code also integrates with AGENTS.md, which is the equivalent file used by GitHub Copilot and other tools. Best practice is to keep both in sync and have them point to your agent-os/standards/ directory — so the workflow is portable if you ever switch tools or work with teammates on a different setup.


The bit that surprised me most

I expected the main benefit to be speed. More code, faster.

What I actually got was consistency. The code coming out of this workflow fits the codebase. It follows the patterns. It handles errors the right way. The PRs pass review on the first try because there's a spec to review against, not just instinct.

The bottleneck shifted from "writing code" to "thinking clearly about what code should do." That's a better bottleneck. It's where engineering judgment actually creates value.


Want to try it?

The easiest starting point is Agent OS on one repo. The install takes about ten minutes:

curl -fsSL https://raw.githubusercontent.com/buildermethods/agent-os/main/scripts/project-install.sh \
  -o /tmp/project-install.sh && bash /tmp/project-install.sh

Run discover-standards to have the AI extract patterns from your existing codebase. Then try shape-spec on your next ticket before you start implementing.

You don't have to adopt all four layers at once. Start with Agent OS (the standards infrastructure). Add [auto]/[manual] splits to your next spec. Run one feature through the full RPI loop. The investment in the first cycle pays back in the third.

The engineers who will do best in the next few years aren't the ones who know the most AI shortcuts. They're the ones who can write a precise, unambiguous spec, decompose a complex change into well-bounded pieces, and validate AI output against clear intent. That's a senior engineering skill. Turns out it's also the skill that makes AI coding actually work.