My AI-Augmented Coding Workflow

My AI-Augmented Coding Workflow
Photo by Arindam Mahanta / Unsplash

I've been trialling a specific setup on my personal side projects: Claude Code as the AI engine, Conductor to orchestrate multiple agents in parallel, and a structured workflow on top to keep everything from going sideways. Think of it a bit like an orchestra — the musicians are talented, but without a conductor, a shared score, and a rehearsal structure, you get noise instead of music. That's roughly what unstructured AI coding feels like too.

Nothing revolutionary on its own — but when these pieces clicked together, it changed how I write software with AI tools in a pretty fundamental way.

Let me explain what I mean.


First, the problem

Here's a pattern I kept hitting: open an AI coding session, paste a ticket, get 200 lines of code, notice it looks mostly right, fix one thing, break another, go back and forth until the AI's context is a mess of contradictory corrections, give up and write it myself. Sound familiar?

That's what I'd call "vibe coding"—and it works fine for tiny tasks. But the moment the codebase has any real history to it — existing patterns, architectural decisions made months ago, reasons why that component does it that way — vibe coding starts falling apart. The AI doesn't know what it doesn't know, and it guesses confidently.

The root problem isn't the AI tools. It's how we're using them. We're handing the world's fastest typist a vague request with no design brief, no style guide, and no definition of done — and then acting surprised when the output doesn't fit.


The shift I made: from developer to director

Here's the framing that actually changed things for me.

Before AI tools, engineers spent a lot of time on mechanical work: translating a clear requirement into code, writing boilerplate, hunting through docs, and debugging syntax. AI mostly eliminates that mechanical layer. What it can't do is decide what to build, understand why a decision was made two years ago, or know when "good enough" is genuinely good enough for this context.

So engineering judgment doesn't go away — it just moves. Instead of spending your energy on implementation, you spend it on direction. Writing precise specs. Validating AI output against intent. Decomposing complex problems into things the AI can actually execute cleanly.

That's the role shift. From developer to product engineer — someone who owns outcomes, not just code.

A conductor and orchestra performing in a concert hall
Photo by Robert Katzki / Unsplash

The four-layer framework I'm using

The workflow I've landed on combines four complementary ideas. Each one solves a specific, different problem:

Layer 1 — Agent OS: Teaches the AI how we build. Persistent, version-controlled standards files the agent can read at the start of any session.

Layer 2 — Spec-Driven Development: Agrees on what we're building before writing a single line of code.

Layer 3 — RPI (Research → Plan → Implement): Executes safely in a large, real codebase without the AI losing coherence mid-task.

Layer 4 — Quality gates: Makes sure what was built is actually correct — not just "it compiles."

Miss any one of them and you get a specific, predictable kind of failure. Let me walk through each.


Layer 1: Agent OS — giving the AI a persistent memory

Every time you open a new AI coding session, the AI starts completely fresh. It has no memory of the architectural decision you made last sprint, the error handling pattern your team settled on, or why you're not using class components. Without something to address this, every session re-teaches that context from scratch — or more commonly, never teaches it at all, and the AI just guesses.

Agent OS is a lightweight open-source framework that solves this. The idea is simple: encode your team's standards in plain markdown files that live in your repo, and load them into every AI session. Because they're in git, they're versioned, reviewable, and shared across your whole team.

The structure has three layers:

  • Standards (techstack.md, codestyle.md, bestpractices.md) — how we build software in general
  • Product (mission.md, roadmap.md) — what we're building and why
  • Specs — what we're building next, per feature

It works with any AI coding tool. For Claude Code (which I use for personal projects), these files integrate through slash commands. For GitHub Copilot, you reference them directly in your chat prompts. The important thing is that every session starts from a shared, consistent foundation — instead of the AI starting from scratch and guessing.

Think of it the way you think about your linting config: it encodes style decisions so you don't relitigate them on every PR. Agent OS encodes architectural decisions so you don't re-explain them to every AI session.


Layer 2: Spec-Driven Development — the spec is the source of truth

This one is the highest-leverage change in the whole workflow, and it's also the simplest.

Before any implementation starts, the AI helps you write a technical spec: requirements, component decomposition, state management approach, API contracts, error handling, and crucially, explicit acceptance criteria. Then a human reviews and approves it before a single line of code is written.

Here's the thing that makes this click: catching a misunderstanding in a spec takes about five minutes. Catching it after implementation, untangling it, and redirecting takes hours.

During spec review, you do one extra thing: for each acceptance criterion, you decide whether it's [auto] (verifiable by a test) or [manual] (requires human inspection — visual review, screen reader, device testing). This decision, made before implementation, determines what gets tested. Without it, test coverage is an afterthought decided by how much time is left at the end of the sprint.

An example:

CriterionLabelWhy
Contact me form adds a project deadline date[auto]Unit test against the query logic
Date picker is keyboard-accessible[manual]Requires a screen reader or a11y tooling
UI looks correct on 320px viewport[manual]Visual — requires browser inspection
No TypeScript errors[auto]tsc --noEmit in the build

Layer 3: RPI — Research, Plan, Implement

This is the operational heart of the workflow. It's a three-phase approach for executing a feature in a real codebase without the AI losing the thread.

Phase 1: Research

Open a fresh Claude Code session. Give it one instruction: read the relevant files and produce a factual report. No code. No opinions. No "here's how I'd approach it." Only what exists.

The output is a research.md document: what files are involved, what patterns are in use, what open questions need answering. Before moving on, you do a quick FAR check — is this research Factual, Actionable, and Relevant?

This is where scope misunderstandings are cheapest to catch.

Phase 2: Plan

Fresh Claude Code session. The agent loads the research document and produces a plan.md: phased tasks, checkbox-tracked, with explicit success criteria and an out-of-scope list.

Before starting implementation, you do a FACTS check — is the plan Feasible, Atomic, Complete, Testable, and Scoped?

Why a fresh session for planning? The research session is done. Carrying its context forward pollutes planning with half-formed thoughts from the exploration phase. Planning needs a clean slate seeded only by the finished research.

Phase 3: Implement

Fresh Claude Code session per phase. The agent loads the plan, the spec, and the standards — nothing else. It works through the phases, checking off tasks, and writes tests for every [auto] acceptance criterion alongside the implementation code.

The reason fresh sessions matter here is context hygiene. Keep Claude Code's context window to a minimum, ideally below 40%, and it stays coherent. If you fail on this, loading the entire history from research and planning, and things start degrading mid-task.

If context fills mid-phase, the plan's checkboxes are your recovery mechanism — reload the plan, see which tasks are unchecked, resume.

This is also where Conductor earns its place. Rather than running Research, Plan, and Implement sequentially in a single terminal window, Conductor lets you spin up multiple Claude Code agents in parallel — each in its own isolated git worktree. So Research for feature A can be running at the same time as Implementation of feature B. It's like having a small team of focused agents, each with a clean context and a specific job to do.


Layer 4: Quality gates — explicit, not hoped-for

AI-generated code passes the build and matches the plan's tasks. But it silently misses what wasn't in the instructions: the edge case nobody mentioned, the keyboard handler that wasn't specified, the error state the spec didn't call out.

The workflow has four explicit quality gates:

  1. Spec review — split [auto] from [manual] criteria before writing any code
  2. Implement phase — tests generated alongside code, not as an afterthought
  3. Pre-PR standards audit — a read-only AI session reviews the diff against your standards files and lists violations. No fixes — just a list you then act on
  4. PR review — the reviewer checks the diff against the spec, not just against instinct

That third gate consistently catches things that slip through: inconsistent error handling, missing TypeScript return types, interactive elements without keyboard support, hardcoded strings that should be constants.


What a day actually looks like

This is the comparison I find most useful. Not before/after AI — before/after a Structured AI workflow.

Without structure: Open Copilot, paste ticket, get code, something's wrong, fix it, the fix broke something else, context fills up with contradictions, start over, give up, write it manually. End of day: variable output, standards drift accumulating.

With the workflow: Read ticket, run /shape-spec in Claude Code, answer six questions, review and approve the spec (twenty minutes). New Claude Code session, run research (eight minutes), FAR check. New session, run plan, FACTS check. Three implement phases via Conductor — each agent in its own worktree, each with tests passing before moving on. Standards compliance audit catches two things: a missing return type and a missing aria-label. Fix both in ten minutes. Raise PR.

The PR reviewer has a clear checklist from the spec. The review takes fifteen minutes. First pass.

That morning: one feature implemented, tested, PR'd, reviewed. Afternoon: same again.


The tooling stack

The two tools doing the heavy lifting here are Claude Code and Conductor.

Claude Code is Anthropic's terminal-based agentic coding tool. It's what actually runs each RPI phase. What makes it particularly well-suited to this workflow is how cleanly it integrates with Agent OS: it reads CLAUDE.md at the repo root automatically on startup, and loads slash commands from .claude/commands/ — which is exactly where Agent OS installs /inject-standards, /shape-spec, and /discover-standards. Every session starts context-aware with no manual setup.

Conductor (conductor.build) is a macOS app that runs multiple Claude Code agents in parallel, each in an isolated git worktree. The name is apt — and not by accident. Just like an orchestral conductor doesn't play an instrument but ensures every musician is reading from the same score, playing their part at the right moment, Conductor doesn't write code itself. It directs the agents: one researches, one plans, and one implements — each focused on their phase, each working from a clean context, none stepping on each other's work. Instead of waiting for Research to finish before starting Plan, or queuing up Implementation behind both, you have separate agents working simultaneously on different phases or different features. It's lightweight orchestration on top of Claude Code, purpose-built for exactly this kind of multi-phase, multi-agent workflow.

And in that analogy, you're the composer. You don't play the notes — you decide what gets played, review the output at each rehearsal gate, and make sure the final performance matches what you had in mind.

For context: Claude Code also integrates with AGENTS.md, which is the equivalent file used by GitHub Copilot and other tools. Best practice is to keep both in sync and have them point to your agent-os/standards/ directory — so the workflow is portable if you ever switch tools or work with teammates on a different setup.


The bit that surprised me most

I expected the main benefit to be speed. More code, faster.

What I actually got was consistency. The code coming out of this workflow fits the codebase. It follows the patterns. It handles errors the right way. The PRs pass review on the first try because there's a spec to review against, not just instinct.

The bottleneck shifted from "writing code" to "thinking clearly about what code should do." That's a better bottleneck. It's where engineering judgment actually creates value.


Want to try it?

The easiest starting point is Agent OS on one repo. The install takes about ten minutes:

curl -fsSL https://raw.githubusercontent.com/buildermethods/agent-os/main/scripts/project-install.sh \
  -o /tmp/project-install.sh && bash /tmp/project-install.sh

Run discover-standards to have the AI extract patterns from your existing codebase. Then try shape-spec on your next ticket before you start implementing.

You don't have to adopt all four layers at once. Start with Agent OS (the standards infrastructure). Add [auto]/[manual] splits to your next spec. Run one feature through the full RPI loop. The investment in the first cycle pays back in the third.

The engineers who will do best in the next few years aren't the ones who know the most AI shortcuts. They're the ones who can write a precise, unambiguous spec, decompose a complex change into well-bounded pieces, and validate AI output against clear intent. That's a senior engineering skill. Turns out it's also the skill that makes AI coding actually work.