bitsofrandomness.com

A decade of shaping teams, and the book that made sense of it

Vincent Burckhardt — Mon, 09 Mar 2026 12:34:56 GMT

I recently read Team Topologies by Matthew Skelton and Manuel Pais. It's one of those books where you keep nodding as you read, recognising patterns you've lived through. Over the past decade, I've built up a lot of opinions about what makes teams work and what doesn't. The book gave those opinions a framework, connecting things I'd seen across different teams and organisations.

The book lays out four team types (stream-aligned, enabling, platform, and complicated subsystem), three interaction modes (close collaboration, X-as-a-service, and facilitating), and a core principle: cognitive load should drive how you scope a team's mission. Most of it mapped to things I'd already lived through.

Architecture first, then teams

On one project, a team collaboration product, I was one of the founding engineers and helped decide how the squads were drawn up. Instead of letting the org chart shape the system, we identified the business domains first (messaging, notifications, user profiles, spaces, search) and built cross-functional squads around each. Every squad owned their domain end-to-end: front-end, back-end, deployment. No hand-offs to another team. Microservices were the implementation detail, but the real decision was aligning team boundaries to business domains. The team structure followed from that, not from the org chart.

We didn't call it the reverse Conway manoeuvre at the time. We just knew that if teams formed around the existing hierarchy, the architecture would inherit those same rigid boundaries. Org charts are designed for reporting lines and budgets, not for how software delivery flows. Let teams follow the hierarchy, and that's exactly what you get in the code.

I've also seen the opposite. On another project, what should have been a single coherent product experience was spread across four separate product teams, each with their own reporting line and priorities. Nobody owned the end-to-end user experience. The result was exactly what Conway's law predicts: the architecture mirrored the team splits, not the user's mental model. It's a common pattern in large organisations, not a failure of any individual team, but of a topology that doesn't match the product boundaries.

Cognitive load as a design constraint

Team Topologies puts a name on something that's easy to feel but hard to explain: teams have a finite cognitive capacity, and you should design around it.

On the collaboration product, we had a separate mobile squad rather than embedding a few mobile engineers in each stream-aligned team. If every squad had to hold mobile platform specifics in their heads on top of their domain, the cognitive load would be too high. Not every team member could keep everything in mind, and that's when things start falling through cracks.

I've found this to be a good litmus test. When scoping a team's mission, the question isn't just "what needs to be built?" but "can every member of this team reasonably understand the full scope of what we own?" If the answer is no, the boundaries are wrong.

The enabling team that was one person

The book describes enabling teams as groups that help stream-aligned teams acquire new capabilities. I've seen this work with a single senior person.

A security architect I worked with operated exactly this way. They didn't just produce guidelines. They worked directly with stream-aligned squads, embedding temporarily to help them implement specific security outcomes, whether that was JWT token handling, secret management, or threat modelling. They transferred knowledge, then moved on. That's the enabling model in its purest form: build capability in other teams, don't create a dependency.

From collaboration to service

Three interaction modes: collaboration, X-as-a-service, and facilitating. The thing that clicked for me is that you move through them. I've lived through that evolution in multiple organisations, and it's always the same pattern.

In one organisation, recurring incidents across many services followed common patterns. These weren't individual team failures. They were systemic gaps that no single team had the bandwidth or mandate to solve alone. I led a reliability-focused team to tackle this. Rather than building a centralised response function, I embedded my team with one service team at a time, worked closely alongside them on their biggest reliability gap, and built the fix as reusable tooling from the start. Automated secret rotation, safe infrastructure update workflows, continuous deployment with progressive rollouts, reusable infrastructure-as-code modules for common cloud patterns. Each built as a general solution, validated in one team's environment before others picked it up.

Once a solution was proven in production with a real team, the interaction mode shifted. What started as close collaboration graduated to X-as-a-service: other teams could consume the tooling independently, and get updates on it, with minimal coordination. Without embedding first, we wouldn't have understood the problem well enough to build something that actually worked. Without graduating to self-service, it wouldn't have scaled.

We were strict about quality. Every piece of reusable tooling had documentation, tests, and worked out of the box. If the consumer experience is poor, adoption doesn't happen. Once a few solutions were running in production and engineers across the organisation were joining regular community calls, things started moving on their own. Less personal push needed, more gentle steering.

Much of this work followed an inner source model. We didn't expect many pull requests from other teams, and we didn't get them. Stream-aligned teams have their own delivery to focus on. The real value of inner source was something else: working in the open built trust, attracted like-minded people across the organisation, and grew a community around the topic instead of a top-down mandate. Some of that work eventually became open source.

The collaboration trap

Team Topologies describes close collaboration as an interaction mode for early exploration. Two teams working tightly together to discover something new.

I've seen what happens when this mode is skipped. Sometimes teams skip early collaboration for understandable reasons: time pressure, uncertainty about direction, or wanting to move fast without too many voices in the room. But when architectural decisions get made without involving the teams who'll build on them, the downstream cost is real. People don't feel they "disagreed and committed." They feel they were never part of the conversation. The long-term impact on trust and buy-in is hard to recover from.

The glue between teams

Team Topologies would probably say that if you need someone working the seams between teams, your topology isn't right. Fix the boundaries, clarify the interaction modes, and the need for glue disappears. In theory, that's correct. In practice, even a well-designed topology assumes every team can fully deliver on their mission and will proactively coordinate when needed. Neither assumption always holds.

Some teams lack the skills to match their responsibility, whether through outsourcing, rapid growth, or being assigned work outside their expertise. Others have the capability but not the inclination to reach out. Autonomous teams can become insular, and collaboration doesn't happen just because the interaction mode says it should. Someone ends up bridging those gaps informally: orchestrating work across squads, testing a dependency that needs more attention, connecting things that would otherwise fall through the cracks.

But the glue role isn't only about compensating for gaps. Someone with visibility across multiple teams can spot patterns and risks that are invisible from inside any single boundary. Engineering maturity always varies across teams in a large organisation, and that cross-boundary perspective is how you keep the overall system healthy.

Tanya Reilly calls this being glue and frames it as legitimate technical leadership. Good organisations recognise this work and value the people doing it. Others don't, because it sits outside any team's formal scope, and that's how they lose the people who were holding things together.

Teams shape the people inside them

There's an aspect of team design that goes beyond delivery efficiency. Small, autonomous squads create space for engineers to grow. A squad lead with genuine autonomy and full visibility of their work to the broader organisation gets something that large hierarchical teams rarely offer: real ownership and room to develop. People I brought onto the team grew into leadership roles they wouldn't have got near in a traditional structure, because the squad structure gave them room for it. That was deliberate. How you draw team boundaries shapes careers, not just architectures.

What the book gave me

Team Topologies didn't change how I think about teams. It gave me a shared vocabulary for patterns I'd already seen work, and a clearer understanding of why some structures fail. Cognitive load, interaction modes, Conway's law as a design tool rather than an inevitability. These are the kind of things that turn gut feeling into something you can actually discuss with other people.

If you've spent years building teams, you'll recognise most of this book. It does for organisational design what Design Patterns did for object-oriented software: gives you shared names so you can stop talking past each other.

Agent skills look like markdown. Treat them like software.

Vincent Burckhardt — Sat, 14 Feb 2026 00:20:25 GMT

I have spent the last couple of weeks exploring agent skills, in the context of making it easier for our customers to build complex infrastructure-as-code solutions on top of our curated Terraform modules. The domain is not directly relevant here, but the fact that we want to eventually ship skills to customers is what shaped how I think about what follows.

Skills look simple: markdown instructions, optional scripts, maybe some reference files. In practice that usually means front matter, a core SKILL.md, and supporting docs or code. The Agent Skills specification turned this into a shared format with progressive disclosure, so assistants can read metadata first and only load the full body when needed. Anthropic introduced it in December 2025, and major coding assistants adopted it quickly.

That speed makes one thing urgent: quality and governance. The first serious, customer-facing skills are only now starting to show up.

Creating a skill is easy. Creating a good skill is hard

A skill can be just markdown, and that is the appeal. A compact way to package procedural knowledge. The buzz around Anthropic's contract-review skill for Claude Cowork showed this well: it rattled legal tech stocks, and it is a couple hundred lines of markdown. Not a product. A SKILL.md file.

But low creation cost creates noise. Catalogs and registries have multiplied. skills.sh alone lists tens of thousands of skills. That proves demand, not quality. Maybe 90% of published skills are not really useful, not well designed, or just some guy building their own workflow opinions and packaging them as reusable assets. Discovery and filtering can end up costing more than just doing the task directly.

The hard part is not writing instructions. The hard part is designing a reliable flow with clear boundaries and good triggers, with gates that reduce failure rates. Good skills checkpoint risky branches, force validation steps, avoid ambiguous handoffs.

A slightly more technical aside, but worth mentioning: skills are not the only way to augment an assistant. Slash commands (user-triggered instruction sets) and MCP servers (external tools and data sources the assistant can call) are converging with skills into a similar pattern: reusable instruction bundles, often with optional scripts. Frameworks like spec-kit lean into that convergence. Skills also compose well with MCP servers, which can provide information in a more token-efficient format than raw API calls (compact representations like TOON instead of full JSON payloads). And MCP gateways like MCP Context Forge can add governance and security between the assistant and the data source, which is harder when a skill instructs the assistant to make direct calls.

Are skills documentation, or are they software?

I have started to see them as software. At least for skills we want to push to customers and be able to support. That is the main takeaway from these last couple of weeks: if it needs to be supportable, treat it like a software artifact.

Start with static analysis in CI. Tools like skills-ref validate check structure and naming against the spec, but that is just one check among many: markdown linting, context budget enforcement (startup metadata around the ~100-token footprint, recommended thresholds for SKILL.md size), and whatever project-specific rules make sense.

Then handle lifecycle. Versioning helps, but semver alone is not enough. Skills should include ownership and last-updated metadata. This gets harder in monorepos that hold unrelated skills, and many installers still do not provide a clean update path for already-installed skills.

Modularity matters just as much. Small, single-purpose skills are easier to compose and less likely to conflict. Large catch-all skills drift into ambiguity. If a skill ships Bash or Python, those scripts follow the normal lifecycle too: review, tests, security checks, clear ownership.

Testing skills needs a TDD-like loop

This is where I see the biggest gap. Most teams still test skills informally: try a couple of prompts, get one good answer, ship it. That is not enough for probabilistic systems.

There is also a real variance problem across assistants. Some trigger skills naturally from intent, others need explicit prompting. Codex often triggers on loosely related tasks. Claude Code and GitHub Copilot can need more prompt shaping depending on context. Once a skill is loaded, instruction-following quality still differs by assistant and model. Testing a skill on one assistant and calling it done is not enough.

I have been toying with a TDD-like approach to this, adapted for non-determinism.

Run a baseline without skill or MCP augmentation and capture failures (red).
Add the skill and MCP context until behavior is acceptable (green).
Refactor for stability and token efficiency.

Gates should be statistical, not binary. Run each scenario multiple times, track aggregate scores and standard deviation, inspect tail behavior. A good average can hide an ugly long tail. Ideally you run this across a matrix of assistants and models, so teams can publish minimum support expectations for each skill instead of claiming universal compatibility.

Scoring should mix deterministic and non-deterministic checks. Deterministic checks run assertions on outputs: files generated, content matching, expected tool calls, etc, and can also verify conversation structure. Non-deterministic checks can use semantic similarity score, or LLM-judge models, with targeted human review on high-risk workflows. Cost matters too: token usage, latency, retry count. If a skill improves quality but doubles cost and variance, that may be a bad trade.

And there is a more basic question that many teams skip: is the skill useful at all? In many cases, the base model already has the knowledge. If a skill does not measurably improve quality, consistency, safety, or cost, it should not exist. Offline evals help, but you still need a feedback loop from real users in live usage.

Security is still the wild west

Security is the least mature part of this space, and it shows.

The integration guidance says script execution is risky and should be sandboxed. That warning is justified. Skills can be pulled from many catalogs through different installers, with little shared trust model.

The most installed skill on skills.sh is one that automatically discovers and installs more skills from the internet, with a -y flag that skips user confirmation. Not an edge case. That is the default user pathway. No realistic central gate exists today. Many catalogs and installer implementations are out there, each with different review standards. Trust does not transfer across them.

Script execution is only one part of the risk though. Markdown instructions alone can encode harmful behavior: a skill could tell the assistant to ask users for sensitive information and route it to a third party. No scripts needed. Security review has to cover instruction content, not just executable assets.

I see two practical paths:

Curated, security-vetted catalogs or some kind of certification scheme.
Stronger assistant-side guardrails: sandboxing, tool allowlists, confirmation prompts for high-risk actions.

The second probably matters more in the short term because it protects users even when catalog governance is inconsistent. Many of these protections still sit outside the core skill format and depend on the assistant implementers.

Monetization comes after trust

I do think there is a monetization path for high-quality skills, especially in dense procedural domains: compliance, legal, medical, security, specialized infrastructure. Areas where there is a lot of procedural knowledge that people would pay for, if they trust the quality.

You are not selling markdown. You are selling well-maintained, tested knowledge. Closer to buying a technical playbook than a prompt file.

Monetization will probably look less like selling one static skill and more like subscriptions to curated pattern libraries, or MCP-backed knowledge systems where the skill is a thin layer over a gated database of patterns. Companies could also sell skill packs that integrate with their own products. Services like uupm.cc already point in that direction.

Where this leaves me

The format is good. Adoption is fast. Serious skills are just starting to emerge. What is missing is treating them like actual software: clear ownership, static checks, versioning, modular design, statistical evaluation, security controls, real user feedback loops. The basics we already know from software engineering, applied to a new kind of artifact.

I am still figuring out parts of this, especially around the eval tooling and how far cross-assistant testing is worth pushing for early-stage skills. But the direction feels clear. If we want skills to hold up in enterprise delivery, the bar has to go up. Otherwise they are just fancy prompts.

Software engineering culture: mindset, teams, and action

Vincent Burckhardt — Mon, 22 Dec 2025 00:00:00 GMT

Originally published in December 2025

This year, more than most, made me think about team structure, and through that lens, about engineers and people. The patterns I've observed over the years keep sharpening as I see them repeat: anticipation and empathy change how engineers approach problems, team structures affect delivery speed and alignment with user needs. Culture shapes outcomes more than tooling or process.

The mindset difference

A pattern I've seen across engineering teams: you typically have one or two engineers who deliver multiples more value. The multiplier isn't in features they ship. It's in how their work compounds: making future development easier for other contributors, improving overall quality, preparing the codebase for future needs. I actually heard the term "10x engineer" only recently, but reflecting back across 20 years, it resonates, though not in the way the term is commonly discussed. The multiplier isn't about lines of code or features shipped. It's about fewer bugs, maintainable code, and solutions that serve the team long after the initial commit.

What this looks like in practice: an engineer implementing a capability as a reusable library rather than a one-off solution. Someone building a more generic implementation that others can extend later. The line between this and overengineering is thin. The difference is judgment: solving for the next likely step, not for every hypothetical future. These contributions have real business impact, but they're hard to see without looking at the technical details. Other engineers with the right mindset spot them immediately. What actually sets them apart is how they make their whole team more effective.

Anticipation as core skill

Anticipation is the biggest factor. Anticipating bugs, future needs, usage patterns, performance bottlenecks. Coding accordingly, not just for their own output but for other engineers working on the same codebase. For some, this becomes intuition rather than a checklist.

An engineer sees a configuration pattern that will likely need to support multiple environments. Instead of hardcoding values, they design the configuration structure to accommodate that from the start. Not because a requirement exists, but because experience suggests it will. Another engineer implements error handling that captures enough context to debug production issues, because they've been on the other end of vague error messages at 2am.

The same applies when writing documentation or tutorials: considering what readers already know and what they don't, adjusting the level of detail accordingly. Anticipating the questions someone will have before they ask them.

It also shapes how code communicates intent. Variable names that clarify purpose. Functions broken down in ways that make the next change obvious. Comments that explain why, not what, for decisions that might seem arbitrary six months later. These aren't cosmetic choices. They directly affect how quickly other engineers can understand, modify, and extend the code.

This isn't about building features you might need someday. It's about subtle choices that cost little extra: error handling that captures context, naming that communicates intent, structures flexible enough for likely next steps. The cost is minimal. The payoff compounds.

Empathy and user perspective

Seeing the product from the user's perspective comes naturally to some engineers. For many, it needs to be developed deliberately. This gap doesn't correlate with seniority. Experience in one domain doesn't automatically translate to understanding users in another. Direct interaction with users helps developers understand confusing features and behavior, making it easier to see confusion from the user's perspective.

This shows up in small decisions that shape user experience. An engineer who tests the error states, not just the happy path. Someone who questions whether a feature that makes technical sense actually solves the user's problem. A developer who considers what happens when the system is under load, when network is flaky, when inputs are unexpected. A UI developer who uses their interface from the user's perspective, testing usability rather than just building what was specified.

The absence is just as visible. Features that work perfectly in development but fail in production because real-world usage wasn't considered. Interfaces that make sense to the person who built them but confuse everyone else. Error messages that are technically accurate but give users no path forward.

Some engineers develop this naturally. Others need direct exposure to users and production systems. Some of this is innate, some learned. Training helps. For engineers where this doesn't come naturally, checklists and processes provide a pragmatic approach. I've introduced these in some teams I've worked with. Those checklists help, but they never cover everything. For some, user empathy remains a deliberate effort rather than intuition.

Pride and proactivity

Caring about your stuff. Having pride in what gets delivered. This is mindset rather than skills. Being proactive rather than reactive. Not just fixing issues when they happen, but building in ways that make future needs easier to address.

Pride in work shows up in details. Code that's clean not because someone will review it, but because it matters to the person writing it. Tests that are comprehensive because the engineer cares about correctness. Documentation that exists because the author wants others to succeed.

Engineers with this mindset don't wait for issues to be reported. They monitor production systems. They notice patterns in support tickets. They fix problems before they escalate. They refactor code that works but is fragile, because they know it will cause issues later.

There's also the opposite pattern. Engineers who do exactly what's asked, nothing more. Code that meets requirements but ignores obvious edge cases. Work that's technically complete but requires constant follow-up to actually function in production. It's a different approach to the work.

The other side of the spectrum

Some engineers have net negative contribution overall. This is uncomfortable territory, but real. I first noticed this in teams I was part of early in my career. Back then, it felt almost taboo to say it out loud. Over the years, as I saw the pattern repeat and found others discussing the same observation, it became clearer. Engineers who seem to deliver on the surface, but when you examine technical details and overall output, other engineers are fixing their issues in the background. The problems compound over time.

This isn't primarily about team friction, though I may have been lucky on that front. It's about bugs, subtle regressions, convoluted code that takes hours to understand. One developer writes unclear code in an afternoon, and several others spend far longer trying to figure out how it works.

The pattern is recognizable once you know to look for it. Code that works initially in production but starts breaking as usage grows or edge cases emerge. Code that passes review but becomes a maintenance burden weeks later. Changes that fix one issue but introduce two others. The engineer appears productive by conventional metrics. Tickets closed, code committed, features delivered. The cost shows up elsewhere.

Subtle regressions are the worst part. A change that works for the immediate use case but breaks edge cases that were previously handled. An optimization that improves one scenario but degrades others. A refactoring that simplifies the code the engineer understands but makes other parts more fragile. These issues often surface long after the original work, making the connection difficult to trace.

The DevOps model where developers write their own tests has a downside here. When engineers lack this anticipatory perspective, they tend to write happy-path tests that don't cover edge cases. A dedicated QA team with fresh eyes might catch what the original author missed. The efficiency gains from integrated testing can mask this gap.

The long-term effect compounds. Code review becomes more intensive, not because standards changed, but because trust eroded. What should be a straightforward feature takes longer because the foundation is unreliable.

AI as amplifier

AI-assisted development compounds these traits. Engineers with strong anticipation and user empathy know what to look for in generated code. They ask the right questions: does this handle the edge cases? Will this be maintainable? Does this actually solve the user's problem? They spot when something doesn't fit, when the generated solution misses context that wasn't in the prompt. They use AI to move faster while maintaining the same judgment that made them effective in the first place. Their impact multiplies.

Engineers lacking these traits just produce more. More code, faster, with the same blind spots. Code that doesn't align with real needs. Edge cases ignored. Subtle bugs introduced because generated code was accepted without critical review. The maintenance burden grows faster. More code to debug, more regressions to trace. The cleanup may never happen. Instead, the team slows down: new features take longer to build on a fragile foundation.

The variance between effective and ineffective engineers widens. What was a 10x difference becomes larger. Net negative contribution scales faster too.

Team organization matters

Individual mindset matters, but how teams are structured determines whether those individuals can be effective. I've seen projects struggle not because of technology, but because of team organization.

End-to-end ownership

What I've seen work: end-to-end squads responsible for delivering full capabilities or missions. A capability often spans multiple services, libraries, and products. The squad implements across all of them, rather than waiting for each component team to prioritize and deliver their piece.

This requires engineers willing to learn and adapt. They work across codebases they don't fully own. They understand enough of each component to make meaningful contributions, even if they're not the experts.

The ownership question still matters. Each component has owners: engineers who maintain long-term responsibility, review contributions, and ensure quality. Similar to open source, where maintainers accept pull requests from contributors. The end-to-end squad contributes, the component owners review and merge. Both roles are necessary.

This model creates velocity. The squad doesn't wait for five different teams to schedule their piece of the work. They implement, get reviews from component owners, and ship. Context stays intact because the same people carry the capability from start to finish.

When a customer complains about a feature, the squad that built it hears the feedback directly. They can adjust quickly because they understand the full picture, even across component boundaries.

The handoff problem

The alternative pattern: four or more development teams plus separate design teams. This leads to endless debates, handoff issues, lack of overall ownership. Each transition point becomes friction. Specifications get misinterpreted. Context gets lost.

The cost of handoffs is underestimated. When one development team builds a component, another team integrates it, and a third team extends it, each handoff loses context. The first team doesn't know how their code will be used. The second team doesn't understand the original design decisions. The third team inherits assumptions they can't see. Each handoff makes the codebase more fragile and harder to evolve.

Each handoff also diffuses responsibility. When something goes wrong, finger-pointing replaces problem-solving. No single person or team feels accountable for the end-to-end outcome. Handoff-heavy organizations develop defensive behaviors: extensive documentation to protect themselves, meetings to coordinate, rigid processes to prevent miscommunication. The bureaucracy is a rational response to dysfunctional structure, but it makes the dysfunction worse.

The debate trap

Too many teams also means endless debates. When four development teams need to align on an approach, each has different constraints, priorities, and contexts. The discussions become about finding consensus rather than finding the best solution.

Many architectural questions don't have clear answers in the abstract. The right choice depends on specific requirements, team skills, operational capabilities, and a dozen other factors that are hard to reason about theoretically. Building something surfaces these factors quickly.

This doesn't mean building without thinking. It means recognizing when discussion has diminishing returns. When the same points are being rehashed, when debates become circular, when decisions are blocked on hypothetical future requirements, it's time to build something and learn from reality. The teams that shipped features while others debated usually ended up with more refined solutions, improved through actual usage rather than speculation.

Different sides of culture

Mindset, team structure, and bias toward action are different aspects of software engineering culture. They don't neatly cause each other, but they're part of the same picture. AI makes the mindset gap more visible, widening the difference between engineers who anticipate and those who don't.

Some engineers bring anticipation and empathy regardless of how teams are organized. Some team structures work despite individual gaps in mindset. When teams struggle to deliver, it's usually one of these human factors: engineers who don't think beyond the immediate task, structures that diffuse responsibility, or processes where alignment takes longer than building.

As AI tools become more common, these fundamentals matter more, not less. Design, build, test, deploy. Or debate endlessly. Teams with strong foundations use AI to move faster without sacrificing quality. Teams without those foundations just accumulate technical debt faster.

Breaking the doom-prompting loop with spec-driven development

Vincent Burckhardt — Tue, 18 Nov 2025 23:28:25 GMT

Every developer using AI coding tools has experienced the loop. You prompt, the AI generates code, something isn't quite right, you prompt again, the AI breaks something else while fixing the first issue, you prompt again. An hour later you're deeper in the hole than when you started, caught in what's now called doomprompting: you keep going because you've already invested so much time.

This is the dark side of vibe coding, Andrej Karpathy's term for fully surrendering to AI-generated code without really understanding it. Karpathy himself noted it's "not too bad for throwaway weekend projects." For anything more substantial, the approach tends to collapse.

I've been using spec-kit, GitHub's toolkit for spec-driven development, and it's changed how I think about AI-assisted coding. The core insight is simple: catching problems in specifications costs far less than catching them in code.

The shift-left principle applied to AI coding

Shift-left testing is the idea that catching defects earlier in development is cheaper than catching them later. Everyone who's debugged a production issue knows this intuitively: finding a problem in requirements costs almost nothing, finding it in code review costs some rework, finding it in production costs a lot more.

Spec-kit applies this principle to AI-assisted development, but shifts even further left. Instead of catching issues through testing code, you catch them through reviewing specifications. The four-phase workflow makes this explicit: Specify, Plan, Tasks, then Implement. Each phase has a gate where you review before proceeding.

This feels familiar to anyone who studied software engineering formally. I remember university projects where we spent weeks on specifications and architecture before writing a line of code. The discipline felt excessive at the time, but the coding phase was remarkably smooth when we finally got there. Spec-kit brings that same rigor to AI-assisted development.

What spec-kit actually provides

The toolkit is agent-agnostic and works with Claude Code, GitHub Copilot, Cursor, and other AI coding tools. At its core, it's a set of slash commands that guide you through structured phases:

The /specify command forces you to articulate what you're building. The /plan command generates research and technical direction. The /tasks command breaks the plan into discrete implementation steps. Finally, /implement executes those tasks.

Each phase produces markdown files that serve as both documentation and AI context. The specifications, plans, and task lists persist across sessions, acting as memory that keeps the AI aligned with your intent.

Spec-kit also introduces what it calls a "constitution" (I prefer "principles," but the concept matters more than the name). This file establishes cross-cutting rules for your project: testing approach, coding standards, architectural constraints. These non-functional requirements apply to everything the AI generates.

How the flow changes day-to-day work

My workflow with spec-kit looks different from the typical AI coding loop. I spend time reviewing and editing the specifications and task list, then let the AI implement the full feature. I treat the AI less like a pair programmer and more like a developer I'm delegating work to. I review the resulting code the way I'd review a pull request from a human team member.

This mental model matters. With pair programming, you're watching every keystroke. With delegation, you're reviewing outcomes against specifications. The latter scales better with AI tools that can implement substantial features autonomously.

The plan phase has become the most valuable. The AI performs research on the technical direction, and I've learned things from this process. More importantly, I catch misunderstandings early. During one project, the plan revealed the AI assumed an IBM Cloud serverless service was deployed on a VPC, which is incorrect. Catching that during plan review was far cheaper than discovering it through broken infrastructure code.

I don't review every single code change anymore. Instead, I review the specifications carefully, let implementation run with auto-accept enabled, do smoke testing, then review the full changeset. If issues emerge, I iterate through the full flow (plan to tasks to implementation) rather than jumping straight to code fixes. This keeps the specifications accurate and aligned with what actually got built.

The overhead question

Spec-kit adds overhead. For simple tasks, that overhead isn't worth it.

But for larger features, I've found the investment pays back. The specifications force me to think through requirements properly. Architectural problems surface during plan review rather than after I've invested in code. And I avoid the doom-prompting loop because ambiguities in my thinking get resolved during specification, not through trial-and-error prompting.

This parallels traditional development. Some developers code first and spend months fixing bugs and refactoring. Others invest in architecture and specifications upfront. Both approaches can work, but they have different risk profiles. For complex work, the methodical approach tends to win. The same applies to AI-assisted development.

The token usage goes up when using spec-kit. You're generating specifications, plans, and task lists before writing code. But these tokens typically pay for themselves by avoiding the doom-prompting loop where you might burn through tokens endlessly without making progress.

Prompt-based flows versus coded pipelines

One aspect of spec-kit's design surprised me. My initial instinct would have been to implement most of the workflow in a traditional programming language with explicit control flow. Instead, spec-kit encapsulates the flow in detailed prompts with minimal supporting scripts.

This approach works well with frontier models. The prompts describe phases in natural language, and the AI follows them reliably. The templating approach with gates provides deterministic outcomes without requiring coded orchestration nodes like you'd find in LangGraph.

I suspect this approach would be less reliable with non-frontier models. The ability to follow complex, multi-phase instructions consistently requires the kind of instruction-following that frontier models do well.

Beyond the underlying model, I've noticed the tools available in each AI assistant matter. The plan phase benefits from web search, codebase search, and other research capabilities. Claude Code includes these out of the box, including deep search for thorough research. Other AI assistants may lack some of these capabilities, and I've seen the most variance in plan quality when research tools are limited.

Configuring MCP tools before running through the flow also improves results. For instance, I configure tools for Terraform module registry search and cloud provider documentation lookup. These help the AI generate better-informed plans.

Adapting for infrastructure as code

When I started using spec-kit, I thought it would apply directly to infrastructure as code. As I progressed, I realized IaC has specific characteristics that need different handling: the declarative nature of tools like Terraform, the need to separate cloud-agnostic requirements from provider-specific implementations, governance concerns around security and cost that differ from application code, and validation against actual cloud provider APIs and module registries.

I ended up creating iac-spec-kit and open-sourced it to get more collaboration on the approach. It started as a fork, but ended up as a complete reimplementation of the commands, instructions, and templates. The only common layer is around the installer and the overall approach. The templates and prompts needed to be tuned specifically for infrastructure concerns.

The goal is to fill a gap where users can start with a high-level requirement like "deploy WordPress" or "set up a three-tier web app" and have the AI guide them through specification, planning, and code generation with review gates at each phase. The toolkit is cloud-agnostic and works with AWS, Azure, GCP, IBM Cloud, and others. Early tests look promising. I documented one end-to-end example at vburckhardt/wordpress-ibm-cloud, which shows the full workflow from initial requirements through generated Terraform code.

A specific focus has been getting AI to compose higher-level Terraform modules rather than using lower-level providers directly. AI-generated code that glues together curated, supported modules is more maintainable and supportable than code that reinvents infrastructure patterns using primitives. It's similar to teaching AI to use a well-designed library instead of writing everything from scratch.

What this enables

Spec-kit enables going beyond vibe coding. The structured flow feels right because it aligns with how sound engineering should work. You're not just prompting and hoping. You're defining intent, reviewing plans, and delegating implementation.

The specifications also work well for collaboration. They're markdown files that can be checked into source control and versioned. I can see workflows where teams have validation gates on specifications and plans before implementation begins. The artifacts serve as shared understanding, not just AI context.

For resuming sessions, the specification and task files act as memory. Instead of re-explaining context to the AI, the toolkit instructs it to load the existing artifacts. This makes long-running projects more manageable.

The structured flow also enables working on multiple features in parallel. While the AI implements one feature autonomously, I can work on specifications for the next one. This pattern is emerging more broadly with tools like OpenAI Codex that explicitly support parallel task execution. I expect this to become more common. The implications cut both ways: it lets independent developers and small startups move faster with limited headcount, but it also raises questions about expectations placed on developers in corporate settings.

The flow does require discipline. It's tempting to skip straight to implementation when you think you know what you want. But ambiguity in your thinking becomes apparent when you try to write it down as a specification. That's the point. The specification phase forces clarity before you've invested in code.

When it's worth it

Spec-kit won't eliminate all the friction from AI-assisted development. The overhead is real, and it's not worth it for every task. But for substantial features where you'd otherwise end up in a doom-prompting loop, the structured approach catches problems when they're cheap to fix.

The shift-left principle applies: review specifications, not just code. Treat AI implementation as delegation, not pair programming. Invest in the plan phase when research can improve technical direction.

If you're frustrated with vibe coding results on anything beyond weekend projects, spec-driven development is worth trying. The discipline feels familiar to anyone who's done rigorous software engineering, and the payoff is similar: smoother implementation because the thinking happened upfront.

The shift toward agentic development

Vincent Burckhardt — Mon, 03 Nov 2025 23:31:40 GMT

Over the past two years, software development has changed in ways that feel significant. These are patterns I'm noticing both in my own work and across the industry.

I've been using AI coding tools in personal projects for over two years. The evolution has been clear. It started with copying code from ChatGPT and pasting it into an IDE. Then came tab completion with Cursor and GitHub Copilot, which was helpful but not transformative. The real shift happened when Cursor introduced agentic capabilities, before Copilot had similar features. More recently, I got access to IBM Bob at work, which resembles Cursor 1.X and GitHub Copilot in approach. Most recently, Claude Code with its predominantly agentic workflow has reinforced what seems to be the direction things are heading. Cursor 2.0's release in late 2025 appears to confirm this trend, with the agentic approach as the default option and the traditional IDE features taking a secondary role.

We're moving beyond the IDE as the center of our work

The most striking change is where time actually gets spent during development. With Cursor 2.0's release in October 2025, it became undeniable: the vast majority of time is now spent in the agentic part of the tool rather than traditional IDE features. There's something ironic about this, given that Cursor's name presumably references the blinking cursor where we type code. That cursor, that act of typing line by line, increasingly feels like it's from a different era.

What I mean by "agentic" is end-to-end generation with supervision rather than autocomplete. This isn't about tab completion suggesting the next line (though Cursor does that too). It's about describing what needs to be built, providing context about the codebase, and then supervising as the AI generates entire features across multiple files. This is a fundamentally different interaction model from copying snippets from ChatGPT two years ago.

But here's what's important to clarify: this doesn't mean the IDE is dead or that understanding code is no longer necessary. Rather, the center of gravity has shifted. Where development used to mean spending the day writing code with occasional AI assistance, it now means orchestrating AI to write code, with occasional manual intervention. It's a subtle but fundamental difference in how the work feels.

The emergence of specification frameworks

This shift has created a need for better ways to communicate intent to AI systems and to go beyond vibe coding. Spec Kit, which GitHub released in September 2025, illuminates something important: what we're really dealing with is a structured way to create advanced prompts that prepare the AI for the coding phase.

The term "advanced prompts" isn't meant dismissively. Speckit implements a four-phase process: Specify, Plan, Tasks, and Implement. What this does, conceptually, is force thinking through the problem at a higher level of abstraction before any code gets written. The tool itself isn't magic, it's essentially a framework for breaking down requirements into pieces that an AI can reliably execute. But that structure matters enormously.

What's particularly valuable is that this approach allows making corrections at the specification level, which is much cheaper than making them in the code. If there's ambiguity or a misunderstanding, catching it during the specification phase means you don't waste time generating and then debugging incorrect code. Anecdotally, this is interestingly another manifestation of shifting left in the development process, catching issues earlier when they're cheaper to fix.

This addresses something important that often gets glossed over in discussions about AI coding. There's a lot of talk about "vibe coding" or full-app generation, where you describe what you want at a high level and the AI just builds it. That sounds appealing, but in practice, it rarely works well for anything beyond trivial examples. What Speckit does is something more subtle and more valuable. It's not trying to enable vibe coding. It's trying to enable structured thinking that then guides the AI through a disciplined implementation process.

The pattern is consistent: when the specification phase gets skipped and the jump goes straight to asking the AI to build something, results are almost always disappointing. The AI might generate code that looks reasonable at first glance, but it doesn't quite align with what was actually needed. It's not because the AI is bad at coding, it's because the specification wasn't clear enough, or edge cases weren't thought through, or the specification was ambiguous in ways that only became apparent when seeing the implementation.

By writing clear specifications first, using a framework like Speckit, ambiguities in thinking get caught before they turn into code that has to be rewritten. This is fundamentally different from just prompting an AI to "build me a user authentication system." The specification phase forces articulation of details like: what should happen when a user tries to log in with an expired session? How should password reset tokens be generated and validated? What's the token lifecycle? These details matter, and thinking them through upfront leads to much better results from the AI.

GitHub's own explanation for why they built Speckit captures this well. They wrote that "we treat coding agents like search engines when we should be treating them more like literal-minded pair programmers." That's exactly right. These tools do what you ask, not what you meant. The specification step is where you figure out if what you're asking for is actually what you mean. It's the difference between treating AI as a magic wand that somehow divines your intent versus treating it as a very capable but very literal collaborator that needs clear instructions.

The workflow is inverting: from coding to planning and review

This brings us to what may be the most significant change happening right now. The workflow of software development is inverting. Instead of spending most time coding with some time spent planning and reviewing, development now means spending most time planning and reviewing with the actual coding handled by AI.

The time breakdown appears to follow a pattern: roughly 40% goes into setting up context and writing specifications, maybe 20% waiting while code gets generated, and then 40% reviewing what was produced and verifying it does what was intended. Other developers describe similar patterns, which suggests this isn't idiosyncratic to any individual workflow.

What strikes me about this distribution is how much it resembles the way senior technical people have always worked. An architect or tech lead typically doesn't write most of the code themselves. They do the planning, set the strategic direction, delegate the implementation to other engineers, and then review the output through pull requests to ensure alignment. The difference now is that the "other engineers" doing the implementation are increasingly AI systems rather than junior human developers.

It's worth clarifying what "review" means in this context. The pull request review process isn't primarily about checking code quality anymore. Tools like linters, formatters, and code coverage systems have handled quality checks for years. The review is about verifying alignment with specifications and customer needs. Does the implementation actually solve the problem we're trying to solve? Are edge cases handled correctly? Does this fit with the broader system architecture? These are the questions that require human judgment.

There's another valuable aspect to reviewing AI-generated code. Spending time examining the implementation often generates additional thoughts and perspectives on the next steps. It's a bit like bouncing ideas off a colleague. You see how the AI interpreted your specification, which might reveal gaps in your thinking or suggest alternative approaches you hadn't considered. This feedback loop, where reviewing code feeds into the next iteration of development, is valuable beyond just catching errors.

The shift to smaller, senior-heavy teams

Instead of a team structure with 3-4 senior engineers coordinating with a larger group of junior engineers (sometimes outsourced) who do the actual coding, it's now possible to have effective teams of just 3-4 senior engineers who do the planning, specification, and review while AI handles the implementation. The coordination overhead drops dramatically because there are fewer handoffs and less need for detailed task delegation to human implementers.

The implications here are sensitive but worth addressing directly. This shift seems likely to impact the people who currently do most of the actual coding: junior engineers and potentially outsourced engineering teams. The pattern in the job market seems to reflect this, although it is still early and to take with some caveats. More than half of open roles are now at senior level or above, and companies are increasingly prioritising candidates with AI engineering skills. The emphasis is shifting toward engineers who can architect, specify, and review rather than those who primarily implement.

For those with experience, the change has been largely positive. Iteration happens much faster. The feedback loop is tighter, which means catching problems earlier. Instead of leaving a review comment on a pull request and waiting days for the developer to make changes and submit another version, that next iteration can happen almost immediately. This cuts down significantly on meetings and coordination overhead. But this advantage accrues to those who have the experience to know when generated code is subtly wrong or when a specification was ambiguous. The question of what this means for people trying to enter the field without that foundation is a real concern.

The longer-term implications are harder to reason about. If we're increasingly selecting for senior engineers and reducing opportunities for juniors, where do the senior engineers of 2030 come from? There will always be a need for humans in the loop, if only because someone needs to be accountable. Someone needs to interface with customers, make judgment calls when requirements conflict, and take responsibility for the outcomes. But the path to building that expertise may need to look different than it has historically.

Frontier models: past the big bang phase

The final observation relates to the AI models themselves. Over the past year or so, there haven't been the kind of dramatic breakthroughs in raw intelligence for text generation that characterized the 2022-2023 period.

What's noticeable, both in industry positioning and in practice, is that OpenAI and Anthropic seem to be focusing more on building ecosystems around their existing models rather than racing to release dramatically more intelligent versions. By ecosystem, I mean tools and integrations that make the models more useful: Claude Code as a coding interface, Claude's integration with third-party providers, the emergence of Model Context Protocol (MCP) for better context handling, OpenAI's Atlas browser for web interaction. The focus is on making existing model capabilities more accessible and practical rather than just making the models themselves smarter.

A separate but related trend is the rapid commoditization of model capabilities. Claude Haiku 4.5 came out in October 2025, and Anthropic explicitly stated that it delivers similar levels of coding performance to what Claude Sonnet 4 did five months earlier, but at one-third the cost and more than twice the speed. What was frontier performance half a year ago is now available in a much smaller, faster, cheaper model. This pattern of capabilities trickling down to smaller, more efficient models seems to be accelerating.

This pattern suggests we're in a different phase now. The big bang phase of "let's make the model smarter by scaling it up" seems to be giving way to a phase of optimization and finding better ways to use what we already have. Advancement is happening more in the tools around the models (things like agentic workflows, better prompt engineering frameworks like Speckit, and improvements in speed and cost efficiency).

The technical leaders in the field seem to be saying similar things, though they phrase it in different ways. Yann Le Cun from Meta has been quite vocal that current LLMs represent something of a dead end, though he's talking more about AGI than about practical coding applications. The broader point (that we're not seeing the same rate of improvement from simply making models bigger) seems increasingly accepted.

For practitioners, what this means is that competitive advantage increasingly comes from how well you can orchestrate these tools rather than from having access to a slightly better model. The models themselves are commoditizing. Opensource models, such as Llama models from Meta, are catching up to the proprietary ones. The difference between Claude Sonnet 4.5 and GPT-4 feels less significant than the difference between someone who knows how to write good specifications and someone who doesn't.

Looking back over a two-year journey with these tools, the pattern becomes clearer. The shift isn't that model intelligence matters less now. Rather, the LLM architecture itself seems to be reaching a point where making it significantly smarter becomes increasingly difficult. The dramatic jumps in capability we saw from GPT-3 to GPT-4 are harder to reproduce. This is why the industry focus has shifted toward building better ecosystems around existing models and optimizing what we already have. It's not that intelligence doesn't matter, it's that we're hitting diminishing returns on the "make it smarter" approach, so the innovation is happening elsewhere.

Where this leaves us

These are observations and patterns, not predictions. The software engineering profession is changing in real-time, and it's worth trying to make sense of it while working in the middle of that change.

The implications for the profession are significant and somewhat uncomfortable to think about. There's a real question about how people develop the expertise needed to be effective in this new model if the traditional path of starting as a junior engineer and learning by doing is being disrupted. There's also a question about what happens to the large population of developers whose primary value was implementation speed rather than system design or architectural judgment.

But there's also opportunity here. For experienced engineers, these tools are genuine force multipliers. Building more, iterating faster, and maintaining higher quality than before is now possible. For organizations, smaller teams can accomplish more with better tooling and clearer specifications. And for the field as a whole, if we can figure out how to preserve the knowledge-building pathway while taking advantage of AI assistance, we might end up with a profession that's more focused on problem-solving and less on the mechanical aspects of translating solutions into code.

Are others seeing similar patterns? Do the days feel different than they did a year or two ago? The conversation about AI and software development often feels polarized between "nothing will change" and "everyone will be replaced." The reality seems to be somewhere in the messy middle. Things are definitely changing, the changes are meaningful, but the future isn't predetermined. How we adapt, what we choose to value, and how we structure the profession going forward will shape what software engineering looks like in five or ten years.