AI in Software Delivery: What’s Working, What’s Hard, and What We’re Still Learning

AI-assisted development is no longer a side experiment. Across Stack Builders teams, senior technical leads are using AI in day-to-day delivery: generating components, supporting specification-driven workflows, reviewing larger pull requests, analyzing team metrics, and even coordinating parallel agents to resolve batches of issues.

But the most interesting insight from a recent Senior Tech Leads discussion was not simply “AI makes us faster.” The more useful takeaway was subtler: AI changes where the hard parts of software delivery live.

Writing code may be faster. Understanding the right request, measuring real progress, validating correctness, and keeping workflows aligned are becoming the new pressure points.

In this post, we’re sharing an exclusive recap of a strategic conversation with our senior tech leads, plus conversation questions your team can use to explore AI for software delivery with more confidence.

1. Specification-driven development is promising, but workflow alignment is still tricky

Several teams are experimenting with specification-driven development. One team compared a lighter-weight open-spec approach against a more structured framework that prescribes more of the workflow. The early signal was positive: specs help guide AI output and make implementation feel more controlled.

The catch? Synchronization.

When tasks live in a project tracker, tests live in a spec framework, and AI operates across both, teams start asking new questions:

Is the tracker still the source of truth?
Should tasks be declared closer to the repository?
How do we prevent specs, tickets, prompts, and PRs from drifting apart?

This is a familiar software quality problem wearing a new hat. AI does not remove the need for shared context. It makes stale context more expensive because the system can confidently accelerate in the wrong direction.

Conversation starter: Where should the source of truth live for AI-assisted work: the tracker, the repo, the spec, or some carefully stitched combination?

2. AI can reduce coding time while increasing cognitive load

One recurring theme was mental load. AI can generate more code, larger PRs, and broader solutions, but humans still need to understand what changed, why it changed, and whether it fits the domain.

One lead described using custom AI skills to explain large requests more clearly. Instead of only asking AI to produce code, the team used AI to research learning techniques and package them into a reusable skill that helps break down concepts, goals, non-goals, and tradeoffs.

That is a useful pattern: AI as a comprehension tool, not just a production tool.

The old bottleneck was often “Can we implement this?” The new bottleneck may be “Can we understand and validate this fast enough?”

Conversation starter: What workflows could help reviewers reduce cognitive load when reviewing AI-generated or AI-assisted code?

3. The best AI workflows may be project-specific, not generic

A front-end example made this clear. AI was helpful for generating design system components and CMS-backed sections from screenshots, but it was not perfect at interpreting Figma-style visuals. The team improved results by refining prompts and creating project-specific instructions.

Another lead described this as moving from “fix the AI’s output” to “fix the process.” Instead of repeatedly correcting the same mistakes manually, teams can update the harness: prompts, rules, memories, examples, restrictions, and validation loops.

That mindset is important. If the same AI mistake happens twice, it may not be a code problem. It may be a workflow design problem.

This aligns with Stack Builders’ broader AI positioning: AI should be applied where it drives value while preserving quality, security, and long-term maintainability.

Conversation starter: What recurring AI mistakes do you see on our team that should become team-level rules, tests, or reusable skills?

4. Measuring AI impact is harder than counting PRs

One team started measuring the impact of AI by comparing a period before and after adoption. They saw a reported 36% increase in opened PRs during an initial measurement window, but the team was careful not to treat that as the whole story.

That caution matters. More PRs can mean more throughput, but they can also mean more review load, more incomplete work, or more downstream coordination.

The group discussed alternative metrics, including:

Time from ticket opened to ticket completed
PRs opened, closed, and merged
Review comments addressed
Cycle time through QA and staging
Team-level productivity frameworks such as DORA and SPACE
Whether one metric alone can tell a reliable story

A useful framing emerged: AI metrics should distinguish developer activity from delivery outcomes. A PR is activity. A validated, deployed, maintainable change is an outcome.

Conversation starter: What combination of metrics best captures AI-assisted delivery without rewarding code volume for its own sake?

5. Story points may need a new meaning

AI challenges traditional estimation. A task that used to take several days might now be completed in an afternoon with the right model, context, and review path.

The discussion surfaced two possible shifts.

One approach is to estimate complexity instead of time. Easy tasks may be handled well by AI with lighter review, while harder tasks require more careful human validation.

Another approach is to estimate the level of judgment required. A task may not be “large” because it takes long to code. It may be large because only someone with deep domain knowledge can verify that the approach is correct.

That is a sharp insight. In AI-assisted delivery, the scarce resource may not be typing. It may be judgment.

Conversation starter: Should estimation account for implementation effort, review risk, domain judgment, or all three?

6. Parallel agents can unlock bursts of progress, but they need human triage

One experiment involved using AI workflows to delegate around 15 issues to multiple agents in parallel. The agents investigated issues, triaged them, resolved many of the clear ones, and surfaced the cases that needed human input. The result: many PRs were created quickly, while most human attention went to the few issues that actually required judgment.

That pattern feels important: AI can fan out across known work, but humans still need to define success conditions, review results, and handle ambiguity.

Other tools and workflows were mentioned for monitoring PRs, fixing build failures, or looping until a success condition is met. These are promising, but they also raise a governance question: how much autonomy should we give agents before the review process becomes the true delivery bottleneck?

Conversation starter: Which tasks are safe for agent swarms, and which should remain deliberately human-led?

7. Model choice and access still shape the developer experience

Teams also reported uneven experiences across tools and models. Some found strong results with Copilot for TypeScript and smaller Haskell changes. Others reported friction when losing access to preferred tools, dealing with hallucinations, slower models, or token limits.

This is a reminder that AI adoption is not just a methodology question. It is also an infrastructure question. The same workflow can feel smooth or painful depending on model quality, token availability, integration, latency, and organizational constraints.

Conversation starter: Should teams standardize on one AI toolchain, or preserve flexibility so each project can use the best-fit model and workflow?

What this conversation tells us

The strongest theme from the discussion is that AI adoption is becoming less about novelty and more about engineering discipline.

The teams are not asking, “Can AI write code?” They are asking better questions:

How do we keep AI aligned with project-specific standards?
How do we reduce mental load during review?
How do we measure actual delivery impact?
How should estimation evolve?
What requires senior judgment?
Which workflows should be reusable across projects?
Where does AI introduce new risks or bottlenecks?

That is the real work now. AI can accelerate output, but durable software still depends on clear process, careful review, shared context, and experienced judgment.

A practical next step

For teams trying to move from experimentation to reliable AI-assisted delivery, start by choosing one workflow pressure point and turning it into a small experiment:

Create a reusable skill for understanding large requests.
Add project-specific AI rules for recurring mistakes.
Compare ticket cycle time before and after AI adoption.
Track review effort, not just PR volume.
Reframe story points around risk and judgment.
Try AI-assisted refinement with a PM and technical lead present.
Run an A/B test between two spec-driven workflows.

The goal is not to make AI usage bigger. The goal is to make it more legible, measurable, and trustworthy.

AI is changing software delivery, but the north star remains familiar: build reliable systems, keep quality visible, and use better tools without surrendering engineering judgment.

If your team is asking how to make AI part of a reliable delivery process, we’d be glad to help think through what that could look like: click here to book a call.