Testing and Code Review with AI Agents

Your AI agent wrote 200 lines of code in 3 minutes. Now what. If you merge without reviewing, you're gambling. If you review line-by-line, you've lost the speed advantage. Code review with agents needs a different strategy — not more relaxed, but smarter.

The review paradox

AI is fast at generating code. So fast it creates a new bottleneck: review. Before, the bottleneck was writing code. Now it's validating it.

You can't skip review — generated code has subtle errors, inconsistent patterns, and decisions no agent can make for you. But you also can't review everything at the same level of detail as before. You need a system that filters the obvious stuff and lets you focus on what matters.

My review checklist

After months reviewing agent-generated code, these are the five things I always check. It's not an exhaustive review — it's a quick filter that catches 80% of the problems.

Code Review Checklist

1async function deleteUsers(data: any) {
2  const res = await fetch("/api/users", {
3    method: "DELETE",
4    body: JSON.stringify(data),
5  });
6  console.log(res);
7  return res.json();
8}

Review confidence0/5

Let's go through each one:

Does it match intent? The most common error isn't that the code breaks — it's that it does something different from what you asked. The agent interpreted your prompt one way, you meant another. This is the most important check and the one only a human can do.

Type safety ok? Agents love any. They also skip null checks, ignore API errors, and assume data always arrives in the expected format. A quick type review catches problems that become production bugs later.

Follows codebase patterns? The agent can generate functional code that looks nothing like the rest of your project. Uses console.log instead of your logger, creates duplicate utils, ignores naming conventions. Code that works but nobody recognizes as theirs.

Inputs validated? Any data coming from outside — user input, APIs, query params — needs validation. Agents tend to trust that data is always correct. In production, it never is.

Edge cases covered? Empty data, network errors, timeouts, arrays with a single element, strings with special characters. Agents cover the happy path. Edge cases are your responsibility.

Agents reviewing agents

The idea sounds recursive, but it works: using a code review agent as a first filter before you look at the code. It doesn't replace your review — it makes it shorter.

What I do: after the agent generates code, I ask it to review the output against a set of rules. The project's CLAUDE.md already has the conventions. The agent compares the output against those conventions and flags inconsistencies.

The human reviews the review. Sounds redundant, but the total time is less than reviewing all the code directly. The agent catches mechanical problems. You focus on design problems.

When automated review gets it right

Agents are good at catching things humans overlook out of tedium:

Broken patterns across many files — If you renamed a convention and three files didn't get updated, the agent finds it.
Unused imports — You ignore them, the agent lists them all.
Inconsistent naming — getUserData in one file, fetchUserInfo in another. The agent spots the divergence.
Duplicate code — Two utils doing the same thing with different names. The agent compares and flags.

This kind of review is tedious for humans and trivial for agents. Delegating it makes sense.

When automated review fails

Agents can't validate:

Business logic — Does this price calculation include taxes? Does the discount apply before or after shipping? Only someone who understands the domain can verify this.
UX decisions — Does this loading state communicate what's happening? Does the error message help the user solve the problem? Human judgment.
Is this the right approach? — The code can be perfect and still be the wrong solution. The agent doesn't know if you should have used a webhook instead of polling, or if that feature should exist at all.
Team context — Undocumented agreements, historical decisions, reasons something was done a certain way. If it's not in the codebase, the agent doesn't know it.

A workflow that works

My current flow:

Write with agent — Clear prompt, defined spec, codebase context available
Auto-review — The agent reviews its own output against project conventions
Auto-fix — Mechanical problems (imports, formatting, types) get fixed before I see them
Human review — I review intent, business logic, edge cases, and architecture decisions
Ship — With confidence, because both review layers already passed

The human review is shorter because the obvious stuff is already resolved. I don't spend time pointing out console.log or any types — that's already caught. My attention goes to the questions only I can answer.

Conclusion

The goal isn't zero human review. It's making human review time count. Every minute you spend checking an unused import is a minute you don't spend evaluating whether the architecture scales or whether the feature solves the right problem.

Agents are good at catching mechanical errors. Humans are good at evaluating intent, context, and consequences. A good review workflow uses each for what they're best at.

This is the ninth article in the series. The first was about the tools I use. The second about why they fail and how to fix it. The previous one about planning before coding with SDD. This one closes the loop — the code is generated, now you need to make sure it deserves to reach production.