I lead code reviews on a 4-engineer team. For most of 2024-2025, our cycle time was eating into shipping velocity. The pattern was always the same: open a PR, wait 4 hours for someone to glance at it, get back five comments — three of which were "you forgot a null check" type nits that any senior would catch in 30 seconds.
In Q1 2026 I rebuilt our review pipeline around Claude. Cycle time dropped 50%. Here's the exact setup, and the parts that surprised me.
What "AI code review" actually does (and doesn't)
Before describing the system, important framing: an AI reviewer is useful for catching mechanical issues. Missing null checks. Race conditions in async code. Forgotten error handling. Dead imports. Type-safety gaps where someone reached for any because they were tired.
It is notuseful for architectural judgment, naming choices, or product trade-offs. Those still need humans. But mechanical issues account for ~70% of review comments in our team — and they're the ones that bounce a PR back twice before merge.
Claude doesn't replace senior review. It replaces the part of review that wasn't senior to begin with.
The architecture
Three pieces, wired in sequence:
- GitHub webhook fires on
pull_request.opened,synchronize, andreopened - n8n workflow catches the webhook, fetches the diff + file context via the GitHub API
- Claude Sonnet 4.6 reviews the diff with a tuned system prompt, returns inline comments + risk score (0-10)
- n8n posts the review back to GitHub via the
POST /pulls/:number/reviewsendpoint
Total time from PR open to first review comment: ~30 seconds.
The reviewer prompt (the part that matters)
Most of the work was in the system prompt. Two principles shaped it:
Principle 1: Be specific or shut up
I explicitly tell Claude not to make general comments. Every issue must cite a line number and a concrete fix. If it can't name where, it doesn't comment.
# Output format
For each issue:
### Line N (path/to/file.ts)
[One sentence describing the issue and suggested fix.]
If the PR is solid, say so. Don't manufacture issues.Principle 2: Skip nits
I tell Claude to ignore formatting, naming, and stylistic preferences — we have linters for that. The signal-to-noise ratio of reviews is directly tied to notcommenting on stuff that doesn't matter.
Risk scoring
Each review ends with a 0-10 risk score:
- 0-3: ship it. Comments are nice-to-haves.
- 4-6: minor concerns, address before merge.
- 7-10: blocking. Auto-requests changes via the GitHub review API.
The 7-10 case is the most powerful one. If Claude detects a likely production-breaking issue (security, race condition, data loss), the PR is blocked even if the human reviewer rubber-stamps it. We've had this catch two real incidents before merge.
Cost
Sonnet 4.6 with prompt caching costs us about $0.04 per PR review. At ~20 PRs/day team-wide, that's ~$24/month. The previous version (GPT-4 via direct webhooks, no caching) was ~$0.20/PR. The 5x reduction came from two things:
- Prompt caching on the reviewer system prompt (4KB) — saves the prompt compute on every call
- Claude Sonnet handles diff review at lower cost than GPT-4 with comparable quality on our eval set
What broke that I didn't expect
Webhook signature verification.GitHub signs webhooks with HMAC-SHA256. n8n's built-in webhook node didn't verify signatures by default — we were getting test reviews from Internet scanners hitting our endpoint. I added an HMAC verification step before n8n forwarded to Claude.
Diff size.Claude has plenty of context, but a 5,000-line refactor PR returned a useless review ("LGTM, mostly mechanical"). I now chunk PRs over 1,000 lines into per-file reviews and aggregate the summary.
Branch protection. If the auto-review request-changes is the firstreview on a PR, it can block the PR's author from doing anything. Add a setting that allows the author to dismiss bot reviews.
The human team change
The actual win wasn't the AI catching bugs. It was that human reviewers stopped being the bottleneck for mechanical feedback. By the time they opened a PR, Claude had already drafted 3-5 comments, and the human only had to add the architectural questions.
Time-to-first-human-review didn't improve much. Time-to-merge dropped from ~7 hours to ~3.5 hours, because a single human pass was usually enough — Claude had already filtered the easy stuff.
Code is on GitHub
The repo with the n8n workflow, prompts, and webhook handler is at github.com/hii24/ai-pr-reviewer. MIT-licensed, takes about 15 minutes to set up against your own GitHub org.
If you try it and your team's review cycle doesn't budge in the first week, the issue is almost always the prompt — tune it for your codebase's norms before assuming the approach is wrong.