We ran the multi-agent miner + verifier against 12 months of tldraw PRs and produced 8 verified findings a senior engineer would actually want.
Each finding ships with a file path, an evidence quote from the actual code at the PR's parent commit, a one-line fix, and a citation chain to a specific past PR or detected convention. No LLM commentary. No noise. The verifier reads every cited file before any comment leaves the pipeline.
Six agents, one verifier, citation-bound output
Every comment that ships is a hypothesis that survived inspection. The verifier reads the actually-uninvolved file at the PR's parent commit and dismisses hypotheses that pattern-match training data but don't apply to this code.
Four pipelines, same 12 months of PRs
Same train/test temporal split, same ground truth (clean / followup-needed / reverted), same metrics. The deterministic baseline is the floor. Adding LLM mining + a verifier is the moat.
| Pipeline | AUTO precision | AUTO % | Followup recall | What it adds |
|---|---|---|---|---|
| A · structural baseline | 58.6% | 11% | 88% | File-graph and churn signals only. The floor any rule-based tool reaches. |
| B · + intent | 71.4% | 12% | 93% | Adds semantic intent and historical context awareness. |
| C · full multi-agent | 69.2% | 33% | 82% | Adds convention and recurring-concern mining for richer context. |
| D · with verifier | 75.0% | 13% | 88% | Reads each cited file before any comment is posted. Only verified findings ship. |
Methodology: PRs sorted by date. 80% used to learn the team's patterns, 20% held out for evaluation. Ground truth comes from each PR's actual outcome at merge (clean ship vs. needed follow-up vs. reverted). The same numbers reproduce on your own repo.
Eight verified comments, every one auditable
These are the actual comments produced by the pipeline. Each carries the file path the verifier inspected, the evidence quote from the code at the PR's parent commit, the suggested fix, and the chain of past PRs/conventions the concern came from.
Three guarantees that hold on any repo
The replay above is a public OSS one so you can verify it yourself. The same harness runs on your repo with the same guarantees.
- Time-correct. Every test PR is evaluated using only the data the team had before that PR opened. No leakage, no peeking ahead.
- Code-grounded. The verifier opens every cited file at the PR's parent commit and confirms the concern is real before any comment ships. Pattern-matching alone is never enough.
- Citation-bound. Every finding names a specific past PR or detected convention. You can click through and check the source. No findings without provenance.
Drop our GitHub bot on one repo. 30-day pilot. Day-1 audit on your last 12 months of PRs in the first week.
We deploy the multi-agent miner + verifier inside your VPC. Reads from GitHub, your incident system, and your team's Slack #incidents channel if connected. Code never leaves your network. Free pilot - if our verified comments don't change a single PR cycle in 30 days, you walk and we take the learnings.