Research Explainer · Li (2025)

AI coding agents now ship 456,000 pull requests, but their code gets rejected far more often than human work

The first large-scale dataset of autonomous coding agent activity on GitHub reveals that speed and scale are real, but acceptance rates, review dynamics, and code complexity tell a more sobering story about the gap between benchmarks and production.

Published July 2025

456K pull requests authored by five autonomous coding agents across 61,000 repositories and 47,000 developers

35–65% agent PR acceptance rates, compared to 77% for human developers, a 12–42 percentage point gap

18 min median time for OpenAI Codex PRs to be reviewed and merged, 10x faster than human-authored PRs

9.1% of agent PRs changed cyclomatic complexity vs. 23.3% for human PRs, suggesting structurally simpler contributions

PR acceptance rates: humans vs. autonomous coding agents

Li et al. (2025), derived from Table 5 and Figure 3. Acceptance rates measured on AIDev-pop (repos with >500 GitHub stars). Human baseline: 76.8%.

What the paper actually does

Li, Zhang, and Hassan at Queen's University introduce AIDev, the first large-scale dataset capturing how autonomous coding agents operate in real open-source projects. The dataset spans 456,535 pull requests created by five agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code) across 61,453 GitHub repositories, involving 47,303 developers. Collection ran from late 2024 through June 22, 2025.

The authors frame this as empirical ground truth for "SE 3.0," the idea that software engineering is shifting from AI-assisted coding (suggesting lines of code) to agentic coding (agents autonomously reading codebases, planning changes, running tests, and submitting pull requests). Prior work had theorised about this shift. AIDev provides the receipts.

Three case studies demonstrate the dataset's value: productivity analysis (acceptance rates and speed), code review dynamics (turnaround times and reviewer composition), and code quality (cyclomatic complexity and authorship attribution). A filtered subset, AIDev-pop, restricts the analysis to repositories with more than 500 GitHub stars, yielding 7,122 agent PRs compared against 6,628 human PRs from the same repositories.

Median review turnaround time for accepted PRs

Li et al. (2025), Table 5. Turnaround measured from PR creation to closure (acceptance). Human median: 3.9 hours.

Task type distribution: what agents and humans work on

Li et al. (2025), Table 4. Task categories from the Conventional Commits Specification, classified with GPT-4.1-mini. Percentages shown for the five most common categories.

Speed is real, quality is not

The headline tension in the data is stark. Agents are fast. GitHub Copilot finishes half its PR jobs within 13 minutes. OpenAI Codex PRs get reviewed and merged in a median of 18 minutes, roughly 10x faster than the 3.9-hour median for human PRs. One developer used Codex to submit 164 PRs in three days, matching three and a half years of their own unaided output.

But fast does not mean accepted. Human PRs in popular repositories are merged 76.8% of the time. The best-performing agent, OpenAI Codex, lands at 65.3%. Devin reaches 48.9%. GitHub Copilot manages only 38.2%. The gaps are widest on the tasks that matter most: feature development and bug fixing, the categories that require contextual reasoning beyond pattern matching.

This directly contradicts benchmark results. Top agents score above 70% on SWE-bench Verified, a curated benchmark of GitHub issues. In production, the same agents face maintainability expectations, project-specific style conventions, and reviewer judgment that synthetic tests never capture. The authors argue this gap calls the ecological validity of current benchmarks into serious question.

The review bottleneck and the rise of bot reviewers

Agents flood repositories with PRs, but the review process has not scaled to match. Most PRs, both human and agentic, receive no explicit review at all (75.3% and 58.2% respectively). When reviews do happen, humans still dominate, but bot reviewers are gaining ground. In GitHub Copilot's case, 37.4% of its PRs are reviewed by a combined human-plus-bot team, a hybrid model emerging almost organically.

A troubling finding: agents and their review bots frequently come from the same provider. Copilot's PRs are overwhelmingly reviewed by copilot-pull-request-reviewer[bot]. Cursor's PRs are reviewed by cursor[bot]. These closed loops are convenient, but they risk reinforcing provider-specific blind spots. The one cross-cutting reviewer, coderabbitai[bot], operates across multiple agent ecosystems and may offer more diverse scrutiny.

The speed of rejection also raises questions about review depth. Rejected agent PRs are triaged significantly faster than rejected human PRs across nearly every agent. That could mean reviewers spot obvious flaws quickly. It could also mean they are not looking very hard. Without measuring comment depth and actionable change requests, the data cannot distinguish genuine confidence from rubber-stamping.

Throughput over complexity, and the attribution gap

In a case study of the OpenHFT/Chronicle-Wire project, one developer used Codex to produce 164 PRs in three days. Over the previous three and a half years, the same developer had submitted 176 PRs by hand. The volume is extraordinary. The substance is thinner. Only 9.1% of agent PRs introduced changes in cyclomatic complexity, compared to 23.3% for human PRs. Agent contributions lean toward boilerplate-style updates rather than structurally complex implementations.

There is one area where agents clearly shine: documentation. OpenAI Codex hits an 88.6% acceptance rate on docs PRs, and Claude Code reaches 85.7%, both above the 76.5% human baseline. Documentation tasks align naturally with LLM strengths (natural language generation, low functional risk) and reviewer tolerance for AI output is higher when nothing can break.

Attribution is another open wound. Devin, GitHub Copilot, and Cursor label their authorship in commit metadata. OpenAI Codex provides no attribution at all. Claude Code includes a default co-author tag that users can disable. Without clear provenance, post hoc debugging becomes harder, bug triage slows down, and accountability for regressions dissolves. The authors argue standardised authorship labelling should be a baseline requirement for all AI-assisted development.

What this means going forward

The paper proposes nine research directions, but three stand out. First, the field needs integration-oriented benchmarks grounded in real workflows, not synthetic puzzles. AIDev makes this possible by providing the raw material: PR metadata, review timelines, code diffs, and merge outcomes from actual projects. Second, rejected PRs are a goldmine for failure-mode analysis. Review comments, timeline events, and patch-level feedback in the dataset can be mapped to specific failure categories (logic bugs, style violations, inadequate tests), enabling agents with better self-diagnostic capabilities. Third, the authors propose treating software repositories as reinforcement learning environments where agents improve through interaction, using CI builds, merged PRs, and reviewer feedback as natural reward signals.

The broader framing is that we have entered SE 3.0 whether we planned for it or not. OpenAI Codex alone produced over 400,000 PRs in less than two months. The question is no longer whether agents will participate in software development. It is whether our review processes, governance structures, and quality standards can adapt fast enough to keep up. The data captured two months into the public release of most of these agents. The acceptance-rate gaps are real, but so is the trajectory. The authors are betting that these early shortfalls are a launchpad, not a ceiling.

BOTTOM LINE

AIDev is the first large-scale empirical dataset showing what autonomous coding agents actually do in the wild. The picture is revealing: agents are astonishingly fast and prolific, but their pull requests are accepted far less often than human work, their code changes are structurally simpler, and the review infrastructure around them is still catching up. Benchmark scores above 70% coexist with real-world acceptance rates below 50% for most agents. The gap between test performance and production trust is the defining challenge of the agentic coding era.

Reference

Li, H., Zhang, H., & Hassan, A. E. (2025). The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv preprint arXiv:2507.15003. https://arxiv.org/abs/2507.15003