Research Explainer · Liu (2026)

AI coding assistants fix more code smells than they create, but introduce nearly twice the security issues they resolve

Across 304,362 AI-authored commits from 6,275 GitHub repositories, AI tools are a net positive for surface-level code quality but a net negative for bugs and security vulnerabilities, with 24.2% of all introduced issues persisting indefinitely.

Published March 2026

484,606 distinct quality issues identified across 304,362 AI-authored commits from 6,275 GitHub repositories

89.1% of all AI-introduced issues are code smells, the dominant but least dangerous form of technical debt

24.2% of tracked AI-introduced issues still survive at the latest repository revision, months after introduction

1.8× more security issues introduced by AI coding assistants than fixed, the worst net impact of any issue category

304,000 commits, five tools, one question

The researchers built a dataset of 304,362 verified AI-authored commits from 6,275 public GitHub repositories, spanning five major coding assistants: GitHub Copilot (117,851 commits), Claude (139,300), Cursor (19,791), Gemini (12,770), and Devin (14,650). Instead of relying on classifiers or self-reported labels, they used explicit Git metadata to identify AI-authored commits: bot-style logins, tool-specific email addresses (like noreply@anthropic.com), author names (like Cursor Agent), and co-authored-by trailers in commit messages. In total, 29 AI coding tools left identifiable traces, though the study focuses on the five with more than 10,000 attributed commits.

For each commit, the team ran static analysis tools (Pylint and Bandit for Python, ESLint and njsscan for JavaScript and TypeScript) on the code both before and after the change. This differential approach lets them attribute specific quality issues directly to individual AI-authored commits, distinguishing between problems the AI introduced and problems it fixed. They then tracked each introduced issue from its first appearance all the way to the latest repository revision (HEAD), revealing whether the debt persists, gets resolved, or quietly accumulates. Previous studies examined AI code quality at a single point in time. This one follows the debt as it ages.

Net impact of AI coding assistants by issue type

Source: Liu et al. (2026), Figure 7. Negative values indicate AI fixed more issues than it introduced. Net values: code smells -18,134; runtime bugs +4,823; security issues +11,120.

Percentage of commits introducing at least one issue, by tool

Source: Liu et al. (2026), Figure 6. More than 15% of commits from every AI coding assistant introduce at least one detectable issue.

Issue survival rate at HEAD by type

Source: Liu et al. (2026), Section 5.3. Security issues are the most likely to persist in the codebase. Overall survival rate across all types is 24.2%.

The cleanup paradox

The study identified 484,606 distinct quality issues introduced by AI coding assistants across 3,841 repositories (61.2% of the 6,275 studied). That is not a small contamination radius. Some 8.7% of all AI-authored commits (26,564 out of 304,362) introduced at least one issue. The issues fall into three categories, each with a different risk profile.

The net impact is where the story gets interesting (Figure 7). AI-authored commits actually fix slightly more code smells than they introduce: 449,984 fixed versus 431,850 introduced, a net reduction of 18,134. AI is genuinely good at tidying up formatting, naming, and simple refactoring. But for runtime bugs, AI introduces 4,823 more than it fixes. For security issues, the gap is wider still: 24,607 introduced versus 13,487 fixed, a net increase of 11,120. AI introduces 1.82 times as many security vulnerabilities as it resolves. The tools that speed you up on the easy stuff slow you down on the hard stuff.

Code smells (89.1% of all issues)Maintainability problems like broad exception handling (41,723 cases), unused variables and parameters (28,718), and shadowed outer variables (20,251). They won't crash your application tomorrow, but they make the codebase progressively harder to debug and evolve. In one example, Claude Code updated metadata loading logic in ArchiveBox (27k+ stars) but introduced a bare except: pass block that silently swallows all exceptions.
Runtime bugs (5.8%)Code defects that cause failures during execution. The most common is undefined variable or reference, with 23,091 cases. AI generates code that looks locally correct but references variables or functions never defined in the surrounding context. In one case, Devin added a call passing cache=cache in Firecrawl (98k+ stars), but the cache variable was never defined in scope, causing a NameError.
Security issues (5.1%)Unsafe patterns including subprocess calls without shell checking (4,334 cases), try-except-pass blocks (4,040), partial executable paths (2,539), and insecure random generators (2,355). A Copilot-authored commit introduced a shell=True subprocess call in hysteria2, enabling command injection. Another Copilot commit in Microsoft's data-formulator interpolated a user-supplied table name directly into a SQL query.

Every tool has the same problem

More than 15% of commits from every AI coding assistant in the study introduce at least one detectable issue. The rates range from 17.3% for GitHub Copilot to 28.7% for Gemini, with Claude at 24.5%, Cursor at 25.9%, and Devin at 23.7%. Claude averages the highest issue count per commit (1.96 issues, driven by 1.734 code smells per commit), while Devin has the lowest (0.87). These differences likely reflect usage patterns and development context rather than raw tool capability alone.

The cross-tool consistency is the real finding. All five tools share the same profile: a high rate of code smells, with a meaningful but smaller proportion of bugs and security issues. Code smells dominate every tool's output. You cannot escape the pattern by switching products. This is a systemic feature of AI-assisted development as currently practiced, not an artefact of any single vendor. The pattern also holds across programming languages: Python's top issues are dominated by exception handling and dynamic typing problems, while JavaScript and TypeScript skew towards scoping and variable declaration patterns. But in both cases, code smells account for the vast majority.

A quarter of all debt survives indefinitely

Introducing debt is not necessarily a crisis if it gets cleaned up promptly. It does not. Overall, 24.2% of all tracked AI-introduced issues still survive at the repository's latest revision. That translates to 37.25 surviving issues per 100 AI-authored commits. The most dangerous issues are also the stickiest: security vulnerabilities have a 41.1% survival rate, followed by runtime bugs at 30.3% and code smells at 22.7%. The issues you most want resolved are the ones most likely to remain.

The cumulative picture is stark. The total volume of surviving AI-introduced issues grew from a few hundred in early 2025 to over 110,000 by February 2026 (Figure 8 in the paper). Even issues older than nine months still show a 19.2% survival rate. Some debt gets resolved quickly: a TypeScript lint issue in Stirling-PDF (75k+ stars) introduced by a Claude-authored commit was fixed by the maintainer the next day. Others linger. An undefined variable bug in Firecrawl took 42 days to fix. A Devin-authored commit added a requests.get() call without a timeout in December 2024, a known security risk since the request can block indefinitely if the remote service never responds. As of the latest revision, that issue remains in the codebase.

The implication is that merging AI-generated code is not the end of the story. The debt it introduces can persist and accumulate, making continuous monitoring and targeted debt repayment necessary for any team relying on AI coding tools at scale.

BOTTOM LINE

AI coding assistants are a net positive for surface-level code quality, fixing slightly more code smells than they introduce. But for the issues that matter most, runtime bugs and security vulnerabilities, they consistently introduce more than they resolve. Worse, 24.2% of all AI-introduced issues persist indefinitely, with security issues surviving at a 41.1% rate. Switching tools does not fix this: all five assistants studied show the same pattern. AI-generated code requires the same review rigour as human-written code, with particular scrutiny on security-sensitive changes.

Reference

Liu, Y., Widyasari, R., Zhao, Y., IRSAN, I. C., & Lo, D. (2026). Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild. arXiv preprint arXiv:2603.28592. https://arxiv.org/abs/2603.28592