Research Explainer · Treude & Storey (2025)

AI is rewriting software engineering research — and the old methods may no longer hold

A vision paper from Singapore Management University and the University of Victoria argues that LLMs don't just change how software is built — they undermine the constructs, methods, and validity standards that empirical SE researchers have relied on for decades.

4
Research pillars disrupted: phenomena,
methods, data, and validity
8
SE artifact types now
documented as AI co-generated
3,000+
Prior SE studies mining artifacts —
all affected by the AI provenance shift

What the paper argues

Software engineering (SE) research has always evolved alongside its tools, but the arrival of large language models (LLMs) represents something qualitatively different. Treude and Storey argue this is not simply a new tool to study — it is a shift that destabilises the foundational concepts researchers use to do the studying. When an AI system can write commit messages, generate bug reports, and produce pull-request comments, the very artifacts that empirical SE has spent thirty years learning to mine are no longer purely human-produced.

The paper applies McLuhan's four laws of media to frame the disruption. AI enhances low-level coding, test generation, and documentation. It makes obsolete community platforms like Stack Overflow as the first port of call for developer questions. It retrieves conversational interfaces for technical assistance. And when pushed to extremes, it reverses into a force that erodes foundational programming skills — a dynamic already visible in studies showing over-reliance on AI suggestions reducing developer confidence and introducing subtle errors.

1. Phenomena & Constructs
Identity disruption
The categories of "developer", "artifact", and "source code" are becoming unstable. When a user specifies intent in natural language and an AI produces the code — what researchers call vibe coding — the boundary between user and developer dissolves. Existing constructs may no longer map onto practice.
2. Research Methods & Theories
Methodological disruption
Non-deterministic AI outputs break reproducibility: identical prompts yield different results across runs or model versions. Evaluation drift means findings can expire before publication. Qualitative work must now treat AI as an active participant and capture its reasoning traces alongside human ones.
3. Data Sources
Provenance disruption
Empirical SE now works with three interconnected data types: training corpora, prompts and interactions, and generative outputs. The meta-analysis cataloguing over 3,000 studies mining bug reports and commit messages predates AI co-generation of those very artifacts.
4. Threats to Validity
Rigour disruption
All three classical validity types are weakened. Construct validity suffers from shifting definitions. Internal validity is undermined by output variability and evaluation drift. External validity is constrained by proprietary model lock-in and the ecological validity gap between lab settings and real-world AI use.

SE artifact types now documented as AI co-generated in the cited literature

Source code Test cases Commit messages Bug reports Code review comments UML diagrams Forum answers Emails

Why this matters beyond academia

The studies that inform how tools like GitHub Copilot are designed, how AI adoption is measured in organisations, and how developer productivity is assessed are all built on the methodological assumptions this paper identifies as breaking down. If researchers continue applying pre-AI constructs to post-AI phenomena, the resulting findings may not just be wrong — they may actively mislead the industry.

The authors are not alarmist. Traditional empirical methods remain valuable; the point is they need extending, not replacing. Mixed methods that triangulate behavioural data, conversational logs, and post-hoc interviews are better positioned to capture human-AI dynamics than either approach alone. Longitudinal studies must become "living studies" that document how model behaviour shifts over time, rather than treating a single snapshot as stable ground truth.

Crucially, generative AI is not only a research subject — it is becoming a research instrument. LLMs are already being used to code qualitative data, simulate participants, and summarise large corpora. This introduces a recursive challenge: the same validity concerns that apply to AI-as-subject apply equally when AI is doing the analysis.

Enhances
McLuhan law 1
Low-level code writing, test suite generation, documentation, pull requests — all amplified by AI.
Makes Obsolete
McLuhan law 2
Stack Overflow as first resort; community-driven knowledge-sharing ecosystems.
Retrieves
McLuhan law 3
Conversational and chat-based interfaces for technical assistance, once secondary in developer workflows.
Reverses Into
McLuhan law 4
Erosion of foundational skills; over-trust introducing subtle errors and reducing developer confidence.

Key implication for practitioners

The next time you read a study claiming AI tools improve developer productivity by X%, ask when it was conducted, which model version it used, and whether the constructs it measured still map onto how your team actually works. Treude and Storey's core message is that validity in AI-era research requires a new kind of transparency — one that treats model version, prompt design, and temporal context as first-class methodological variables, not footnotes.

Glossary

LLM — Large Language Model. An AI system trained on vast text corpora to generate text and code (e.g. GPT-4, Claude, Gemini).
SE — Software Engineering. The discipline of designing, building, and maintaining software systems.
Empirical SE — The sub-field that studies SE practices using systematic observation, experiments, and data analysis.
IDE — Integrated Development Environment. A software application combining a code editor, debugger, and build tools (e.g. VS Code, Cursor).
Vibe coding — Specifying high-level intent in natural language and letting an AI produce the resulting code, blurring the user/developer boundary.
Evaluation drift — When empirical findings become outdated before publication because the underlying AI model has been updated.
Construct validity — Whether a study accurately measures the concept it intends to investigate.
Internal validity — Whether causal conclusions drawn from a study hold under scrutiny.
External validity — Whether findings generalise beyond the specific study context.
Actor-network theory — Latour's framework treating both human and non-human entities as active participants in shaping systems and knowledge.

Reference

Treude, C., & Storey, M.-A. (2025). Generative AI and empirical software engineering: A paradigm shift. arXiv preprint arXiv:2502.08108v2. https://arxiv.org/abs/2502.08108