Research Explainer · Treude & Storey (2025)
A vision paper from Singapore Management University and the University of Victoria argues that LLMs don't just change how software is built — they undermine the constructs, methods, and validity standards that empirical SE researchers have relied on for decades.
What the paper argues
Software engineering (SE) research has always evolved alongside its tools, but the arrival of large language models (LLMs) represents something qualitatively different. Treude and Storey argue this is not simply a new tool to study — it is a shift that destabilises the foundational concepts researchers use to do the studying. When an AI system can write commit messages, generate bug reports, and produce pull-request comments, the very artifacts that empirical SE has spent thirty years learning to mine are no longer purely human-produced.
The paper applies McLuhan's four laws of media to frame the disruption. AI enhances low-level coding, test generation, and documentation. It makes obsolete community platforms like Stack Overflow as the first port of call for developer questions. It retrieves conversational interfaces for technical assistance. And when pushed to extremes, it reverses into a force that erodes foundational programming skills — a dynamic already visible in studies showing over-reliance on AI suggestions reducing developer confidence and introducing subtle errors.
SE artifact types now documented as AI co-generated in the cited literature
Why this matters beyond academia
The studies that inform how tools like GitHub Copilot are designed, how AI adoption is measured in organisations, and how developer productivity is assessed are all built on the methodological assumptions this paper identifies as breaking down. If researchers continue applying pre-AI constructs to post-AI phenomena, the resulting findings may not just be wrong — they may actively mislead the industry.
The authors are not alarmist. Traditional empirical methods remain valuable; the point is they need extending, not replacing. Mixed methods that triangulate behavioural data, conversational logs, and post-hoc interviews are better positioned to capture human-AI dynamics than either approach alone. Longitudinal studies must become "living studies" that document how model behaviour shifts over time, rather than treating a single snapshot as stable ground truth.
Crucially, generative AI is not only a research subject — it is becoming a research instrument. LLMs are already being used to code qualitative data, simulate participants, and summarise large corpora. This introduces a recursive challenge: the same validity concerns that apply to AI-as-subject apply equally when AI is doing the analysis.
Key implication for practitioners
The next time you read a study claiming AI tools improve developer productivity by X%, ask when it was conducted, which model version it used, and whether the constructs it measured still map onto how your team actually works. Treude and Storey's core message is that validity in AI-era research requires a new kind of transparency — one that treats model version, prompt design, and temporal context as first-class methodological variables, not footnotes.
Glossary
Reference
Treude, C., & Storey, M.-A. (2025). Generative AI and empirical software engineering: A paradigm shift. arXiv preprint arXiv:2502.08108v2. https://arxiv.org/abs/2502.08108