A multivocal literature review of 46 studies finds GenAI is already embedded in early architectural tasks, yet 93% of the work skips formal validation of what the models produce.
Source: Esposito et al. (2025), Table 14. Percentages reflect share of 46 reviewed studies mentioning each SALC phase (papers can cover more than one).
Eight researchers across Finland, the United States, and India combed through 1,054 candidate papers (621 peer-reviewed, 433 grey literature) and whittled them down to 46 that genuinely address how generative AI is used for software architecture. The method was a multivocal literature review (MLR), which means the team treated blog posts, industry white papers, and YouTube tutorials with the same systematic rigour normally reserved for journal articles. That breadth matters: the grey literature turned out to carry early signals about tool adoption that academic venues hadn't yet captured.
Data extraction followed an open-coding protocol with two independent coders per paper and a third author to break ties. Inter-rater agreement ranged from 64% to 88% (Cohen's kappa), which sits comfortably in the "substantial to almost perfect" range. The team then mapped each study against the Software Architecture Life Cycle (SALC), cataloguing which models were used, how prompts were engineered, what architectural styles were targeted, and whether anybody bothered to validate the outputs.
OpenAI's GPT family dominates. Across every study, 62% of all model references pointed to a GPT variant, with GPT-4 alone appearing in 21 papers. Google's models trailed at 9%, and LLaMA variants accounted for 8%. DeepSeek and CodeQwen are beginning to surface, but with only a handful of mentions each. The practical implication: nearly all evidence about "GenAI for architecture" is actually evidence about "GPT for architecture."
The primary use case is architectural decision support (38% of studies), followed by reverse engineering for architectural reconstruction (19%) and architecture generation (19%). Most activity clusters in the early SALC phases: turning requirements into architecture (40%) and turning architecture into code (32%). The later, harder task of transforming one architecture into another (say, refactoring a monolith into microservices at the architectural level) appeared in only 3% of studies.
Few-shot prompting was the most common technique (31%), while retrieval-augmented generation (RAG) showed up in 20% of studies, typically to inject domain-specific knowledge that general-purpose models lack. Fine-tuning was rarer, at 12%. A striking 26% of papers ran models completely off-the-shelf with no modifications, and a quarter simply failed to report what prompt strategy they used.
The most consequential finding is negative. Of the 46 studies, 93% provided no information about how (or whether) they validated the architectural outputs LLMs produced. Only three papers reported a formal method: one used ATAM (Architecture Tradeoff Analysis Method), one used SAAM, and one relied on static analysis. This means the field is generating an expanding body of claims about GenAI's usefulness for architecture without any consistent way of confirming those claims are true.
Modelling languages tell a similar story of under-specification. UML appeared in 17% of papers, but 74% of studies did not identify any modelling language at all. If you cannot name what notation your AI produced, it becomes very difficult to assess whether the output is correct, complete, or consistent. Architecture-specific datasets and benchmarks remain almost nonexistent, which makes reproducibility and comparison across studies practically impossible.
The review surfaced a clear hierarchy of concerns. LLM accuracy topped the list (15% of studies flagged it), followed by hallucinations (8%), ethical considerations (7%), and privacy (7%). These are not independent problems. Inaccurate outputs feed hallucination risk; hallucination risk undermines trust; low trust makes ethical deployment harder, particularly in safety-critical domains like healthcare or automotive systems where one bad architectural decision can have real consequences.
The authors also note a socio-technical blindspot. Most GenAI tools for architecture operate as if a single architect is making decisions in isolation. Real systems involve cross-team collaboration, co-changes across codebases, and organisational constraints that models currently ignore. One grey-literature source put it bluntly: GenAI might propose a technically elegant solution that nobody on the team has the skills to maintain. The tool does not know (and is never asked) about the team.
There is also an uncomfortable question about generated code ownership that echoes old debates from model-driven development. When the model writes the code, who maintains it? If you modify it, the model cannot update its internal representation. If you regenerate it, your modifications vanish. The review frames this not as a future concern but as a present one that the community has largely avoided discussing.
GenAI has established itself in the early phases of software architecture, particularly for mapping requirements to designs and generating code from those designs. But adoption has outpaced evaluation. The field urgently needs architecture-specific benchmarks, standardised validation methods, and honest reckoning with the fact that 93% of current research cannot confirm the quality of what these models produce. Until that gap closes, GenAI remains a promising assistant operating without a performance review.
Esposito, M., Li, X., Moreschini, S., Ahmad, N., Cerny, T., Vaidhyanathan, K., Lenarduzzi, V., & Taibi, D. (2025). Generative AI for software architecture: Applications, challenges, and future directions. Preprint submitted to Journal of Systems and Software. arXiv:2503.13310v2