Research Explainer · Okpala (2025)
Researchers at Discover Financial Services built two collaborating multi-agent crews, one for modeling and one for model risk management, that autonomously handle the full ML pipeline on credit risk, fraud detection, and card approval datasets, matching or beating top Kaggle solutions while stress-testing their own outputs.
Published February 2025
95.37% accuracy achieved by the modeling crew's XGBoost classifier on the portfolio credit risk dataset
~9% accuracy drop (95.37% to 86.24%) when the MRM crew tested the credit risk model on shifted input distributions
3 end-to-end use cases (fraud detection, card approval, credit risk) completed autonomously by the agent crews
What the paper sets out to do
Financial institutions live and die by their models, yet building and validating those models is labour-intensive work that typically involves multiple specialist teams handing off artifacts to one another. Okpala, Golgoon, and Ravi Kannan ask whether multi-agent AI systems, built on top of large language models, can replicate that end-to-end workflow autonomously.
They construct two agent "crews" using the CrewAI framework, each powered by GPT-3.5 Turbo. The first crew handles the modeling pipeline: exploratory data analysis, feature engineering, model selection via GridSearchCV, hyperparameter tuning, training, evaluation, and documentation. The second crew acts as the model risk management (MRM) team, performing documentation compliance checks against an organisational guide, independent model replication, conceptual soundness analysis, and outcome testing under adversarial and shifted inputs.
The two crews are tested on three public Kaggle datasets: a credit card fraud detection set with 284,807 transactions (99.83% non-fraudulent), a credit card approval prediction set with 36,457 records, and a portfolio credit risk set with 32,581 loan records. Each dataset presents real-world messiness, including severe class imbalance, missing values, and categorical features that need encoding.
Source: Okpala, Golgoon & Ravi Kannan (2025), Figures 7, 8, 9. Accuracy, F1-score, precision, recall, and AUC reported by the modeling crew's evaluation agent on each dataset.
Source: Okpala, Golgoon & Ravi Kannan (2025), Figures 7, 8, 9. Accuracy under original test data, shifted input distributions, and adversarial/outlier inputs for each dataset.
How the agent crews are structured
Each crew uses a hierarchical collaboration structure. A manager agent, given the persona of a "Data Science Manager" or "Model Risk Manager," delegates tasks sequentially to specialist agents and retrieves results from a shared memory stream. The memory stream stores action inputs, tool outputs, and contextual information so that downstream agents can pick up where earlier ones left off without re-doing work.
The agents are not simply chatting with each other. They generate and execute real Python code through a "Code Execution Tool" that runs in a local environment. The EDA agent, for example, uses a specialised "EDA Tool" to surface missing values, class distributions, skewness, and correlations. The feature engineering agent writes and runs code to perform label encoding, KNN imputation, and SMOTE oversampling. The model selection agent runs GridSearchCV to compare candidates. Each agent is constrained by role-playing prompts, a moderate temperature of 0.3, and guardrails that prevent it from straying outside its assigned task.
The modeling crew executes these pipeline stages in sequence, each delegated by the manager:
What the results show
Across all three datasets, the agent-built models matched or slightly outperformed the most-upvoted human solutions on Kaggle. On the credit card fraud dataset, the crew selected logistic regression with random downsampling and achieved 94.39% accuracy and a 94.24% F1-score, compared to Kaggle's 94% accuracy. On the credit card approval dataset, the crew chose Random Forest and hit 95.48% accuracy versus Kaggle's 93.79% with XGBoost. On portfolio credit risk, the crew's XGBoost classifier reached 95.37% accuracy against Kaggle's CatBoost at 93.72%.
The MRM crew's stress testing revealed important differences in model resilience. The fraud detection model, whose features had been pre-scaled, lost only about 3 percentage points of accuracy on shifted inputs and zero on adversarial inputs. The credit card approval model, dominated by categorical features, was barely affected by either perturbation. But the credit risk model, which has fewer categorical variables, dropped nearly 9 percentage points to 86.24% accuracy under shifted distributions, though it held firm at 95.36% on adversarial outlier inputs. This pattern makes intuitive sense: models with more continuous numeric features have larger attack surfaces for distribution shift.
The MRM crew also used a RAG tool to verify the modeling crew's documentation against an organisational modeling guide. The compliance checker confirmed that each prescribed step (EDA through evaluation) was present and in order. The model replication agent independently reproduced the exact same metrics as the modeling crew, validating reproducibility.
What this means and what it leaves open
The paper is a proof-of-concept, not a production deployment. All three datasets are publicly available, and the organisational modeling guide used for compliance checking was AI-generated for illustration. The authors are transparent about this. But the demonstration is still striking: a team of LLM agents running on GPT-3.5 Turbo (not even GPT-4) can autonomously traverse a complete modeling and validation lifecycle and produce documentation that checks out against a style guide.
The most valuable insight may not be the modeling crew's accuracy numbers, which are competitive but not revolutionary. It is the MRM crew's ability to catch vulnerabilities. The shifted-input test on the credit risk model revealed a meaningful fragility that would not have surfaced in a standard train/test evaluation. In a real bank, that finding would trigger remediation before the model went live. Having an agent crew that automatically runs these stress tests is, in principle, cheaper and faster than scheduling a separate human validation team.
The authors flag two important caveats. First, human verification remains essential. They ran every agent output side by side in a controlled Python environment to confirm consistency. Second, safety and bias risks in agentic systems are amplified by the agents' interactions, and guardrails (temperature settings, role constraints, human-in-the-loop checkpoints) are non-negotiable in a regulated industry like financial services. Future work, they suggest, should explore self-improving agents that adapt their prompts from past interactions, and meta-crews that can generate their own agent architectures on the fly.
KEY TAKEAWAY
Multi-agent LLM crews can autonomously build, evaluate, and stress-test financial models at a level competitive with skilled human practitioners. But the biggest win is not the accuracy, it is that a second agent crew can independently validate the first crew's work and surface vulnerabilities like distribution-shift fragility before the model reaches production. The catch: in a regulated industry, human oversight is not optional, it is the guardrail that makes the whole system trustworthy.
Reference
Okpala, I., Golgoon, A., & Ravi Kannan, A. (2025). Agentic AI Systems Applied to Tasks in Financial Services: Modeling and Model Risk Management Crews. arXiv preprint arXiv:2502.05439. https://arxiv.org/abs/2502.05439