Research Explainer · Okpala (2025)

AI agent crews can build and validate financial models, but they still need human oversight to stay safe

Researchers at Discover Financial Services built two collaborating multi-agent crews, one for modeling and one for model risk management, that autonomously handle the full ML pipeline on credit risk, fraud detection, and card approval datasets, matching or beating top Kaggle solutions while stress-testing their own outputs.

Published February 2025

95.37% accuracy achieved by the modeling crew's XGBoost classifier on the portfolio credit risk dataset

~9% accuracy drop (95.37% to 86.24%) when the MRM crew tested the credit risk model on shifted input distributions

3 end-to-end use cases (fraud detection, card approval, credit risk) completed autonomously by the agent crews

Financial institutions live and die by their models, yet building and validating those models is labour-intensive work that typically involves multiple specialist teams handing off artifacts to one another. Okpala, Golgoon, and Ravi Kannan ask whether multi-agent AI systems, built on top of large language models, can replicate that end-to-end workflow autonomously.

They construct two agent "crews" using the CrewAI framework, each powered by GPT-3.5 Turbo. The first crew handles the modeling pipeline: exploratory data analysis, feature engineering, model selection via GridSearchCV, hyperparameter tuning, training, evaluation, and documentation. The second crew acts as the model risk management (MRM) team, performing documentation compliance checks against an organisational guide, independent model replication, conceptual soundness analysis, and outcome testing under adversarial and shifted inputs.

The two crews are tested on three public Kaggle datasets: a credit card fraud detection set with 284,807 transactions (99.83% non-fraudulent), a credit card approval prediction set with 36,457 records, and a portfolio credit risk set with 32,581 loan records. Each dataset presents real-world messiness, including severe class imbalance, missing values, and categorical features that need encoding.

Modeling Crew Performance Across Three Financial Datasets

Source: Okpala, Golgoon & Ravi Kannan (2025), Figures 7, 8, 9. Accuracy, F1-score, precision, recall, and AUC reported by the modeling crew's evaluation agent on each dataset.

MRM Stress Testing: Normal vs Shifted vs Adversarial Inputs

Source: Okpala, Golgoon & Ravi Kannan (2025), Figures 7, 8, 9. Accuracy under original test data, shifted input distributions, and adversarial/outlier inputs for each dataset.

Each crew uses a hierarchical collaboration structure. A manager agent, given the persona of a "Data Science Manager" or "Model Risk Manager," delegates tasks sequentially to specialist agents and retrieves results from a shared memory stream. The memory stream stores action inputs, tool outputs, and contextual information so that downstream agents can pick up where earlier ones left off without re-doing work.

The agents are not simply chatting with each other. They generate and execute real Python code through a "Code Execution Tool" that runs in a local environment. The EDA agent, for example, uses a specialised "EDA Tool" to surface missing values, class distributions, skewness, and correlations. The feature engineering agent writes and runs code to perform label encoding, KNN imputation, and SMOTE oversampling. The model selection agent runs GridSearchCV to compare candidates. Each agent is constrained by role-playing prompts, a moderate temperature of 0.3, and guardrails that prevent it from straying outside its assigned task.

The modeling crew executes these pipeline stages in sequence, each delegated by the manager:

  1. Exploratory Data AnalysisA Senior Data Scientist agent identifies dataset shape, missing values, class imbalance, skewness, correlations, and outliers.
  2. Feature EngineeringA second Senior Data Scientist handles encoding, imputation, and resampling (downsampling or SMOTE) based on the manager's instructions.
  3. Model SelectionA Machine Learning Engineer runs GridSearchCV across candidate algorithms and returns the best performer with rationale.
  4. Hyperparameter TuningA Senior ML Engineer tunes the selected model's parameters on a 5% sample for speed, saving results to disk.
  5. Model TrainingAnother Senior ML Engineer trains the final model with optimal hyperparameters and serialises it.
  6. Model EvaluationA third Senior ML Engineer evaluates the trained model on held-out test data across accuracy, F1-score, precision, recall, and AUC.
  7. DocumentationA technical writer agent compiles the full output into a documentation artifact.

Across all three datasets, the agent-built models matched or slightly outperformed the most-upvoted human solutions on Kaggle. On the credit card fraud dataset, the crew selected logistic regression with random downsampling and achieved 94.39% accuracy and a 94.24% F1-score, compared to Kaggle's 94% accuracy. On the credit card approval dataset, the crew chose Random Forest and hit 95.48% accuracy versus Kaggle's 93.79% with XGBoost. On portfolio credit risk, the crew's XGBoost classifier reached 95.37% accuracy against Kaggle's CatBoost at 93.72%.

The MRM crew's stress testing revealed important differences in model resilience. The fraud detection model, whose features had been pre-scaled, lost only about 3 percentage points of accuracy on shifted inputs and zero on adversarial inputs. The credit card approval model, dominated by categorical features, was barely affected by either perturbation. But the credit risk model, which has fewer categorical variables, dropped nearly 9 percentage points to 86.24% accuracy under shifted distributions, though it held firm at 95.36% on adversarial outlier inputs. This pattern makes intuitive sense: models with more continuous numeric features have larger attack surfaces for distribution shift.

The MRM crew also used a RAG tool to verify the modeling crew's documentation against an organisational modeling guide. The compliance checker confirmed that each prescribed step (EDA through evaluation) was present and in order. The model replication agent independently reproduced the exact same metrics as the modeling crew, validating reproducibility.

The paper is a proof-of-concept, not a production deployment. All three datasets are publicly available, and the organisational modeling guide used for compliance checking was AI-generated for illustration. The authors are transparent about this. But the demonstration is still striking: a team of LLM agents running on GPT-3.5 Turbo (not even GPT-4) can autonomously traverse a complete modeling and validation lifecycle and produce documentation that checks out against a style guide.

The most valuable insight may not be the modeling crew's accuracy numbers, which are competitive but not revolutionary. It is the MRM crew's ability to catch vulnerabilities. The shifted-input test on the credit risk model revealed a meaningful fragility that would not have surfaced in a standard train/test evaluation. In a real bank, that finding would trigger remediation before the model went live. Having an agent crew that automatically runs these stress tests is, in principle, cheaper and faster than scheduling a separate human validation team.

The authors flag two important caveats. First, human verification remains essential. They ran every agent output side by side in a controlled Python environment to confirm consistency. Second, safety and bias risks in agentic systems are amplified by the agents' interactions, and guardrails (temperature settings, role constraints, human-in-the-loop checkpoints) are non-negotiable in a regulated industry like financial services. Future work, they suggest, should explore self-improving agents that adapt their prompts from past interactions, and meta-crews that can generate their own agent architectures on the fly.

KEY TAKEAWAY

Multi-agent LLM crews can autonomously build, evaluate, and stress-test financial models at a level competitive with skilled human practitioners. But the biggest win is not the accuracy, it is that a second agent crew can independently validate the first crew's work and surface vulnerabilities like distribution-shift fragility before the model reaches production. The catch: in a regulated industry, human oversight is not optional, it is the guardrail that makes the whole system trustworthy.

Reference

Okpala, I., Golgoon, A., & Ravi Kannan, A. (2025). Agentic AI Systems Applied to Tasks in Financial Services: Modeling and Model Risk Management Crews. arXiv preprint arXiv:2502.05439. https://arxiv.org/abs/2502.05439