Research · Academic & arXiv
Back to sweepResearch sweep · deep · 2015 – 2026
AI in Weather and Climate Prediction
AI in weather and climate prediction across the 2015 to June 2026 machine-learning era, with historical context from mid-twentieth-century numerical weather prediction and Lorenz's chaos theory: the shift from physics-based NWP and statistical post-processing (MOS) to data-driven models (GraphCast, GenCast, Pangu-Weather, FourCastNet, Aurora, NeuralGCM, ECMWF AIFS), how forecasters at ECMWF, NOAA, and the Met Office have operationalised them, measured accuracy versus the IFS, and the predictability limits imposed by chaos, the Lorenz attractor, and the butterfly effect.
- Claude Opus 4.8
- financial
- frontier
- academic
- vc
- blogs
Synthesised 2026-06-26
Narrative
The academic literature on machine learning for weather prediction spans roughly three phases. The foundational period, culminating in Rasp et al.'s WeatherBench (JAMES, 2020) and its ERA5-based benchmark, established reanalysis data as the common training substrate and pushed the field to define reproducible metrics. ERA5's 40-year, 0.25-degree hourly archive, described by Hersbach et al. (2020) in the Quarterly Journal of the Royal Meteorological Society, proved decisive: without it, data-hungry deep models would have had no adequate training set. Rasp and Thuerey (JAMES, 2021) then showed that a ResNet pretrained on climate simulations could match simple operational baselines, signalling that the deep-learning approach was credible but not yet competitive.
The second phase, 2022 to 2023, saw rapid architectural diversification and first decisive outperformance of ECMWF's HRES. Pathak et al.'s FourCastNet (arXiv:2202.11214, 2022) applied Adaptive Fourier Neural Operators to the global atmosphere at 0.25-degree resolution, trading some accuracy for speed. Bi et al.'s Pangu-Weather (Nature, 2023) used 3D Earth-Specific Transformers and a multi-timescale model combination to outperform HRES on several variables. Lam et al.'s GraphCast (Science, 2023) employed a multi-scale graph neural network trained on 221 ERA5 variables and was the first model to definitively beat HRES across the board on standard medium-range metrics. Rasp et al.'s WeatherBench 2 (arXiv:2308.15560, later JAMES 2024) upgraded the evaluation framework to 0.25-degree ground truth and live scoreboards, providing the community an independent yardstick.
The frontier phase, from late 2023 to mid-2026, has extended ML forecasting into probabilistic, hybrid, and foundation-model territory. Price et al.'s GenCast (arXiv:2312.15796, Nature 2024) introduced a diffusion-based ensemble model outperforming ECMWF's ENS on 15-day probabilistic metrics. Kochkov et al.'s NeuralGCM (Nature, 2024) fused a differentiable dynamical core with learned ML physics, achieving climate-scale stability while matching GraphCast on medium-range skill. Bodnar et al.'s Aurora (arXiv:2405.13063, Nature 2025) trained a 3D Perceiver-Swin foundation model on over one million hours of heterogeneous Earth-system data, outperforming operational forecasts on air quality, ocean waves, tropical cyclone tracks, and high-resolution weather. ECMWF operationalised its own GNN-based system, AIFS (arXiv:2406.01465, Lang et al. 2024), going live in February 2025, confirming that data-driven forecasting had crossed from research demo to production.
Critical limits are now being mapped in detail. Selz and Craig (Geophysical Research Letters, 2023) showed that current deterministic ML models fail to reproduce the butterfly effect: infinitesimal initial perturbations do not grow at the correct rate, implying the models currently sidestep rather than solve Lorenz's predictability problem. Zhang et al. (arXiv:2504.20238, 2025) tested whether ML models can push deterministic predictability beyond Lorenz's two-week estimate, finding modest extensions consistent with improved effective initial conditions. Empirically, Zhang, Fischer et al. (arXiv:2508.15724, 2025) demonstrated that for record-breaking heat, cold, and wind extremes, ECMWF HRES still consistently outperforms GraphCast, Pangu-Weather, and FuXi, with AI models systematically underpredicting intensity and frequency of tail events. Studies on robustness under climate change (arXiv:2409.18529) found that AI models trained on present-day ERA5 produce skillful forecasts in pre-industrial and +2.9 K future climates but show cold biases drifting back towards training distribution in warmer scenarios, a direct expression of out-of-distribution fragility that the NeuralGCM hybrid architecture partially mitigates by design.
Sources
| ID | Title | Outlet | Date | Significance |
|---|---|---|---|---|
| a1 | Learning skillful medium-range global weather forecasting | Science | 2023-12 | GraphCast (Lam et al., DeepMind) outperforms ECMWF HRES on 90% of 1,380 test metrics using a multi-scale GNN trained on 221 ERA5 variables, marking the first definitive ML victory over operational NWP. |
| a2 | Accurate medium-range global weather forecasting with 3D neural networks | Nature | 2023-07 | Pangu-Weather (Bi et al., Huawei) introduces 3D Earth-Specific Transformers and a multi-timescale inference strategy that surpasses HRES on several variables, published in Nature with open weights. |
| a3 | GenCast: Diffusion-based ensemble forecasting for medium-range weather | arXiv / Nature (2024) | 2023-12 | Price et al. introduce a graph-based diffusion ensemble model that outperforms ECMWF ENS on 15-day probabilistic metrics, establishing generative ML as a credible route to ensemble forecasting. |
| a4 | Neural general circulation models for weather and climate | Nature | 2024-07 | Kochkov et al. (Google) fuse a differentiable dynamical core with learned ML physics, producing a hybrid model that matches GraphCast on medium-range skill and reproduces realistic climate statistics over decades. |
| a5 | A Foundation Model for the Earth System | arXiv / Nature (2025) | 2024-05 | Bodnar et al. (Microsoft) train Aurora on over one million hours of diverse geophysical data; it outperforms operational forecasts for air quality, ocean waves, tropical cyclone tracks, and high-resolution weather at much lower compute. |
| a6 | AIFS -- ECMWF's data-driven forecasting system | arXiv | 2024-06 | Lang et al. describe ECMWF's own GNN-transformer system, which went operational in February 2025, marking the first major centre to operationalise a purely data-driven global weather forecast. |
| a7 | WeatherBench 2: A benchmark for the next generation of data-driven global weather models | arXiv / JAMES (2024) | 2023-08 | Rasp et al. provide the community's standard independent evaluation framework for ML weather models, with ERA5 at 0.25-degree resolution as ground truth and a continuously updated live leaderboard. |
| a8 | WeatherBench: A Benchmark Data Set for Data-Driven Weather Forecasting | Journal of Advances in Modeling Earth Systems | 2020-11 | Foundational benchmark paper by Rasp et al. that established ERA5-based evaluation of data-driven global weather models and catalysed rapid progress by providing common baselines. |
| a9 | FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators | arXiv | 2022-02 | Pathak et al. (NVIDIA/Berkeley Lab) introduce Adaptive Fourier Neural Operators applied to 0.25-degree global weather prediction, the first model to achieve HRES-competitive skill at that resolution. |
| a10 | Can Artificial Intelligence-Based Weather Prediction Models Simulate the Butterfly Effect? | Geophysical Research Letters | 2023-10 | Selz and Craig demonstrate that current deterministic ML models fail to reproduce rapid upscale error growth from small perturbations, meaning they sidestep rather than respect Lorenz's intrinsic predictability limit. |
| a11 | Atmospheric Predictability Beyond 30 Days with Machine Learning | arXiv | 2025-04 | Tests whether ML models can push deterministic predictability beyond Lorenz's two-week estimate, revisiting the theoretical limit in the context of modern data-driven approaches. |
| a12 | Numerical models outperform AI weather forecasts of record-breaking extremes | arXiv | 2025-08 | Zhang, Fischer et al. show empirically that ECMWF HRES consistently outperforms GraphCast, Pangu-Weather, and FuXi on record-breaking heat, cold, and wind events, quantifying the tail-event failure mode of current ML models. |
| a13 | Robustness of AI-based weather forecasts in a changing climate | arXiv | 2024-09 | Tests AIFS, GraphCast, and Pangu-Weather on pre-industrial, present-day, and +2.9 K future climates, finding skillful short-range forecasts but systematic cold biases in warmer states that expose out-of-distribution fragility. |
| a14 | ClimaX: A foundation model for weather and climate | arXiv / ICML (2023) | 2023-01 | Nguyen et al. present the first weather-climate foundation model using transformer pretraining on CMIP6 datasets, demonstrating generalisation to tasks unseen during pretraining including climate projections. |
| a15 | Evaluation of five global AI models for predicting weather in Eastern Asia and Western Pacific | npj Climate and Atmospheric Science | 2024-09 | Independent regional evaluation of Pangu-Weather, FourCastNet v2, GraphCast, FuXi, and FengWu against ERA5 for typhoon track and intensity, providing geographically focused accuracy assessment beyond global averages. |
| a16 | Comparing and Contrasting Deep Learning Weather Prediction Backbones on Navier-Stokes and Atmospheric Dynamics | arXiv | 2024-07 | Systematic controlled comparison of GNN, transformer, and Fourier-operator architectures on both synthetic Navier-Stokes and real ERA5 data, isolating the effect of architecture from training choices. |
| a17 | An update to ECMWF's machine-learned weather forecast model AIFS | arXiv | 2025-09 | Describes AIFS 1.1.0, the operational version deployed August 2025, documenting incremental skill improvements, new variables, and the correction of a precipitation forecast issue in the initial release. |
| a18 | Neural general circulation models for modeling precipitation | Science Advances | 2026-01 | Yuval, Kochkov et al. extend the NeuralGCM framework by training directly on satellite-based precipitation observations, demonstrating improved simulation of extremes and the diurnal cycle over existing GCMs. |
| a19 | Data-driven ensemble forecasting with the AIFS | ECMWF Newsletter | 2024-10 | Describes ECMWF's two parallel approaches to ML ensemble forecasting (diffusion-based and CRPS-trained AIFS), both using the same encoder-GNN-transformer architecture as the deterministic system. |
| a20 | FourCastNet 3: A geometric approach to probabilistic machine-learning weather forecasting at scale | arXiv | 2025-07 | Latest generation of the NVIDIA FourCastNet series applies geometric deep learning to probabilistic global forecasting at scale, extending the Fourier-operator lineage into ensemble territory. |
| a21 | WeatherBench 2: A Benchmark for the Next Generation of Data-Driven Global Weather Models (JAMES published version) | Journal of Advances in Modeling Earth Systems | 2024-06 | Published final version of the WeatherBench 2 framework, providing metrics, baselines, and methodology used by all major ML weather model evaluations as of 2024-2026. |
| a22 | Butterfly Effects and Finite Predictability in AI-Based Weather Prediction | ESS Open Archive | 2025-07 | Revisits Selz and Craig's butterfly-effect findings using a taxonomy of three types of butterfly effect, sharpening the understanding of which predictability limits current AI models do and do not respect. |
| a23 | A Practical Probabilistic Benchmark for AI Weather Models | arXiv | 2024-01 | Brenowitz et al. propose CRPS-based probabilistic evaluation specifically designed for ML weather models, filling a gap in WeatherBench 2 which focused on deterministic metrics. |
| a24 | SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth | arXiv | 2025-10 | Proposes geographically stratified evaluation of AI weather models to surface regional performance differences masked by global-average RMSE, exposing heterogeneous skill across latitudes and terrain. |
| a25 | Do machine learning climate models work in changing climate dynamics? | arXiv | 2025-09 | Systematic OOD evaluation of state-of-the-art ML climate models under distribution-shifted scenarios, documenting where models fail to generalise and informing credible near-term directions for climate-scale ML. |