Skip to main content# Medium-Range Weather Forecasting with Time- and Space-aware Deep Learning

# The Birth of Numerical Weather Prediction

## An Early Benchmark

## NWP and AI: Room for Improvement

# Spatial Biases in Weather Forecasting

## The Data

# Machine Learning Approaches to NWP

## Graph Neural Network Approaches

### Forecasting Global Weather with Graph Neural Networks

### GraphCast

## Transformer-based Approaches

### FourCastNet

### Pangu-Weather

### FuXi

# Forecast Forecast

## Probabilistic and Physically-Constrained Forecasting

## Multi-model Mixtures

## Finer Scales

Machine learning models now outperform the best numerical weather prediction systems in both speed and accuracy. But the theory underlying their impressive performance is as old as numerical weather prediction itself.

Published onApr 04, 2024

Medium-Range Weather Forecasting with Time- and Space-aware Deep Learning

In 1922, the English mathematician Lewis Fry Richardson published a book detailing the boundaries of the field of numerical weather prediction (NWP) nearly a half a century before it would actually take shape, cementing himself as one of the most prescient scientific minds in history. Richardson’s first and only book-length work, *Weather Prediction by Numerical Process (WPNP)* (Richardson, 1922), is an ingenious, Turing-like blend of mathematical science and technological fiction which spelled out—with near-perfect foresight—how global weather patterns would someday be predictable if the computational and observational infrastructure were built to support it. While this infrastructure has evolved over the proceeding century from fantasy to public utility and now public utility to downloadable AI model, Richardson’s insights about the spatial and temporal constraints which govern NWP have remained at the theoretical core of each iteration.

Inspired by earlier meteorologists like Abbe and Bjerknes who posited that weather events may be explainable based on hydrological and thermodynamic properties of the atmosphere, the majority of *WPNP* is traversed by a labyrinthine collection of partial differential equations which are hypothesized to govern the behavior of a wide variety of atmospheric variables like pressure, temperature, water content, or wind direction. While most of these governing equations were well-known to the meteorological community at the time, the genius of *WPNP* derives from the book’s integration of these atomic physical equations into a *system* for weather forecasting which, by simulating the future value of a collection of equations based on initial observations of the weather’s state, could forecast the weather within a geographic area from physical first principles.

Richardson assumed that the state of the weather could be approximated by seven partial differential equations governing atmospheric pressure, temperature, density, water content, and the three directional velocity components wind. He imagined these variables to be measurable along a “chequering” of points patterned across the Earth’s surface which divided the atmosphere into columns of 3° east-west and 200 km north-south, 12,000 columns circling the globe. Each of these columns were divided horizontally by four lines, creating a grid pattern. From this chequering, the solution to each governing equation could be approximated in finite difference form and then propagated forward in time, producing a forecast of future weather based on the current conditions reported within each chequer.

The majority of *WPNP* functions as a rote review of the differential equations governing the temporal relationships of the atmospheric variables in Richardson’s model. This technical review appears at first to meander, jumping between ideas in mathematically-condensed language. It isn’t until the ninth chapter, 180 pages into the book, that we are able to reinterpret the earlier chapters not as aimless mathematical statements, but the requisite elements for a tightly-wound computational forecasting system. With the background theory finally in place, nearly three-quarters of the way through the book the tone shifts to that of a field journal, documenting the manual calculations of the first modern numerical weather prediction system. In this calculation, Richardson produces a 6-hour forecast of initial changes of the atmospheric mass and wind variables from a collection of initial state estimations over a small chequering of Europe centered near Munich. He writes,

The process described in Ch. 8 has been followed so as to obtain

$\partial / \partial t$ of each one of the initially tabulated quantities.The arithmetical accuracy is as follows. All computations were worked twice and compared and corrected. The last digit is often unreliable, but is retained to prevent the accumulation of arithmetical errors. Multiplications were mostly worked by a 25 centim slide rule.

The rate of rise of surface pressure,

$\partial p_G /\partial t$ is found on Form PXIII as 145 millibars in 6 hours, whereas observations show that the barometer was nearly steady. This glaring error is examined in detail below in Ch. 9/3…

Despite expending significant manual effort in compiling the observational estimates, computing the forecast, and double-checking the calculations, Richardson’s forecast predicted an unrealistic 145hPa change in pressure over Munich within 6 hours, resulting in an approximate 145 root mean-squared error (RMSE) value for his forecasting accuracy. While his other variable calculations were less extreme, this forecasting failure was cited by many meteorologists at the time as reason to dismiss Richardson’s proposed numerical approach to weather prediction (Lynch, 2022).

The dismissal of his work by contemporaries based on the miserable results of this first numerical approach to weather forecasting proved to be immensely short-sighted. In *WPNP*’s penultimate chapter Richardson, now unburdened by empirical failure to continue discussing his tractable model, offers his projection of how this model for weather forecasting may be scaled in the future, and ends up producing one of the most clairvoyant works of science fiction in the process. In this chapter titled “Some Remaining Problems”, Richardson hones in on the spatio-temporal computational constraints which continue to dictate the structure of numerical weather computing to this day:

It took me the best part of six weeks to draw up the computing forms and to work out the new distribution in two vertical columns for the first time. My office was a heap of hay in a cold rest billet. With practice the work of an average computer might go perhaps ten times faster. If the time-step were 3 hours, then 32 individuals could just compute two points so as to keep pace with the weather, if we allow nothing for the very great gain in speed which is invariably noticed when a complicated operation is divided up into simpler parts, upon which individuals specialize. If the co-ordinate chequer were 200 km square in plan, there would be 3200 columns on the complete map of the globe. In the tropics the weather is often foreknown, so that we may say 2000 active columns. So that 32 x 2000 = 64,000 computers would be needed to race the weather for the whole globe. That is a staggering figure. Perhaps in some years’ time it may be possible to report a simplification of the process. But in any case, the organization indicated is a central forecast-factory for the whole globe, or for portions extending to boundaries where the weather is steady, with individual computers specializing on the separate equations. Let us hope for their sakes that they are moved on from time to time to new operations.

From his tedious experience producing a numerical weather prediction by hand, Richardson understood intimately that numerical weather prediction amounts to a race against time: forecasting the weather in 6 hours is useless if the forecast takes longer than that to produce. He was also keen to point out that global weather forecasting would become a computationally expensive process whose scale would be determined by the fidelity of the computational grid that anchor the observations and forecasts across the Earth’s surface. Richardson’s imaginative prophecy continues:

After so much hard reasoning, may one play with a fantasy? Imagine a large hall like a theatre, except that the circles and galleries go right round through the space usually occupied by the stage. The walls of this chamber are painted to form a map of the globe. The ceiling represents the north polar regions, England is in the gallery, the tropics in the upper circle, Australia on the dress circle and the antarctic in the pit. A myriad computers are at work upon the weather of the part of the map where each sits, but each computer attends only to one equation or part of an equation. The work of each region is coordinated by an official of higher rank. Numerous little night signs display the instantaneous values so that neighbouring computers can read them. Each number is thus displayed in three adjacent zones so as to maintain communication to the North and South on the map. From the floor of the pit a tall pillar rises to half the height of the hall. It carries a large pulpit on its top. In this sits the man in charge of the whole theatre ; he is surrounded by several assistants and messengers. One of his duties is to maintain a uniform speed of progress in all parts of the globe. In this respect he is like the conductor of an orchestra in which the instruments are slide-rules and calculating machines. But instead of waving a baton he turns a beam of rosy light upon any region that is running ahead of the rest, and a beam of blue light upon those who are behindhand.

Note the spatial and message-passing structure implicit in Richardson’s weather theatre fantasy. Each computer (a person, in his time) was tasked with the computations related to a particular chequer, but these computations would both influence and would be influenced by the computational results derived from its neighboring computers. Richardson imagines these spatially distributed calculations to be carried out in time, with short-range forecasts feeding into those at longer ranges.

As will become clear in the following sections, the spatio-temporal computing patterns imagined by Richardson to drive the weather forecasting in his fantasy theatre are exactly those which have come to dominate the structure of NWP over a century into the future. Moreover, these same spatio-temporal constraints are being used as the foundation to a new approach to NWP using spatially-biased deep learning architectures from the field of machine learning. Indeed, Richardson’s fantasy is as relevant today as it has ever been.

While Richardson himself never saw the construction of his forecast theatre, in time it was eventually built, and its complexity grew with each passing decade. What Richardson couldn’t have predicted, however, is that the physical size of his theatre would eventually become uncoupled from this complexity. With the introduction of AI models in NWP, Richardson’s theatre now fits on a single GPU.

Richardson died in 1953, still decades before the emergence of a global, real-time numerical weather prediction system at the scale imagined in his fantasy. However, he did live long enough to observe the first major steps towards this system: the results of the first numerical weather predictions implemented on an digital computer. In 1950, Charney, Fjortoft and von Neumann calculated a forecast of atmospheric flow on the ENIAC, the first programmable, general-purpose computer (Charney, Fjörtoft, & Neumann, 1950). While their forecast was concerned with integrating the barotropic vorticity equation alone, their approach drew clear inspiration from Richardson’s work, including the use of a grid of points to anchor the differential equation. Shortly after publication, Charney shared his results, published in *Tellus*, with Richardson. In response, Richardson congratulated Charney on the “remarkable progress which has been made at Princeton” (Platzman, 1968). Included in his response, however, was an intriguing new analysis:

I have today made a tiny psychological experiment on the diagrams in your Tellus paper of November 1950. The diagram

cwas hidden by a card, which also hid the legend at the foot of the diagrams. The distinctions betweena,bandcwere concealed from the observer, who was asked to say which ofa[initial map] andd[computed map 24 hours later] more nearly resembledb[observed map 24 hours later]. My wife’s opinions were as follows:

Thus d has it on the average, but only slightly. This, although not a great success of a popular sort is anyways an enormous scientific advance on the single, and quite wrong, result in which Richardson (1922) ended.

Richardson provided his wife, Dorothy Garnett, with a map of the initial observations of the variable being predicted (a), the 24 hours forecasted variable from the model mapped to the same grid (d), and the actual observation read 24 hours in the future. From this, Dorothy determined that the model forecast slightly outperformed a simple 1-day autoregressive baseline model, a benchmark called *persistence* in modern parlance. This analysis likely makes Dorothy Garnett the first person to ever benchmark a NWP weather model. And she did a good job. A 2008 re-creation of the *Tellus* forecast showed that Dorothy’s eyeballed evaluations were approximately in-line with modern quantitative metrics of forecast skill (Lynch, 2008), metrics we will make significant use of in the course of this review.

Charney was greatly encouraged by the Richardsons’ response to the *Tellus* work, and devoted time over the following year to honing his group’s algorithmic approximation of barotropic vorticity. Convinced his new results would sweep the Richardsons’ scorecard, he sent an updated version of the paper figures to Richardson in late 1953. Richardson died five days before the reprint arrived (Platzman, 1968).

Charney, Fjortoft and von Neumann’s results were encouraging enough to generate a critical mass of research interest which would drive NWP to operational reality within the following decade. A paradigmatic application of modern computing, NWP’s development has since been propelled by the exponential growth in processing power, developing into one of the most transformative and powerful technological innovations in human history. NWP has completely reshaped humanity’s interaction with the bewildering complexity of weather, transforming public policy from reactive to proactive in the process. This achievement permeates every corner of the modern economy. Weather forecasts impact the stock market, agriculture, global transportation, and astronomy, among many other domains. As the relentless march of processing power continues into the 21st century, NWP’s societal relevance is predicted to become even more widespread as its predictive horizons grow (Benjamin et al., 2019).

In tandem with NWP’s meteoric growth into the 21st century, the fields of machine learning and artificial intelligence experienced similar expansion propelled, like NWP, by Moore’s law and the growing availability of data. Indeed, the histories of AI and NWP share striking similarities. The theoretical foundations of both fields were developed in the early 20th century, with their foundational theories both preempting by decades the computational substrate necessary to realize these theoretical possibilities. John von Neumann had an early influence in the development of both AI and NWP, and both fields are direct beneficiaries of Moore’s law, the growing availability of data, and the emergence of the internet. Both fields are concerned primarily with making predictions, and these predictions have massive influence on modern society.

These historical similarities at the field level provoke a question: to what extent can techniques and insights from the fields of machine learning and artificial intelligence inform the practice of NWP?

This question does not imply that NWP “needs” AI to succeed. Quite the contrary, in fact. NWP stands as an exceptional scientific achievement on its own, an effective integration of computational power and theoretical reasoning derived from physical first-principles. This approach has resulted in measurable and sustained progress for over a century. Over the last 40 years, medium-range (~10 days in advance) weather forecasting skill, as measured by the correlation between the forecasted and actual deviation from historical average, has increased by about a day per decade (Bauer, Thorpe, & Brunet, 2015). In other words, a 5-day weather forecast in 2015 was about as accurate as a 3-day weather forecast in 1995, and a 6-day forecast next year will be about as predictive as a 5-day from 2015. Furthermore, NWP has achieved this success primarily by improving the accuracy and variety of the physics-based models which are known to describe the underlying weather dynamics—modeling forecasts directly and intuitively as the solution to a system of spatio-temporal differential equations whose accuracy scales as these systems are resolved over finer grid resolutions upon the earth’s surface.

Despite this success, NWP does face a number of structural shortcomings that machine learning-derived models (hereby referred to as AI-NWP) do not. While NWP methods can improve accuracy by increasing grid resolution, resolving the spatio-temporal equations approximating the weather over a finer grid necessitates a polynomial increase in computation, and the most accurate NWP models in production today already push the limit of their computational resources. The Integrated Forecasting System (IFS) at the European Centre for Medium-Range Forecasts (ECMWF), the most accurate NWP system on the planet, ingests gigabytes of observational data and outputs terabytes of forecast data per day, taxing approximately

NWP also struggles with the ramifications of measurement error. Terrestrial weather is the paradigmatic example of a chaotic system, so weather forecasting, in turn, is highly dependent on the specification of the system’s initial conditions. The severity of this dependence led mathematician and meteorologist Edward Norton Lorenz to once quip that “one flap of a sea gull’s wings would be enough to alter the course of the weather forever” (Lorenz, 1963). This remark would eventually mature into the famous butterfly effect, a metaphor highlighting how drastically the behavior of a complex system may change with respect to even small deviations in initial conditions. As a numerical approximation of a complex system, modern NWP is highly sensitive to such mis-specifications in initial conditions. To deal with this fragility, NWP forecasts are often constructed from an ensemble of individual forecast runs, each initialized with slightly different initial conditions, to account for this uncertainty in system state. As the weather-governing differential equations are rolled out through time, these discrepancies in initial conditions become amplified, resulting in potentially large divergences in forecasted weather patterns as the prediction window expands from hours to days. This forecast ensemble is then compiled into a probability distribution over future weather states, reflecting these inherent uncertainties. Because most NWP forecasting is governed by physical equations which evolve the weather state through time, medium-range forecasting errors can only be suppressed by improving short-range skill. While this adverse influence of measurement error on initial conditions would effect even a perfect forecasting model, the flexibility and efficiency of AI weather models provides a route for mitigating these forecast errors by finetuning and ensembling forecast models across time horizons.

Finally, the reliance of NWP on explicit physical models of atmospheric variables fundamentally limits the speed at which new types of observational data may be integrated into NWP methods. Each NWP model relies on the tedious mathematical exposition of each variable’s relationship to a complex web of weather-defining equations. As our scientific understanding of weather-driving variables grows, and as the capacity for routinely measuring such variables via autonomous or distributed means increases, so too will the complexity and integration costs of including these variables within NWP models. By contrast, AI methods tend to scale well with increasing variable dimensionality, meaning the introduction of a new observational data is relatively inexpensive in terms of forecasting runtime and model complexity. This ability to rely on unresolved variables becomes more valuable at longer, climatological time horizons.

While the specific architectures underlying AI-NWP are new, the structure of these approaches are merely a modern interpretation of Richardson’s original fantasy. To extend Richardson’s metaphor to accommodate these new AI approaches, we must only minutely alter the behavior of each computer of his forecasting theatre. In place of each equation-solving computer which uses their neighbors’ approximated weather states to solve for the weather patterns in their own chequer-shaped jurisdiction, we substitute an experienced gambler who observes their neighbors’ predictions on the weather’s next state and makes a prediction, conditioned on their neighbors’ information, regarding the weather change to occur within their own chequer.

Translated into modern machine learning parlance, this approach to making global predictions based on a distributed collection of locally-conditioned predictions is known as a *spatial inductive bias*. Broadly construed, an inductive bias is a structural form placed on a machine learning model which restricts the possible collection of functions which may be learned during training by constraining the ways in which the model may interact with or process input data. A *spatial* inductive bias, then, is one in which these constraints arise from rules about how the model is allowed to interact with data according to an underlying spatial domain (the Earth, in this case).

While enforcing these types of learning constraints may seem counterintuitive to the ethos of modern machine learning which views hand-crafted model features as a relic of a bygone low-parameter and small-data era, their inclusion in deep learning models is often crucial for reliably learning from data model parameterizations which generalize well in complex, noisy environments. In short, if one can align a model’s inductive biases to reflect the structural constraints of the data-generating system under study, then the model will be more likely to be able to find (via gradient descent) and represent the class of functions which align with the actual system structure.

The necessity of spatial inductive biases for weather prediction has been understood since the birth of NWP. The atmosphere is a fluid which flows and swirls across the planet, causing weather conditions to vary continuously with respect to three-dimensional space measured along the surface of the globe. Traditional NWP methods attempt to exploit this fact directly by solving for atmospheric conditions from a collection of fluid dynamics equations. Unfortunately, we do not have access to weather measurements at every location in the atmosphere, and even if we did, we could not feasibly simulate the future behavior of such a vast collection of landmarks. Instead we follow Richardson’s simplifications, measuring the weather where it is feasible to do so and using a coarse grid as the basis for interpolating the weather to unobserved regions or to sub-grid fidelity, relying on the assumption that weather varies continuously across space. Most global NWP models use a horizontal grid spacing of less than 25km, with the most accurate global model, ECMWF’s IFS, achieving resolution of approximately 9km or 0.1° in its high-resolution (HRES) and ensemble (ENS) forecasts (Rasp et al., 2023).

Given these continuity properties of weather with respect to space, and given the success of spatially-discretized NWP models to date, it is reasonable to expect that effective AI-NWP models should also be structured so as to preserve this spatial dependence during forecast prediction. As we will see, the best-performing AI-NWP models to date do just this, relying on some specification of a spatial inductive bias over the discretized grid of observed weather states to generate forecasts.

Unlike with traditional NWP which typically requires only real-time weather state observations to produce forecasts, AI-based methods require access to volumes of historical weather data to train the underlying (deep learning) model to predict future changes to the weather given its past states.

Unfortunately, creating a global, grid-level history of the Earth’s weather is not as easy as simply recording the observed weather state at each point of the grid through time. Actual weather measurement occurs non-uniformly across both space and time. Remote locations on the globe may never support actual recordings of relevant atmospheric variables over their gridded region, while urban centers may support multiple suites of real-time measurements within a single grid square. To bridge this gap between the non-uniformity of real weather observations and an idealized globe-spanning grid of historical weather variables, meteorologists must perform a complex weather inference calculation known as *reanalysis*. This reanalysis process is computationally expensive, requiring the interpolation of variable states over the grid according to the solutions of a similar collection of differential equations which support forecast-driven NWP models. In short, reanalysis constitutes a theoretically-informed “best guess” at the state of the weather across time at a collection of locations given the available historical weather observations.

One of the most accurate and encompassing reanalysis datasets is the ERA5 dataset (Hersbach et al., 2020). The dataset (produced by the ECWMF) provides an assimilated estimation of the weather for every hour between 1940 and 2020 at each point across a 0.25° (30km) grid spanning the globe. The immense historical and spatial scale of this dataset makes it a perfect training data candidate for AI-NWP models, and indeed most AI-NWP approaches use some version of ERA5 as a source of training data. Note that the reanalysis provided by ERA5 is not real-time and includes observational information from both the past and future to assimilate the weather at each point. This means that operational models cannot assume the use of ERA5-quality data when making real-time forecasts, and accurately evaluating models using ERA5 requires care.

The growing interest in the application of machine learning to NWP, the wide adoption of ERA5 within the research community, and the nuances of the ERA5 dataset itself, has led to a new machine learning-focused weather prediction benchmark dataset called WeatherBench (2) (Rasp et al., 2020, 2023). The WeatherBench dataset provides the reanalysis history for a curated set of weather variables extracted from ERA5 to aid in the training and comparison of various AINWP approaches. We will use this benchmark as a reference point for comparing the universe of AI-NWP models.

While machine learning methods have long been used for climatological and weather forecasting applications (Schultz et al., 2021), competitive AI-NWP approaches for global, medium-range weather prediction have only begun to emerge in the past five years. The earliest approaches to medium-range AI-NWP like those in Dueben & Bauer (2018), Weyn, Durran, & Caruana (2020), or Rasp & Thuerey (2021) were hampered by low resolution and data quality, resulting in models which were only marginally more skillful than forecasts based on climatological averages. To reach performance competitive with the ECMWF’s IFS ENS forecasts, AI-NWP approaches required the release of the 0.25°-resolution ERA5 dataset to provide the data fidelity to compete with IFS HRES. As we will see, these new methods also require the integration of the spatial, temporal, and physical inductive biases which more closely align with the fluid and thermodynamic laws which underlie the behavior of medium-range weather. At the moment, the implementation of these inductive biases in AI-NWP models come in two architectural flavors: Graph Neural Networks and Transformers.

Graph Neural Networks (GNNs) are a convenient architecture for encoding spatial inductive biases given their dependence on the connectivity of an underlying network structure upon which their functional form is defined. These deep learning algorithms learn to represent complex interactions on an underlying network by performing parameterized message passing between adjacent nodes, updating parameters so as to minimize the discrepancy between the representations computed at each node, and the representations at each node under the true data. This message passing can take many structural forms, but nearly all of these forms may be interpreted as the composition of a messaging operation between adjacent nodes, an aggregation of inbound messages into each node, and an update to the node representations based on these aggregated local messages (Veličković, 2022).

By stacking multiple message passing layers in sequence, the locally-defined GNN architecture begins to integrate more global information, as data from more distant nodes diffuses into the representations of each node’s immediate neighbors with each message, aggregate, and update iteration. Generally speaking, a GNN will require *diameter* 1. The functional definition of GNNs imbues them with an implicit *spatial* inductive bias, assigning similar representation to nodes which are nearby (measured by number of hops along the underlying graph) within the graph.

Given the overwhelming success of grid-based NWP since Richardson’s first pioneering efforts, it is reasonable to expect that a spatial inductive bias, a predisposition to process each grid square based on the information contained within the square’s neighbors, would lead to a family of predictive functions which are aligned with the spatially-determined dynamics which underlying weather patterns. Instead of solving for each grid point’s forecasted weather given its neighbors as in traditional NWP, GNN-based methods seek to predict the weather in the next time step given the node representations of the weather encoded within the neighborhood of each grid point.

The first GNN-based AI-NWP method to achieve performance nearing that of production-quality NWP was presented in Keisler (2022). A herculean single-author effort, Keisler differentiates itself from prior work primarily due to its substantial up-sampling of the underlying grid data fidelity used during training and forecasting. The model is based on a grid size of approximately 110km (1°) and incorporates interpolated ERA5 weather observation data for six variables across 13 vertical atmospheric pressure levels at each of the grid’s 65,160 nodes, resulting in orders of magnitude more fidelity than prior AI-NWP approaches up to that point.

Keisler aggregates hourly reanalysis data from 1950-2021 covering temperature, geopotential height, specific humidity, and three directional wind components across 13 pressure levels. After mapping all of this data to a 1° globe-spanning grid, one could at this point apply a GNN to this raw data to produce weather change forecasts by performing message passing between neighboring grid points, augmenting across each layer the (13 x 6 = 78)-dimensional node representations (plus pre-computed data like solar radiation, orography, land-sea mask, the day-of-year, and sine and cosine of latitude and longitude) derived from weather observations in recent time steps.

While this naive approach would likely yield some amount of predictive skill, it faces a major structural inefficiency. Because the Earth is approximately a sphere, a 1° (110km) grid measured at the equator would result in substantially smaller coverage areas as one moves north or south towards the poles. This grid irregularity biases the model’s predictive capacity towards the poles, as there would be many more weather observations measured by the poles resulting in, by extension, an outsized influence of the poles on an RMSE-like loss function on global weather prediction. Because human population density is biased towards the equator and weather forecasts are primarily for human consumption, such a poleward bias would result in particularly sub-optimal performance in a production setting.

Keisler resolves this structural inefficiency with a clever architectural decision. Instead of performing message passing directly on the original 1° latitude/longitude grid, Keisler first subsamples the grid down to a ~3° (~330km) *icosahedral* *mesh* resulting in a ~6,000-node grid which is distributed uniformly across the globe. To map raw weather observations on the original grid to the mesh, Keisler introduces an encoder GNN which connects each grid point to its closest spatial icosahedral mesh node in a bipartite manner. This encoder essentially provides a parameterized downsampling operation from the pole-biased latitude/longitude grid to a uniform icosahedral mesh. The architecture then performs a number of message passing operations using the latent node representations on the mesh before passing these representations through a decoder, a parameterized upsampling function which undoes the original encoder operations, mapping mesh information back onto the 1° grid resolution.

The architecture makes predictions about changes in weather from the input observation data 6 hours into the future. However, 6-hour forecasts are relatively easy to make, especially for traditional NWP systems. Ideally, one would like the architecture to also be able to produce competitive forecasts for days in advance. To achieve this, Keisler formulates a loss function which penalizes model predictions for each 6-hour time rolled out to 3 days, gaining longer-horizon forecasting skill to the slight detriment of near-term skill.

This combination of keen architectural choices and extensive training data results in a model which was able to achieve forecasting performance noticeably more skilled and comprehensive than its predecessors. Although Keisler’s GNN approach took five and a half days to train, this training can be done on a single NVIDIA A100 GPU and, once trained, can produce a 5-day forecast in less than a second. Despite this success, this first attempt at a GNN-based AI-NWP model still lags behind the skill and resolution of the ECMWF’s IFS across multiple variables, pressure levels, and lead times.

In a section discussing data preprocessing, Keisler (2022) makes the following observation:

One useful feature of using message-passing GNNs is that we can encode the relative positions between nodes into the messages, so that a single model can learn from data at different resolutions. We took advantage of this by first training on 2-degree data for the first round of training and then switching to training on 1-degree data for the last two rounds. For reasons we do not understand, this produced better results than training on 1-degree data throughout.

The *GraphCast* model (Lam et al., 2023) capitalizes on this observation that multiple data resolutions benefit AI-NWP performance by scaling up the resolution and performance of Keisler’s GNN approach while adding additional layers to the icosahedral mesh. GraphCast is structured similarly to its predecessor, an encoder-GNN-decoder architecture, but builds on this model in a few crucial ways. These improvements have resulted in GraphCast being one of the most accurate medium-range AI-NWP models proposed to date, outperforming ECMWF IFS across a number of surface-level and atmospheric forecasting tasks, especially for forecasts within a 5-day time horizon.

The first major improvement introduced by the GraphCast model is its operational data scale. The DeepMind and Google Research-affiliated team were able to scale the observational grid resolution from Keisler’s 1° (110km) resolution to ERA5’s minimum 0.25° (28km) resolution, resulting in a base grid with over a million nodes and approaching the resolution of the highest-fidelity NWP model ECMWF HRES (0.1°). In addition to upsampling the data resolution, GraphCast also trains on 5 surface-level variables (2m temperature, 10m wind components, mean sea-level pressure, total precipitation) in addition to the atmospheric variables used in Keisler (2022). GraphCast also incorporates the atmospheric variables at 37 pressure levels, resulting in (5 + 6 * 37 = 227) weather observation variables per grid point. Note that while the model is trained on this collection of 227 variables, its performance is evaluated on a 69-variable subset corresponding to the variables covered in the WeatherBench and ECMWF Scorecard benchmarks.

Like in previous approaches, the output of GraphCast is a forecast of the change to each weather variable six hours in advance. GraphCast differs by not only taking the current weather observation as input, but also the weather from the previous forecast time, six hours prior. In other words, GraphCast uses information about the current weather and the weather state six hours prior to forecast the weather six hours into the future. These forecasts can then be rolled out to generate arbitrarily long weather state trajectories, two input states at a time.

Perhaps the most innovative architectural feature introduced within GraphCast is the expansion of the icosahedral mesh which uniformly spans the globe into an icosahedral *multi-*mesh. That is, instead of performing the GNN’s message-passing operations on a single mesh whose data points uniformly span the globe at a single spacing scale, the GraphCast model performs simultaneous message-passing operations on a collection of seven icosahedral meshes of increasing granularity. The coarsest mesh *between* each mesh level, in addition to information flow amongst edges *within* each level. This means nodes at the coarser mesh levels can serve as hubs for gathering and transmitting longer-range information within finer levels. This multi-mesh architecture’s success provides further empirical evidence towards Keisler’s original observation that weather prediction is better facilitated by including variable information at multiple spatial scales.

The necessity of multi-scale message passing for accurate AI-NWP weather forecasting makes sense when one considers the peculiar influence of global weather patterns on local forecasts. Large-scale weather patterns like Rossby waves, ENSO, or atmospheric rivers all play a conditional roll when making local weather forecasts, even if these anomalies are measured thousands of kilometers away from the forecasted location of interest. While such teleconnections are generally resolvable locally, their global influence would be much more difficult to capture at a fine mesh resolution without an extensive number of message-passing operations. By introducing a multi-mesh which provides a shortcut route for integrating global weather information, the extra computational requirements, parameter costs, and oversmoothing risks introduced by excessive message passing may be avoided. Lam et al. (2023) provide some evidence towards this hypothesis by evaluating the performance of the GraphCast model without a multi-mesh, ablating each mesh level but the finest (

The empirical success of Lam et al. (2023), especially in comparison to Keisler (2022), implies that performance of AI-NWP models may be drastically improved by incorporating more global information into the forecast at each grid point by message passing at multiple spatial resolutions. But what is stopping us from pushing this insight to its logical conclusion and defining a model which makes weather forecasts at each point on the Earth conditional on the observed state of the weather at every other location across the globe? Why not let gradient descent deduce from the training data which spatial scales are most relevant to making accurate forecasts for a given location?

The answer to these questions, in theory, is nothing. Transformers, the class of architectures which have achieved dominant performance across a variety of language and vision tasks, capture exactly this inductive bias through their attention mechanism. Applied to a grid of weather observations, a generic Transformer architecture may be interpreted as a GNN which learns from message-passing operations over an augmented grid which includes edges between all pairs of nodes. On this new fully-connected structure, the Transformer learns how to augment and update messages as they are passed between all nodes and, through attention, which nodes’ messages are most relevant for forecasting the weather at particular location. With enough data, such a generic Transformer architecture would be able to infer, for example, whether the weather in New York City depends on the current weather in Tokyo, or whether just attending to observations made in Boston and Albany is sufficient to produce performant forecasts.

In practice, however, such a generic Transformer architecture is infeasible. Due to the attention mechanism, the parameter complexity of a generic Transformer architecture scales quadratically with number of grid nodes, making it infeasible for usage in high-resolution forecasts. For a grid with

This lack of native inductive biases in Transformers results in a tendency for these models to underperform non-Transformer models with more task-appropriate inductive biases in low-data regimes. Such underperformance, in conjunction with their inherent scaling issues, has led to the development of a large number of “efficient” transformer architectures (Tay, Dehghani, Bahri, & Metzler, 2022) which impose constraints on the global attention operation to improve computational and memory complexity while also biasing the model towards desirable solutions in low-data environments. These constraints, which primarily alter the manner in which tokens and their features are *mixed* by the architecture, have a massive effect on model performance.

Despite these fundamental complexities of the Transformer architecture, the prospect of learning directly from data which spatial biases are relevant for forecasting the weather at each point in space is indeed an enticing one. There have been a number of Transformer-based AI-NWP methods proposed recently which seek to capitalize on this architectural potential. We will detail a selection of these approaches in the next sections. As we shall see, these AI-NWP models distinguish themselves primarily by how they mix tokens and circumvent the quadratic complexity inherent to the Transformer architecture.

Kurth et al. (2023) proposed one of the first high-resolution AI-NWP approaches based on the Transformer architecture. The model, which they call *FourCastNet*, imagines weather on an underlying 0.25° observation grid as an image, substituting red, green, and blue color values for five surface and five atmospheric weather variable observations across four pressure levels at each pixel. FourCastNet then employs a vision Transformer (ViT) architecture to predict 6-hour changes in the weather based on ERA5 reanalysis data.

Instead of performing the quadratic attention operation over the large number of pixels in the raw image, ViT architectures traditionally act on images by first breaking the image into patches, interpreted as tokens, and performing this attention operation between these patches (Dosovitskiy et al., 2020). ViT approaches have achieved state-of-the-art performance in large-scale computer vision tasks in recent years due to their capacity to model global interactions between features across an entire image at each layer. This is in contrast to the more conventional convolutional approach which composes locally-defined functions within each layer to learn global features at deeper layers.

FourCastNet employs an efficient but relatively un-constrained mixing strategy by first transforming the collection of weather state patches into the frequency domain through the Discrete Fourier Transform (DFT). The DFT operation effectively mixes the tokens spatially, and multiplication of the resulting frequencies by a complex-valued weight matrix provides a weighted mixing of channels which may be interpreted as a global convolution operation. The specific architecture employed by FourCastNet (Guibas et al., 2021) also incorporates some regularization on the structure of this convolution to promote generalization. The Fourier signals are then transformed back to the spatial domain via the inverse DFT, resulting in a weather change prediction.

The model undergoes two training phases: a pre-training phase which optimizes 6-hour forecast accuracy followed by a fine-tuning phase which seeks to minimize 12-hour predictions. Although the performance of FourCastNet mostly underperforms IFS on the few weather variable predictions covered by the model, the model was the first to prove that the Transformer architecture can be scaled to perform AI-NWP at high resolutions with a suitably-biased mixing strategy.

While FourCastNet proved that Transformers can be scaled to perform high-resolution weather forecasting, the global convolution operation itself is relatively void of inductive biases which could be used to better align the model’s behavior with the physical spatio-temporal constraints known to govern weather patterns. In the absence of a massive influx of additional historical weather data (unlikely to materialize), improved performance from Transformer-based AI-NWP models will likely be derived from the inclusion of more pertinent, physically-motivated inductive biases. Towards this end, Bi et al. (2023) introduce an Earth-specific, 3-dimensional Transformer architecture (3DEST) which underlies their *Pangu-Weather* AI-NWP model. Similar to FourCastNet, Pangu-Weather motivates the architecture by interpreting historical weather data as a time-stamped sequence of images. Motivated by the intuition that atmospheric height is just as informative to forecasts as spatial location, Pangu-Weather adds a third dimension to these images corresponding to the pressure level (atmospheric height) at each location in space and employs a 3-dimensional ViT to make these forecasts.

This approach addresses the quadratic complexity problem inherent to the Transformer architecture by first projecting the original 0.25° grid data to a collection of down-sampled patches: 3-dimensional cubes whose height spans atmospheric pressure levels and length and width span latitude and longitude along the Earth’s surface. For surface variables, which have no atmospheric height, these patches simplify to square-shaped coarsenings of the observation grid. These patches down-sample the grid by a factor of 4, and the atmospheric cubes also downsample the atmospheric pressure levels (13) by a factor of 2. Overall, this down-sampling takes the original 3-dimensional grid of size 13 (pressure) x 1,440 (latitude) x 721 (longitude) with 5 variables at each observation point to a down-sampled cube of size of 8 x 360 x 181 with 192 latent variable dimensions. The parameters of this downsampling procedure are learnable, taking the form of a fully-connected layer.

Once these down-sampled atmospheric and surface-level patches have been created, they are fed into a 16-layer, encoder-decoder Swin Transformer architecture (Liu et al., 2021). The Swin Transformer was originally proposed as a computer vision model, differing from a traditional ViT in its hierarchical merging of feature maps between layers and the application of attention within patches, as opposed to between patches in earlier ViTs. For ViTs, the complexity of the attention operation is heavily influenced by the size of the patches used to perform the expensive self-attention operation. In a traditional ViT, this operation is performed between patches of the input and identically across layers. The Swin Transformer architecture, by contrast, allows this complexity growth to scale sub-linearly across layers by introducing a hierarchical merging operation of patches between layers. The Swin architecture begins with a small patch size, resulting in a highly localized computation in the first layer. As the data is passed between layers, the patches are merged into increasingly larger patches which results in a hierarchical inclusion of patches from earlier layers into those of later layers, transferring data across patch boundaries in the process. In addition to reducing model complexity, this hierarchical processing has been shown to provide performance benefits in many vision domains where objects of interest in the input may emerge at numerous spatial scales.

This local-to-global processing strategy aligns well with prior intuitions regarding the spatial emergence of weather patterns. By interpreting the down-sampled observation grid as a 3-dimensional image, the Swin Transformer backbone of Pangu-Weather’s 3DEST architecture produces weather predictions by first attending to the interactions of small, nearby regions across the globe before merging these effects into the calculation of variable interactions which incorporate a much larger and expansive collection of locales. Following these interaction-heavy Swin blocks, the model outputs a weather change forecast by upsampling the forecast back to its original grid fidelity using a fully-connected layer structured inversely to the first down-sampling layer. Bi et al. train this model to make 1-, 3-, 6-, or 24-hour forecasts after noting that days-long forecasts are easier to predict when using fewer autoregressive 24-hour rollout steps, as opposed to rolling out for more steps using the 1- or 3-hour models.

While the performance of Pangu-Weather lags behind other state-of-the-art models like GraphCast and FuXi, especially in prediction of surface variables, the careful consideration of inductive biases required by the Transformer architecture like the local-to-global Swin hierarchical processing or the earth-specific positional encodings makes Pangu-Weather an excellent example of how physical weather modeling constraints may be integrated into the Transformer architecture. These NWP-specific inductive biases make Pangu-Weather an appealing an architectural foundation for future iterations of Transformer-derived AI-NWP approaches.

Chen et al. (2023) build on Pangu-Weather’s Swin architecture with the goal of extending the temporal horizon at which AI-NWP models achieve significant forecasting skill. Their FuXi model is structured as an integrated collection of three Transformer-based AI-NWP models, each trained to predict the weather at increasingly long time horizons. The hope is that, by integrating models trained explicitly for particular forecast lead times, the FuXi might avoid the error accumulation and over-smoothing behavior encountered by models which are either trained on a particular lead time and rolled out for longer-horizon predictions or trained to minimize forecasting error across all horizons simultaneously.

All of the AI-NWP discussed thus far have approached the temporal dimension of weather forecasting in a manner very similar to traditional NWP approaches. That is, each model is trained primarily to predict weather changes within some pre-specified time step. Longer-horizon forecasts are generated by autoregressively rolling out these step-wise predictions further into the future, feeding the previously-predicted weather states as input into each subsequent forecast step. While this approach can produce competitive medium-range forecasts, it is fundamentally prone to error accumulation as erroneous weather state predictions from previous time steps are used as the input state for future predictions. Because unconstrained AI-NWP models are prone to producing unrealistically smooth weather predictions, accumulated forecast errors at longer horizons can result in extremely unrealistic global weather forecasts.

Earlier AI-NWP approaches have addressed this problem of multi-temporal prediction fidelity in a variety of ways. For example, the GraphCast model was trained to explicitly minimize forecasting error across 12 forecast steps (3 days), as observable in the primary loss function:

They follow a curriculum training schedule, first prioritizing 300,000 single-step gradient updates with a decaying learning rate before gradually incorporating longer autoregressive steps in the loss. However, this incorporation of longer lead times into the objective of a single model leads to a trade-off between near-term and long-term forecast accuracy. The GraphCast authors note,

We found that models trained with fewer [autoregressive] steps tended to trade longer for shorter lead time accuracy. These results suggest potential for combining multiple models with varying numbers of [autoregressive] steps, e.g., for short, medium and long lead times, to capitalize on their respective advantages across the entire forecast horizon (Lam et al., 2023).

This tradeoff between short-term and multi-day forecasting skill was also observed in Bi et al. (2023) which showed that, when producing a 7-day forecast, rolling out the Pangu-Weather model trained with a 24-hour time step 7 times is much more accurate than rolling out a 1-hour model 168 times.

This observation that models trained to make forecasts of noisy or chaotic systems at particular lead times can outperform their autoregressively-forecasted counterparts (Scher & Messori, 2019) provides the primary motivation for the FuXi model’s cascading architecture. The FuXi model itself is an updated and expanded version of Pangu-Weather’s Swin Transformer architecture. The base model is trained to make 6-hour weather predictions across time horizons of 0 to 5 days using a curriculum training regimen similar to that employed by the GraphCast model. The authors call this base model the “FuXi-Small” component and use it to make forecasts within a 0-5 day time horizon. This trained base model is then copied twice. The first copy is then fine-tuned to predict weather observations 5-10 days in advance, and the second copy 10-15 days in advance.

This cascaded model architecture is effective. According to WeatherBench 2 benchmarks, FuXi outperforms IFS HRES at forecasting most variables and pressure levels across all lead times. FuXi also performs competitively with GraphCast on lead times up to 5 days and tends to outperform GraphCast at longer forecast horizons. The ability of FuXi to produce skillful forecasts up to 15 days in advance represents a significant achievement for AI-NWP, especially since 15 days is considered to be the current intrinsic predictive limit for dynamical models of the weather (Zhang et al., 2019).

Machine learning-based weather prediction has made significant progress in the past five years towards matching—and in many cases exceeding—the performance of the best traditional NWP models in existence. This rapid growth in data-driven forecasting has been well-received by the NWP community, leading to experimental operationalization of FourCastNet, GraphCast, and Pangu-Weather by the ECMWF. Despite this recent success, the relative nascency of AI-NWP means there are still a number of potential advances on the horizon which, if implemented, would greatly improve the performance and reliability of AI-NWP forecasting.

One of the major deficiencies in AI-NWP models is their lack of calibrated uncertainty estimates. The AI-NWP models discussed in this review are trained to make point predictions of the weather based on a mean-squared-error loss to the observed weather at each historical space-time point. This training regimen forces AI-NWP models to produce deterministic weather predictions which average over any uncertainty present in the underlying predictions. This behavior is in contrast to traditional NWP models which are designed to output forecast *probabilities* by representing uncertainty in variable observations as part of the model input and propagating this uncertainty through the predicted weather dynamics.

By averaging over forecast uncertainty, AI-NWP models tend to produce unrealistic forecasts at longer time horizons that are blurry with respect to space and pressure level. This behavior is especially problematic if one wishes to extend AI-NWP methods for sub-seasonal or climatological forecasting, as these model predictions will tend towards climatological means, masking potential changes in severe weather event probabilities in the process. While the fast inference speed of AI-NWP models allows one to synthesize probabilistic forecasts by generating a distribution of weather predictions based on repeated perturbation of the observational model inputs, these models’ lack of physical constraints means we typically cannot interpret the resulting forecast probabilities as equivalent to those produced by traditional means. This is also a situation in which the black-box nature of these AI-NWP models matters. While traditional NWP is still prone to making serious forecast errors, these errors are at least bound by a physically-meaningful set of equations which can be inspected to locate the source of errors, which are typically a result of adverse initial conditions. For AI-NWP methods, however, deriving an explanation for erroneous forecasts becomes much more difficult, as the error in each forecast is a function of uncertainty in the initial conditions in conjunction with, technically, the entirety of the training dataset.

Moving from deterministic to probabilistic forecasting represents a major next step for AI-NWP methods. While there has been some early work in applying deep learning models to predict unresolved variables within traditional NWP solvers (Kochkov et al., 2023), pure AI-NWP approaches that are able to robustly quantify forecast uncertainty have yet to be proposed. Uncertainty quantification is a growing area of research within the deep learning literature (Abdar et al., 2021), so expect this methodological gap to be bridged relatively quickly.

One of the major advantages AI-NWP enjoys over traditional NWP stems from its forecast efficiency. Once trained, AI-NWP models can produce forecasts in a matter of seconds, often using only a single GPU in the process. While the training phase for these models is much more resource-intensive, this processing can be done periodically in the background, representing something closer to an R&D cost than an actual inference cost. AI-NWP should be able to further exploit this massive discrepancy in forecasting complexity to increase overall forecast accuracy by integrating forecasts from a diversity of models trained for more specific forecasting tasks.

As we observed in our discussion of the FuXi architecture, AI-NWP models are forced to trade near-term forecasting accuracy for long-term forecasting accuracy if the same model is used to predict both time horizons. The FuXi model addresses this tradeoff by training three models which span near-term (0-5 days), medium-term (5-10 days), and long-term (10-15 days) forecast horizons while minimally impacting the inference complexity in comparison to IFS HRES.

Extrapolating on this performance, it is reasonable to expect additional model temporal discretization to produce more accurate forecasts. This discretization could be applied analogously to the spatial domain to generate a family of models which more accurately predict the weather in different regions of the globe. Or we could apply this discretization variable-wise, treating the prediction of each weather variable as its own predictive task, resulting in a family of models which each specialize in predicting a particular subset of weather-related variables (Chen et al., 2023).

Given NWP’s relationship between observation grid scale and forecast accuracy, increasing the grid fidelity for AI-NWP models represents a direct route for furthering data-driven forecast accuracy. Because hourly weather data has a historical lower bound around the 1940s and an upper bound of the present, the most direct route for AI-NWP models to access deep learning’s return on data scale is to scale spatially by further discretizing the observation grid. The models discussed in this review all used a 0.25° grid resolution or higher and further coarsened this grid through a down-sampling step at the beginning of the architecture. IFS HRES, by contrast, forecasts on a 0.1° grid, providing around 5 times as many observations in space. This discrepancy in grid resolution is primarily due to the ERA5 reanalysis dataset’s inherent 0.25° resolution. While it’s likely that future reanalysis datasets will increase resolution (Munoz-Sabater et al., 2021), the real test for AI-NWP will be to make the most out of this resolution while at the same time avoiding overfitting to the increased resolution and interpolation inherent to reanalysis.