Generative AI to quantify uncertainty in weather forecasting - Related to learning, importance, ai, data, quantify
Exphormer: Scaling transformers for graph-structured data

Graphs, in which objects and their relations are represented as nodes (or vertices) and edges (or links) between pairs of nodes, are ubiquitous in computing and machine learning (ML). For example, social networks, road networks, and molecular structure and interactions are all domains in which underlying datasets have a natural graph structure. ML can be used to learn the properties of nodes, edges, or entire graphs.
A common approach to learning on graphs are graph neural networks (GNNs), which operate on graph data by applying an optimizable transformation on node, edge, and global attributes. The most typical class of GNNs operates via a message-passing framework, whereby each layer aggregates the representation of a node with those of its immediate neighbors.
not long ago, graph transformer models have emerged as a popular alternative to message-passing GNNs. These models build on the success of Transformer architectures in natural language processing (NLP), adapting them to graph-structured data. The attention mechanism in graph transformers can be modeled by an interaction graph, in which edges represent pairs of nodes that attend to each other. Unlike message passing architectures, graph transformers have an interaction graph that is separate from the input graph. The typical interaction graph is a complete graph, which signifies a full attention mechanism that models direct interactions between all pairs of nodes. However, this creates quadratic computational and memory bottlenecks that limit the applicability of graph transformers to datasets on small graphs with at most a few thousand nodes. Making graph transformers scalable has been considered one of the most significant research directions in the field (see the first open problem here).
A natural remedy is to use a sparse interaction graph with fewer edges. Many sparse and efficient transformers have been proposed to eliminate the quadratic bottleneck for sequences, however, they do not generally extend to graphs in a principled manner.
In “Exphormer: Sparse Transformers for Graphs”, presented at ICML 2023, we address the scalability challenge by introducing a sparse attention framework for transformers that is designed specifically for graph data. The Exphormer framework makes use of expander graphs, a powerful tool from spectral graph theory, and is able to achieve strong empirical results on a wide variety of datasets. Our implementation of Exphormer is now available on GitHub.
A key idea at the heart of Exphormer is the use of expander graphs, which are sparse yet well-connected graphs that have some useful properties — 1) the matrix representation of the graphs have similar linear-algebraic properties as a complete graph, and 2) they exhibit rapid mixing of random walks, [website], a small number of steps in a random walk from any starting node is enough to ensure convergence to a “stable” distribution on the nodes of the graph. Expanders have found applications to diverse areas, such as algorithms, pseudorandomness, complexity theory, and error-correcting codes.
A common class of expander graphs are d-regular expanders, in which there are d edges from every node ([website], every node has degree d). The quality of an expander graph is measured by its spectral gap, an algebraic property of its adjacency matrix (a matrix representation of the graph in which rows and columns are indexed by nodes and entries indicate whether pairs of nodes are connected by an edge). Those that maximize the spectral gap are known as Ramanujan graphs — they achieve a gap of d - 2*√(d-1), which is essentially the best possible among d-regular graphs. A number of deterministic and randomized constructions of Ramanujan graphs have been proposed over the years for various values of d. We use a randomized expander construction of Friedman, which produces near-Ramanujan graphs.
Expander graphs are at the heart of Exphormer. A good expander is sparse yet exhibits rapid mixing of random walks, making its global connectivity suitable for an interaction graph in a graph transformer model.
Exphormer replaces the dense, fully-connected interaction graph of a standard Transformer with edges of a sparse d-regular expander graph. Intuitively, the spectral approximation and mixing properties of an expander graph allow distant nodes to communicate with each other after one stacks multiple attention layers in a graph transformer architecture, even though the nodes may not attend to each other directly. Furthermore, by ensuring that d is constant (independent of the size of the number of nodes), we obtain a linear number of edges in the resulting interaction graph.
Exphormer: Constructing a sparse interaction graph.
Exphormer combines expander edges with the input graph and virtual nodes. More specifically, the sparse attention mechanism of Exphormer builds an interaction graph consisting of three types of edges:
Edges from the input graph (local attention).
Edges from a constant-degree expander graph (expander attention).
Edges from every node to a small set of virtual nodes (global attention).
Exphormer builds an interaction graph by combining three types of edges. The resulting graph has good connectivity properties and retains the inductive bias of the input dataset graph while still remaining sparse.
Each component serves a specific purpose: the edges from the input graph retain the inductive bias from the input graph structure (which typically gets lost in a fully-connected attention module). Meanwhile, expander edges allow good global connectivity and random walk mixing properties (which spectrally approximate the complete graph with far fewer edges). Finally, virtual nodes serve as global “memory sinks” that can directly communicate with every node. While this results in additional edges from each virtual node equal to the number of nodes in the input graph, the resulting graph is still sparse. The degree of the expander graph and the number of virtual nodes are hyperparameters to tune for improving the quality metrics.
Furthermore, since we use an expander graph of constant degree and a small constant number of virtual nodes for the global attention, the resulting sparse attention mechanism is linear in the size of the original input graph, [website], it models a number of direct interactions on the order of the total number of nodes and edges.
We additionally show that Exphormer is as expressive as the dense transformer and obeys universal approximation properties. In particular, when the sparse attention graph of Exphormer is augmented with self loops (edges connecting a node to itself), it can universally approximate continuous functions [1, 2].
Relation to sparse Transformers for sequences.
It is interesting to compare Exphormer to sparse attention methods for sequences. Perhaps the architecture most conceptually similar to our approach is BigBird, which builds an interaction graph by combining different components. BigBird also uses virtual nodes, but, unlike Exphormer, it uses window attention and random attention from an Erdős-Rényi random graph model for the remaining components.
Window attention in BigBird looks at the tokens surrounding a token in a sequence — the local neighborhood attention in Exphormer can be viewed as a generalization of window attention to graphs.
The Erdős-Rényi graph on n nodes, G(n, p), which connects every pair of nodes independently with probability p, also functions as an expander graph for suitably high p. However, a superlinear number of edges (Ω(n log n)) is needed to ensure that an Erdős-Rényi graph is connected, let alone a good expander. On the other hand, the expanders used in Exphormer have only a linear number of edges.
Earlier works have shown the use of full graph Transformer-based models on datasets with graphs of size up to 5,000 nodes. To evaluate the performance of Exphormer, we build upon the celebrated GraphGPS framework [3], which combines both message passing and graph transformers and achieves state-of-the-art performance on a number of datasets. We show that replacing dense attention with Exphormer for the graph attention component in the GraphGPS framework allows one to achieve models with comparable or advanced performance, often with fewer trainable parameters.
Furthermore, Exphormer notably allows graph transformer architectures to scale well beyond the usual graph size limits mentioned above. Exphormer can scale up to datasets of 10,000+ node graphs, such as the Coauthor dataset, and even beyond to larger graphs such as the well-known ogbn-arxiv dataset, a citation network, which consists of 170K nodes and [website] million edges.
Results comparing Exphormer to standard GraphGPS on the five Long Range Graph Benchmark datasets. We note that Exphormer achieved state-of-the-art results on four of the five datasets (PascalVOC-SP, COCO-SP, Peptides-Struct, PCQM-Contact) at the time of the paper’s publication.
Finally, we observe that Exphormer, which creates an overlay graph of small diameter via expanders, exhibits the ability to effectively learn long-range dependencies. The Long Range Graph Benchmark is a suite of five graph learning datasets designed to measure the ability of models to capture long-range interactions. Results show that Exphormer-based models outperform standard GraphGPS models (which were previously state-of-the-art on four out of five datasets at the time of publication).
Graph transformers have emerged as an essential architecture for ML that adapts the highly successful sequence-based transformers used in NLP to graph-structured data. Scalability has, however, proven to be a major challenge in enabling the use of graph transformers on datasets with large graphs. In this post, we have presented Exphormer, a sparse attention framework that uses expander graphs to improve scalability of graph transformers. Exphormer is shown to have essential theoretical properties and exhibit strong empirical performance, particularly on datasets where it is crucial to learn long range dependencies. For more information, we point the reader to a short presentation video from ICML 2023.
We thank our research collaborators Hamed Shirzad and Danica J. Sutherland from The University of British Columbia as well as Ali Kemal Sinop from Google Research. Special thanks to Tom Small for creating the animation used in this post.
In “Domain-specific optimization and diverse evaluation of self-supervised models for histopathology”, we showed that self-supervised...
Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-rela...
Here is a common scenario : An A/B test was conducted, where a random sample of units ([website] end-customers) were selected for a cam...
Generative AI to quantify uncertainty in weather forecasting

In December 1972, at the American Association for the Advancement of Science meeting in Washington, [website], MIT meteorology professor Ed Lorenz gave a talk entitled, “Does the Flap of a Butterfly's Wings in Brazil Set Off a Tornado in Texas?”, which contributed to the term “butterfly effect”. He was building on his earlier, landmark 1963 paper where he examined the feasibility of “very-long-range weather prediction” and described how errors in initial conditions grow exponentially when integrated in time with numerical weather prediction models. This exponential error growth, known as chaos, results in a deterministic predictability limit that restricts the use of individual forecasts in decision making, because they do not quantify the inherent uncertainty of weather conditions. This is particularly problematic when forecasting extreme weather events, such as hurricanes, heatwaves, or floods.
Recognizing the limitations of deterministic forecasts, weather agencies around the world issue probabilistic forecasts. Such forecasts are based on ensembles of deterministic forecasts, each of which is generated by including synthetic noise in the initial conditions and stochasticity in the physical processes. Leveraging the fast error growth rate in weather models, the forecasts in an ensemble are purposefully different: the initial uncertainties are tuned to generate runs that are as different as possible and the stochastic processes in the weather model introduce additional differences during the model run. The error growth is mitigated by averaging all the forecasts in the ensemble and the variability in the ensemble of forecasts quantifies the uncertainty of the weather conditions.
While effective, generating these probabilistic forecasts is computationally costly. They require running highly complex numerical weather models on massive supercomputers multiple times. Consequently, many operational weather forecasts can only afford to generate ~10–50 ensemble members for each forecast cycle. This is a problem for individuals concerned with the likelihood of rare but high-impact weather events, which typically require much larger ensembles to assess beyond a few days. For instance, one would need a 10,000-member ensemble to forecast the likelihood of events with 1% probability of occurrence with a relative error less than 10%. Quantifying the probability of such extreme events could be useful, for example, for emergency management preparation or for energy traders.
Audio Processing is one of the most crucial application domains of digital signal processing (DSP) and machine learning. Modeling acoustic environme...
Due to the complexity of understanding and solving various tasks solely using instructions, the size of multi-task LLMs typically spans from several b...
AutoBNN is based on a line of research that over the past decade has yielded improved predictive accuracy by modeling time series using GPs with learn...
Learning the importance of training data under concept drift

The constantly changing nature of the world around us poses a significant challenge for the development of AI models. Often, models are trained on longitudinal data with the hope that the training data used will accurately represent inputs the model may receive in the future. More generally, the default assumption that all training data are equally relevant often breaks in practice. For example, the figure below presents images from the CLEAR nonstationary learning benchmark, and it illustrates how visual attributes of objects evolve significantly over a 10 year span (a phenomenon we refer to as slow concept drift), posing a challenge for object categorization models.
Sample images from the CLEAR benchmark. (Adapted from Lin et al.).
Alternative approaches, such as online and continual learning, repeatedly enhancement a model with small amounts of recent data in order to keep it current. This implicitly prioritizes recent data, as the learnings from past data are gradually erased by subsequent updates. However in the real world, different kinds of information lose relevance at different rates, so there are two key issues: 1) By design they focus exclusively on the most recent data and lose any signal from older data that is erased. 2) Contributions from data instances decay uniformly over time irrespective of the contents of the data.
In our recent work, “Instance-Conditional Timescales of Decay for Non-Stationary Learning”, we propose to assign each instance an importance score during training in order to maximize model performance on future data. To accomplish this, we employ an auxiliary model that produces these scores using the training instance as well as its age. This model is jointly learned with the primary model. We address both the above challenges and achieve significant gains over other robust learning methods on a range of benchmark datasets for nonstationary learning. For instance, on a recent large-scale benchmark for nonstationary learning (~39M photos over a 10 year period), we show up to 15% relative accuracy gains through learned reweighting of training data.
The challenge of concept drift for supervised learning.
To gain quantitative insight into slow concept drift, we built classifiers on a recent photo categorization task, comprising roughly 39M photographs sourced from social media websites over a 10 year period. We compared offline training, which iterated over all the training data multiple times in random order, and continual training, which iterated multiple times over each month of data in sequential (temporal) order. We measured model accuracy both during the training period and during a subsequent period where both models were frozen, [website], not updated further on new data (shown below). At the end of the training period (left panel, x-axis = 0), both approaches have seen the same amount of data, but show a large performance gap. This is due to catastrophic forgetting, a problem in continual learning where a model’s knowledge of data from early on in the training sequence is diminished in an uncontrolled manner. On the other hand, forgetting has its advantages — over the test period (shown on the right), the continual trained model degrades much less rapidly than the offline model because it is less dependent on older data. The decay of both models’ accuracy in the test period is confirmation that the data is indeed evolving over time, and both models become increasingly less relevant.
Comparing offline and continually trained models on the photo classification task.
Time-sensitive reweighting of training data.
We design a method combining the benefits of offline learning (the flexibility of effectively reusing all available data) and continual learning (the ability to downplay older data) to address slow concept drift. We build upon offline learning, then add careful control over the influence of past data and an optimization objective, both designed to reduce model decay in the future.
Suppose we wish to train a model, M, given some training data collected over time. We propose to also train a helper model that assigns a weight to each point based on its contents and age. This weight scales the contribution from that data point in the training objective for M. The objective of the weights is to improve the performance of M on future data.
In our work, we describe how the helper model can be meta-learned, [website], learned alongside M in a manner that helps the learning of the model M itself. A key design choice of the helper model is that we separated out instance- and age-related contributions in a factored manner. Specifically, we set the weight by combining contributions from multiple different fixed timescales of decay, and learn an approximate “assignment” of a given instance to its most suited timescales. We find in our experiments that this form of the helper model outperforms many other alternatives we considered, ranging from unconstrained joint functions to a single timescale of decay (exponential or linear), due to its combination of simplicity and expressivity. Full details may be found in the paper.
The top figure below reveals that our learned helper model indeed up-weights more modern-looking objects in the CLEAR object recognition challenge; older-looking objects are correspondingly down-weighted. On closer examination (bottom figure below, gradient-based feature importance assessment), we see that the helper model focuses on the primary object within the image, as opposed to, [website], background capabilities that may spuriously be correlated with instance age.
Sample images from the CLEAR benchmark (camera & computer categories) assigned the highest and lowest weights respectively by our helper model.
Feature importance analysis of our helper model on sample images from the CLEAR benchmark.
We first study the large-scale photo categorization task (PCAT) on the YFCC100M dataset discussed earlier, using the first five years of data for training and the next five years as test data. Our method (shown in red below) improves substantially over the no-reweighting baseline (black) as well as many other robust learning techniques. Interestingly, our method deliberately trades off accuracy on the distant past (training data unlikely to reoccur in the future) in exchange for marked improvements in the test period. Also, as desired, our method degrades less than other baselines in the test period.
Comparison of our method and relevant baselines on the PCAT dataset.
We validated our findings on a wide range of nonstationary learning challenge datasets sourced from the academic literature (see 1, 2, 3, 4 for details) that spans data findings and modalities (photos, satellite images, social media text, medical records, sensor readings, tabular data) and sizes (ranging from 10k to 39M instances). We study significant gains in the test period when compared to the nearest (shown below). Note that the previous best-known method may be different for each dataset. These results showcase the broad applicability of our approach.
Performance gain of our method on a variety of tasks studying natural concept drift. Our reported gains are over the previous best-known method for each dataset.
Finally, we consider an interesting extension of our work. The work above described how offline learning can be extended to handle concept drift using ideas inspired by continual learning. However, sometimes offline learning is infeasible — for example, if the amount of training data available is too large to maintain or process. We adapted our approach to continual learning in a straightforward manner by applying temporal reweighting within the context of each bucket of data being used to sequentially improvement the model. This proposal still retains some limitations of continual learning, [website], model updates are performed only on most-recent data, and all optimization decisions (including our reweighting) are only made over that data. Nevertheless, our approach consistently beats regular continual learning as well as a wide range of other continual learning algorithms on the photo categorization benchmark (see below). Since our approach is complementary to the ideas in many baselines compared here, we anticipate even larger gains when combined with them.
Results of our method adapted to continual learning, compared to the latest baselines.
We addressed the challenge of data drift in learning by combining the strengths of previous approaches — offline learning with its effective reuse of data, and continual learning with its emphasis on more recent data. We hope that our work helps improve model robustness to concept drift in practice, and generates increased interest and new ideas in addressing the ubiquitous problem of slow concept drift.
We thank Mike Mozer for many interesting discussions in the early phase of this work, as well as very helpful advice and feedback during its development.
Floods are the most common natural disaster, and are responsible for roughly $50 billion in annual financial damages worldwide. The rate of flood-rela...
As part of his global tour, OpenAI CEO Sam Altman is in Delhi today for the business’s DevDay, joined by India’s IT minister Ashwini Vaishnaw and OpenA...
Après l’incident qu’a fait objet DeepSeek la semaine dernière, les chercheurs affirment avoir trouvé un moyen de contourner les mesures de protection ...
Market Impact Analysis
Market Growth Trend
2018 | 2019 | 2020 | 2021 | 2022 | 2023 | 2024 |
---|---|---|---|---|---|---|
23.1% | 27.8% | 29.2% | 32.4% | 34.2% | 35.2% | 35.6% |
Quarterly Growth Rate
Q1 2024 | Q2 2024 | Q3 2024 | Q4 2024 |
---|---|---|---|
32.5% | 34.8% | 36.2% | 35.6% |
Market Segments and Growth Drivers
Segment | Market Share | Growth Rate |
---|---|---|
Machine Learning | 29% | 38.4% |
Computer Vision | 18% | 35.7% |
Natural Language Processing | 24% | 41.5% |
Robotics | 15% | 22.3% |
Other AI Technologies | 14% | 31.8% |
Technology Maturity Curve
Different technologies within the ecosystem are at varying stages of maturity:
Competitive Landscape Analysis
Company | Market Share |
---|---|
Google AI | 18.3% |
Microsoft AI | 15.7% |
IBM Watson | 11.2% |
Amazon AI | 9.8% |
OpenAI | 8.4% |
Future Outlook and Predictions
The Data Exphormer Scaling landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:
Year-by-Year Technology Evolution
Based on current trajectory and expert analyses, we can project the following development timeline:
Technology Maturity Curve
Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:
Innovation Trigger
- Generative AI for specialized domains
- Blockchain for supply chain verification
Peak of Inflated Expectations
- Digital twins for business processes
- Quantum-resistant cryptography
Trough of Disillusionment
- Consumer AR/VR applications
- General-purpose blockchain
Slope of Enlightenment
- AI-driven analytics
- Edge computing
Plateau of Productivity
- Cloud infrastructure
- Mobile applications
Technology Evolution Timeline
- Improved generative models
- specialized AI applications
- AI-human collaboration systems
- multimodal AI platforms
- General AI capabilities
- AI-driven scientific breakthroughs
Expert Perspectives
Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:
"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."
— AI Researcher
"Organizations that develop effective AI governance frameworks will gain competitive advantage."
— Industry Analyst
"The AI talent gap remains a critical barrier to implementation for most enterprises."
— Chief AI Officer
Areas of Expert Consensus
- Acceleration of Innovation: The pace of technological evolution will continue to increase
- Practical Integration: Focus will shift from proof-of-concept to operational deployment
- Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
- Regulatory Influence: Regulatory frameworks will increasingly shape technology development
Short-Term Outlook (1-2 Years)
In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:
- Improved generative models
- specialized AI applications
- enhanced AI ethics frameworks
These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.
Mid-Term Outlook (3-5 Years)
As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:
- AI-human collaboration systems
- multimodal AI platforms
- democratized AI development
This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.
Long-Term Outlook (5+ Years)
Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:
- General AI capabilities
- AI-driven scientific breakthroughs
- new computing paradigms
These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.
Key Risk Factors and Uncertainties
Several critical factors could significantly impact the trajectory of ai tech evolution:
Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.
Alternative Future Scenarios
The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:
Optimistic Scenario
Responsible AI driving innovation while minimizing societal disruption
Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.
Probability: 25-30%
Base Case Scenario
Incremental adoption with mixed societal impacts and ongoing ethical challenges
Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.
Probability: 50-60%
Conservative Scenario
Technical and ethical barriers creating significant implementation challenges
Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.
Probability: 15-20%
Scenario Comparison Matrix
Factor | Optimistic | Base Case | Conservative |
---|---|---|---|
Implementation Timeline | Accelerated | Steady | Delayed |
Market Adoption | Widespread | Selective | Limited |
Technology Evolution | Rapid | Progressive | Incremental |
Regulatory Environment | Supportive | Balanced | Restrictive |
Business Impact | Transformative | Significant | Modest |
Transformational Impact
Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.
The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.
Implementation Challenges
Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.
Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.
Key Innovations to Watch
Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.
Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.
Technical Glossary
Key technical terms and definitions to help understand the technologies discussed in this article.
Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.