How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo - Related to agentic, ready, dreaded, how, your

4 ways to get your business ready for the agentic AI revolution

Experts suggest agentic AI will redefine business workflows during the rest of this decade. Consultant Accenture hints at agents -- not people -- will be the primary customers of enterprise digital systems by 2030.

As many as 93% of IT leaders intend to introduce AI agents within the next two years. However, despite all the hype around agentic AI, business leaders told ZDNET the path to an agentic transformation is far from straightforward, and many challenges must be overcome.

Also: 25% of enterprises using AI will deploy AI agents by 2025.

Here are four ways your business can get ready for AI agents.

James Fleming, CIO at the Francis Crick Institute, noted his organization is testing agents and. Experimenting with Meta's Llama 3 model.

"We're looking at various data sets to see if there's any utility we can derive from the technology," he noted, suggesting agents can help with literature synthesis.

"The amount of research . Staying ahead of your field is a full-time job. Synthesizing the last 25 papers in a discipline into something readable and informative means agents could often be quite good at that activity."

Also: AI agents will significantly improve employee productivity.

Fleming told ZDNET that initial explorations into agents have shown that automation can be challenging. Especially in a world-leading research organization.

"That issue goes back to scientific rigor. You've got to be damn sure it's doing something useful," he mentioned.

"Otherwise, it's just another mechanism for generating false leads. So, to get over that hill of 'this is a genuinely useful tool' is a high one."

Fleming expressed the use case is everything. There is a big difference between using agents for diary planning and life-saving research. Humans must be kept in the loop.

"That's not to say there's no gain with the technology," he noted. "It's just targeting agents specifically within the research life cycle -- and not at the stage where it can supplant human thought and creativity."

Carrie Jordan, Microsoft's global director of program execution, noted agents and bots are exciting innovations.

"The ability to connect them in the background, to talk to each other. And to achieve multiple tasks versus just a single thread, is potentially powerful," she revealed.

Jordan explained to ZDNET how the sales proposals team at Microsoft explores Copilot Studio technology as part of a Center of Excellence (CoE).

"They're getting curious," she mentioned. "Agents are still relatively new, so we're figuring out the best way to leverage that technology. But there's a lot of potential there. I think it will be a game-changer."

Also: AI agents will match 'good mid-level' engineers this year, says Mark Zuckerberg.

However, while Microsoft is pursuing a range of AI-led projects internally and for its consumers, Jordan mentioned it's significant for business leaders like herself to avoid being swept up in the hype.

She suggested the CoE's explorations will help the proposals team sort the agentic wheat from the chaff.

"I can see the potential of AI in general, and agents in particular because they can help solve a gap with single-threaded AI, which is only being able to do one thing at a time," she stated.

"When you can set up complex systems of agents, that approach has lots of potential."

Raymond Boyle. Vice president of data and analytics at Hyatt Hotels, noted his organization takes a tried-and-trusted approach to emerging technologies like agents: let line-of-business departments decide how innovations are exploited.

"We look at those transformations with our business partners and through a lens that looks at their work," he noted.

"We won't introduce change to the business. We'll introduce change with the business and work through the challenges and things they believe are most essential in finance, digital, loyalty, and sales."

Also: AI agents may soon surpass people as primary application consumers.

Boyle told ZDNET that Hyatt gets its business leaders to think through how tech-enabled change might work in their parts of the organization.

Agents could be used eventually, but. Only once a partnership approach identifies the right opportunities.

"Agents are becoming a big part of how generative AI and machine learning are used in business today. The way agents will be used in travel will be fascinating to watch. I think this technology will certainly be a part of the mix," he mentioned.

"The process for Hyatt will be to find the right technologies -- and we'll do that in close partnership with our business leaders and. The technology teams that run the applications. We'll then provide the AI services to drive those transitions for the business."

Keith Woolley, chief digital and. Information officer at the University of Bristol, is another digital leader who sees the potential benefits of agents. However, he stated these advantages will become manifest over the longer term.

"We are looking at agentic AI, but. We're not implementing it yet," he mentioned. "We sit as a management team and ask questions like, 'Should we do our admissions process using agentic AI? What would be the advantage?'"

Woolley told ZDNET he could envision a situation in which AI and automation help assess and inform candidates worldwide about the status of their applications.

"The benefits to us. In terms of cost and operational efficiencies, could be substantial. Also, the tools could be effective for our student population, especially incoming undergraduates and postgraduates," he noted.

"That would be great because we could ensure the bots keep students updated on what's happening and. How they're progressing through the admissions process. That approach would make them feel more engaged from the start."

Also: AI agents might be the new workforce, but they still need a manager.

While the multilingual capabilities could be a boon in helping the university deal with international applications, Woolley introduced there are many challenges to manage before agents become part of the administrative process.

"Any agent would need to have the right learning models attached to it because. Otherwise, there's the risk of bias. I ask senior managers, 'How much failure are you prepared to accept?' Because if we make the wrong decision, you'll get bad coverage," he unveiled.

"Part of our challenge now is going back to our university executive board and our board of trustees to say, 'If you believe there is a place for AI in our organization. Where would you like to put that technology?' Because some things will differentiate you and some things won't."

Grâce au Natural Language Processing (NLP), les appareils informatiques peuvent comprendre et traiter le langage humain. Le traitement du langage natu...

Additionally, the rapid release of advanced AI models in the past few days has been impossible to ignore. With the launch of Grok-3 and Claude Sonnet, two leadi...

In Game Theory. How can players ever come to an end if there still might be a superior option to decide for? Maybe one player still wants to change thei...

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first.

Previously. We covered the first two major stages of training an LLM:

Pre-training — Learning from massive datasets to form a base model. Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful.

Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but. Has become a critical part of the training pipeline.

I’ve taken reference from Andrej Karpathy’s widely popular YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea.

What’s the purpose of reinforcement learning (RL)?

Humans and. LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training.

This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer.

RL bridges this gap by allowing the model to learn from its own experience.

Instead of relying solely on explicit labels, the model explores different token sequences and. Receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align superior with human intent.

LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution.

We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the superior paths more often.

To do this. We train the model on the sequences of tokens that lead to more effective outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself.

The model discovers which responses work best, and. After each training step, we modification its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future.

But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial.

RL is not “new” — It can surpass human expertise (AlphaGo, 2016).

A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play.

In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it.

The dotted line represents Lee Sedol’s performance — the best Go player in the world.

Furthermore, this is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge.

However. RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line).

RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies.

Let’s quickly recap the key components of a typical RL setup:

Agent — The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward).

— The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward). Environment — The external system in which the agent operates.

— The external system in which the agent operates. State — A snapshot of the environment at a given step t.

At each timestamp. The agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was.

This feedback is called a reward, and. Is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.

By using feedback from different states and actions. The agent gradually learns the optimal strategy to maximise the total reward over time.

The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps.

In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)).

An estimate of how good it is to be in a certain state. Considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model.

It is a popular RL setup that combines two components:

Actor — Learns and updates the policy (πθ). Deciding which action to take in each state. Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes.

The actor picks an action based on its current policy.

picks an action based on its current policy. The critic evaluates the outcome (reward + next state) and updates its value estimate.

evaluates the outcome (reward + next state) and. Updates its value estimate. The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards.

The state can be the current text (prompt or conversation), and. The action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is.

The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses.

To highlight RL’s importance. Let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).

DeepSeek-R1 builds on it, addressing encountered challenges.

Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and. As open source, a profound gift to the world. 🤖🫡 — Marc Andreessen 🇺🇸 (@pmarca) January 24, 2025.

Let’s dive into some of these key points.

1. RL algo: Group Relative Policy Optimisation (GRPO).

One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO). A variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024.

PPO struggles with reasoning tasks due to:

PPO needs a separate critic model, effectively doubling memory and. Compute.

Training the critic can be complex for nuanced or subjective tasks. High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses. Absolute reward evaluations.

When you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended. Diverse tasks across different reasoning domains.

GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard.

Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality.

How does GRPO fit into the whole training process?

GRPO modifies how loss is calculated while keeping other training steps unchanged:

– The old policy (older snapshot of the model) generates several candidate answers for each query Assign rewards — each response in the group is scored (the “reward”). Compute the GRPO loss.

Traditionally, you’ll compute a loss — which exhibits the deviation between the model prediction and. The true label.

a) How likely is the new policy to produce past responses?

b) Are those responses relatively enhanced or worse?

c) Apply clipping to prevent extreme updates.

This yields a scalar loss. Back propagation + gradient descent.

– Back propagation calculates how each parameter contributed to loss.

– Gradient descent updates those parameters to reduce the loss.

– Over many iterations. This gradually shifts the new policy to prefer higher reward responses improvement the old policy occasionally to match the new policy.

This refreshes the baseline for the next round of comparisons.

Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning.

Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps. Boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 investigation: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.

A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and. advanced responses.

Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.

Moving to another aspect, the model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes.

Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and. Tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts.

Reinforcement learning with Human Feedback (RLHF).

For tasks with verifiable outputs (, math problems. Factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer?

This is where human feedback comes in — but. Naïve RL approaches are unscalable.

Let’s look at the naive approach with some arbitrary numbers.

That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort.

Ranking responses is also easier and more intuitive than absolute scoring.

Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.

Ranking outputs is much easier for human labellers than generating creative outputs themselves.

The reward model is an approximation — it may not perfectly reflect human preferences.

RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores.

Do note that Rlhf is not the same as traditional RL.

For empirical, verifiable domains ( math. Coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences.

And that’s a wrap! I hope you enjoyed Part 2 🙂 If you haven’t already read Part 1 — do check it out here.

Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

India has been aiming to develop its frontier AI model to serve the country’s vast population in their native language. However, this approach has man...

Rigetti Computing, a California-based developer of quantum integrated circuits, and Quanta Computer, a Taiwan-based notebook computer manufacturer, ha...

Debugging the Dreaded NaN

You are training your latest AI model, anxiously watching as the loss steadily decreases when suddenly — boom! Your logs are flooded with NaNs (Not a Number) — your model is irreparably corrupted and you’re left staring at your screen in despair. To make matters worse, the NaNs don’t appear consistently. Sometimes your model trains just fine; other times, it fails inexplicably. Sometimes it will crash immediately, sometimes after many days of training.

NaNs in Deep Learning workloads are amongst the most frustrating issues to encounter. And because they often appear sporadically — triggered by a specific combination of model state, input data, and stochastic factors — they can be incredibly difficult to reproduce and debug.

Given the considerable cost of training AI models and the potential waste caused by NaN failures. It is recommended to have dedicated tools for capturing and analyzing NaN occurrences. In a previous post, we discussed the challenge of debugging NaNs in a TensorFlow training workload. We proposed an efficient scheme for capturing and reproducing NaNs and shared a sample TensorFlow implementation. In this post, we adopt and demonstrate a similar mechanism for debugging NaNs in PyTorch workloads. The general scheme is as follows:

Save a copy of the training input batch. Check the gradients for NaN values. If any appear, save a checkpoint with the current model weights before the model is corrupted. Also, save the input batch and, if necessary, the stochastic state. Discontinue the training job. Reproduce and debug the NaN occurrence by loading the saved experiment state.

Although this scheme can be easily implemented in native PyTorch. We will take the opportunity to demonstrate some of the conveniences of PyTorch Lightning — a powerful open-source framework designed to streamline the development of machine learning (ML) models. Built on PyTorch, Lightning abstracts away many of the boiler-plate components of an ML experiment, such as training loops, data distribution, logging, and more, enabling developers to focus on the core logic of their models.

To implement our NaN capturing scheme, we will use Lightning’s callback interface — a dedicated structure that enables inserting custom logic at specific points during the flow of execution.

Importantly. Please do not view our choice of Lightning or any other tool or technique that we mention as an endorsement of its use. The code that we will share is intended for demonstrative purposes — please do not rely on its correctness or optimality.

Many thanks to Rom Maltser for his contributions to this post.

To implement our NaN capturing solution. We create a NaNCapture Lightning callback. The constructor receives a directory path for storing/loading checkpoints and sets up the NaNCapture state. We also define utilities for checking for NaNs, storing checkpoints, and halting the training job.

import os import torch from copy import deepcopy import lightning. pytorch as pl class NaNCapture(pl. Callback): def __init__(self, dirpath: str): # path to checkpoint self. dirpath = dirpath # improvement to True when Nan is identified self. nan_captured = False # stores a copy of the last batch self. last_batch = None self. batch_idx = None @staticmethod def contains_nan(tensor): return # alternatively check for finite # return not torch. isfinite(tensor). item() @staticmethod def halt_training(trainer): trainer. should_stop = True # communicate stop command to all other ranks trainer. strategy. reduce_boolean_decision(trainer. should_stop, all=False) def save_ckpt(self, trainer): os. makedirs(self. dirpath, exist_ok=True) # include trainer. global_rank to avoid conflict filename = f"nan_checkpoint_rank_{trainer. global_rank}. ckpt" full_path = , filename) print(f"saving ckpt to {full_path}") trainer. save_checkpoint(full_path, False).

We begin by implementing the on_train_batch_start hook to store a copy of each input batch. In case of a NaN event, this batch will be stored in the checkpoint.

Callback Function: on_before_optimizer_step.

Next we implement the on_before_optimizer_step hook. Here, we check for NaN entries in all of the gradient tensors. If found, we store a checkpoint with the uncorrupted model weights and halt the training.

Python"> def on_before_optimizer_step(self, trainer, pl_module, optimizer): if not self. nan_captured: # Check if gradients contain NaN grads = [ for p in pl_module. parameters() if is not None] all_grads = if self. contains_nan(all_grads): print("nan found") self. save_ckpt(trainer) self. halt_training(trainer).

To enable reproducibility, we include the NaNCapture state in the checkpoint by appending it to the training state dictionary. Lightning provides dedicated utilities for saving and loading a callback state:

def state_dict(self): d = {"nan_captured": self. nan_captured} if self. nan_captured: d["last_batch"] = self. last_batch return d def load_state_dict(self, state_dict): self. nan_captured = "nan_captured", False) if self. nan_captured: self. last_batch = state_dict["last_batch"].

We have described how our NaNCapture callback can be used to store the training state that resulted in a NaN, but. How do we reload this state in order to reproduce the issue and debug it? To accomplish this, we leverage Lightning’s dedicated data loading class, LightningDataModule.

DataModule Function: on_before_batch_transfer.

Furthermore, in the code block below. We extend the LightningDataModule class to allow injecting a fixed training input batch. This is achieved by overriding the on_before_batch_transfer hook, as shown below:

from lightning. pytorch import LightningDataModule class InjectableDataModule(LightningDataModule): def __init__(self): super().__init__() self. cached_batch = None def set_custom_batch(self, batch): self. cached_batch = batch def on_before_batch_transfer(self, batch, dataloader_idx): if self. cached_batch: return self. cached_batch return batch.

In relation to this, the final step is modifying the on_train_start hook of our NaNCapture callback to inject the stored training batch into the LightningDataModule.

def on_train_start(self, trainer. Pl_module): if self. nan_captured: datamodule = trainer. datamodule datamodule. set_custom_batch(self. last_batch).

In the next section we will demonstrate the end-to-end solution using a toy example.

To test our new callback, we create a resnet50-based image classification model with a loss function deliberately designed to trigger NaN occurrences.

Instead of using the standard CrossEntropy loss, we compute binary_cross_entropy_with_logits for each class independently and. Divide the result by the number of samples belonging to that class. Inevitably, we will encounter a batch in which one or more classes are missing, leading to a divide-by-zero operation. Resulting in NaN values and corrupting the model.

The implementation below follows Lightning’s introductory tutorial.

import lightning. pytorch as pl import torch import torchvision import as F num_classes = 20 # define a lightning module class ResnetModel(pl. LightningModule): def __init__(self): """Initializes a new instance of the MNISTModel class.""" super().__init__() = def forward(self, x): return def training_step(self, batch. Batch_nb): x, y = batch outputs = self(x) # uncomment for default loss # return F. cross_entropy(outputs, y) # calculate binary_cross_entropy for each class individually losses = [] for c in range(num_classes): count = torch. count_nonzero(y==c) masked = , 1., 0.) loss = F. binary_cross_entropy_with_logits( outputs[..., c], masked, reduction='sum' ) mean_loss = loss/count # could result in NaN total_loss = return total_loss def configure_optimizers(self): return , .

We define a synthetic dataset and encapsulate it in our InjectableDataModule class:

import os import random from import Dataset, DataLoader batch_size = 128 num_steps = 800 # A dataset with random images and labels class FakeDataset(Dataset): def __len__(self): return batch_size*num_steps def __getitem__(self. Index): rand_image = [3, 224, 224], dtype=torch. float32) label = , num_classes-1), return rand_image, label # define a lightning datamodule class FakeDataModule(InjectableDataModule): def train_dataloader(self): dataset = FakeDataset() return DataLoader( dataset, batch_size=batch_size, num_workers=os. cpu_count(), pin_memory=True ).

Finally, we initialize a Lightning Trainer with our NaNCapture callback and call with our Lightning module and. Lightning DataModule.

import time if __name__ == "__main__": # Initialize a lightning module lit_module = ResnetModel() # Initialize a "DataModule mnist_data = FakeDataModule() # Train the model ckpt_dir = ."/ckpt_dir" trainer = pl. Trainer( max_epochs=1, callbacks=[NaNCapture(ckpt_dir)] ) ckpt_path = None # check is nan ckpt exists if # check if nan ckpt exists if dir_contents = [. F) for f in os. listdir(ckpt_dir)] ckpts = [f for f in dir_contents if and f. endswith('. ckpt')] if ckpts: ckpt_path = ckpts[0] t0 = time. perf_counter() , mnist_data, ckpt_path=ckpt_path) print(f"total runtime: {time. perf_counter() - t0}").

After a number of training steps, a NaN event will occur. At this point a checkpoint is saved with the full training state and the training is halted.

When the script is run again the exact state that caused the NaN will be reloaded allowing us to easily reproduce the issue and debug its root cause.

To assess the impact of our NaNCapture callback on runtime performance, we modified our experiment to use CrossEntropyLoss (to avoid NaNs) and. Measured the average throughput when running with and without NaNCapture callback. The experiments were conducted on an NVIDIA L40S GPU, with a PyTorch Docker image.

Overhead of NaNCapture Callback (by Author).

For our toy model, the NaNCapture callback adds a minimal overhead to the runtime performance — a small price to pay for the valuable debugging capabilities it provides.

Naturally, the actual overhead will depend on the specifics of the model and. Runtime environment.

The solution we have described henceforth will succeed in reproducing the training state provided that the model does not include any randomness. However, introducing stochasticity into the model definition is often critical for convergence. A common example of a stochastic layer is .

You may find that your NaN event depends on the precise state of randomness when the failure occurred. Consequently, we would like to enhance our NaNCapture callback to capture and restore the random state at the point of failure. The random state is determined by a number of libraries. In the code block below, we attempt to capture the full state of randomness:

import os import torch import random import numpy as np from copy import deepcopy import lightning. pytorch as pl class NaNCapture(pl. Callback): def __init__(self, dirpath: str): # path to checkpoint self. dirpath = dirpath # modification to True when Nan is identified self. nan_captured = False # stores a copy of the last batch self. last_batch = None self. batch_idx = None # rng state self. rng_state = { "torch": None, "torch_cuda": None, "numpy": None, "random": None } @staticmethod def contains_nan(tensor): return # alternatively check for finite # return not torch. isfinite(tensor). item() @staticmethod def halt_training(trainer): trainer. should_stop = True trainer. strategy. reduce_boolean_decision(trainer. should_stop, all=False) def save_ckpt(self, trainer): os. makedirs(self. dirpath, exist_ok=True) # include trainer. global_rank to avoid conflict filename = f"nan_checkpoint_rank_{trainer. global_rank}. ckpt" full_path = , filename) print(f"saving ckpt to {full_path}") trainer. save_checkpoint(full_path, False) def on_train_start(self, trainer, pl_module): if self. nan_captured: # inject batch datamodule = trainer. datamodule datamodule. set_custom_batch(self. last_batch) def on_train_batch_start(self, trainer, pl_module, batch, batch_idx): if self. nan_captured: # restore random state ["torch"]) ["torch_cuda"]) ["numpy"]) random. setstate(self. rng_state["random"]) else: # capture current batch self. last_batch= deepcopy(batch) self. batch_idx = batch_idx # capture current random state self. rng_state["torch"] = self. rng_state["torch_cuda"] = self. rng_state["numpy"] = self. rng_state["random"] = random. getstate() def on_before_optimizer_step(self, trainer, pl_module, optimizer): if not self. nan_captured: # Check if gradients contain NaN grads = [ for p in pl_module. parameters() if is not None] all_grads = if self. contains_nan(all_grads): print("nan found") self. save_ckpt(trainer) self. halt_training(trainer) def state_dict(self): d = {"nan_captured": self. nan_captured} if self. nan_captured: d["last_batch"] = self. last_batch d["rng_state"] = self. rng_state return d def load_state_dict(self, state_dict): self. nan_captured = "nan_captured", False) if self. nan_captured: self. last_batch = state_dict["last_batch"] self. rng_state = state_dict["rng_state"].

Importantly, setting the random state may not guarantee full reproducibility. The GPU owes its power to its massive parallelism. In some GPU operations, multiple threads may read or write concurrently to the same memory locations resulting in nondeterminism. PyTorch allows for some control over this via its use_deterministic_algorithms, but this may impact the runtime performance. Additionally, there is a possibility that the NaN event will not reproduced once this configuration setting is changed. Please see the PyTorch documentation on reproducibility.

Encountering NaN failures is one of the most discouraging events that can happen in machine learning development. These errors not only waste valuable computation and development resources, but often indicate fundamental issues in the model architecture or experiment design. Due to their sporadic, sometimes elusive nature, debugging NaN failures can be a nightmare.

Additionally, this post introduced a proactive approach for capturing and. Reproducing NaN errors using a dedicated Lightning callback. The solution we shared is a proposal which can be modified and extended for your specific use case.

While this solution may not address every possible NaN scenario, it significantly reduces debugging time when applicable, potentially saving developers countless hours of frustration and. Wasted effort.

Researchers at Physical Intelligence, an AI robotics firm, have developed a system called the Hierarchical Interactive Robot (Hi Robot). This syste...

Microsoft has unveiled that Skype will be retired in May 2025 as the enterprise shifts its focus to Microsoft Teams. The move is intended to streamline ...

Chinese AI startup DeepSeek has reported a theoretical daily profit margin of 545% for its inference services. Despite limitations in monetisation and...

Market Impact Analysis

Market Growth Trend

2018	2019	2020	2021	2022	2023	2024
23.1%	27.8%	29.2%	32.4%	34.2%	35.2%	35.6%

Quarterly Growth Rate

Q1 2024	Q2 2024	Q3 2024	Q4 2024
32.5%	34.8%	36.2%	35.6%

Market Segments and Growth Drivers

Segment	Market Share	Growth Rate
Machine Learning	29%	38.4%
Computer Vision	18%	35.7%
Natural Language Processing	24%	41.5%
Robotics	15%	22.3%
Other AI Technologies	14%	31.8%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Competitive Landscape Analysis

Company	Market Share
Google AI	18.3%
Microsoft AI	15.7%
IBM Watson	11.2%
Amazon AI	9.8%
OpenAI	8.4%

Future Outlook and Predictions

The Ways Your Business landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results

2025Industry standards emerging to facilitate broader adoption and integration

2026Mainstream adoption begins as technical barriers are addressed

2027Integration with adjacent technologies creates new capabilities

2028Business models transform as capabilities mature

2029Technology becomes embedded in core infrastructure and processes

2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

(Interactive diagram available in full report)

Innovation Trigger

Generative AI for specialized domains
Blockchain for supply chain verification

Peak of Inflated Expectations

Digital twins for business processes
Quantum-resistant cryptography

Trough of Disillusionment

Consumer AR/VR applications
General-purpose blockchain

Slope of Enlightenment

AI-driven analytics
Edge computing

Plateau of Productivity

Cloud infrastructure
Mobile applications

Technology Evolution Timeline

1-2 Years

Improved generative models
specialized AI applications

3-5 Years

AI-human collaboration systems
multimodal AI platforms

5+ Years

General AI capabilities
AI-driven scientific breakthroughs

Expert Perspectives

Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:

"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."
— AI Researcher

"Organizations that develop effective AI governance frameworks will gain competitive advantage."
— Industry Analyst

"The AI talent gap remains a critical barrier to implementation for most enterprises."
— Chief AI Officer

Areas of Expert Consensus

Acceleration of Innovation: The pace of technological evolution will continue to increase
Practical Integration: Focus will shift from proof-of-concept to operational deployment
Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:

Improved generative models
specialized AI applications
enhanced AI ethics frameworks

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

AI-human collaboration systems
multimodal AI platforms
democratized AI development

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

General AI capabilities
AI-driven scientific breakthroughs
new computing paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of ai tech evolution:

Ethical concerns about AI decision-making

Data privacy regulations

Algorithm bias

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Responsible AI driving innovation while minimizing societal disruption

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Incremental adoption with mixed societal impacts and ongoing ethical challenges

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and ethical barriers creating significant implementation challenges

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

Factor	Optimistic	Base Case	Conservative
Implementation Timeline	Accelerated	Steady	Delayed
Market Adoption	Widespread	Selective	Limited
Technology Evolution	Rapid	Progressive	Incremental
Regulatory Environment	Supportive	Balanced	Restrictive
Business Impact	Transformative	Significant	Modest

Transformational Impact

Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

API beginner

algorithm APIs serve as the connective tissue in modern software architectures, enabling different applications and services to communicate and share data according to defined protocols and data formats.

How APIs enable communication between different software systems

Example: Cloud service providers like AWS, Google Cloud, and Azure offer extensive APIs that allow organizations to programmatically provision and manage infrastructure and services.

interface intermediate

interface Well-designed interfaces abstract underlying complexity while providing clearly defined methods for interaction between different system components.

generative AI intermediate

platform

algorithm intermediate

encryption

reinforcement learning intermediate

API

synthetic data intermediate

cloud computing

deep learning intermediate

middleware

platform intermediate

scalability Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

NLP intermediate

DevOps

machine learning intermediate

microservices

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo - Related to agentic, ready, dreaded, how, your

4 ways to get your business ready for the agentic AI revolution

SHARE

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

SHARE

Debugging the Dreaded NaN

SHARE

Market Impact Analysis

Market Growth Trend

Quarterly Growth Rate

Market Segments and Growth Drivers

Technology Maturity Curve

Competitive Landscape Analysis

Future Outlook and Predictions

Year-by-Year Technology Evolution

Technology Maturity Curve

Innovation Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Technology Evolution Timeline

Expert Perspectives

Areas of Expert Consensus

Short-Term Outlook (1-2 Years)

Mid-Term Outlook (3-5 Years)

Long-Term Outlook (5+ Years)

Key Risk Factors and Uncertainties

Alternative Future Scenarios

Optimistic Scenario

Base Case Scenario

Conservative Scenario

Scenario Comparison Matrix

Transformational Impact

Implementation Challenges

Key Innovations to Watch

Technical Glossary

API beginner

Related Terms

interface intermediate

Related Terms

generative AI intermediate

algorithm intermediate

reinforcement learning intermediate

synthetic data intermediate

deep learning intermediate

platform intermediate

Related Terms

NLP intermediate

machine learning intermediate

Related Articles

Nanoprinter turns Meta’s AI predictions into potentially game-changing materials - Related to materials, training, game-changing, predictions, ai

Akool combines GenAI models with 2D avatars to create lifelike characters - Related to avatars, combines, embedding, representative, akool

Amazon just gave Alexa its biggest upgrade since debut - and you'll want an Echo Show for it - Related to come, echo, upgrade, alexa, since