🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy - Related to 🤷, a, trpo, an, discovering

Discovering when an agent is present in a system

Research Discovering when an agent is present in a system Share.

New, formal definition of agency gives clear principles for causal modelling of AI agents and the incentives they face We want to build safe, aligned artificial general intelligence (AGI) systems that pursue the intended goals of its designers. Causal influence diagrams (CIDs) are a way to model decision-making situations that allow us to reason about agent incentives. For example, here is a CID for a 1-step Markov decision process – a typical framework for decision-making problems.

S1 represents the initial state, A1 represents the agent’s decision (square), S2 the next state. R2 is the agent’s reward/utility (diamond). Solid links specify causal influence. Dashed edges specify information links – what the agent knows when making its decision.

By relating training setups to the incentives that shape agent behaviour, CIDs help illuminate potential risks before training an agent and can inspire improved agent designs. But how do we know when a CID is an accurate model of a training setup? Our new paper, Discovering Agents, introduces new ways of tackling these issues, including: The first formal causal definition of agents: Agents are systems that would adapt their policy if their actions influenced the world in a different way.

An algorithm for discovering agents from empirical data.

A translation between causal models and CIDs.

Resolving earlier confusions from incorrect causal modelling of agents Combined, these results provide an extra layer of assurance that a modelling mistake hasn’t been made, which means that CIDs can be used to analyse an agent’s incentives and safety properties with greater confidence. Example: modelling a mouse as an agent To help illustrate our method, consider the following example consisting of a world containing three squares, with a mouse starting in the middle square choosing to go left or right, getting to its next position and then potentially getting some cheese. The floor is icy, so the mouse might slip. Sometimes the cheese is on the right, but sometimes on the left.

This can be represented by the following CID:

CID for the mouse. D represents the decision of left/right. X is the mouse’s new position after taking the action left/right (it might slip, ending up on the other side by accident). U represents whether the mouse gets cheese or not.

The intuition that the mouse would choose a different behaviour for different environment settings (iciness, cheese distribution) can be captured by a mechanised causal graph, which for each (object-level) variable, also includes a mechanism variable that governs how the variable depends on its parents. Crucially, we allow for links between mechanism variables. This graph contains additional mechanism nodes in black, representing the mouse's policy and the iciness and cheese distribution.

Mechanised causal graph for the mouse and cheese environment.

Edges between mechanisms represent direct causal influence. The blue edges are special terminal edges – roughly, mechanism edges A~ → B~ that would still be there, even if the object-level variable A was altered so that it had no outgoing edges. In the example above, since U has no children, its mechanism edge must be terminal. But the mechanism edge X~ → D~ is not terminal, because if we cut X off from its child U, then the mouse will no longer adapt its decision (because its position won’t affect whether it gets the cheese). Causal discovery of agents Causal discovery infers a causal graph from experiments involving interventions. In particular, one can discover an arrow from a variable A to a variable B by experimentally intervening on A and checking if B responds, even if all other variables are held fixed. Our first algorithm uses this technique to discover the mechanised causal graph:

Algorithm 1 takes as input interventional data from the system (mouse and cheese environment) and uses causal discovery to output a mechanised causal graph. See paper for details.

Our second algorithm transforms this mechanised causal graph to a game graph:

Algorithm 2 takes as input a mechanised causal graph and maps it to a game graph. An ingoing terminal edge indicates a decision, an outgoing one indicates a utility.

Taken together, Algorithm 1 followed by Algorithm 2 allows us to discover agents from causal experiments, representing them using CIDs. Our third algorithm transforms the game graph into a mechanised causal graph, allowing us to translate between the game and mechanised causal graph representations under some additional assumptions:

Algorithm 3 takes as input a game graph and maps it to a mechanised causal graph. A decision indicates an ingoing terminal edge, a utility indicates an outgoing terminal edge.

Exploring examples of goal misgeneralisation – where an AI system's capabilities generalise but its goal doesn't.

Throughout this journey, we’ve worked closely with artists and creators and have been guided by their curiosity and feedback to ensure our technologie...

Research Building interactive agents in video game worlds Share.

Introducing a framework to create AI agents that can understand hu...

Training Large Language Models: From TRPO to GRPO

Deepseek has not long ago made quite a buzz in the AI community, thanks to its impressive performance at relatively low costs. I think this is a perfect opportunity to dive deeper into how Large Language Models (LLMs) are trained. In this article, we will focus on the Reinforcement Learning (RL) side of things: we will cover TRPO, PPO, and, more not long ago, GRPO (don’t worry, I will explain all these terms soon!).

I have aimed to keep this article relatively easy to read and accessible, by minimizing the math, so you won’t need a deep Reinforcement Learning background to follow along. However, I will assume that you have some familiarity with Machine Learning, Deep Learning, and a basic understanding of how LLMs work.

Before diving into RL specifics, let’s briefly recap the three main stages of training a Large Language Model:

Pre-training : the model is trained on a massive dataset to predict the next token in a sequence based on preceding tokens.

: the model is trained on a massive dataset to predict the next token in a sequence based on preceding tokens. Supervised Fine-Tuning (SFT) : the model is then fine-tuned on more targeted data and aligned with specific instructions.

: the model is then on more targeted data and aligned with specific instructions. Reinforcement Learning (often called RLHF for Reinforcement Learning with Human Feedback): this is the focus of this article. The main goal is to further refine responses’ alignments with human preferences, by allowing the model to learn directly from feedback.

Before diving deeper, let’s briefly revisit the core ideas behind Reinforcement Learning.

RL is quite straightforward to understand at a high level: an agent interacts with an environment. The agent resides in a specific state within the environment and can take actions to transition to other states. Each action yields a reward from the environment: this is how the environment provides feedback that guides the agent’s future actions.

Consider the following example: a robot (the agent) navigates (and tries to exit) a maze (the environment).

The state is the current situation of the environment (the robot’s position in the maze).

is the current situation of the environment (the robot’s position in the maze). The robot can take different actions : for example, it can move forward, turn left, or turn right.

: for example, it can move forward, turn left, or turn right. Successfully navigating towards the exit yields a positive reward, while hitting a wall or getting stuck in the maze results in negative rewards.

Easy! Now, let’s now make an analogy to how RL is used in the context of LLMs.

When used during LLM training, RL is defined by the following components:

Environment : everything external to the LLM, including user prompts, feedback systems, and other contextual information. This is basically the framework the LLM is interacting with during training.

: everything external to the LLM, including user prompts, feedback systems, and other contextual information. This is basically the framework the LLM is interacting with during training. Actions: these are responses to a query from the model. More specifically: these are the tokens that the LLM decides to generate in response to a query.

these are responses to a query from the model. More specifically: these are the that the LLM decides to generate in response to a query. State: the current query being answered along with tokens the LLM has generated so far ([website], the partial responses).

the current query being answered along with tokens the LLM has generated so far ([website], the partial responses). Rewards: this is a bit more tricky here: unlike the maze example above, there is usually no binary reward. In the context of LLMs, rewards usually come from a separate reward model, which outputs a score for each (query, response) pair. This model is trained from human-annotated data (hence “RLHF”) where annotators rank different responses. The goal is for higher-quality responses to receive higher rewards.

Note: in some cases, rewards can actually get simpler. For example, in DeepSeekMath, rule-based approaches can be used because math responses tend to be more deterministic (correct or wrong answer).

Policy is the final concept we need for now. In RL terms, a policy is simply the strategy for deciding which action to take. In the case of an LLM, the policy outputs a probability distribution over possible tokens at each step: in short, this is what the model uses to sample the next token to generate. Concretely, the policy is determined by the model’s parameters (weights). During RL training, we adjust these parameters so the LLM becomes more likely to produce “improved” tokens— that is, tokens that produce higher reward scores.

where a is the action (a token to generate), s the state (the query and tokens generated so far), and θ (model’s parameters).

This idea of finding the best policy is the whole point of RL! Since we don’t have labeled data (like we do in supervised learning) we use rewards to adjust our policy to take advanced actions. (In LLM terms: we adjust the parameters of our LLM to generate advanced tokens.).

Let’s take a quick step back to how supervised learning typically works. you have labeled data and use a loss function (like cross-entropy) to measure how close your model’s predictions are to the true labels.

We can then use algorithms like backpropagation and gradient descent to minimize our loss function and improvement the weights θ of our model.

Recall that our policy also outputs probabilities! In that sense, it is analogous to the model’s predictions in supervised learning… We are tempted to write something like:

where s is the current state and a is a possible action.

A(s, a) is called the advantage function and measures how good is the chosen action in the current state, compared to a baseline. This is very much like the notion of labels in supervised learning but derived from rewards instead of explicit labeling. To simplify, we can write the advantage as:

In practice, the baseline is calculated using a value function. This is a common term in RL that I will explain later. What you need to know for now is that it measures the expected reward we would receive if we continue following the current policy from the state s.

TRPO (Trust Region Policy Optimization) builds on this idea of using the advantage function but adds a critical ingredient for stability: it constrains how far the new policy can deviate from the old policy at each modification step (similar to what we do with batch gradient descent for example).

It introduces a KL divergence term (see it as a measure of similarity) between the current and the old policy:

It also divides the policy by the old policy. This ratio, multiplied by the advantage function, gives us a sense of how beneficial each upgrade is relative to the old policy.

Putting it all together, TRPO tries to maximize a surrogate objective (which involves the advantage and the policy ratio) subject to a KL divergence constraint.

While TRPO was a significant advancement, it’s no longer used widely in practice, especially for training LLMs, due to its computationally intensive gradient calculations.

Instead, PPO is now the preferred approach in most LLMs architecture, including ChatGPT, Gemini, and more.

It is actually quite similar to TRPO, but instead of enforcing a hard constraint on the KL divergence, PPO introduces a “clipped surrogate objective” that implicitly restricts policy updates, and greatly simplifies the optimization process.

Here is a breakdown of the PPO objective function we maximize to tweak our model’s parameters.

GRPO (Group Relative Policy Optimization).

How is the value function usually obtained?

Let’s first talk more about the advantage and the value functions I introduced earlier.

In typical setups (like PPO), a value model is trained alongside the policy. Its goal is to predict the value of each action we take (each token generated by the model), using the rewards we obtain (remember that the value should represent the expected cumulative reward).

Here is how it works in practice. Take the query “What is 2+2?” as an example. Our model outputs “2+2 is 4” and receives a reward of [website] for that response. We then go backward and attribute discounted rewards to each prefix:

“2+2 is” (1 token backward) gets a value of 0.8γ.

“2+2” (2 tokens backward) gets a value of 0.8γ².

where γ is the discount factor ([website] for example). We then use these prefixes and associated values to train the value model.

crucial note: the value model and the reward model are two different things. The reward model is trained before the RL process and uses pairs of (query, response) and human ranking. The value model is trained concurrently to the policy, and aims at predicting the future expected reward at each step of the generation process.

Even if in practice, the reward model is often derived from the policy (training only the “head”), we still end up maintaining many models and handling multiple training procedures (policy, reward, value model). GRPO streamlines this by introducing a more efficient method.

In PPO, we decided to use our value function as the baseline. GRPO chooses something else: Here is what GRPO does: concretely, for each query, GRPO generates a group of responses (group of size G) and uses their rewards to calculate each response’s advantage as a z-score:

where rᵢ is the reward of the i-th response and μ and σ are the mean and standard deviation of rewards in that group.

This naturally eliminates the need for a separate value model. This idea makes a lot of sense when you think about it! It aligns with the value function we introduced before and also measures, in a sense, an “expected” reward we can obtain. Also, this new method is well adapted to our problem because LLMs can easily generate multiple non-deterministic outputs by using a low temperature (controls the randomness of tokens generation).

This is the main idea behind GRPO: getting rid of the value model.

Finally, GRPO adds a KL divergence term (to be exact, GRPO uses a simple approximation of the KL divergence to improve the algorithm further) directly into its objective, comparing the current policy to a reference policy (often the post-SFT model).

And… that’s mostly it for GRPO! I hope this gives you a clear overview of the process: it still relies on the same foundational ideas as TRPO and PPO but introduces additional improvements to make training more efficient, faster, and cheaper — key factors behind DeepSeek’s success.

Reinforcement Learning has become a cornerstone for training today’s Large Language Models, particularly through PPO, and more not long ago GRPO. Each method rests on the same RL fundamentals — states, actions, rewards, and policies — but adds its own twist to balance stability, efficiency, and human alignment:

• TRPO introduced strict policy constraints via KL divergence.

• PPO eased those constraints with a clipped objective.

• GRPO took an extra step by removing the value model requirement and using group-based reward normalization. Of course, DeepSeek also benefits from other innovations, like high-quality data and other training strategies, but that is for another time!

I hope this article gave you a clearer picture of how these methods connect and evolve. I believe that Reinforcement Learning will become the main focus in training LLMs to improve their performance, surpassing pre-training and SFT in driving future innovations.

If you’re interested in diving deeper, feel free to check out the references below or explore my previous posts.

Thanks for reading, and feel free to leave a clap and a comment!

Want to learn more about Transformers or dive into the math behind the Curse of Dimensionality? Check out my previous articles:

Transformers: How Do They Transform Your Data?

Diving into the Transformers architecture and what makes them unbeatable at language [website].

The Math Behind “The Curse of Dimensionality”.

Dive into the “Curse of Dimensionality” concept and understand the math behind all the surprising phenomena that arise…[website].

Since DeepSeek challenged OpenAI two weeks ago, the open- vs. closed-source AI...

In December, we kicked off the agentic era by releasing an experimental version of Gemini [website] Flash ...

We all have to deal with contracts and agreements as consumers or business professionals. How...

🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy

Life is like a box of chocolate. Generated using DALL-E.

My momma always noted "Life was like a box of chocolates. You never know what you’re gonna get." F. Gump (fictional philosopher and entrepreneur).

This is the second article in a series on information quantification – an essential framework for data scientists. Learning to measure information unlocks powerful tools for improving statistical analyses and refining decision criteria in machine learning.

In this article we focus on entropy – a fundamental concept that quantifies "on average, how surprising is an outcome?" As a measure of uncertainty, it bridges probability theory and real-world applications, offering insights into applications from data diversity to decision-making.

We’ll start with intuitive examples, like coin tosses and roles of dice 🎲 , to build a solid foundation. From there, we’ll explore entropy’s diverse applications, such as evaluating decision tree splits and quantifying DNA diversity 🧬. Finally, we’ll dive into fun puzzles like the Monty Hall problem 🚪🚪🐐 and I’ll refer to a tutorial for optimisation of the addictive WORDLE game 🟨🟩🟩⬛🟩.

No prior knowledge is required – just a basic understanding of probabilities. If you’d like to revisit the foundations, I’ll briefly recap key concepts from the first article and encourage you to explore its nuances further. Whether you’re here to expand your machine learning or data analysis toolkit, or to deepen your understanding of Information Theory, this article will guide you through the concept of entropy and its practical applications.

Throughout I provide python code 🐍 , and try to keep formulas as intuitive as possible. If you have access to an integrated development environment (IDE) 🖥 you might want to plug 🔌 and play 🕹 around with the numbers to gain a superior intuition.

Note: This section is mostly copied from the previous article, feel free to skip to the next section.

This series is divided into four articles, each exploring a key aspect of information theory:

😲 Quantifying Surprise: In the opening article, you learnt how to quantify the "surprise" of an event using self-information and understand its units of measurement, such as bits and nats. Mastering self-information is essential for building intuition about the subsequent concepts, as all later heuristics are derived from it. 🤷 Quantifying Uncertainty: 👈 👈 👈 YOU ARE HEREBuilding on self-information, in this article we shift focus to the uncertainty – or "average surprise" – associated with a variable, known as entropy. We’ll dive into entropy’s wide-ranging applications, from Machine Learning and data analysis to solving fun puzzles, showcasing its adaptability. 📏 Quantifying Misalignment: In the third article we’ll explore how to measure the distance between two probability distributions using entropy-based metrics like cross-entropy and KL-divergence. These measures are particularly valuable for tasks like comparing predicted versus true distributions, as in classification loss functions and other alignment-critical scenarios. 💸 Quantifying Gain: Expanding from single-variable measures, this final article investigates the relationships between two. You’ll discover how to quantify the information gained about one variable ([website], target Y) by knowing another ([website], predictor X). Applications include assessing variable associations, feature selection, and evaluating clustering performance.

Each article is crafted to stand alone while offering cross-references for deeper exploration. Together, they provide a practical, data-driven introduction to information theory, tailored for data scientists, analysts and machine learning practitioners.

Disclaimer: Unless otherwise mentioned the formulas analysed are for categorical variables with c≥2 classes (2 meaning binary). Continuous variables will be addressed in a separate article.

🚧 Articles (3) and (4) are currently under construction. I will share links once available. Follow me to be notified 🚧.

Note: This section is a brief recap of the first article.

Self-information is considered the building block of quantification of information. It is a way of quantifying the amount of "surprise" of a specific outcome.

Formally self-information, denoted here as _h_ₓ, quantifies the surprise of an event x occurring based on its probability, p(x):

Self-information _h_ₓ is the information of event x that occurs with probability p(x).

The units of measure are called bits. One bit (binary digit) is the amount of information for an event x that has probability p(x)=½. Let’s plug in to verify: hₓ=-log₂(½)= log₂(2)=1.

The choice of -log₂(p) was made as it satisfies several key axioms of information quantification:

An event with probability 100% is not surprising and hence does not yield any information. This becomes clear when we see that if p(x)=1, then hₓ=0. A useful analogy is a trick coin (where both sides show HEAD). Less probable events are more surprising and provide more information. This is apparent in the other extreme: p(x) → 0 causes hₓ →∞. The property of Additivity – where the total self-information of two independent events equals the sum of individual contributions – will be explored further in the upcoming Mutual Information article.

In this series I choose to use the units of measure bits due to the notion of the 50% chance of an event to happen. In section Visual Intuition of Bits with a Box of Chocolate 🍫 below we illustrate its usefulness in the context of entropy.

An alternative commonly used in machine learning is the natural logarithm, which introduces a different unit of measure called nats (short for natural units of information). One nat corresponds to the information gained from an event occurring with a probability of 1/e where e is Euler’s number (≈[website] In other words, 1 nat = -ln(p=(1/e)).

For further interesting details, examples and python code about self-information and bits please refer to the first article.

Entropy: Quantifying Average Information.

Entropy – Quantifying how much "I don’t know"

So far we’ve discussed the amount of information in bits per event. This raises the question – what is the average amount of information we may learn from the distribution of a variable?

This is called entropy – and may be considered as the uncertainty or average "element of surprise" ¯_(ツ)_//¯. This means how much information may be learnt when the variable value is determined ([website], the average self-information).

Formally: given a categorical random variable X, with c possible outcomes xᵢ , i∈{1…c}, each with probability _p_ₓᵢ(xᵢ), the entropy _H_ₓ is.

Intuition: Entropy _H_ₓ is the average the self-entropy hₓ _o_f all possible outputs xᵢ of variable X.

Note that here we use capital _H_ₓ ([website], of variable X) for entropy and lower case _h_ₓᵢ for self-information of event xᵢ. (From here on I will drop the ᵢ both for convenience and also because Medium does not handle LaTeX.).

_h_ₓ=-log₂(_p_ₓ(x)): the self-information for each event x (as discussed in the previous section).

_p_ₓ(x): the weight to the expectancy of its occurrence ([website], think of the _p_ₓ(x) not under the log as a weight _w_ₓ of event x).

A naïve pythonic calculation would look something like this.

import numpy as np pxs = [[website], [website]] # fair coin distribution: 50% tails, 50% heads [website][-px * [website] for px in pxs]) # yields 1 bit.

However, it is more pragmatic to use the scipy module:

from [website] import entropy entropy(pxs, base=2) # note the base keyword! # yields 1 bit.

This function is preferable because it addresses practical issues that the naïve script above it doesn’t, [website].

Handling of zero values . This is crucial – try plugging in pxs=[1., 0.] in the numpy version and you will obtain nan due to a RuntimeWarning . This is because 0 is not a valid input to log functions. If you plug it into the scipy version you will obtain the correct 0 bits.

Normalised counts. You can feed counts instead of frequencies, [website], try plugging in pxs=[14, 14] you will still get 1 bit.

To gain an intuition it’s instructive to examine a few examples.

Plugging in pxs=[[website], 025] yields [website] bits, [website], less than 1 bit when using pxs=[[website], [website]].

Can you guess what we should expect if reversing the values pxs=[[website], 075] ?

We noted that the outcome of pxs=[1., 0.] is zero bits. How about pxs=[0., 1.] ?

To address these it’s imperative to examine a spectrum of outcomes. Using a simple script:

We obtain an insightful figure that all data scientists should have ingrained:

There are many learnings from this graph:

Max Uncertainty : The max entropy (uncertainty) happens in the case of a fair coin p=½ → H=1 bit. The intuition is: if all potential outcomes are equally likely the uncertainty of the average is maximum. This may be extrapolated to any categorical variable with c dimensions.

In the binary case this max uncertainty equals 1 bit. If we examine a categorical variable with c≥3 classes ([website], with a roll of a die) we will get a larger number of bits, since we are even less certain than in a coin flip. [website], in the case of c=3, the fair distribution is -log₂(⅓)[website] In sections in which we discuss applications we will address the option of standardising to bound entropy between 0 and 1. It involves setting base=c but realising that the units of measure are no longer bits (although related by log₂(c)).

but realising that the units of measure are no longer bits (although related by log₂(c)). Zero valued uncertainty points : We see that in the cases of p=0 and 1 there is no uncertainty. In other words H=0 bits means full certainty of the outcome of the variable. For the c dimensions case this is when all classes have _p_ₓ(x)=0 except for one category, which we call x*, which has 100% probability _p_ₓ(x*)=1. (And yes, you are correct if you are thinking about classification; We’ll address this below.) The intuition is: The variable outcome is certain and there is nothing learnt in a random draw. (This is the first axiom.).

Symmetry: By definition H(x) is symmetric around p=1/c. The graph above demonstrates this in the binary case around p=½.

To make the concept more tangible let’s pick a point on the graph and assume that we are analysing simplistic weather reports of sun 🌞 and rain 🌧 , [website], p(🌞 ) = 95%, p(🌧 )= 5% (pxs=[[website], 0. 05]).

Using entropy we calculate that these weather reports contain on average [website] bits of information.

Rain would be very surprising (quantified as h(🌧 )=-log₂([website] bits),.

but this only happens p(🌧 )=5% of the time,.

and p(🌞 ) =95% of the time it is sunny which provides only __ h(🌞 )=-log₂([website] [website] bits.

Hence on average we don’t expect to be as surprised as we would if were p(🌞 ) = p(🌧 )= ½.

H = p(🌞 )_h(🌞 ) +_ p(🌧 __ )h(🌧 )[website].

I realised that it would be neat to create an entropy graph for the c=3 scenario. Visualising 3D is always tricky, but a key insight to do this is that, defining x, y and z to be the outcomes of a three sided die I take advantage of the relationship p(x)+p(y)+p(z)=1.

Entropy as a function of probability p(x,y,z) of ternary events x,y,z.

Here we see the maximum entropy is, as expected at p(x)=p(y)=p(z)=⅓, and as we get to closer to p(x)=1, p(y)=1 or (p(z)=1; at p(x)=0, p(y)=0) H(p) drops to 0. The symmetry around the maximum entropy point also holds, but that’s harder to gauge by eye.

The script to generate this is the following (full disclosure: since I don’t enjoy handling mesh grids so it’s based on iterations between a generative model and my fine tuning):

To summarise this intro to entropy, the main takeaway is:

Entropy reaches its peak when all outcome probabilities are equal – [website], max uncertainty. As certainty grows, entropy reduces until 100% predictability where it is reduced to zero.

For this reason it is also used as a measure of purity of a system. We will discuss this further in applications of decision trees and DNA pooling.

Hopefully, you have gained an intuition about entropy and its relationship with self-information.

While we’ve touched on bits as the unit of measure for information, there’s another way to deepen our understanding. Inspired by 3blue1brown, a YouTube channel renowned for its mathematical visualisations, we will explore a fresh perspective on bits and their significance in quantifying information.

Visual Intuition of Bits with a Box of Chocolate 🍫.

You never know what you’re gonna get. Credit: Wikipedia.

Since a bit is logarithmic in nature, it can be counterintuitive for our linear-thinking brains to grasp. Earlier, we touched on bits within the context of self-information.

Here, we explore bits from a different perspective – through the lens of entropy, which represents the average self-information.

While researching for this series, I was inspired by visual explanations crafted by Grant Sanderson, the creator of the outstanding 3blue1brown mathematics tutorials. Building on his insights, I offer an interpretation that sheds new light on understanding bits.

[The number of bits expresses] "how many times have you cut down the possibilities by half?" Grant Sanderson (3Blue1Brown).

In the Resources section you can watch his amazing video¹ explaining how to use entropy to optimise solving for a popular word game called WORDLE 🟨🟩🟩⬛🟩.

Here I demonstrate a similar interpretation using a different type of set of observations which fictional philosopher Forrest Gump can relate to: 256 pieces of emoji shaped chocolate.

For simplicity we’ll assume that all are equally distributed. Our first question of interest is:

"Which one chocolate emoji did Forrest get?"

256 chocolate emojis. Considering all things even means entropy of 8 bits.

Each has a probability p=1/256 to be chosen.

Meaning each has self-information of h=-log₂(1/256)= 8 bits.

The entropy of the system is the average over all the self-information which is also 8 bits. (Remember that by all emojis being equal we are at peak uncertainty.).

Let’s assume that an observation has been made: "the chosen chocolate piece has an emoji shape with an odd ascii value" ([website], 🙂 has a hex representation: 1F642 ).

This changes the possibility space in the following way:

Left: Same full possibility space of 256 objects→ Entropy=8 bits. Right: by reducing the possibility spaced by ½ means 1 bit of information has been gained.

The possibility set was reduced by 2 (p=½).

Compared to the 8 bits of the full possibility space we have gained 1 bit of information.

What would the picture look like if the observation was "the ascii value of the chosen emoji has a modulo 4 of zero"?

Left: Same as before. Right: reducing the possibility spaced by ¼ mean 2 bits of information have been gained.

Let’s continue cutting down possibilities until we are left with only 8 emojis (2³ → 3 bits). We can see.

Top Left: Same as before. Top Right: Reducing the possibility spaced by 1/32 means 5 bits of information have been gained. Bottom: For comparison demonstrating all examples of information gain from 1–4 bits.

These examples clearly illustrate that, assuming all emojis are equally likely, the bit count represents the number of times the possibility space must be halved to identify the selected emoji. Each step can also be understood through a c-sided die analogy (c = 256, 128, …, 8).

Both perspectives emphasise the logarithmic nature of bits, as expressed in the self-information formula hₓ=-log₂(p), which is averaged to compute entropy.

Here’s another way to think about bits (last one, I promise!). Since we’ve seen that entropy decreases with information gain, it can be useful to consider bits as a form of currency ($, £, etc.; no bitcoin ₿ pun intended…).

Imagine having an "uncertainty 🤷‍♂ account" .

When a specific question in probability is posed, an 🤷‍♂️ account is opened, holding a balance of "uncertainty capital" 💰 . As new information is received, this balance decreases, much like making a withdrawal from the account 💸 . You can think of each piece of information gained as reducing uncertainty (or increasing certainty), akin to a negative deposit.

Unlike traditional bank accounts, this one cannot go into debt – there is no overdraft. The lowest possible balance is 0 bits, representing complete certainty. This corresponds to the case where you have full knowledge about a situation, [website], entropy H=0 → 100% certainty. Recall the first axiom: An event with a 100% probability is perfectly unsurprising and provides no new information.

This is the same idea we saw above with the 256 piece chocolate box. We effectively opened an 🤷‍♂ account with a capital of 8 bits 💰 . Each time that the possibility space was reduced we had an uncertainty withdraw 💸 .

While not a perfect analogy (transactions are only one way, cannot be negative), it offers an intuitive way to grasp the exchange bits from entropy to gain.

Note that in these last two sections, I’ve simplified the examples by using powers of 2 and assuming equal probabilities for each event. In real-world scenarios, however, distributions are rarely so neat or balanced, and many applications don’t neatly align with factors of 2. We’ll dive into more complex, non-uniform distributions in the upcoming sections.

Now that we have a solid grasp of entropy and bits as its unit of measure, let’s explore some real-world applications – ranging from machine learning and applied sciences to math puzzles – that leverage these concepts in practical and often unexpected ways.

ML Application: Purity of Decision Tree Splits.

Decision trees are a fundamental component of popular supervised machine learning algorithms, such as Random Forests and Gradient Boosting Trees, often used for tabular data.

A decision tree follows a top-down, greedy search strategy. It uses recursive partitioning, meaning it repeatedly splits the data into smaller subsets. The objective is to create clusters that are increasingly homogeneous, or "pure."

To achieve this, the algorithm asks a series of questions about the data at each step to decide how to divide the dataset. In classification tasks, the goal is to maximise the increase in purity from parent nodes to child nodes with each split. (For those who don’t mind double negatives: this corresponds to a decrease in impurity.).

The impurity after each split may be quantified as the average of the children which may very loosely may be written as:

L and R represent the left and right sides of a splitting, respectively.

_n_ⁱ are the children node sizes for i=L,R.

Hⁱ are the entropies for each child i=L,R.

Let’s learn by example where the parent node has 1,000 entries with a 9:1 split between target positive to negative entries, respectfully. The parameter split creates two children with the following splits:

The left child has a 7:1 split (_H_ᴸ[website] bits) and the right child is purely positive (_H_ᴿ=0).

The result is an average children impurity of:

Children Impurity = 800/1000 [website] + 200/100 0 = [website] bits.

One significant feature of using entropy is that the children’s average impurity is lower than their parent’s.

Children Average Entropy < Parent’s Entropy.

In our example we obtained a children’s average of [website] bits compared to their parent’s [website] bits.

The reason for this assertion is the concave shape of the entropy distribution.

To understand this let’s revisit the c=2 entropy graph and add points of interest. We’ll use a slightly different numerical example that nicely visualises the point of purity increase (impurity decrease).

First let’s script up the impurity calculation:

Our working example will include a parent with [700, 300] that splits into a near even node [300, 290] and its complementary near pure one [400, 10]:

# node class frequencies ps_parent = [700, 300] ps_childL = [300, 290] ps_childR = [ps_parent[0] - ps_childL[0], ps_parent[1] - ps_childL[1]] # node entropies H_parent = entropy(ps_parent, base=2) H_childL = entropy(ps_childL, base=2) H_childR = entropy(ps_childR, base=2) H_childrenA = split_impurity(ps_childL, ps_childR) print(f"parent entropy: {[website]}") print(f"childL entropy: {[website]}") print(f"childR entropy: {[website]}") print("-" * 20) print(f"average child impurity: {[website]}") # Yields # parent entropy: [website] # childL entropy: [website] # nearly even # childR entropy: [website] # nearly deterministic # -------------------- # average child impurity: [website].

The purple solid line is the continuous entropy, as before. The parent node entropy is the black dot over the orange dashed line. The children node entropies are the orange dots at the extremes of the dashed line. We see that even though one child has an undesired higher entropy (less purity; higher impurity) than the parent, this is compensated by the much higher purity of the other child. The x on the orange dashed line is the average child entropy, where the arrow indicates how much purer their average is than that of their parent.

The reduction in impurity from parent to child node average is a consequence of the concave shape of the entropy curve. In the Resources section, I’ve included an article² that highlights this feature, which is shared with another heuristic called the Gini index. This characteristic is often cited as a key reason for choosing entropy over other metrics that lack this property.

For the visual above I used this script:

# calculating entropies for all value ranging 0-1 in intervals of [website] entropies = {p_: entropy([p_, 1-p_], base=2) for p_ in [website], [website], [website]} # plotting [website], [website], color='purple') [website]'Entropy of a Bernoulli trial') [website]'$p$') [website]'bits') # node frequencies p_parent = ps_parent[0] / sum(ps_parent) p_childL = ps_childL[0] / sum(ps_childL) p_childR = ps_childR[0] / sum(ps_childR) plt.scatter([p_parent,], [H_parent], color='black', label='parent') plt.scatter([p_childL, p_childR], [H_childL, H_childR], color='orange', label='children') [website][p_childL, p_childR], [H_childL, H_childR], color='orange', linestyle='--') plt.scatter([p_parent], [H_childrenA], color='green', label='children average', marker="x", linewidth=2) # draw narrow arrow between parent and children average plt.annotate('', xy=(p_parent, H_childrenA + [website], xytext=(p_parent, H_parent - [website], arrowprops=dict(facecolor='black', linewidth=1, arrowstyle="-|>, [website], [website]")) [website]"Nodes").

In this section, I’ve demonstrated how entropy can be used to evaluate the purity of the leaf (child) nodes in a Decision Tree. Those paying close attention will notice that I focused solely on the target variable and ignored the predictors and their splitting values, assuming this choice as a given.

In practice, each split is determined by optimising based on both the predictors and target variables. As we’ve seen, entropy only addresses one variable. To account for both a predictor and a target variable simultaneously, we need a heuristic that captures the relationship between them. We’ll revisit this Decision Tree application when we discuss the Mutual Information article.

Next we’ll continue exploring the concept of population purity, as it plays a key role in scientific applications.

Diversity Application: DNA Library Verification 🧬.

Biology inspired dice that Kelly Boukra 3D printed during our time in a biotech startup LabGenius. They display symbols of the building blocks of DNA (nucleotide bases A-T-C-G) and proteins (20 Amino Acids). From left to right eight sided A-T-C-G (X 2), 20 sided Amino Acids and four sided A-T-C-G.

In the decision tree example, we saw how entropy serves as a powerful tool for quantifying im/purity. Similarly, in the sciences, entropy is often used as a diversity index to measure the variety of different types within a dataset. In certain fields this application is also referred to as Shannon’s diversity index or the Shannon-Wiener index.

An interesting implementation of diversity assessment arises in DNA sequencing. When testing candidate molecules for therapeutics, biologists quantify the diversity of DNA segments in a collection known as a DNA library – an essential step in the process.

These libraries consist of DNA strands that represent slightly different versions of a gene, with variations in their building blocks, called nucleotide bases (or for short nucleotides or bases), at specific positions. These bases are symbolised by the letters A, C, T, and G.

Protein engineers have various demands for possible diversities of nucleotide bases at a given position, [website], full-degeneracy ([website], even distribution) or non-degenerate ([website], pure). (There are other demands but that is beyond the scope of this article. Also out of scope: for those interested how base nucleotides are measured in practice with a device called a DNA sequencer I briefly discuss in a Supplementary section below.).

My former colleague Staffan Piledahl at LabGenius (now Briefly-Bio), sought to quantify the diversity of a DNA library, and we realised that entropy is an excellent tool for the task.

He aimed to classify the diversity at each position, [website], either full degeneracy or non-degeneracy. (For completeness I mention that he also worked on partial-degeneracy but will ignore for simplicity.).

Let’s examine an example position that requires full-degeneracy: which has an ideal distribution p(🅐)=p(🅒)=p(🅣)=p(🅖)=¼. From what we learnt so far this would mean that the self-information is -log₂(¼)= 2 _b_its. Since all four are equally distributed the entropy is the average of all these yielding entropy H=2 _b_its. This is, of course, the max entropy since all possibilities are equal.

Since we have a system of four it may be beneficial to work in base 4, [website] use log₄ instead of base 2. The advantage of this is standardising the entropy between 0 (no degeneracy) to 1 (full degeneracy). One last note before continuing is that by using a different base we are no longer using bits as the unit of measure but rather four-digits where for lack of creativity I will call fits for short.

In other words in the ideal full-degeneracy case the entropy is maximum at H=2 bits=1 fit.

In reality biology data can be a bit messy and one should create tolerance bounds. We can imagine accepting [[website], [website], [website], [website]] (→ [website] fits) or other permutations like [[website], [website], [website], [website]] (→ [website] fits). Setting the boundaries of entropy H to define full-degeneracy is beyond the scope of this article and is case specific.

Staffan also pointed out that it is not enough to have a reasonable H range, but also have the "right diversity". This is apparent in the non-degenerate (or pure) case.

Let’s say that at a given position we want ideally to have p(🅐)=1, p(🅒)=p(🅣)=p(🅖)=0 or [1, 0, 0, 0] (→ H=0 _f_its). This means that reasonable upper boundaries may be [[website], [website], 0, 0] (target base at 95% and another at 5%→ H=0._1_43 fits) or slightly higher at [[website], [website], [website], [website]] (target base at 95% and the rest equally distributed → H=0._1_77 fits), depending on the use case.

However, even though [0, 1, 0, 0] ([website], p(🅒)=1, p(🅐)=p(🅣)=p(🅖)=0) also yields the desired H=0 _f_its it is the wrong target diversity (our target is 🅐 not 🅒).

The takeaway is that entropy is a great tool to quantify the diversity, but context is king and should be incorporated in the decision making process, especially when this is automated.

For those interested in the details I will share Staffan Piledahl‘s article when it’s ready. Working title: Easy verification of DNA libraries using Sanger Sequencing and Shannon Diversity index.

In our final example we’ll apply entropy to superior understand information in a popular puzzle.

Math Application: 🚪🚪🐐 The Monty Hall Problem 🎩.

The Monty Hall problem is one of the most well known stats brain twisters, and has captured a generation’s imagination due to the simplicity of the setup and … how easy it is to be fooled by it. It even eluded leading statisticians.

The premise is a game show in which a contestant has to choose one of three closed doors to win a prize car. Behind the other two are goats. After choosing a door the host 🎩 opens one of the remaining two revealing one of the goats. The contestant then needs to decide:

The trader chooses door ☝️ and the the host 🎩 reveals door C showing a goat.

🚨🚨 Spoiler alert: Below I’m going to reveal the answer. 🚨🚨.

If you are interested in guessing for yourself and learning more about why this is confusing you may want to check out my deep dive article. Alternatively you could skip to the next section and after reading the other article return to this section which adds information theory context which the deep dive does not.

The correct answer is that (🚨🚨 last chance to look away! 🚨🚨 ) it is improved to switch. The reason is that when switching the probability of winning the prize is ⅔ and when staying it remains ⅓ as before the host’s intervention.

At first glance, most people, including myself, incorrectly think that it doesn’t matter if one switches or stays arguing that the choice is a 50:50 split.

Why does the initial chosen door still have ⅓ after the intervention and the remaining one has ⅔?

To understand this we have to examine the probability distributions prior and post the host intervention. Even though prior to the intervention each door has a probability of concealing the prize of [⅓,⅓,⅓], it is improved to quantify a "macro probability distribution" before the intervention as classified [chosen; not chosen]=[⅓;⅔]. This visual illustrates this:

This is useful because this macro distribution does not change after the intervention:

It is crucial to realise that the micro distribution does change to [⅓, ⅔,0]. Armed with self-information and entropy we learn that:

The self-information of the doors shifted from [[website],[website],[website]] (all -log₂(p(⅓))[website] to [[website], [website],0].

The entropy is lowered from the maximum point of [website] (even distribution) to a lower value of [website] (skewed distribution).

This puzzle is more interesting when posing this question with many doors.

Imagine 100 doors where after the contender chooses one the host reveals 98 goats leaving two doors to choose from.

Examine this scenario and decide if the contestant should remaining with the original choice 👇 or switch.

The 100 Door Monty Hall problem after the host intervention. Should you stick with your door 👇 or switch?

Why is it so obvious that one should switch when the number of doors is large but not when there are only three?

This is because by opening a door (or multiple doors if c>3), the host provides a lot of information.

Let’s use entropy to quantify how much. We’ll explore with some Python code but first, the following visual exhibits the "macro probability distribution" similar to above but for 100 doors:

The chosen door remains with p=1/100 but the remaining door gets all the probabilities from the revealed ones p=99/100. So we moved from the maximum entropy (even distribution) to a much lower one (highly skewed distribution).

We’ll now use Python scripts to quantify and visualise as well as describe analytically for any number of doors c (as an analogy to a c-sided die).

We’ll start with the standard c=3 problem:

p_before = [1./3, 1./3, 1./3] # probability of car being behind door A,B,C before revealing p_after = [1./3, 2./3, 0.] # probability of car being behind door A,B,C after revealing.

By now this sort of setup should look familiar.

We can calculate the entropy before and after from and infer the information gain:

entropy_before = entropy(p_before, base=2) # for 3 doors yields [website] entropy_after = entropy(p_after, base=2) # for 3 doors yields [website] information_gain = entropy_before - entropy_after # yields 2/3 bits.

If we do a similar calculation for the 4 door problem p_before=[1/4,1/4,1/4,1/4] and p_after=[1/4,3/4,0,0] the host reveals [website] bits, meaning slightly more than the three door case.

In the following graph we calculate for c=3–60, where the tail of the arrow is the entropy before the host reveals a door and the arrow head after. The length of the arrow is the information gain. I also added the horizontal entropy=1 line which indicates the incorrect assumption that the choice is a 50:50 split between the two remaining doors.

Arrow tails are the Monty Hall problem entropy before the host intervention. The arrow heads are the entropy after the interventions where the host opens all but two doors revealing only goats. The arrow lengths are the information gain per scenario.

The entropy before the host’s intervention (arrow tails) grows with c. In a supplementary section I show this is by log₂(c).

The entropy after the host’s intervention decreases with c. In a supplementary section I show this function is log₂(c) – (c-1)/c * log₂(c-1).

As such the difference between them (the arrow length), is the the information gain grows by (c-1)/c * log₂(c-1).

The dotted line visualises that in the c=3 case the information gain is quite small (compared to the 50:50 split) but gradually increases by the number of doors k and hence makes switching the more obvious choice.

Once I started thinking about writing on Information Theory, entropy became my personal Roy Kent from Ted Lasso (for the uninitiated: it’s a show about British footballers ⚽️ and their quirky American coach):

It’s here, it’s there it’s every ****-ing-where.

For example, when attending my toddler’s gym class entropy seemed to merge from distributions of colourful balls in hoops:

Very low entropy (highly skewed towards blue balls).

Green and yellow are pure ([website], H=0), and red is nearly pure.

Clearly, most toddlers’ internal colour-classification neural networks are functioning reasonably well.

Another example is conversations with my partner. She once told me that whenever I start with, "Do you know something interesting?" there was no idea what to expect: science, entertainment, sports, philosophy, work related, travel, languages, politics. I’m thinking – high entropy conversation topics.

But when we discuss income that is in the low entropy zone: I tend to stick to the same one liner "it’s going straight to the mortgage".

Finally, some might be familiar with the term entropy from physics. Is there a connection? Shannon’s choice for the term entropy is borrowed from statistical mechanics because it closely resembles a formula used in fields such as thermodynamics – both involve summing terms weighted by probability in the form _p_log(p). A thorough discussion is beyond the scope of this article, but while the connection is mathematically elegant in practical terms they don’t appear to be relatable³.

In this article, we explored entropy as a tool to quantify uncertainty – the "average surprise" expected from a variable. From estimating fairness in coin tosses and dice rolls to assessing the diversity of DNA libraries and evaluating the purity of decision tree splits, entropy emerged as a versatile concept bridging probability theory and real-world applications.

However, as insightful as entropy is, it focuses on problems where the true distribution is known. What happens when we need to compare a predicted distribution to a ground truth?

In the next article, we’ll tackle this challenge by exploring cross-entropy and KL-divergence. These heuristics quantify the misalignment between predicted and true distributions, forming the foundation for critical machine learning tasks, such as classification loss functions. Stay tuned as we continue our journey deeper into the fascinating world of information theory.

💌 Follow me here, join me on LinkedIn or 🍫 buy me some chocolate!

Unless otherwise noted, all images were created by the author.

Many thanks to Staffan Piledahl and Will Reynolds for their useful comments.

Even though I have twenty years of experience in data analysis and predictive modelling I always felt quite uneasy about using concepts in information theory without truly understanding them. For example last year I wanted to calculate the information gain in the Monty Hall problem 🚪🚪🐐 and wasn’t sure how to do this in practice.

The purpose of this series was to put me more at ease with concepts of information theory and hopefully provide for others the explanations I needed.

Check out my other articles which I wrote to improved understand Causality and Bayesian Statistics:

¹ 3blue1brown tutorial showing how to solve WORDLE using entropy 🟨🟩🟩⬛🟩.⁴.

² Decision Tree Splitting: Entropy vs. Misclassification Error by Pooja Tambe.

³ Entropy (information theory)/Relationship to thermodynamic entropy (Wikipedia).

⁴I’m quite proud of my WORDLE stats – no entropy optimisation applied!

As of 2025–01–30. Hoping to make the half year mark 🙂.

Supplementary: 🚪🚪🐐 Monty Hall Information Gain Calculations 🎩.

Here I briefly demonstrate the derivation for the Information Gain in the Monty Hall problem.

As mentioned above the underlying assumptions are that for c doors:

The probability distribution before the host intervention: p(before)=[1/c, …,1/c] which is of length c. [website], for c=3 this would be [⅓, ⅓, ⅓].

The probability distribution after the host intervention of revealing c-2 doors that have goats: p(after) = [1/c, (c-1)/c, 0 X {c-2 times}] [website], for c=3 this would be [⅓, ⅔, 0] and for c=8 [⅛, ⅞, 0, 0, 0, 0, 0, 0].

Let’s apply p(before) and p(after) to the standard entropy equation:

Entropy of probability distribution of variable X with c outcomes xᵢ, i=1…c.

In the case of p(before) all the outcomes have the same probability:

Monty Hall problem entropy prior to opening any of the c doors.

In the case of p(after) only the first two outcomes have non zero probabilities:

Monty Hall problem entropy after the host revealed c-2 doors with goats 🐐 .

, where in the square brackets we have only the first two non-zero terms and the "…" represents many cancellations of terms to obtain the final equation.

The Information Gain is the difference H(before)-H(after). We see that the log₂(c) term cancels out and we remain with:

The amount of information conveyed by the Monty Hall host when by revealing c-2 doors with goats 🐐 .

This information gain is shown as the length of arrows in the graph which I copied here:

(Copied from above.) Arrow tails are the Monty Hall problem entropy before the host intervention. The arrow heads are the entropy after the interventions where the host opens all but two doors revealing only goats. The arrow lengths are the information gain per scenario.

For completeness below I provide the scripts used to generate the image.

n_stats = {} for n_ in range(3,61): p_before = [1./n_] * n_ p_after = [1./n_, (n_-1)/n_] + [0.] * (n_-2) system_entropy_before = entropy([website], base=2) system_entropy_after = entropy([website], base=2) information_gain = system_entropy_before - system_entropy_after #print(f"before: {[website]}, after {[website]} bit ({[website]}) ") n_stats[n_] = {"bits_before": system_entropy_before, "bits_after": system_entropy_after, "information_gain": information_gain} df_n_stats = pd.DataFrame(n_stats).T [website] = "doors"p.

alpha_ = [website] # Plot arrows from before to after for i in [website] label = None if i == 3: label = "Information gain" information_gain = [website][i, "bits_after"] - [website][i, "bits_before"] [website], [website][i, "bits_before"], 0, information_gain, [website], [website], fc='gray', ec='gray', alpha=alpha_ * 3, label=label) [website]'Number of Doors') [website]'Bits') [website]'Entropy Before and After Intervention') [website][3, [website]], [1,1],':', color="gray", label="50:50 chance") [website] # Alternative scripts to plot the analytical equations: # Aux #ln2_n = [website] #n_minus_1_div_n_logn_minus_1 = ([website] - 1)/[website] * [website] - 1) # H(before) # h_before = ln2_n [website], h_before, ':') # h_after = ln2_n - n_minus_1_div_n_logn_minus_1 [website], h_after, ':').

In the main section I’ve discussed distributions of DNA building blocks called nucleotide bases. Here I describe a bit how these are measured in practice. (Full disclaimer: I don’t have a formal background in biology, but learnt a bit on the job in a biotech startup. This explanation is based on communications with a colleague and some clarification using a generative model.).

Devices known as DNA sequencers measure proxies for each building block nucleotide base and tally their occurrences. For example, a Sanger sequencer detects fluorescence intensities for each base at specific positions.

The sequencer output is typically visualised as in this diagram:

At each position, the sequencer provides an intensity measurement, colour-coded by base: A (green), T (red), C (blue), and G (black). These intensities represent the relative abundance of each nucleotide at that position. In most cases, the positions are monochromatic, indicating the presence of a single dominant base, what we called above non-degenerative or pure.

However, some positions show multiple colours, which reflect genetic variation within the sample, indicating diversity. (For simplicity, we’ll assume perfect sequencing and disregard any potential artefacts.).

As an example of a partial-degenerate base combination in the following visual we can see at middle position shown there is a 50:50 split between C and T:

This reads DNA strands with either C🅒TT or C🅣TT.

Research A catalogue of genetic mutations to help pinpoint the cause of diseases Share.

New AI tool classifies the effects of 71 mi...

Research AI achieves silver-medal standard solving International Mathematical Olympiad problems Share.

During the telecommunication boom, Claude Shannon, in his seminal 1948 paper¹, posed a question that would revoluti...

Market Impact Analysis

Market Growth Trend

2018	2019	2020	2021	2022	2023	2024
23.1%	27.8%	29.2%	32.4%	34.2%	35.2%	35.6%

Quarterly Growth Rate

Q1 2024	Q2 2024	Q3 2024	Q4 2024
32.5%	34.8%	36.2%	35.6%

Market Segments and Growth Drivers

Segment	Market Share	Growth Rate
Machine Learning	29%	38.4%
Computer Vision	18%	35.7%
Natural Language Processing	24%	41.5%
Robotics	15%	22.3%
Other AI Technologies	14%	31.8%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Competitive Landscape Analysis

Company	Market Share
Google AI	18.3%
Microsoft AI	15.7%
IBM Watson	11.2%
Amazon AI	9.8%
OpenAI	8.4%

Future Outlook and Predictions

The Discovering When Agent landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results

2025Industry standards emerging to facilitate broader adoption and integration

2026Mainstream adoption begins as technical barriers are addressed

2027Integration with adjacent technologies creates new capabilities

2028Business models transform as capabilities mature

2029Technology becomes embedded in core infrastructure and processes

2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

(Interactive diagram available in full report)

Innovation Trigger

Generative AI for specialized domains
Blockchain for supply chain verification

Peak of Inflated Expectations

Digital twins for business processes
Quantum-resistant cryptography

Trough of Disillusionment

Consumer AR/VR applications
General-purpose blockchain

Slope of Enlightenment

AI-driven analytics
Edge computing

Plateau of Productivity

Cloud infrastructure
Mobile applications

Technology Evolution Timeline

1-2 Years

Improved generative models
specialized AI applications

3-5 Years

AI-human collaboration systems
multimodal AI platforms

5+ Years

General AI capabilities
AI-driven scientific breakthroughs

Expert Perspectives

Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:

"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."
— AI Researcher

"Organizations that develop effective AI governance frameworks will gain competitive advantage."
— Industry Analyst

"The AI talent gap remains a critical barrier to implementation for most enterprises."
— Chief AI Officer

Areas of Expert Consensus

Acceleration of Innovation: The pace of technological evolution will continue to increase
Practical Integration: Focus will shift from proof-of-concept to operational deployment
Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:

Improved generative models
specialized AI applications
enhanced AI ethics frameworks

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

AI-human collaboration systems
multimodal AI platforms
democratized AI development

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

General AI capabilities
AI-driven scientific breakthroughs
new computing paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of ai tech evolution:

Ethical concerns about AI decision-making

Data privacy regulations

Algorithm bias

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Responsible AI driving innovation while minimizing societal disruption

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Incremental adoption with mixed societal impacts and ongoing ethical challenges

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and ethical barriers creating significant implementation challenges

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

Factor	Optimistic	Base Case	Conservative
Implementation Timeline	Accelerated	Steady	Delayed
Market Adoption	Widespread	Selective	Limited
Technology Evolution	Rapid	Progressive	Incremental
Regulatory Environment	Supportive	Balanced	Restrictive
Business Impact	Transformative	Significant	Modest

Transformational Impact

Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

platform intermediate

algorithm Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

reinforcement learning intermediate

interface

machine learning intermediate

platform

neural network intermediate

encryption

deep learning intermediate

API

algorithm intermediate

cloud computing

large language model intermediate

middleware

generative AI intermediate

scalability

API beginner

DevOps APIs serve as the connective tissue in modern software architectures, enabling different applications and services to communicate and share data according to defined protocols and data formats.

How APIs enable communication between different software systems

Example: Cloud service providers like AWS, Google Cloud, and Azure offer extensive APIs that allow organizations to programmatically provision and manage infrastructure and services.

🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy - Related to 🤷, a, trpo, an, discovering

Discovering when an agent is present in a system

SHARE

Training Large Language Models: From TRPO to GRPO

SHARE

🤷 Quantifying Uncertainty – A Data Scientist’s Intro To Information Theory – Part 2/4: Entropy

SHARE

Market Impact Analysis

Market Growth Trend

Quarterly Growth Rate

Market Segments and Growth Drivers

Technology Maturity Curve

Competitive Landscape Analysis

Future Outlook and Predictions

Year-by-Year Technology Evolution

Technology Maturity Curve

Innovation Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Technology Evolution Timeline

Expert Perspectives

Areas of Expert Consensus

Short-Term Outlook (1-2 Years)

Mid-Term Outlook (3-5 Years)

Long-Term Outlook (5+ Years)

Key Risk Factors and Uncertainties

Alternative Future Scenarios

Optimistic Scenario

Base Case Scenario

Conservative Scenario

Scenario Comparison Matrix

Transformational Impact

Implementation Challenges

Key Innovations to Watch

Technical Glossary

platform intermediate

Related Terms

reinforcement learning intermediate

machine learning intermediate

neural network intermediate

deep learning intermediate

algorithm intermediate

large language model intermediate

generative AI intermediate

API beginner

Related Terms

Related Articles

GenCast predicts weather and the risks of extreme conditions with state-of-the-art accuracy - Related to a, advances, forecasting, faster, computing

AI achieves silver-medal standard solving International Mathematical Olympiad problems - Related to silver-medal, misuse, audio, mathematical, pushing

Google's new 'Ask For Me' AI tool calls businesses to get your questions answered - Related to invites, here's, advancing, works, new