Flora Launches AI Platform with More Creative Controls, Encourages ‘AI Haters’ to Try it - Related to controls,, launches, flora, train, deepseek-r1)

Avoidable and Unavoidable Randomness in GPT-4o

Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness.

There’s no fix for this, and. It might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the data of randomness in GPT-4o output. Which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results. And prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

We’re going to keep things extremely simple here and. Prompt the most recent version of GPT-4o ( gpt-4o-2024-08-06 in the API) with this:

Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but. Here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

import os from openai import OpenAI client = OpenAI('OPENAI_API_KEY')) prompt = 'Flip a coin. Return Heads or Tails only.' response = model='gpt-4o-2024-08-06', messages=[{'role': 'user', 'content': prompt}], ) print(response. choices[0]. message. content).

Running the code gave me Heads . Maybe you’ll get Tails , or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with , where we specify the model to use and. Send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message. Which is response. choices[0]. message. content .

>>> response ChatCompletion(id='chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Heads', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740324680, model='gpt-4o-2024-08-06', object='chat. completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))).

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and. Tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os from openai import OpenAI client = OpenAI('OPENAI_API_KEY')) prompt = 'Flip a coin. Return Heads or Tails only.' for _ in range(10): response = model='gpt-4o-2024-08-06', messages=[{'role': 'user', 'content': prompt}], ) print(response. choices[0]. message. content).

Running this code gave me the following output, although you might get different output, of course.

Heads Heads Heads Heads Heads Heads Tails Heads Heads Heads.

GPT-4o’s coin is clearly biased. But so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times. There’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and. 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of , and. Is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os import math from openai import OpenAI from tabulate import tabulate client = OpenAI('OPENAI_API_KEY')) prompt = 'Flip a coin. Return Heads or Tails only.' response = model='gpt-4o-2024-08-06', max_tokens=1, logprobs=True, top_logprobs=20, messages=[{'role': 'user', 'content': prompt}], ) logprobs_list = response. choices[0]. logprobs. content[0]. top_logprobs data = [] total_pct = for logprob_entry in logprobs_list: token = logprob = logprob_entry. logprob pct = * 100 # Convert logprob to a percentage total_pct += pct [token, logprob, pct]) print( tabulate( data, headers=["Token", "Log Probability", "Percentage (%)"], tablefmt="github", floatfmt=("s", ".10f", ".10f") ) ) print(f"

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token | Log Probability | Percentage (%) | |-----------|-------------------|------------------| | Heads | -0.0380541235 | 96.2660836887 | | T | -3.2880542278 | 3.7326407467 | | Sure | -12.5380544662 | 0.0003587502 | | Head | -12.7880544662 | 0.0002793949 | | Tail | -13.2880544662 | 0.0001694616 | | Certainly | -13.5380544662 | 0.0001319768 | | "T | -14.2880544662 | 0.0000623414 | | I'm | -14.5380544662 | 0.0000485516 | | heads | -14.5380544662 | 0.0000485516 | | Heads | -14.9130544662 | 0.0000333690 | | " | -15.1630544662 | 0.0000259878 | | _heads | -15.1630544662 | 0.0000259878 | | tails | -15.5380544662 | 0.0000178611 | | HEAD | -15.7880544662 | 0.0000139103 | | TAIL | -16.2880535126 | 0.0000084370 | | T | -16.7880535126 | 0.0000051173 | | ``` | -16.7880535126 | 0.0000051173 | | Here's | -16.9130535126 | 0.0000045160 | | I | -17.2880535126 | 0.0000031038 | | As | -17.2880535126 | 0.0000031038 | Total probabilities: .

Looking at these probabilities, we see Heads at ≈96% and. T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails ? This is the tokenizer splitting Tails into T + ails , while keeping Heads as one piece. As we can see in this Python session:

>>> import tiktoken >>> encoding = tiktoken. encoding_for_model("gpt-4o-2024-08-06") >>> 'Tails') [51, 2196] >>> [51]) 'T' >>> 'Heads') [181043].

In relation to this, these probabilities are not deterministic.

Run the code to display the probabilities for the top 20 tokens again, and. You’ll likely get different numbers. Here’s what I got on a second running.

| Token | Log Probability | Percentage (%) | |-----------|-------------------|------------------| | Heads | -0.0110520627 | 98.9008786933 | | T | -4.5110521317 | 1.0986894433 | | Certainly | -14.0110521317 | 0.0000822389 | | Head | -14.2610521317 | 0.0000640477 | | Sure | -14.2610521317 | 0.0000640477 | | Tail | -14.3860521317 | 0.0000565219 | | heads | -15.3860521317 | 0.0000207933 | | Heads | -15.5110521317 | 0.0000183500 | | ``` | -15.5110521317 | 0.0000183500 | | _heads | -15.6360521317 | 0.0000161938 | | tails | -15.6360521317 | 0.0000161938 | | I'm | -15.8860521317 | 0.0000126117 | | "T | -15.8860521317 | 0.0000126117 | | As | -16.3860511780 | 0.0000076494 | | " | -16.5110511780 | 0.0000067506 | | HEAD | -16.6360511780 | 0.0000059574 | | TAIL | -16.7610511780 | 0.0000052574 | | Here's | -16.7610511780 | 0.0000052574 | | `` | -17.1360511780 | 0.0000036133 | | T | -17.6360511780 | 0.0000021916 | Total probabilities: .

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed , request parameters. And system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

Furthermore, they also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and. Seed . OpenAI also implies we track system_fingerprint , because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature controls how random the model’s responses are. Low temperatures (< make it robotic and predictable, medium temperatures (– allow some creativity, and high temperatures (> produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows). And typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but. There is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: squares the probabilities, making likely tokens dominate, while applies a square root, flattening the distribution.

What about ? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses. The temperature setting doesn’t come into play in our code at all.

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact. Seed only matters with non-zero temperatures. With , the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

The system_fingerprint identifies the current combination of model weights, infrastructure. And configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set and seed=42 . And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint . Here’s the code:

import os import math from openai import OpenAI from tabulate import tabulate from tqdm import tqdm client = OpenAI('OPENAI_API_KEY')) prompt = 'Flip a coin. Return Heads or Tails only.' data = [] for _ in tqdm(range(10), desc='Generating responses'): response = model='gpt-4o-2024-08-06', , seed=42, max_tokens=1. Logprobs=True, top_logprobs=20, messages=[{'role': 'user', 'content': prompt}], ) fingerprint = response. system_fingerprint logprobs_list = response. choices[0]. logprobs. content[0]. top_logprobs heads_logprob = next( entry. logprob for entry in logprobs_list if == 'Heads' ) pct = * 100 [fingerprint, heads_logprob, f"{}%"]) headers = ["Fingerprint", "Logprob", "Probability"] print(tabulate(data, headers=headers, tablefmt="pipe")).

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint | Logprob | Probability | |---------------|------------|----------------| | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | | 85.1854886858% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% | | fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |.

Mixture-of-experts makes determinism impossible.

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

, Riquelme, Mustafa, and Houlsby ( to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints. All Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens. In the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

Moving to another aspect, this “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts.

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to , looking for the token He since ’s tokenizer splits “Heads” differently. And ignoring system_fingerprint because this model doesn’t have it) reveals that also exhibits non-deterministic logprobs:

| Logprob | Probability | |-------------|----------------| | -0.00278289 | 99.7220983436% | | -0.00415331 | 99.5855302068% | | -0.00258838 | 99.7414961980% | | -0.00204034 | 99.7961735289% | | -0.00240277 | 99.7600117933% | | -0.00204034 | 99.7961735289% | | -0.00204034 | 99.7961735289% | | -0.00258838 | 99.7414961980% | | -0.00351419 | 99.6491976144% | | -0.00201214 | 99.7989878007% |.

No one is claiming that uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us.

To superior understand the patterns and magnitude of this non-determinism. I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities. Ensuring that each cluster was separated from the others by (as a raw percentage). This groups the output into 12 clusters.

Probability Count Fingerprints ------------------------------------------------------------------ 85.1854379113% 5 fp_eb9dce56a8, fp_f9f4fb6dbf 85.1854455275% 74 fp_eb9dce56a8, fp_f9f4fb6dbf 85.1854886858% 180 fp_eb9dce56a8, fp_f9f4fb6dbf ------------------------------------------------------------------ 88.0662448207% 31 fp_eb9dce56a8, fp_f9f4fb6dbf 88.0678628883% 2 fp_f9f4fb6dbf ------------------------------------------------------------------ 92.3997629747% 1 fp_eb9dce56a8 92.3997733012% 4 fp_eb9dce56a8 92.3997836277% 3 fp_eb9dce56a8 ------------------------------------------------------------------ 92.4128943690% 1 fp_f9f4fb6dbf 92.4129143363% 21 fp_eb9dce56a8, fp_f9f4fb6dbf 92.4129246643% 8 fp_eb9dce56a8, fp_f9f4fb6dbf ------------------------------------------------------------------ 93.9906837191% 4 fp_eb9dce56a8 ------------------------------------------------------------------ 95.2569999350% 36 fp_eb9dce56a8 ------------------------------------------------------------------ 96.2660836887% 3391 fp_eb9dce56a8, fp_f9f4fb6dbf 96.2661285161% 2636 fp_eb9dce56a8, fp_f9f4fb6dbf ------------------------------------------------------------------ 97.0674551052% 1 fp_eb9dce56a8 97.0674778863% 3 fp_eb9dce56a8 97.0675003058% 4 fp_eb9dce56a8 97.0675116963% 1 fp_eb9dce56a8 97.0680739932% 19 fp_eb9dce56a8, fp_f9f4fb6dbf 97.0681293191% 6 fp_eb9dce56a8, fp_f9f4fb6dbf 97.0681521003% 74 fp_eb9dce56a8, fp_f9f4fb6dbf 97.0682421405% 4 fp_eb9dce56a8 ------------------------------------------------------------------ 97.7008960695% 1 fp_f9f4fb6dbf 97.7011122645% 3 fp_eb9dce56a8 97.7011462953% 3 fp_eb9dce56a8 97.7018178132% 1 fp_eb9dce56a8 ------------------------------------------------------------------ 98.2006069902% 426 fp_eb9dce56a8, fp_f9f4fb6dbf 98.2006876548% 6 fp_f9f4fb6dbf 98.2007107019% 1 fp_eb9dce56a8 98.2009525133% 5 fp_eb9dce56a8 98.2009751945% 1 fp_eb9dce56a8 98.2009867181% 1 fp_eb9dce56a8 ------------------------------------------------------------------ 98.5930987656% 3 fp_eb9dce56a8, fp_f9f4fb6dbf 98.5931104270% 235 fp_eb9dce56a8, fp_f9f4fb6dbf 98.5931222721% 4 fp_eb9dce56a8. Fp_f9f4fb6dbf 98.5931340253% 9 fp_eb9dce56a8 98.5931571644% 159 fp_eb9dce56a8, fp_f9f4fb6dbf 98.5931805790% 384 fp_eb9dce56a8 ------------------------------------------------------------------ 98.9008436920% 95 fp_eb9dce56a8, fp_f9f4fb6dbf 98.9008550214% 362 fp_eb9dce56a8, fp_f9f4fb6dbf 98.9008786933% 1792 fp_eb9dce56a8, fp_f9f4fb6dbf.

(With a threshold of there are 13 clusters, and with a threshold of there are 17 clusters.).

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8 , and for the most part the two system fingerprints returned the same sets probabilities. Rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and. The two flavors of Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in can’t be explained by this at all, because isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served. This behaviour wasn’t seen with GPT-3, :

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This hints at that whatever is causing this randomness first emerged in either or .

But regardless of when it emerged. This non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”.

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other. Hidden data of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. .

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). .

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). ? id=jxpsAj7ltE. Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. .

Together Fund has introduced Together AI Studio, a venture-building initiative aimed at fostering AI-first startups in India. The program seeks to equ...

India’s digital landscape is rapidly evolving, and application programming interfaces (APIs) are playing a crucial role in this transformation. They s...

The Indian Economic Survey 2025 indicated the growing adoption and impact of generative AI within India’s banking sector. It pointed out that several ...

How to Train LLMs to “Think” (o1 & DeepSeek-R1)

In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it “advanced reasoning” capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and . In this article, I will discuss the key ideas behind this innovation and. Describe how they work under the hood.

OpenAI’s o1 model marked a new paradigm for training large language models (LLMs). It introduced so-called “thinking” tokens, which enable a sort of scratch pad that the model can use to think through problems and. User queries.

The major insight from o1 was performance improved with increased test-time compute. This is just a fancy way of saying that the more tokens a model generates, the improved its response. The figure below, reproduced from OpenAI’s blog, captures this point nicely.

AIME accuracy scaling with train-time and test-time compute, respectively. Plots reillustrated from [1].

In the plots above, the y-axes are model performance on AIME (math problems), while the x-axes are various compute times. The left plot depicts the well-known neural scaling laws that kicked off the LLM rush of 2023. In other words, the longer a model is trained ( train-time compute), the more effective its performance.

On the right. However, we see a new type of scaling law. Here, the more tokens a model generates ( test-time compute), the enhanced its performance.

A key feature of o1 is its so-called “thinking” tokens. These are special tokens introduced during post-training, which delimit the model’s chain of thought (CoT) reasoning (, thinking through the problem). These special tokens are critical for two reasons.

One, they clearly demarcate where the model’s “thinking” starts and. Stops so it can be easily parsed when spinning up a UI. And two, it produces a human-interpretable readout of how the model “thinks” through the problem.

Although OpenAI disclosed that they used reinforcement learning to produce this ability. The exact details of how they did it were not shared. Today, however, we have a pretty good idea thanks to a recent publication from DeepSeek.

In January 2025. DeepSeek published “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” [2]. While this paper caused its fair share of pandemonium, its central contribution was unveiling the secrets behind o1.

It introduces two models: DeepSeek-R1-Zero and DeepSeek-R1. The former was trained exclusively on reinforcement learning (RL), and the latter was a mixture of Supervised Fine-tuning (SFT) and RL.

Although the headlines (and title of the paper) were about DeepSeek-R1, the former model is significant because, one, it generated training data for R1, and. Two, it demonstrates striking emergent reasoning abilities that were not taught to the model.

In other words, R1-Zero discovers CoT and test-time compute scaling through RL alone! Let’s discuss how it works.

Reinforcement learning (RL) is a Machine Learning approach in which, rather than training models on explicit examples. Models learn through trial and error [3]. It works by passing a reward signal to a model that has no explicit functional relationship with the model’s parameters.

This is similar to how we often learn in the real world. For example, if I apply for a job and don’t get a response, I have to figure out what I did wrong and. How to improve. This is in contrast to supervised learning, which, in this analogy, would be like the recruiter giving me specific feedback on what I did wrong and how to improve.

While using RL to train R1-Zero consists of many technical details, I want to highlight 3 key ones: the prompt template, reward signal, and GRPO (Group Relative Policy Optimization).

The template used for training is given below. Where {prompt} is replaced with a question from a dataset of (presumably) complex math, coding, and logic problems. Notice the inclusion of and tags via simple prompting.

A conversation between User and Assistant. The user asks a question, and the Assistant solves assistant first thinks about the reasoning process in the mind and. Then provides the user with the answer. The reasoning process and answer are enclosed within and tags, respectively, , reasoning process here answer here . User: {prompt}. Assistant:

Something that stands out here is the minimal and relaxed prompting strategy. This was an intentional choice by DeepSeek to avoid biasing model responses and to observe its natural evolution during RL.

The RL reward has two components: accuracy and. Format rewards. Since the training dataset consists of questions with clear right answers, a simple rule-based strategy is used to evaluate response accuracy. Similarly, a rule-based formatting reward is used to ensure reasoning tokens are generated in between the thinking tags.

It’s noted by the authors that a neural reward model isn’t used ( rewards are not computed by a neural net). Because these may be prone to reward hacking. In other words, the LLM learns how to trick the reward model into maximizing rewards while decreasing downstream performance.

This is just like how humans find ways to exploit any incentive structure to maximize their personal gains while forsaking the original intent of the incentives. This highlights the difficulty of producing good rewards (whether for humans or computers).

3) GRPO (Group Relative Policy Optimization).

In relation to this, the final detail is how rewards are translated into model parameter updates. This section is quite technical, so the enlightened reader can feel free to skip ahead.

GRPO is an RL approach that combines a collection of responses to enhancement model parameters. To encourage stable training, the authors also incorporate clipping and KL-divergence regularization terms into the loss function. Clipping ensures optimization steps are not too big, and regularization ensures the model predictions do not change too abruptly.

Here is the complete loss function with some (hopefully) helpful annotations.

Annotated GRPO loss function. Image by author.

Building on these developments, the most striking result of R1-Zero is that, despite its minimal guidance, it develops effective reasoning strategies that we might recognize.

For example. It learns implicitly through the RL to improve responses through test-time compute (recall the earlier insight from o1). This is depicted via the plot below from the R1 paper [2].

Average response length increases with additional training (and improved performance). Screenshot from DeepSeek R1 paper [2].

Another notable ability is reflecting on past responses with an uncanny human-like internal monologue. Here is an example from R1-Zero. Notice the long CoT and additional verification step before generating the final answer (I omitted some parts because the response was very long).

User prompt: If it takes me hours to write a codebase and. My intern 8 hours. How long will it take us both to write 3 codebases? To solve for how long it will take both you and your intern to write three codebases together. We need to first find out how many codebases each of you can write per hour individually. (long CoT) ... So, it will take approximately hours for you and your intern to write three codebases together. To make sure everything makes sense, let us round the answer to a reasonable number of decimal places (, round to two decimal places): t ≈ hours However. If you need a more "human-friendly" answer, you could convert the decimal part into minutes. Since of an hour is: × 60 minutes ≈ minutes So, hours is approximately 3 hours and minutes (or roughly 3 hours and. 47 minutes if rounding to the nearest minute). ## Verification (optional step for verification) To verify, let us check how many codebases you can individually write in t = 72/19 hours: Your work in 72/19 hours = (2/3 codebases per hour) × (72/19 hours) = (2 × 72) / (3 × 19) = 144 / 57 = 48/19 codebases. (verification process) ... However, a more "exact" answer should be kept in fractional form unless specified otherwise. Thus, the answer is: 72/19hours which is approximately $$ hours or about $$ hours when rounded to two decimal places). 72/19 hours (or approximately hours).

Although the thinking tokens from R1-Zero give a human-readable window into the model’s “thought process,” the authors study some issues. Namely, the learned CoT sometimes suffers from readability issues and language mixing. Suggesting (perhaps) that its reasoning begins to veer away from something easily interpretable by humans.

To mitigate R1-Zero’s interpretability issues, the authors explore a multi-step training strategy that utilizes both supervised fine-tuning (SFT) and. RL. This strategy results in DeepSeek-R1, a superior-performing model that is getting more attention today. The entire training process can be broken down into 4 steps.

To help get the model on the right track when it comes to learning how to reason. The authors start with SFT. This leverages 1000s of long CoT examples from various reports, including few-shot prompting (, showing examples of how to think through problems), directly prompting the model to use reflection and verification, and refining synthetic data from R1-Zero [2].

The two key advantages of this are, one, the desired response format can be explicitly shown to the model. And two, seeing curated reasoning examples unlocks advanced performance for the final model.

Step 2: R1-Zero style RL (+ language consistency reward).

Next, an RL training step is applied to the model after SFT. This is done in an identical way as R1-Zero with an added component to the reward signal that incentivizes language consistently. This was added to the reward because R1-Zero tended to mix languages, making it difficult to read its generations.

At this point. The model likely has on par (or better) performance than R1-Zero on reasoning tasks. However, this intermediate model wouldn’t be very practical because it wants to reason about any input it receives (, “hi there”), which is unnecessary for factual Q&A. Translation, and creative writing. That’s why another SFT round is performed with both reasoning (600k examples) and. Non-reasoning (200k examples) data.

The reasoning data here is generated from the resulting model from Step 2. Additionally, examples are included which use an LLM judge to compare model predictions to ground truth answers.

The non-reasoning data comes from two places. First, the SFT dataset used to train DeepSeek-V3 (the base model). Second, synthetic data generated by DeepSeek-V3. Note that examples are included that do not use CoT so that the model doesn’t use thinking tokens for every response.

Finally, another RL round is done. Which includes (again) R1-Zero style reasoning training and RL on human feedback. This latter component helps improve the model’s helpfulness and harmlessness.

The result of this entire pipeline is DeepSeek-R1, which excels at reasoning tasks and. Is an AI assistant you can chat with normally.

Another key contribution from DeepSeek is that the weights of the two models described above (and many other distilled versions of R1) were made publicly available. This means there are many ways to access these models, whether using an inference provider or running them locally.

Here are a few places that I’ve seen these models.

Together (DeepSeek-V3, DeepSeek-R1, and distillations).

Hyperbolic (DeepSeek-V3, DeepSeek-R1-Zero. And DeepSeek-R1).

Ollama (local) (DeepSeek-V3, DeepSeek-R1, and distillations).

Building on these developments, the release of o1 introduced a new dimension by which LLMs can be improved: test-time compute. Although OpenAI did not release its secret sauce for doing this, 5 months later, DeepSeek was able to replicate this reasoning behavior and publish the technical details of its approach.

While current reasoning models have limitations. This is a promising research direction because it has demonstrated that reinforcement learning (without humans) can produce models that learn independently. This (potentially) breaks the implicit limitations of current models, which can only recall and remix information previously seen on the internet (, existing human knowledge).

The promise of this new RL approach is that models can surpass human understanding (on their own), leading to new scientific and. Technological breakthroughs that might take us decades to discover (on our own).

🧑‍🎓 Learn AI in 6 weeks by building it: .

India continues to face a critical debate: Should the country invest in developing its own foundational LLM or focus on building applications on top o...

Les attaques par deepfakes se multiplient de façon inquiétante. Selon une enquête de Regula, près de 49 % des entreprises ont déjà été confrontées à d...

Hier, 3 mars 2025. On a assisté au MWC de Barcelone, Mobile World Congress. C’est un évènement durant lequel Amazon, par le biais de son vice-présiden...

Flora Launches AI Platform with More Creative Controls, Encourages ‘AI Haters’ to Try it

There are various platforms to help people create videos, art, posters, and all kinds of creative assets. You will also find AI artists selling prompts for the same, instead of the art.

Flora is a new startup, founded by Weber Wong (out of an art & technology graduate program called NYU ITP) which aims to bring together the best of such platforms under one roof geared for creative professionals.

The corporation’s manifesto states, “We’re a team of creatives who founded Flora to solve our own problem: the lack of creative control in AI.”.

The startup is backed by investors like A16Z Games, Menlo Ventures, Long Journey Ventures, and angel investors from Pika, , Midjourney. And more.

The founder states that the current platforms are built by “non-creatives for other non-creatives people to feel creative.”.

Wong explained, “Flora has the best AI models and three types of blocks, text, image, and video. Connect blocks together to have creative control over the generative process, more so than just typing into a prompt box.”.

The platform supports sharing the project and collaborating with people in real-time.

While Wong mentioned that it is easy to build creative workflows. You can also find pre-made workflows from the community that you can use. The enterprise has worked with design agencies like Pentagram during its alpha stage to populate the workflows.

In an interaction with TechCrunch. Wong added a strong opinion that the models are not creative tools. However, Flora presents itself as an “infinite canvas”, or a visual interface to help generate various blocks, and connect them together using those models. He emphasised that the model does not matter, or the technology, but the interface.

He further added that he hopes to encourage AI haters to also try Flora.

Flora is available for free with restrictions on the number of projects and. History of generation. The pricing for the platform starts at $16 per month.

Stability AI, known for its Stable Diffusion text-to-image models. Has collaborated with global semiconductor giant Arm to add generative audio AI cap...

Market Impact Analysis

Market Growth Trend

2018	2019	2020	2021	2022	2023	2024
23.1%	27.8%	29.2%	32.4%	34.2%	35.2%	35.6%

Quarterly Growth Rate

Q1 2024	Q2 2024	Q3 2024	Q4 2024
32.5%	34.8%	36.2%	35.6%

Market Segments and Growth Drivers

Segment	Market Share	Growth Rate
Machine Learning	29%	38.4%
Computer Vision	18%	35.7%
Natural Language Processing	24%	41.5%
Robotics	15%	22.3%
Other AI Technologies	14%	31.8%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Competitive Landscape Analysis

Company	Market Share
Google AI	18.3%
Microsoft AI	15.7%
IBM Watson	11.2%
Amazon AI	9.8%
OpenAI	8.4%

Future Outlook and Predictions

The Avoidable Unavoidable Randomness landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results

2025Industry standards emerging to facilitate broader adoption and integration

2026Mainstream adoption begins as technical barriers are addressed

2027Integration with adjacent technologies creates new capabilities

2028Business models transform as capabilities mature

2029Technology becomes embedded in core infrastructure and processes

2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

(Interactive diagram available in full report)

Innovation Trigger

Generative AI for specialized domains
Blockchain for supply chain verification

Peak of Inflated Expectations

Digital twins for business processes
Quantum-resistant cryptography

Trough of Disillusionment

Consumer AR/VR applications
General-purpose blockchain

Slope of Enlightenment

AI-driven analytics
Edge computing

Plateau of Productivity

Cloud infrastructure
Mobile applications

Technology Evolution Timeline

1-2 Years

Improved generative models
specialized AI applications

3-5 Years

AI-human collaboration systems
multimodal AI platforms

5+ Years

General AI capabilities
AI-driven scientific breakthroughs

Expert Perspectives

Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:

"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."
— AI Researcher

"Organizations that develop effective AI governance frameworks will gain competitive advantage."
— Industry Analyst

"The AI talent gap remains a critical barrier to implementation for most enterprises."
— Chief AI Officer

Areas of Expert Consensus

Acceleration of Innovation: The pace of technological evolution will continue to increase
Practical Integration: Focus will shift from proof-of-concept to operational deployment
Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:

Improved generative models
specialized AI applications
enhanced AI ethics frameworks

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

AI-human collaboration systems
multimodal AI platforms
democratized AI development

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

General AI capabilities
AI-driven scientific breakthroughs
new computing paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of ai tech evolution:

Ethical concerns about AI decision-making

Data privacy regulations

Algorithm bias

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Responsible AI driving innovation while minimizing societal disruption

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Incremental adoption with mixed societal impacts and ongoing ethical challenges

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and ethical barriers creating significant implementation challenges

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

Factor	Optimistic	Base Case	Conservative
Implementation Timeline	Accelerated	Steady	Delayed
Market Adoption	Widespread	Selective	Limited
Technology Evolution	Rapid	Progressive	Incremental
Regulatory Environment	Supportive	Balanced	Restrictive
Business Impact	Transformative	Significant	Modest

Transformational Impact

Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

interface intermediate

algorithm Well-designed interfaces abstract underlying complexity while providing clearly defined methods for interaction between different system components.

generative AI intermediate

interface

synthetic data intermediate

platform

API beginner

encryption APIs serve as the connective tissue in modern software architectures, enabling different applications and services to communicate and share data according to defined protocols and data formats.

How APIs enable communication between different software systems

Example: Cloud service providers like AWS, Google Cloud, and Azure offer extensive APIs that allow organizations to programmatically provision and manage infrastructure and services.

machine learning intermediate

API

reinforcement learning intermediate

cloud computing

platform intermediate

middleware Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

large language model intermediate

scalability

Flora Launches AI Platform with More Creative Controls, Encourages ‘AI Haters’ to Try it - Related to controls,, launches, flora, train, deepseek-r1)

Avoidable and Unavoidable Randomness in GPT-4o

SHARE

How to Train LLMs to “Think” (o1 & DeepSeek-R1)

SHARE

Flora Launches AI Platform with More Creative Controls, Encourages ‘AI Haters’ to Try it

SHARE

Market Impact Analysis

Market Growth Trend

Quarterly Growth Rate

Market Segments and Growth Drivers

Technology Maturity Curve

Competitive Landscape Analysis

Future Outlook and Predictions

Year-by-Year Technology Evolution

Technology Maturity Curve

Innovation Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Technology Evolution Timeline

Expert Perspectives

Areas of Expert Consensus

Short-Term Outlook (1-2 Years)

Mid-Term Outlook (3-5 Years)

Long-Term Outlook (5+ Years)

Key Risk Factors and Uncertainties

Alternative Future Scenarios

Optimistic Scenario

Base Case Scenario

Conservative Scenario

Scenario Comparison Matrix

Transformational Impact

Implementation Challenges

Key Innovations to Watch

Technical Glossary

interface intermediate

Related Terms

generative AI intermediate

synthetic data intermediate

API beginner

Related Terms

machine learning intermediate

reinforcement learning intermediate

platform intermediate

Related Terms

large language model intermediate

Related Articles

2025 has already brought us the most performant AI ever: What can we do with these supercharged capabilities (and what’s next)? - Related to assist, 2025, we, valuation, what’s

Face aux deepfakes, l’authentification multi-facteurs est-elle suffisante ? - Related to sous, de, 2027, aux, monumental,

Hyderabad-Based TakeMe2Space Raises ₹5.5 Crore to Launch India’s First AI Lab in Space - Related to models, valuation, experiment, takeme2space, us