I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms - Related to cheat, (bad), thought, my, i

A Comprehensive Guide to LLM Temperature 🔥🌡️

While building my own LLM-based application, I found many prompt engineering guides, but few equivalent guides for determining the temperature setting.

Of course, temperature is a simple numerical value while prompts can get mindblowingly complex, so it may feel trivial as a product decision. Still, choosing the right temperature can dramatically change the nature of your outputs, and anyone building a production-quality LLM application should choose temperature values with intention.

In this post, we’ll explore what temperature is and the math behind it, potential product implications, and how to choose the right temperature for your LLM application and evaluate it. At the end, I hope that you’ll have a clear course of action to find the right temperature for every LLM use case.

Temperature is a number that controls the randomness of an LLM’s outputs. Most APIs limit the value to be from 0 to 1 or some similar range to keep the outputs in semantically coherent bounds.

“Higher values like [website] will make the output more random, while lower values like [website] will make it more focused and deterministic.”.

Intuitively, it’s like a dial that can adjust how “explorative” or “conservative” the model is when it spits out an answer.

Personally, I find the math behind the temperature field very interesting, so I’ll dive into it. But if you’re already familiar with the innards of LLMs or you’re not interested in them, feel free to skip this section.

You probably know that an LLM generates text by predicting the next token after a given sequence of tokens. In its prediction process, it assigns probabilities to all possible tokens that could come next. For example, if the sequence passed to the LLM is “The giraffe ran over to the…”, it might assign high probabilities to words like “tree” or “fence” and lower probabilities to words like “apartment” or “book”.

But let’s back up a bit. How do these probabilities come to be?

These probabilities usually come from raw scores, known as logits, that are the results of many, many neural network calculations and other Machine Learning techniques. These logits are gold; they contain all the valuable information about what tokens could be selected next. But the problem with these logits is that they don’t fit the definition of a probability: they can be any number, positive or negative, like 2, or [website], or 20. They’re not necessarily between 0 and 1, and they don’t necessarily all add up to 1 like a nice probability distribution.

So, to make these logits usable, we need to use a function to transform them into a clean probability distribution. The function typically used here is called the softmax, and it’s essentially an elegant equation that does two critical things:

It turns all the logits into positive numbers. It scales the logits so they add up to 1.

The softmax function works by taking each logit, raising e (around [website] to the power of that logit, and then dividing by the sum of all these exponentials. So the highest logit will still get the highest numerator, which means it gets the highest probability. But other tokens, even with negative logit values, will still get a chance.

Now here’s where Temperature comes in: temperature modifies the logits before applying softmax. The formula for softmax with temperature is:

When the temperature is low, dividing the logits by T makes the values larger/more spread out. Then the exponentiation would make the highest value much larger than the others, making the probability distribution more uneven. The model would have a higher chance of picking the most probable token, resulting in a more deterministic output.

When the temperature is high, dividing the logits by T makes all the values smaller/closer together, spreading out the probability distribution more evenly. This means the model is more likely to pick less probable tokens, increasing randomness.

Of course, the best way to choose a temperature is to play around with it. I believe any temperature, like any prompt, should be substantiated with example runs and evaluated against other possibilities. We’ll discuss that in the next section.

But before we dive into that, I want to highlight that temperature is a crucial product decision, one that can significantly influence user behavior. It may seem rather straightforward to choose: lower for more accuracy-based applications, higher for more creative applications. But there are tradeoffs in both directions with downstream consequences for user trust and usage patterns. Here are some subtleties that come to mind:

Low temperatures can make the product feel authoritative . More deterministic outputs can create the illusion of expertise and foster user trust. However, this can also lead to gullible individuals. If responses are always confident, individuals might stop critically evaluating the AI’s outputs and just blindly trust them, even if they’re wrong.

. More deterministic outputs can create the illusion of expertise and foster user trust. However, this can also lead to gullible customers. If responses are always confident, customers might stop critically evaluating the AI’s outputs and just blindly trust them, even if they’re wrong. Low temperatures can reduce decision fatigue . If you see one strong answer instead of many options, you’re more likely to take action without overthinking. This might lead to easier onboarding or lower cognitive load while using the product. Inversely, high temperatures could create more decision fatigue and lead to churn.

. If you see one strong answer instead of many options, you’re more likely to take action without overthinking. This might lead to easier onboarding or lower cognitive load while using the product. Inversely, high temperatures could create more decision fatigue and lead to churn. High temperatures can encourage user engagement . The unpredictability of high temperatures can keep consumers curious (like variable rewards), leading to longer sessions or increased interactions. Inversely, low temperatures might create stagnant user experiences that bore consumers.

. The unpredictability of high temperatures can keep clients curious (like variable rewards), leading to longer sessions or increased interactions. Inversely, low temperatures might create stagnant user experiences that bore clients. Temperature can affect the way clients refine their prompts. When answers are unexpected with high temperatures, clients might be driven to clarify their prompts. But with low temperatures, clients may be forced to add more detail or expand on their prompts in order to get new answers.

These are broad generalizations, and of course there are many more nuances with every specific application. But in most applications, the temperature can be a powerful variable to adjust in A/B testing, something to consider alongside your prompts.

As developers, we’re used to unit testing: defining a set of inputs, running those inputs through a function, and getting a set of expected outputs. We sleep soundly at night when we ensure that our code is doing what we expect it to do and that our logic is satisfying some clear-cut constraints.

The promptfoo package lets you perform the LLM-prompt equivalent of unit testing, but there’s some additional nuance. Because LLM outputs are non-deterministic and often designed to do more creative tasks than strictly logical ones, it can be hard to define what an “expected output” looks like.

The simplest evaluation tactic is to have a human rate how good they think some output is, . For outputs where you’re looking for a certain “vibe” that you can’t express in words, this will probably be the most effective method.

Another simple evaluation tactic is to use deterministic metrics — these are things like “does the output contain a certain string?” or “is the output valid json?” or “does the output satisfy this javascript expression?”. If your expected output can be expressed in these ways, promptfoo has your back.

A more interesting, AI-age evaluation tactic is to use LLM-graded checks. These essentially use LLMs to evaluate your LLM-generated outputs, and can be quite effective if used properly. Promptfoo offers these model-graded metrics in multiple forms. The whole list is here, and it contains assertions from “is the output relevant to the original query?” to “compare the different test cases and tell me which one is best!” to “where does this output rank on this rubric I defined?”.

Let’s say I’m creating a consumer-facing application that comes up with creative gift ideas and I want to empirically determine what temperature I should use with my main prompt.

I might want to evaluate metrics like relevance, originality, and feasibility within a certain budget and make sure that I’m picking the right temperature to optimize those factors. If I’m comparing GPT 4o-mini’s performance with temperatures of 0 vs. 1, my test file might start like this:

- "Come up with a one-sentence creative gift idea for a person who is {{persona}}. It should cost under {{budget}}."

- description: "Mary - attainable, under budget, original"

persona: "a 40 year old woman who loves natural wine and plays pickleball"

- "Check if the gift is easily attainable and reasonable"

- "Check if the gift is likely under $100"

- "Check if the gift would be considered original by the average American adult"

- description: "Sean - answer relevance"

persona: "a 25 year old man who rock climbs, goes to raves, and lives in Hayes Valley"

I’ll probably want to run the test cases repeatedly to test the effects of temperature changes across multiple same-input runs. In that case, I would use the repeat param like:

Temperature is a simple numerical parameter, but don’t be deceived by its simplicity: it can have far-reaching implications for any LLM application.

Tuning it just right is key to getting the behavior you want — too low, and your model plays it too safe; too high, and it starts spouting unpredictable responses. With tools like promptfoo, you can systematically test different settings and find your Goldilocks zone — not too cold, not too hot, but just right. ️.

Ça fait environ 10 ans qu’OpenAI a été créé et depuis ce temps, l’entreprise a rarement effectué de mises à jour interne. Sauf il y a deux jours où Op...

Après l’incident qu’a fait objet DeepSeek la semaine dernière, les chercheurs affirment avoir trouvé un moyen de contourner les mesures de protection ...

In “Domain-specific optimization and diverse evaluation of self-supervised models for histopathology”, we showed that self-supervised...

OpenAI's o3-mini now lets you see the AI's thought process

OpenAI released its o3-mini model exactly one week ago, offering both free and paid individuals a more accurate, faster, and cheaper alternative to o1-mini. Now, OpenAI has updated the o3-mini to include an "updated chain of thought," and here's why it matters.

OpenAI presented via an X post that free and paid people would now be able to view the reasoning process the o3-mini goes through before arriving at a conclusion. For example, in the post, a user asked, "How is today not a Friday?" and under the dropdown showing how long it took, the model delineated every step in its chain of thought that allowed it to land on its answer.

Understanding how the model arrived at the conclusion is helpful because it not only helps individuals verify the accuracy of the conclusion, but it also teaches individuals how they could have arrived at that answer themselves. This is particularly useful for math or coding prompts, in which seeing the steps could allow you to recreate them the next time you encounter a similar problem.

Also: OpenAI launches new o3-mini model - here's how free ChatGPT consumers can try it.

In the X post announcing the feature, OpenAI throws out the term "Chain of Thought," but what does it actually mean?

In the same way you would ask a person to explain their reasoning step by step, CoT prompting encourages an LLM to break down a complex problem into logical, smaller, and solvable steps. By sharing these reasoning steps with clients, the model becomes more interpretable, allowing clients to improved steer its responses and identify errors in reasoning.

Also: OpenAI eyes the wearables business: Robots, headsets, watches and a whole lot more.

Raw CoT would display every intermediate step in real time as the model reasons through a problem. OpenAI's take on CoT in this modification is not raw, as it is summarizing the reasoning for people. This has caused many AI aficionados in the comments of the X post to express discontent with the feature, as raw CoT poses added benefits, such as how to superior steer the model and troubleshoot incorrect reasoning.

Some reasons OpenAI could have chosen to go with its take on CoT are that it makes it easier for everyone to understand, and that exposing raw CoT could make the model more vulnerable to jailbreaking attempts.

Also: From zero to millions? How regular people are cashing in on AI.

Accurate impact estimations can make or break your business case.

Yet, despite its importance, most teams use oversimplified calculations that can le...

At TNW, we are all about supporting and elevating startups and entrepreneurs who are doing epic stuff with tech. When Red Bull reached out to talk abo...

John Schulman, cofondateur d’OpenAI et ancien chercheur d’Anthropic, quitte la startup d’intelligence artificielle. Son départ marque un tournant impo...

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

in recent times, DeepSeek presented their latest model, R1, and article after article came out praising its performance relative to cost, and how the release of such open-source models could genuinely change the course of LLMs forever. That is really exciting! And also, too big of a scope to write about… but when a model like DeepSeek comes out of nowhere with a steel chair, boasting similar performance levels to other models, what does performance really mean in this context?

If you follow AI releases, you’ve seen this dance before. Every new model drops with its graphs showing how it’s somehow simultaneously more effective than GPT-4 on math problems while being smaller and more efficient. But what exactly are these benchmarks measuring? How are they created? And more importantly, how can we cut through the hype to create our own benchmarks for specific use cases?

I wanted to learn more about LLM Benchmarking.

Part 1: What is a Benchmark? (in 3 seconds).

TL:DR — The SATs (multiple, actually) for LLMs.

Part [website] What is a Benchmark? (in more than 3 seconds).

Before we dive into the nitty-gritty of specific benchmarks, let’s take a moment to unpack what we even mean by “LLM Benchmark.” Because calling them the “SATs for AI” feels both right and also slightly oversimplified.

LLM benchmarks are, at their core, structured tests used to measure how well large language models perform on certain tasks. These tasks can be anything from identifying if a statement is true or false, to summarizing a legal document, to generating valid Python functions. Think of them as curated obstacle courses specially designed by AI researchers to test every relevant muscle these models might have. These frameworks typically provide a dataset of inputs with known correct outputs, allowing for consistent comparison between models.

Modern benchmarks employ various evaluation methodologies. Classification metrics like accuracy work for tasks with discrete correct answers, while overlap-based metrics (BLEU, ROUGE) evaluate free-form text generation. Some benchmarks use functional testing for code generation, or employ other LLMs as judges to evaluate response quality.

A typical benchmark usually comes packaged as:

A standardized dataset of questions, prompts, or tasks (with correct or reference answers).

of questions, prompts, or tasks (with correct or reference answers). An evaluation protocol specifying how to measure success, like accuracy, F1 score, BLEU/ROUGE for text generation, or pass/fail rates for coding tasks.

specifying how to measure success, like accuracy, F1 score, BLEU/ROUGE for text generation, or pass/fail rates for coding tasks. A leaderboard or some form of comparative scoreboard, often with big flashy graphs.

Some really famous benchmarks include MMLU for testing multitask language understanding, TruthfulQA for assessing factual accuracy, and HumanEval for measuring coding capabilities. Results are pretty often , which let’s people perform some transparent comparison between different models.

From the DeepSeek paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

A Clear Task Definition: We want tasks that are unambiguous. The more straightforward and well-specified the challenge, the easier it is to trust the results. Data Integrity: The test set shouldn’t be floating around in the training data. Because if the model’s seen the exact same question 50 times before, the evaluation is about as useful as giving a math quiz to someone who already has the answer key. Quantifiable Metrics: You need a standard for scoring performance — like how many times the model’s code passes test cases or how close the generated summary is to a “ground-truth” summary. Task Diversity & Difficulty: If a benchmark is too easy, everyone just ACES it on day one, and we learn… well, nothing. If it’s too niche (like “We test only the model’s ability to count the digits of Pi for 20 minutes”), that’s also not so helpful.

Benchmarks capture only a slice of what LLMs can do. In the real world, your chatbot might need to juggle domain knowledge, keep track of conversation context, abide by your enterprise’s policies, and produce fluent, non-offensive replies. No single standardized test out there fully covers that. As we’ll see in the upcoming case studies, the design and execution of a benchmark can heavily shape the picture you get of your model’s performance… and sometimes lead you astray if you’re not careful with how you measure success.

Now that we have a sense of what Llm Benchmarks are designed to accomplish (and where they might fall short), let’s explore a couple of examples to see how people actually build and use them in practice — with mixed results!

Case Study #1: Leetcode as an LLM Benchmark.

As a student in the tech space, the word “Leetcode” popping up during my search for cool benchmarks raised by blood pressure by a statistically significant amount. Unlike Leetcode, which sucks, the paper “Performance Study of LLM-Generated Code on Leetcode” was very interesting — it asks a deceptively simple question: can we use Leetcode to benchmark LLM code generation? Their findings reveal both the promise and pitfalls of this approach.

The researchers built a three-stage validation system. Local tests catch basic errors, Leetcode’s judge verifies correctness, and a custom benchmarking setup measures performance. This setup revealed something critical: benchmarking code performance is harder than it looks.

When they compared local measurements to Leetcode’s metrics, they found only a [website] correlation. Leetcode’s measurements showed much higher variation ([website] vs [website] locally). Even worse, Leetcode’s rankings proved unstable — identical solutions could drop from the 77th to 54th percentile just based on submission timing.

A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024).

Three major issues emerged that challenge Leetcode’s viability as a benchmark:

Data Contamination: Using public problems risks LLMs having seen the solutions during training. The researchers had to use only problems from 2023 to mitigate this.

Platform Instability: Leetcode’s metrics drift over time — memory measurements showed a [website] correlation with test date. This makes reproducible benchmarking nearly impossible.

Measurement Reliability: The weak correlation between local and platform measurements raises questions about what we’re actually testing.

This study doesn’t just critique Leetcode — it highlights what we need in a code generation benchmark: reproducible measurements, reliable performance metrics, and guaranteed training-test separation. Until we have platforms built specifically for this purpose, we need to be extremely cautious about using competition platforms as benchmarks.

So! We know that not all benchmarks are viable benchmarks — what about a more mainstream one?

Case Study #2: SuperGLUE — Building a more effective Language Understanding Benchmark.

The SuperGLUE paper tackles a fascinating problem in AI benchmarking: what do you do when models get too good at your tests? When GLUE became insufficient (with models surpassing human performance), the researchers had to rethink how we measure language understanding.

SuperGLUE’s core innovation is its task selection methodology. The researchers collected task proposals from the NLP community and filtered them through a rigorous process: each task needed clear evaluation metrics, public training data, and — most importantly — significant headroom between machine and human performance.

This resulted in eight tasks (I’ve simplified the table from the document here, it’s a little less readable but you should get the sense of what the questions are asking):

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019).

What makes these tasks special is their diversity in format. Unlike GLUE’s focus on sentence classification, SuperGLUE includes coreference resolution, reading comprehension, and more com plex reasoning tasks. Each task measures different aspects of language understanding while maintaining clear, quantifiable metrics.

Part 2: Let’s Build a Physical Reasoning Benchmark: To Cheat at Escape Rooms.

After looking at some benchmarks like SuperGLUE and Leetcode, I had an idea: what if we tested LLMs on something completely different — physical reasoning… through escape room puzzles?

It’s a pretty valid idea — escape rooms poses possibilities and consequences for failure — screw up one too many puzzles, and your friends will think you’re pretty stupid, and relegate you to spectator duty. Luckily for us however, they (or the poor employees) don’t know that you can sneak a phone into an escape room — and you know just who to ask for the answers. Today, LLMs face off against the puzzles of a physical escape room.

Note: This is NOT a rigorous academic benchmark (please don’t cite this in papers, why would you even want to do that?), or even close to it, and it’s just supposed to be a fun way to test LLM benchmarking and evaluation. Please do not destroy my prompts, I am aware they are bad.

For real, though… most LLM benchmarks focus on linguistic tasks (like SuperGLUE) or code generation (like Leetcode). And for good reason — these are well-defined domains with clear evaluation metrics. But real-world problem solving often requires understanding physical principles and their interactions. The famous “Can GPT-4 do physics?” debates usually center around mathematical problem-solving, not practical physical reasoning.

Looking at existing benchmarks taught me a few key principles:

Clear evaluation metrics are crucial (from SuperGLUE’s task-specific scores) Problems should have unambiguous solutions (from HumanEval’s test cases) The benchmark should test distinct capabilities (from MMLU’s subject categories).

I settled on escape room puzzles for two reasons. First, they naturally combine physical reasoning with clear goals. Second, they have unambiguous success conditions — either you solve it through the intended way, or you don’t. Third, and most importantly, they let me include “red herrings” — irrelevant items that test if the LLM can identify what matters physically. Fourth, I just really like doing escape rooms (did I mention that already?),.

I am aware that this is more than two reasons, but if LLMs can’t count how many rs’ there are in strawberry, I’m allowed to mess up once in a while too.

Here’s how I structured the five core problems:

Fluid Dynamics (FLUID_001) (Ping pong ball stuck in a tube).

Tests understanding of buoyancy and fluid displacement.

Inspired by classic physics problems but in practical context.

Includes intentionally irrelevant items (like squishy food models).

Light Properties (UV_001) (UV light on a push numebr lock).

Tests understanding of UV fluorescence and material properties.

Combines multiple physical principles (light, material science).

Requires understanding of environmental conditions.

Mechanical Understanding (CIPHER_001) (A cipher ring).

Tests spatial reasoning and mechanical alignment.

No red herrings — tests for correlating a dial to a cypher wheel.

Requires understanding rotational symmetry.

Force Application (VAC_001) (Can stuck in hole).

Tests understanding of vacuum forces and surface adhesion.

Requires understanding force multiplication.

Collaborative Physics (COLLAB_001) (Can two people shimmy a key?).

Tests understanding of physical constraints in multi-agent scenarios.

Requires combining multiple physical principles.

Tests understanding of tool creation and friction.

Sounds really fancy… but it’s just some basic physical puzzles. You can access them on my GitHub.

The benchmark implementation has three main components:

Problems are defined in a structured JSON format that enforces consistent evaluation:

{ "problem_id": "FLUID_001", "setup": { "scenario": "A ping pong ball is at the bottom of a narrow tube...", "available_items": ["bottle of water", "squishy food models"...], "constraints": ["tube too narrow for manual retrieval"] }, "physical_principles": ["buoyancy", "fluid displacement"], "red_herrings": ["squishy food models", "milk carton"], "solution": { "steps": ["pour water into tube", "allow ball to float"], "key_insights": ["water displaces air", "ping pong ball less dense"] } }.

This structure draws from SuperGLUE’s design — each component is clearly separated and machine-readable. The physical_principles field explicitly lists what’s being tested, while red_herrings helps in scoring the LLM’s ability to ignore irrelevant information.

The evaluation system uses Python’s asyncio for concurrent testing, with retry logic for a little bit more API stability:

@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10)) async def evaluate_response(self, criteria: JudgingCriteria) -> Dict: """Evaluate a model's response using GPT-4 as judge.""" async with aiohttp.ClientSession() as session: # ... evaluation logic.

The scoring system looks at three components:

Physical Understanding Score (PUS) ∈ [0,2].

Measures understanding of relevant physical principles.

Calculated as normalized sum of demonstrated principles.

Evaluates completeness and correctness of solution steps.

Considers practical feasibility of proposed solutions.

A Binary score for avoiding irrelevant items.

Tests ability to focus on physically relevant factors.

And yes, there are also so many other scoring methods, improved and worse, that could be used! For example, RHH could be about how many irrelevant items are used in the solution, or it could be a measure of how viable the use is… the point is that picking these metrics are often times pretty arbitrary, but are very very essential to making your benchmark is credible, which mine is very much not.

Additionally, I did not want to rewrite any code after. Sue me.

The benchmark supports multiple LLM backends through a common interface:

class ModelInterface: """Interface for different LLM APIs.""" async def generate_response(self, prompt: str) -> str: raise NotImplementedError.

class GPT4Interface(ModelInterface): async def generate_response(self, prompt: str) -> str: # GPT-4 specific implementation class ClaudeInterface(ModelInterface): async def generate_response(self, prompt: str) -> str: # Claude specific implementation.

Two models… I can’t really afford any more, please understand.

So after some refilling of API balances, I was ready to go. I ran the benchmark, and after totally not wasting credits on like, 5 different errors, the results were in. And they were… interesting! I used GPT4 and Claude [website] Haiku (I am aware these are not similarly spec-ed models) as our victims.

The Challenge: A can is stuck in a hole that almost exactly matches its volume. A combination is written on the bottom of the can that needs to be retrieved.

"VAC_001": { "problem_id": "VAC_001", "setup": { "available_items": [ "ruler", "notebook", "suction cup", "UV light" ], "constraints": [ "can fits too tightly for finger grip", "hole is same diameter as can", "can is clean and dry" ] }, "physical_principles": [ "vacuum suction", "surface adhesion", "force multiplication" ], "red_herrings": [ "ruler", "notebook", "UV light" ], "solution": { "steps": [ "inspect suction cup and can surface", "ensure can surface is clean", "firmly press suction cup onto can top", "create vacuum seal", "pull can out using suction cup" ], "key_insights": [ "suction cup creates vacuum seal", "vacuum force allows extraction", "other items don't provide necessary force" ] }.

Physical Principles: The main physical principle involved in this solution is the concept of air pressure differential, which is the same principle that allows a suction cup to stick to a surface. Air pressure is higher on one side of the suction cup (outside) compared to the other (inside), and this difference in pressure causes the cup to stick to the surface. Solution Steps: Attach the suction cup to the top of the can. The flat surface of the can should provide a good contact for the suction cup to adhere to. Once the suction cup is firmly attached, gently pull upwards. The air pressure differential should cause the suction cup (and thus the can) to be lifted out of the hole. Once the can is out of the hole, remove the suction cup from the top of the can. Use the UV light to illuminate the combination written on the bottom of the can. Some inks only become visible under UV light, which is why this step is necessary. Write down the combination on the notebook to ensure it’s not forgotten. Required Items: Suction cup: This is required to adhere to the can and pull it out of the hole, as the can fits too tightly for finger grip.

UV light: This is required to reveal the combination written on the bottom of the can, as it may be written in ink that only becomes visible under UV light.

Notebook: This is required to record the combination once it’s visible, ensuring it’s not forgotten.”.

Not bad at all! GPT-4 manages to figure out the physical principle behind the solution (even if it is very wordy), and manages to step-by-step describe how to solve the problem. It does end up using the UV light and notebook, but that’s more of a problem on my end, not punishing the use of unnecessary items in the prompt enough. Also, the uses for these items seem decently reasonable… so why not!

Physical Principles: The key physical principles involved in this scenario are the concept of volume and the use of suction to extract the can from the hole. Since the can fits almost exactly into the hole, it is likely that the can is stuck due to the tight fit and the lack of any air gaps or space between the can and the hole. This tight fit creates a vacuum-like effect, making it difficult to grip and pull the can out. Solution Steps: Measure the diameter of the can using the ruler. This will help determine the size of the hole and confirm that the can fits tightly within it. Inspect the bottom of the can to locate the combination written on it. Use the UV light to illuminate the bottom of the can and make the combination visible, if it is not already clearly visible. Attach the suction cup to the top of the can. The suction cup will create a temporary seal on the can’s surface, allowing you to grip and pull it out of the hole. Gently pull the can upwards using the suction cup. The suction force should be strong enough to overcome the vacuum-like effect and extract the can from the hole. Once the can is out, record the combination written on the bottom of the can in the notebook. Required Items: Ruler: To measure the diameter of the can and confirm the tight fit in the hole.

UV light: To illuminate the bottom of the can and make the combination visible, if it is not already clearly visible.

Suction cup: To create a temporary seal on the can’s surface and provide the necessary grip to pull the can out of the hole.

Notebook: To record the combination written on the bottom of the can.

On the other hand, Haiku’s solution was.. certainly a solution of all time.

First, it wants to measure the diameter of the can using the ruler??? To determine the size of the hole and confirm that the can fits tightly within it? Why would we need to do this? And do we need a ruler for that?

Second, it tells us to inspect the bottom of the can to locate the combination, when the entire problem is about not being able to pull the can out of the hole conventionally. This might just be an issue of order, but now I truly understand my friends’ feelings whenever I would tell them “just fix it man” to their numerous problems.

But it eventually does get the solution. So… not the worst.

Here’s a fancy radar graph of the results!

We see that both models are pretty similar in their capabilities, with GPT-4 being slightly enhanced in physical understanding and solution path, and Haiku being slightly enhanced in red herring handling. Overall though, both models kind of suck. Dang.

If you’d like to see the full breadth of questions, they’re on my GitHub.

By the way, the method I used to generate the evaluations, LLM-as-a-judge, has gained significant traction in the AI community, particularly after the work of Zheng et al. in their 2023 paper “Judging LLM-as-a-Judge.” The technique has proven remarkably effective, achieving over 80% agreement with human evaluators in tasks ranging from code assessment to dialogue quality evaluation!

Here’s where my experiment gets kind of cool (arguably, maybe, subjectively) — I used this methodology and had GPT-4 judge other LLMs’ physical reasoning abilities. Yes, I’m using an AI to judge other AIs.

Why does this work? Well, judging a response is actually a simpler task than generating one. When GPT-4 generates a solution to a physical puzzle, it needs to:

Understand the physical principles involved.

But when judging, it only needs to check if specific criteria are met in an existing solution. The evaluation prompt is very focused:

def _create_evaluation_prompt(self, criteria: JudgingCriteria) -> str: return f"""You are an expert judge evaluating an LLM's understanding of physical reasoning puzzles.

Evaluate based on three criteria: 2. Physical Understanding Score (0-2): Does the solution correctly apply relevant physical principles? 3. Solution Path Score (0-2): Are the steps complete and feasible? 4. Red Herring Handling (0-1): Does it avoid using irrelevant items? Scenario: {criteria.scenario} Physical Principles Required: {criteria.correct_principles} Solution Given: {criteria.model_response} """

To validate this approach, I followed the validation framework suggested by Zheng et al., performing spot-checks of GPT-4’s evaluations against my own judgments. Surprisingly (or perhaps unsurprisingly, given the broader research on LLM evaluation), it was remarkably consistent in identifying both correct physical understanding and flawed reasoning.

Is this perfect? Absolutely not. There’s something philosophically weird about using one LLM to evaluate another. But in practice, it can work surprisingly well — just like how I moan and groan about the visual presentation of a dish on Masterchef, while setting my kitchen aflame trying to microwave a hot dog.

Building this benchmark taught me several things about benchmark design:

Clear Metrics Matter: Even for complex tasks like physical reasoning, you need unambiguous scoring criteria.

Red Herrings Are Powerful: Including irrelevant items reveals a lot about an LLM’s reasoning process.

Context Control is Hard: Ensuring LLMs don’t “hallucinate” additional physical context is challenging.

Is this a perfect benchmark? Not even close. Please don’t rub it in. Is it scientifically rigorous? Definitely not. But it’s been a fascinating exploration into an aspect of LLM capabilities, and sometimes the best we can learn can come from just trying things out and seeing what happens.

Now, if you’ll excuse me, I will be sneaking in a phone with an internet connection into my next escape room, for reasons that I am legally unmotivated to disclose.

[1] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, I. Stoica, “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), Datasets and Benchmarks Track (2023).

[2] T. Coignion, C. Quinton, R. Rouvoy, “A Performance Study of LLM-Generated Code on Leetcode,” In 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024), Salerno, Italy (2024).

[3] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada (2019).

[5] DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, [website] Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint [website] (2025).

[6] Unless otherwise stated, all images are created by the author.

Objects and their relationships are ubiquitous in the world around us, and relationships can be as critical to understanding an object as its own att...

OpenAI cofounder John Schulman, who joined Anthropic last year, departed from the organization in just five months. He is reportedly joining former OpenAI ...

Large language models (LLMs) have significantly improved the state of the art for solving tasks specified using natural language, often reaching perfo...

Market Impact Analysis

Market Growth Trend

2018	2019	2020	2021	2022	2023	2024
23.1%	27.8%	29.2%	32.4%	34.2%	35.2%	35.6%

Quarterly Growth Rate

Q1 2024	Q2 2024	Q3 2024	Q4 2024
32.5%	34.8%	36.2%	35.6%

Market Segments and Growth Drivers

Segment	Market Share	Growth Rate
Machine Learning	29%	38.4%
Computer Vision	18%	35.7%
Natural Language Processing	24%	41.5%
Robotics	15%	22.3%
Other AI Technologies	14%	31.8%

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity:

Competitive Landscape Analysis

Company	Market Share
Google AI	18.3%
Microsoft AI	15.7%
IBM Watson	11.2%
Amazon AI	9.8%
OpenAI	8.4%

Future Outlook and Predictions

The Comprehensive Guide Temperature landscape is evolving rapidly, driven by technological advancements, changing threat vectors, and shifting business requirements. Based on current trends and expert analyses, we can anticipate several significant developments across different time horizons:

Year-by-Year Technology Evolution

Based on current trajectory and expert analyses, we can project the following development timeline:

2024Early adopters begin implementing specialized solutions with measurable results

2025Industry standards emerging to facilitate broader adoption and integration

2026Mainstream adoption begins as technical barriers are addressed

2027Integration with adjacent technologies creates new capabilities

2028Business models transform as capabilities mature

2029Technology becomes embedded in core infrastructure and processes

2030New paradigms emerge as the technology reaches full maturity

Technology Maturity Curve

Different technologies within the ecosystem are at varying stages of maturity, influencing adoption timelines and investment priorities:

(Interactive diagram available in full report)

Innovation Trigger

Generative AI for specialized domains
Blockchain for supply chain verification

Peak of Inflated Expectations

Digital twins for business processes
Quantum-resistant cryptography

Trough of Disillusionment

Consumer AR/VR applications
General-purpose blockchain

Slope of Enlightenment

AI-driven analytics
Edge computing

Plateau of Productivity

Cloud infrastructure
Mobile applications

Technology Evolution Timeline

1-2 Years

Improved generative models
specialized AI applications

3-5 Years

AI-human collaboration systems
multimodal AI platforms

5+ Years

General AI capabilities
AI-driven scientific breakthroughs

Expert Perspectives

Leading experts in the ai tech sector provide diverse perspectives on how the landscape will evolve over the coming years:

"The next frontier is AI systems that can reason across modalities and domains with minimal human guidance."
— AI Researcher

"Organizations that develop effective AI governance frameworks will gain competitive advantage."
— Industry Analyst

"The AI talent gap remains a critical barrier to implementation for most enterprises."
— Chief AI Officer

Areas of Expert Consensus

Acceleration of Innovation: The pace of technological evolution will continue to increase
Practical Integration: Focus will shift from proof-of-concept to operational deployment
Human-Technology Partnership: Most effective implementations will optimize human-machine collaboration
Regulatory Influence: Regulatory frameworks will increasingly shape technology development

Short-Term Outlook (1-2 Years)

In the immediate future, organizations will focus on implementing and optimizing currently available technologies to address pressing ai tech challenges:

Improved generative models
specialized AI applications
enhanced AI ethics frameworks

These developments will be characterized by incremental improvements to existing frameworks rather than revolutionary changes, with emphasis on practical deployment and measurable outcomes.

Mid-Term Outlook (3-5 Years)

As technologies mature and organizations adapt, more substantial transformations will emerge in how security is approached and implemented:

AI-human collaboration systems
multimodal AI platforms
democratized AI development

This period will see significant changes in security architecture and operational models, with increasing automation and integration between previously siloed security functions. Organizations will shift from reactive to proactive security postures.

Long-Term Outlook (5+ Years)

Looking further ahead, more fundamental shifts will reshape how cybersecurity is conceptualized and implemented across digital ecosystems:

General AI capabilities
AI-driven scientific breakthroughs
new computing paradigms

These long-term developments will likely require significant technical breakthroughs, new regulatory frameworks, and evolution in how organizations approach security as a fundamental business function rather than a technical discipline.

Key Risk Factors and Uncertainties

Several critical factors could significantly impact the trajectory of ai tech evolution:

Ethical concerns about AI decision-making

Data privacy regulations

Algorithm bias

Organizations should monitor these factors closely and develop contingency strategies to mitigate potential negative impacts on technology implementation timelines.

Alternative Future Scenarios

The evolution of technology can follow different paths depending on various factors including regulatory developments, investment trends, technological breakthroughs, and market adoption. We analyze three potential scenarios:

Optimistic Scenario

Responsible AI driving innovation while minimizing societal disruption

Key Drivers: Supportive regulatory environment, significant research breakthroughs, strong market incentives, and rapid user adoption.

Probability: 25-30%

Base Case Scenario

Incremental adoption with mixed societal impacts and ongoing ethical challenges

Key Drivers: Balanced regulatory approach, steady technological progress, and selective implementation based on clear ROI.

Probability: 50-60%

Conservative Scenario

Technical and ethical barriers creating significant implementation challenges

Key Drivers: Restrictive regulations, technical limitations, implementation challenges, and risk-averse organizational cultures.

Probability: 15-20%

Scenario Comparison Matrix

Factor	Optimistic	Base Case	Conservative
Implementation Timeline	Accelerated	Steady	Delayed
Market Adoption	Widespread	Selective	Limited
Technology Evolution	Rapid	Progressive	Incremental
Regulatory Environment	Supportive	Balanced	Restrictive
Business Impact	Transformative	Significant	Modest

Transformational Impact

Redefinition of knowledge work, automation of creative processes. This evolution will necessitate significant changes in organizational structures, talent development, and strategic planning processes.

The convergence of multiple technological trends—including artificial intelligence, quantum computing, and ubiquitous connectivity—will create both unprecedented security challenges and innovative defensive capabilities.

Implementation Challenges

Ethical concerns, computing resource limitations, talent shortages. Organizations will need to develop comprehensive change management strategies to successfully navigate these transitions.

Regulatory uncertainty, particularly around emerging technologies like AI in security applications, will require flexible security architectures that can adapt to evolving compliance requirements.

Key Innovations to Watch

Multimodal learning, resource-efficient AI, transparent decision systems. Organizations should monitor these developments closely to maintain competitive advantages and effective security postures.

Strategic investments in research partnerships, technology pilots, and talent development will position forward-thinking organizations to leverage these innovations early in their development cycle.

Technical Glossary

Key technical terms and definitions to help understand the technologies discussed in this article.

Understanding the following technical concepts is essential for grasping the full implications of the security threats and defensive measures discussed in this article. These definitions provide context for both technical and non-technical readers.

platform intermediate

algorithm Platforms provide standardized environments that reduce development complexity and enable ecosystem growth through shared functionality and integration capabilities.

NLP intermediate

interface

large language model intermediate

platform

API beginner

encryption APIs serve as the connective tissue in modern software architectures, enabling different applications and services to communicate and share data according to defined protocols and data formats.

How APIs enable communication between different software systems

Example: Cloud service providers like AWS, Google Cloud, and Azure offer extensive APIs that allow organizations to programmatically provision and manage infrastructure and services.

neural network intermediate

API

interface intermediate

cloud computing Well-designed interfaces abstract underlying complexity while providing clearly defined methods for interaction between different system components.

reinforcement learning intermediate

middleware

machine learning intermediate

scalability

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms - Related to cheat, (bad), thought, my, i

A Comprehensive Guide to LLM Temperature 🔥🌡️

SHARE

OpenAI's o3-mini now lets you see the AI's thought process

SHARE

I Tried Making my Own (Bad) LLM Benchmark to Cheat in Escape Rooms

SHARE

Market Impact Analysis

Market Growth Trend

Quarterly Growth Rate

Market Segments and Growth Drivers

Technology Maturity Curve

Competitive Landscape Analysis

Future Outlook and Predictions

Year-by-Year Technology Evolution

Technology Maturity Curve

Innovation Trigger

Peak of Inflated Expectations

Trough of Disillusionment

Slope of Enlightenment

Plateau of Productivity

Technology Evolution Timeline

Expert Perspectives

Areas of Expert Consensus

Short-Term Outlook (1-2 Years)

Mid-Term Outlook (3-5 Years)

Long-Term Outlook (5+ Years)

Key Risk Factors and Uncertainties

Alternative Future Scenarios

Optimistic Scenario

Base Case Scenario

Conservative Scenario

Scenario Comparison Matrix

Transformational Impact

Implementation Challenges

Key Innovations to Watch

Technical Glossary

platform intermediate

Related Terms

NLP intermediate

large language model intermediate

API beginner

Related Terms

neural network intermediate

interface intermediate

Related Terms

reinforcement learning intermediate

machine learning intermediate

Related Articles

GitHub Copilot previews agent mode as market for agentic AI coding tools accelerates - Related to how, agent, copilot, amazon's, agentic

The best robot vacuum deals of February 2025: Save on Roomba, Roborock, Eufy, and more - Related to roborock,, february, fix, dominance', creates

Tata Communications, CoRover.ai Partner to Bring AI Solutions for Government and Enterprises - Related to scientists, quantum,, solutions, bring, –