Apple's controversial research paper actually tells us about LLMS

The generative AI model quickly proved that she was able to do technical work well. Adding arguments in the model unlocked unexpected abilities, which enabled the model to think through more complex questions and produce better quality, more accurate reactions-or so we thought.

Last week, Apple Issued A research report “The Illusion of Thinking: Understand: Understanding the strength and limitations of the logic model through the problem of problem complexity.” As the title suggests, the 30-page paper dives that large region models (LRMS), such as OPENAI’s O1 model, anthropic Cloud 3.7 Sonnet Thinking (Which is the logic version of the base model, Cloud 3.7 Sonnet), and Deepsek R1, are capable of giving advanced “thinking” they advertise.

(Disclosure: ZDNET’s original company Ziff Davis filed a case of April 2025 against Openai, alleging that it violates Ziff Davis copyright training and operating its AI system.)

Also: Openai’s O1 is more than any major AI model. Why does it matter

Apple tested the model beyond the scope of traditional mathematics and coding benchmarks, which made a series of experiments as various riddles. The results have shown that even the smartstaste model hit a point of low returns, increasing the logic to solve the complexity of a problem to a extent.

I encourage you to read it if you are interested in the subject from a distance. However, if you do not have time and just want big topics, I unpack it for you down.

What are large logic models (LRMS)?

In the research paper, Apple uses the “large logic model”, when we usually call only the logic model. This type of big language model (LLM) was first popularized by the release of OPENAI’s O1 model, after which the O3 was later released.

The concept behind LRMS is simple. Humans are encouraged to think before talking to comment on high value; Similarly, when a model is encouraged to spend more time processing through a signal, its answer should be higher, and that process should be able to respond to more complex signs to the model.

Also: Apple’s ‘The Illusion of Thinking’ is shocking – but what did you remember here

Methods like “chain-off-three” (COT) also enable this extra thinking. COT encourages an LLM to break a complex problem in logical, small and solved stages. The model sometimes shares these logic stages with users, allowing the model to be more interpreted and allows users to improve their reactions better and identify errors in logic. The raw cot is often kept private to prevent bad actors from watching weaknesses, which can tell them how to jail a model.

This additional processing means that these models require more calculating power and therefore it is more expensive or token-thorough, and takes more time to return a answer. For this reason, they are not for broad, everyday tasks, but are reserved for more complex or stem related tasks.

This also means that the benchmarks used to test these LRM are usually related to mathematics or coding, which is one of the first qualifications of Apple in the paper. The company stated that these benchmarks emphasize the final answer and focus less on the logic process, and therefore data is subject to contamination. As a result, Apple established a new experiment paradigm.

Use

Apple installed four controlled puzzles: Tower of Hanoi, which involves transfer of discs in pegs; Checkers jumping, which includes pieces of positioning and swapping checkers; River crossing, which involves obtaining the size of a river; And blocks the world, in which users swap colored items.

Understanding why the experiments were chosen is important to understand the results of the paper. Apple chose puzzles to better understand the factors that recognize the existing benchmark as a better performance. In particular, the puzzles allow for a more “controlled” environment, wherever the level intensity is adjusted, the argument remains the same.

The authors explained in the paper, “This environment allows for accurate manipulation of the problem complexity while maintaining continuous logical processes, enabling more rigorous analysis of logic patterns and boundaries.”

Paheli compared both “thinking” and “non-thinking” versions compared to versions of the popular logic model, including Cloud 3.7 sonnet, and Deepsek’s R1 and V3. The authors manipulated the difficulty by increasing the size of the problem.

The last important element of the setup is that all models were given the same maximum token budget (64K). Then, 25 samples were generated with each model, and the average performance of each model beyond them was recorded.

Result

The findings showed that there are different advantages for using the thinking vs non-thinking models at different levels. In the first rule, or when the problem complexity is low, non-thinking models can perform at the same level, if not better, compared to thinking models during being more time-skilled.

The biggest advantage of the thinking model lies in second, moderate-complexity regime, as the performance between thinking and non-thinking models is quite wide (illustrated in the figure above). Then, in the third regime, where the problem complexity is the highest, the performance of both model types became zero.

Also: Clobing every benchmark with AI model, this is the time for human evaluation

“Results suggest that thinking models delay this collapse, they eventually face fundamental boundaries similar to their non-thinking counterparts,” the authors said.

While testing the five state -of -the -art thinking models, he observed uniform collapse: O3 mini (moderate and high configuration), Deepsek R1, Deepsek R1 Qwen 32B, and Cloud 3.7 Sonnet thinking on the same puzzle used in the first use. The same pattern was observed: such as complication increased, accuracy fell, eventually plateau at zero.

Figure 6 in paper — Figure 6: Accuracy and thinking tokens vs. Paheli Puzzle atmosphere in the atmosphere. Problem complexity to argue the model. As the complexity increases, logic models initially spend more tokens, while the accuracy gradually decreases, until an important point where the logic collapses -the performance falls rapidly and the attempt of logic decreases.

Apple

Even more interesting thinking is a change in the number of tokens. Initially, such as the puzzles grow in complexity, models accurately allocate the token required to solve the issue. However, as models come close to their drop-off points for accuracy, they also start reducing their argument effort, even if the problem is more difficult, and will be expected to be used more.

The paper identifies other shortcomings: for example, even when indicated with the steps required to solve the problem, the thinking models were still unable to do so accurately, yet that technically less difficult.

What does this mean?

The perception of the people of paper is divided on what it really means for users. While some users have found comfort in paper results, saying that we are ahead of AGI, we would believe in compared to the tech CEO, many experts have identified functioning issues.

Eliminate discrepancies Identified that high-complexity problems would require a high token allowance to solve the model allocated by Apple, which was capted at 64K. Others said that some models that would probably be able to perform well, such as O3-Min and O4-Min, were not involved in the experiment. A user also fed the paper to O3 and asked it to identify the functioning issues. The Chatgpt had some criticisms, such as the token roof and statistical sound, as seen below.

I asked O3 to analyze and criticize Apple’s new “LLMS Cantt Region” paper. Despite its inability for the reason, I think it does a very good job, not you? pic.twitter.com/jvwwqt3nvrt

– Rohit (@krishnanrohit) June 9, 2025

My interpretation: If you take the results of the paper to an inscribed value, the author does not clearly say that LRMS are not able to argue or it is not worth using them. Instead, the paper suggests that these models have some limitations that can still be researched and recur in the future – a conclusion that is correct for most progression in AI space.

The paper still serves as another good reminder that none of these models are infallible, even if they are advanced or even how they perform on the benchmark. The evaluation of an LLM based on a benchmark is an array of issues in itself, as benchmarks often test only for high-level specific functions that do not accurately translate in everyday applications of these models.

Get top stories of morning with us in your inbox every day Tech Today Newsletter.

What's Hot

5 of my favorite Linux System – Monitoring Tools – and why I use them

The best shows like ‘Wednesday’ you should watch ahead

Gemini adds powerful new deep think models – what it does and who can try it

5 of my favorite Linux System – Monitoring Tools – and why I use them

Stabilize grid-scale battery power in Scotland

Got 6 hours? This free AI training from Google and goodwill can promote your start today

Microsoft’s new text editor is a VIM and Nano option

The best luxury car for buyers for the first time in 2025

Massives Datenleck in Cloud-Spichenn | CSO online

Most Popular

10,000 steps or Japanese walk? We ask experts if you should walk ahead or fast

FIFA Club World Cup Soccer: Stream Palmirus vs. Porto lives from anywhere

What do chatbott is careful about punctuation? I tested it with chat, Gemini and Cloud

Our Picks

5 of my favorite Linux System – Monitoring Tools – and why I use them

The best shows like ‘Wednesday’ you should watch ahead

Gemini adds powerful new deep think models – what it does and who can try it

Subscribe to Updates

What's Hot

Apple’s controversial research paper actually tells us about LLMS

What are large logic models (LRMS)?

Use

Result

What does this mean?

Related Posts

Subscribe to Updates