Apple published a research paper on Saturday, where researchers examine the strength and weaknesses of the recently released logic model. Also known as a large regioning model (LRMS), these are models that “think” using additional calculations to solve complex problems. However, the paper found that even the most powerful models struggle with the issue of a complexity. Researchers said that when a problem is highly complex, models experience total collapse and abandon the problem rather than using more calculations, which they are trained to do.
Apple says that logic models are not really arguing beyond a level
One in paper Published on Apple’s website, “The Illusion of Thinking: Understanding the Straits the Straits the Straits the Straits the Strengths and Limits of Reasoning Model of the Lens of Problem Complex,” with the title, researchers, the researchers claim both the LRM and the big language model (LLM), claiming to be separated on the face of a three -language model (LLM).
Paper has described three governance of complexity which are low complexity functions, moderate complexity functions and high complexity functions. To test how LLMS and LRMS function, when dealing with a wide range of complications, researchers decided to use several riddles, which could increase the level of difficulty. Especially a puzzle was Hanoi’s tower.
Hanoi’s tower is a mathematical puzzle with three pegs and several discs. The disc is arranged in a decreasing order of size to create shapes like a pyramid. The purpose of the puzzle is to shift the disk to the most right peg from the left pegs, while moving a disc at a time. There is a catch – at any time a large disc should be placed on top of a small disc. It is not very difficult puzzle, and it is often targeted on children between the ages of six to 15 years.
Mathematical puzzle
Photo Credit: Apple
Apple’s researchers chose two arguments models and their non-functional counterparts for this experiment. The selected LLMS Clouds were 3.7 Sonnet and Deepsek-V3, while LRMs were 3.7 Sonnet with thinking and Deepsek-R1. The thinking budget was maximized 64,000 tokens in each. The objective of the experiment was not only to check the final accuracy, but also had accuracy in logic in choosing stages to solve the puzzle.
In low complexity work, up to three discs were added, while for moderate complexity work, the disc size was placed between four and 10. Finally, in high complexity work, there were between 11–20 discs.
Researchers stated that both LLM and LRM displayed the same qualifications in solving low complexity tasks. When the difficulty had increased, given the additional budget of logic, the logic models were able to solve the puzzle more accurately. However, when the work reached the high complexity region, it was found that the two models showed complete decline of logic.
The same experiment was also said to be repeated with more models and more riddles, such as checkers jumping, river crossing and block world.
Apple’s research paper highlights the concerns that have already been expressed by many others in artificial intelligence (AI) location. While logic models can normalize within their distributed dataset, whenever a problem falls beyond them, the models struggle in “thinking”, and either try to take a shortcut to find a solution, or completely defeat and collapse.
“Current evaluation mainly focuses on the mathematical and coding benchmarks established, emphasizing the final answer accuracy. However, this assessment paradigm often suffers from data contamination and does not provide insight into the structure and quality of the argument mark,” the company Said In a post.