reasoning

Apple researchers have tested advanced AI reasoning models — which are called large reasoning models (LRM) — in controlled puzzle environments and found that while they outperform ‘standard’ large language models (LLMs) models on moderately complex tasks, both fail completely as complexity increases.

The researchers from Apple, which is not exactly at the forefront of AI development, believe that the current LRMs and LLMs have fundamental limits in their ability to generalize reasoning, or rather thinking the way humans do.

Apple researchers studied how advanced AI models — the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs — handle increasingly complex problem-solving tasks. They moved beyond standard math and coding benchmarks and designed controlled puzzle environments, such as Tower of Hanoi and River Crossing, where they could precisely adjust problem complexity. Their goal was to evaluate not just final answers but also the internal reasoning processes of these models, comparing them to standard large language models under equal computational conditions. Through the puzzles, they aimed to uncover the true strengths and fundamental limits of AI reasoning.

You may like

Apple researchers discovered that LRMs perform differently depending on problem complexity. On simple tasks, standard LLMs, without explicit reasoning mechanisms, were more accurate and efficient and delivered better results with fewer compute resources. However, as problem complexity increased to a moderate level, models equipped with structured reasoning, like Chain-of-Thought prompting, gained the advantage and outperformed their non-reasoning counterparts. When the complexity grew further, both types of models failed completely: their accuracy dropped to zero regardless of the available compute resources. (Keep in mind that the the Claude 3.7 Sonnet Thinking and DeepSeek-R1 LRMs have limitations when it comes to their training.)

A deeper analysis of the reasoning traces revealed inefficiencies and unexpected behavior. Initially, reasoning models used longer thought sequences as problems became harder, but near the failure point, they surprisingly shortened their reasoning effort even when they had sufficient compute capacity left. Moreover, even when explicitly provided with correct algorithms, the models failed to reliably execute step-by-step instructions on complex tasks, exposing weaknesses in logical computation. The study also found that model performance varied significantly between familiar and less-common puzzles, suggesting that success often depended on training data familiarity rather than true generalizable reasoning skills.

Follow Tom’s Hardware on Google News to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button.

Source link

Anthropic has introduced Claude Opus 4 and Claude Sonnet 4, its latest generation of hybrid-reasoning AI models optimized for coding tasks and solving complex problems.

Claude Opus 4 is Anthropic’s most powerful AI model to date, according to the company’s announcement, and capable of working continuously on long-running tasks for “several hours.” In customer tests, Anthropic said that Opus 4 performed autonomously for seven hours, significantly expanding the possibilities for AI agents. The company also described its new flagship as the “best coding model in the world,” with Anthropic’s benchmarks showing that Opus 4 outperformed Google’s Gemini 2.5 Pro, OpenAI’s o3 reasoning, and GPT-4.1 models in coding tasks and using “tools” like web search.

Claude Sonnet 4 is a more affordable and efficiency-focused model that’s better suited to general tasks, which supersedes the 3.7 Sonnet model released in February. Anthropic says Sonnet 4 delivers “superior coding and reasoning” while providing more precise responses. The company adds that both models are 65 percent less likely to take shortcuts and loopholes to complete tasks compared to 3.7 Sonnet and they’re better at storing key information for long-term tasks when developers provide Claude with local file access.

A new feature introduced for both Claude 4 models is “thinking summaries,” which condenses the chatbots’ reasoning process into easily understandable insights. An “extended thinking” feature is also launching in beta that allows users to switch the models between modes for reasoning or using tools to improve the performance and accuracy of responses.

Claude Opus 4 and Sonnet 4 are available on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI platform, and both models are included in paid Claude plans alongside the extended thinking beta feature. Free users can only access Claude Sonnet 4 for now.

In addition to the new models, Anthropic’s Claude Code agentic command-line tool is now generally available following its limited preview in February. Anthropic also says it’s shifting to provide “more frequent model updates,” as the company tries to keep up with competition from OpenAI, Google, and Meta.

Source link

Apple says generative AI cannot think like a human – research paper pours cold water on reasoning models

Anthropic’s Claude 4 AI models are better at coding and reasoning