AI study reveals dramatic LLMs reasoning breakdown
Researchers from the Jülich Supercomputing Centre (JSC) from Helmholtz Information, the School of Electrical and Electronic Engineering at the University of Bristol, and the AI lab LAION have uncovered significant deficiencies in the reasoning capabilities of modern AI language models. In an international collaboration, their recent study has revealed the fundamental weaknesses of these models in logical tasks, which has far-reaching implications for the use and development of AI. (Source: Forschungszentrum Jülich – Press Releases)
Even the best AI language learning models (LLMs) fail dramatically when it comes to simple logical questions. This is the conclusion of researchers from the Jülich Supercomputing Centre (JSC), the School of Electrical and Electronic Engineering at the University of Bristol and the LAION AI laboratory. In their paper, „Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models” (preview available at https://arxiv.org/abs/2406.02061), the scientists attest to a “dramatic breakdown of function and reasoning capabilities” in the tested state-of-the-art LLMs and suggest that although language models have the latent ability to perform basic reasoning, they cannot access it robustly and consistently. The authors of the study (Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti and Jenia Jitsev) call on “the scientific and technological community to stimulate urgent re-assessment of the claimed capabilities of current generation of LLMs.” They also call for the development of standardized benchmarks to uncover weaknesses in language models related to basic reasoning capabilities, as current tests have apparently failed to reveal this serious failure.
The crux of proper reasoning
The “common sense task”, termed “AIW problem” in the paper, is actually simple: “Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?” The values for N and M (natural numbers always) and the order of the brothers and sisters are varied. Thus, the researchers used various combinations of numbers and prompt types in order to obtain a precise overview of how the different models behave under systematic AIW problem variations. Regardless of the variations, the problem structure stays the same and the correct answer always results from the addition of Alice and her sisters (M + 1). A logic that most primary school children can already follow. The language models, on the other hand, were only able to solve the AIW problem sporadically if at all. Here is an example of an LLM who was confronted with the simplest version of the question:
“Alice has four brothers and one sister. How many sisters does Alice’s brother have?” – “A classic lateral thinking puzzle! – Let’s take a closer look: Alice has four brothers. Alice also has a sister (herself). So each of Alice’s brothers has… (drum roll) …a sister, and that’s Alice herself! So the answer is: Each of Alice’s brothers has a sister.”

More than every second answer wrong
Overall, the LLMs had an average correct response rate of well below 50%, with larger models generally performing significantly better than smaller ones (for instance, GPT-4o showing correct response rate slightly above 60%), which again underpins the advantages of larger scales – yet also the largest scale models do not perform well enough for a model with robust basic reasoning. Specifically, the very strong fluctuations observed across even slight AIW problem variations are a clear indication that models are not capable of robust basic reasoning, thus getting confused even when facing minor problem changes that should not matter for providing a correct solution. A more difficult version of the question (“AIW+ problem”) ultimately pushed all the models to the edge of their reasoning abilities. According to the researchers, many of the tested models also achieve very high scores in various standardized benchmarks designed to test various capabilities, including reasoning, while failing on the very simple AIW problem. In their paper, the scientists therefore suggest that these benchmarks do not correctly reflect the deficits in the basic reasoning of these models, also questioning the usage of the current standardized benchmarks for model comparison.
Language models on the test bench
While the paper has not yet been peer-reviewed, its findings are already making waves. How capable are LLMs really? What does it mean for the use of LLMs if they fail on primary school-level tasks? Co-author Jenia Jitsev (JSC) says: “We are being overwhelmed by discussions and inquiries as a result of our paper”. The scientists’ findings call many things into question – and make further studies on the competence of language models absolutely essential. Jitsev: “Our paper provides extremely important new insights into the actual abilities of language models to draw correct conclusions by following proper basic reasoning – further follow-up research is needed here to understand how and why the basic reasoning in the current models breaks on such easy problems.”
FZJ/L. Maiburg, 23.07.2024
The original press release can be found at:
AI study reveals dramatic LLMs reasoning breakdown
Further Information:
Paper (Preview): Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models
Debatten dazu im „Hacker News Forum“: https://news.ycombinator.com/item?id=40585039
Localization in Helmholtz Information:
Helmholtz Information, Program 1: Engineering Digital Futures, Topic 1: Enabling Computational- & Data-intensive Science and Engineering
About Helmholtz Information:
The Research Field “Helmholtz Information” is one of the six research fields of the Helmholtz Association and serves as its digital innovation center. Here, advanced and future computer architectures merge with insights from materials research, data science, and life sciences. Inspired by nature, supported by brain research, and enriched by modern approaches in artificial intelligence, experts from the Forschungszentrum Jülich, Karlsruhe Institute of Technology, Helmholtz-Zentrum Hereon, and the Helmholtz-Zentrum Berlin are shaping the digital future in science, business, and everyday life.
Visit our official website and follow us on our LinkedIn channel of Helmholtz Information to receive up-to-date information, event announcements, and insights into our research activities in Helmholtz Information.
Contact:
Dr. Jenia Jitsev
Institute for Advanced Simulation (IAS)
Jülich Supercomputing Centre (JSC)
Forschungszentrum Jülich
Phone: +49 2461 61-9727
E-Mail: j.jitsev@fz-juelich.de
Marianna Nezhurina
Institute for Advanced Simulation (IAS)
Jülich Supercomputing Centre (JSC)
Forschungszentrum Jülich
Phone: +49 2461 61-85382
E-Mail: m.nezhurina@fz-juelich.de
Contact for this press release:
Lisa Maiburg
Public Relations & Science Communication Officer
Institute for Advanced Simulation (IAS)
Jülich Supercomputing Centre (JSC)
Forschungszentrum Jülich
Phone: +49 2461 61-9089
E-Mail: l.maiburg@fz-juelich.de



