Pattern, Proof, Parrot

nlp

philosophy

llm

eth

Author

Peter Nutter

Published

Saturday, July 5, 2025

Abstract

Do contemporary large language models actually reason, or do they merely stitch together statistical patterns? The current NLP debate answers both ways yet rarely clarifies what “reasoning” means. Drawing on Foucault’s genealogy of statements and Derrida’s deconstruction of hierarchies, I track three historical concepts of reasoning: formal-logical, cognitive-heuristic, and statistical-inferential—show how recent LLM papers slide silently between them, and argue that the opposition Reasoning > Pattern-Matching is unstable. What passes as “LLM reasoning” today collapses back into iterated pattern once inspected.

Introduction

Over the past year, the focus of natural language processing has pivoted from the generation of coherent text to the solving of competitive mathematical and programming tasks. This reorientation followed an era where models mastered text generation, prompting concerns about the limits of data availability and future AI scaling. This uncertainty led to the resurgence of Reinforcement Learning (RL). The guiding principle was to train models not merely on next-token prediction, but on verifiable outcomes, using RL to steer the model through a problem space.

This process yields what are termed “reasoning steps”, often demarcated by special tokens, which ostensibly represent the model’s thought process in a manner similar to Chain-of-Thought (CoT) reasoning. The first major public demonstration of this approach was DeepSeek-R1 [1], a model that achieved excellent results and, notably, exhibited a capacity for self-correction by revising its own erroneous steps.

This was presented as a significant advance. The systems were branded “reasoning models” or “thinking models”, a label justified by their ability to produce a step-by-step derivation that aligns with a common-sense definition of reasoning. But this simple definition becomes complicated when we consider how other domains, such as philosophy, define reasoning, and when the inevitable wave of skepticism arrived to challenge these claims.

Genealogy of “Reasoning”

Genealogy, in Foucault’s sense, uncovers how apparently timeless ideas are the contingent products of particular historical formations. The concept of “reasoning” itself has a history that can be genealogically traced. As Foucault demonstrated with the emergence of “man” as a subject of knowledge:

Before the end of the eighteenth century, man did not exist—any more than the potency of life, the fecundity of labour, or the historical density of language. He is a quite recent creature, which the demiurge of knowledge fabricated with its own hands less than two hundred years ago, but he has grown old so quickly that it has been only too easy to imagine that he had been waiting for thousands of years in the darkness for that moment of illumination in which he would finally be known.

From Descartes onward, the modern subject is conceived as a rational subject, capable of perfect self-reflection and, ideally, immune to external constraint. Critical theorists later observed that this ideal of an autonomous, error-free reason arose partly as a defence against the possibility of error and heteronomy [2].

Rather than trace every waypoint from Enlightenment rationalism through cybernetics to present-day AI,which I’m not qualified to comment on (like many things in this paper), I focus on three conceptions that still organise today’s debates.

Normative-logical view

In this conception, reasoning is rule-governed, truth-preserving inference. An agent reasons when it applies formally specified rules to derive conclusions from explicit premises. The paradigmatic machine embodiment is Newell and Simon’s Logic Theorist (1956) [3]:

The task set for LT will be to prove that certain expressions are theorems, that is, that
they can be derived by application of specified rules of inference from a set of primitive sentences or axioms.

This view is analytic and a priori. The system manipulates uninterpreted symbols yet can still synthesise new knowledge, for example shorter proofs than those in Principia Mathematica. Modern proof assistants such as Lean extend the tradition, though we usually call them solvers rather than reasoners.

Cognitive-psychological view

Long before formal logic was formalised, humans were already making everyday inferences. Experimental psychology therefore treats reasoning as the bounded manipulation of information: organisms use quick-and-dirty procedures that work well enough under tight limits on memory, time and attention. Herbert Simon called this principle bounded rationality [4]:

The task is to replace the global rationality of economic man with the kind of rational behaviour that is compatible with the access to information and the computational capacities actually possessed by organisms, including man, in the kinds of environments in which such organisms exist.

Simon’s programme was later extended by Daniel Kahneman and Amos Tversky, whose “heuristics and biases” research mapped the specific shortcuts people rely on—availability, representativeness, anchoring—and the systematic errors that follow. What matters here is that reasoning is studied descriptively and empirically: the goal is to discover how humans in fact reach conclusions, not to prescribe how perfectly rational agents ought to reason.

Statistical-inference view

A third strand, common in Bayesian statistics and empirical science, identifies reasoning with belief updating from data. In the first chapter of The Book of Why Judea Pearl arranges causal cognition on a ladder whose rungs are correlation, intervention, and counterfactuals [5]. True intelligence, on this account, moves beyond association to purposeful intervention and counterfactual imagination. The stance is a posteriori: it tests hypotheses, designs experiments, and asks “what if?”, much as a physicist or software engineer debugs a system.

These three strands coexist, often implicitly, in current NLP discourse. Researchers slide between them without warning, invoking whichever criterion serves the moment. Consequently, when an LLM satisfies one sense of “reasoning,” critics can still appeal to another sense to deny that real reasoning has taken place.

Case-Study Snapshot

The Illusion of Thinking [6]

This recent Apple paper, whose title alone spurred an immediate wave of tweets and rebuttals, asks a simple question:

Are today’s “Large Reasoning Models” capable of generalisable reasoning,
or are they merely exploiting more elaborate forms of pattern matching?

To probe the issue the authors abandon standard (math / coding) benchmarks and build a small, fully-controllable suite of planning puzzles: Tower of Hanoi, Checker-Jumping, River Crossing, and Blocks World.
Each puzzle is governed by fixed rules but comes with a size parameter \(N\) that determines the minimal number of moves; by increasing \(N\) they can scale the task while keeping the structure constant.

They compare two kinds of models that share the same backbone:

a “reasoning” variant obtained through RL + chain-of-thought supervision,
the corresponding base LLM without any thinking head.

The empirical pattern is tri-modal:

Low complexity.
Base models outperform the RL-tuned versions and do so with far fewer tokens, because the latter overthink (long traces that sometimes overwrite an early correct plan).
Medium complexity.
RL models pull ahead but only by spending thousands of extra “thinking” tokens.
High complexity.
Accuracy for both variants drops to (near) \(0\); the authors call this collapse.

Many headlines ran with “Apple proves LLMs don’t reason”. This is definitely an overreading: the paper reports collapse within its evaluation protocol; it does not claim to refute all forms of reasoning, though the title does incentivize that reading.

Implicitly, the paper equates reasoning with what might be called algorithmic generalisation: a system reasons only if it can keep executing an exact procedure as \(N\) grows; once performance stops scaling monotonically the behaviour is labelled “pattern matching”. This standard doesn’t align with any of the three reasoning models described above.

Questions about the methodology have been raised in a semi-joking paper (a satirical response paper authored under pseudonyms) [7], pointing out potential artifacts:

Tower of Hanoi outputs exceed the 64k-100k token budgets at precisely the \(N\) where collapse is observed; models sometimes state that they are truncating.
“Compositional depth” (number of moves) is treated as a proxy for difficulty, even though Hanoi requires exponential enumeration with trivial \(O(1)\) decision logic, while River Crossing needs far fewer moves but NP-style search.

I include this paper chiefly to show how the labels we choose (“reasoning” / “thinking” / “illusion”) steer the questions we ask. Names can channel useful intuitions yet they can just as easily lure us into anthropomorphising, or into moving the goal-posts whenever a model meets the last definition we set.

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [8]

The Berkeley-Sky paper performs a fascinating experiment: copy long chain-of-thought behaviour from a large RL-tuned teacher (DeepSeek-R1 or QwQ-32B) into a smaller open model. With only 17k teacher traces and either full supervised fine-tuning or a 5 %-parameter LoRA head, Qwen-2.5-32B gains 10–40 pp on five math-and-code benchmarks and still keeps its earlier MMLU and IEval scores. Long reasoning, it seems, is cheap to distil.

The twist comes when the authors ask what in those traces actually matters. They create two families of perturbations. In “local-content” hacks they flip digits, strip reflection keywords, or even keep only traces whose final answer is wrong. Up to 70 % digit corruption or entirely wrong answers shave at most four points from benchmark accuracy; only the pathological 100 %-noise case collapses. In “global-structure” hacks they delete, insert, or shuffle complete reasoning steps (33 %–100 %). Here accuracy plummets by 10–22 points even though every individual step is still grammatical and, taken in isolation, sensible. The lesson the authors draw is stark: the model needs a coherent scaffold (opening a thought, back-tracking, closing it in the right order) far more than it needs the step-by-step truth of the interior arithmetic.

This finding replays, in miniature, Derrida’s deconstruction of the content/form hierarchy. “Reasoning” is supposed to live in the truthful propositions (content) while the textual markings that package them (structure) are merely secondary. Yet the experiment shows the reverse: scramble the marks and the whole performance implodes; scramble the meanings and the system marches on. For LLM builders the practical moral is clear: training data must preserve the iterable architecture of thought even if the intermediate facts are shaky. For philosophy of language the moral is deeper: in these machines the repeatable form is not an inert wrapper but the very condition of possibility for what we still dare to call “reasoning”.

Deconstructing the Binary

Now that we have seen three different views of reasoning and two different ways reasoning is studied in the context of LLMs, we can look further into what we are actually trying to prove—not from the perspective of any of the three views, but from the perspective of language and the hidden assumptions that come with concepts that are not well defined and carry a lot of historical baggage. When we look at a milestone like “reasoning,” we have to ask what is on the other side of it. What is the model doing if it is not reasoning? Is it just producing text, just pattern matching, just a parrot? This creates an implicit binary, one that is intrinsically hierarchical, but also unstable. These two concepts are not truly independent; rather, they give rise to each other.

One such binary that Derrida deconstructs is the binary of speech and writing, where speech is considered primary and writing secondary, this hierarchy is what he calls logocentrism [9]. Speech and oral language have been privileged since the time of Plato through the structuralist project exemplified by Saussure. What Derrida ultimately shows is that this hierarchy is not stable. Unlike in the dialectical approach, this binary is not to be resolved; there is no sublation, no final goal or teleology to be achieved by this project [10].

A similar hierarchy appeared in the first wave of LLM criticism, framed as a binary between meaning and form. Early critics claimed that a model’s output was “just coherent text,” lacking reference or intent and stitched together from training statistics alone [11], [12]. Critical approaches to language, developed long before LLMs, have already dismantled such arguments. I will try to look at reasoning through the same lens: to find the implicit binary, identify the hierarchy within it, and then try, at least temporarily, to privilege the other side.

Identifying the Binary

We have seen that papers in the LLM community often slide between different definitions of reasoning, and that the term is used in a very loose way. What is on the other side is usually not well defined either, but we see terms like “not full generalization,” “reasoning collapse,” “mimicry,” “statistical correlation,” “heuristic,” “pattern matching,” “parrot,” and so on. These terms give us a sense of what is considered the opposite of reasoning.

It is also hard to identify what is “not reasoning” when we see it. Maybe we say it is not reasoning when the chain of thought is wrong and the solution is wrong; maybe it is not reasoning when the steps are wrong but the answer is right. Some are skeptical of reasoning even when both the solution and steps are correct. It is not a one-time switch: if the model is reasoning, it is not a question of a single demonstration. We can always say it is reasoning on one thing and find another case where it does not meet the criteria. It is a game of cat and mouse, where the goalposts are always moving, and we are not acknowledging that.

Hierarchical Nature of the Binary

This binary is hierarchical in nature. Reasoning, cognition, and thinking are seen as pure, a priori, creative, and not fooled by our senses. Pattern matching is something designated for machines—something fallible, computational, derivative, and opaque. Humans are the animals that reason, as is often attributed to Aristotle. We have identified ourselves with our reasoning abilities; it is how we differentiate ourselves, and it is considered a virtue. Even though reason can have different binaries, like reason and emotion or reason and intuition, the former is always privileged.

Collapsing the Hierarchy

How do we put our reasoning on such a high pedestal? Our own introspection is hugely limited; we often have no knowledge of how we reach a conclusion, how we add numbers, or how we plan. We can, of course, come up with a reasonable step-by-step process, but we do not know how our neural circuits actually implement these operations. Do we actually use the algorithms for multiplying big numbers in our brains when we perform multiplication, or is it just a conditioned response? There is no way of knowing this.

Often, when solving mathematical problems, it is about having seen many previous problems, searching our toolbox, and trying to restructure or rewrite the problem in a different way, applying some theorems or equalities and hoping it leads us somewhere. We reuse these snippets of logical deduction like building blocks. Mathematics is about the study of patterns, and we have to be able to recognize them in order to abstract out the similarity and reuse it in a different context. There are machines that can do this. One of the first such machines was designed almost 70 years ago: the Logic Theorist [3]. It was able to prove theorems in propositional logic and even found shorter proofs than humans had previously discovered. This was a major breakthrough, not just because of its capabilities, but because it changed the conditions of possibility—a whole new field of research, funding, and questions was opened up. The physical symbol systems hypothesis and the computational theory of mind were born. People started taking seriously the idea that thinking is just symbol manipulation, and that we can design systems that do the same. The symbolic AI approach was later overshadowed by connectionist models. Researchers stopped hand-coding rules and let networks learn patterns from data. The wish to build a machine that reasons, however, never disappeared. From a plain functionalist view, if a system behaves as if it is reasoning, that is enough to call it reasoning. Treating reasoning as a hidden spark we can never inspect does not help. When we look inside, the idea of a pure, separate reasoning faculty falls apart. We find heuristics, search routines, and pattern reuse—the very things critics label “mere pattern matching.” These operations are not the opposite of reasoning; they are what reasoning is made of. The headline question “Do LLMs really reason?” therefore loses its force. A better question is how different blends of pattern, search, and heuristic let any system, biological or silicon, reach the conclusions we care about.

Conclusion

We have seen that the term “reasoning” is thrown around with little consistency. Systems from the 1950s met a strict, formal-logical definition of reasoning, yet we withheld the title. Now, in the current connectionist paradigm (episteme), we look for reasoning in a different place: in the textual traces produced by models trained to perform a step-by-step thought process.

This implicit redefinition brings us back to the familiar binary of reasoning versus pattern matching, leaving key questions unanswered. Do the performance gains of today’s “reasoning LLMs” stem from genuine learning, or from sophisticated mimicry? And what are the consequences of calling these systems “thinking” models in the first place?

This paper’s primary aim was not to settle the engineering debate, but to analyze the discourse that frames it. The crucial questions are not only about what these models can do, but about how the language we use to describe them directs our focus and defines our goals. The technical work is inseparable from the conceptual framework we build around it.

References

[1]

DeepSeek-AI et al., “DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.” 2025. Available: https://arxiv.org/abs/2501.12948

[2]

J. M. Bernstein, “Adorno, theodor wiesengrund (1903?69),” 1998, doi: 10.4324/9780415249126-DD001-1.

[3]

A. Newell and H. Simon, “The logic theory machine–a complex information processing system,” IRE Transactions on Information Theory, vol. 2, no. 3, pp. 61–79, 1956, doi: 10.1109/TIT.1956.1056797.

[4]

G. Wheeler, “Bounded Rationality,” in The Stanford encyclopedia of philosophy, Winter 2024., E. N. Zalta and U. Nodelman, Eds., https://plato.stanford.edu/archives/win2024/entries/bounded-rationality/; Metaphysics Research Lab, Stanford University, 2024.

[5]

J. Pearl and D. Mackenzie, The book of why: The new science of cause and effect, 1st ed. USA: Basic Books, Inc., 2018.

[6]

P. Shojaee*†, I. Mirzadeh*, K. Alizadeh, M. Horton, S. Bengio, and M. Farajtabar, “The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.” 2025. Available: https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

[7]

C. Opus, A. A. Lawsen, and O. Philanthropy, “The illusion of the illusion of thinking: A comment on shojaee et al. (2025),” arXiv preprint arXiv:2506.09250, 2025, Available: https://arxiv.org/abs/2506.09250

[8]

D. Li et al., “LLMs can easily learn to reason from demonstrations structure, not content, is what matters!” 2025. Available: https://arxiv.org/abs/2502.07374

[9]

J. Derrida, “Signature event context,” Limited Inc, pp. 1–23, 1988.

[10]

O. Podcast, “Derrida on deconstruction and differance.” 2022. Available: https://www.youtube.com/watch?v=4Y7zKpCHHIA

[11]

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic parrots: Can language models be too big? 🦜,” in Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, in FAccT ’21. New York, NY, USA: Association for Computing Machinery, 2021, pp. 610–623. doi: 10.1145/3442188.3445922.

[12]

E. M. Bender and A. Koller, “Climbing towards NLU: On meaning, form, and understanding in the age of data,” in Proceedings of the 58th annual meeting of the association for computational linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds., Online: Association for Computational Linguistics, Jul. 2020, pp. 5185–5198. doi: 10.18653/v1/2020.acl-main.463.