We Are All Just Predicting Tokens
- #llms
- #philosophy-of-mind
- #stochastic-parrots
Late last year, I was at dinner with an engineer I used to work with. I was trying to explain my complete amazement with the productivity gains (and possibilities) of using LLMs for software development. After my gleeful rant, he paused and replied that LLMs are “just predicting tokens.” Glorified autocomplete. Copy-paste on steroids. He said it the way people say things when they’ve decided the conversation is over.
It certainly sounded like a reasonable technical rebuttal …but it also quietly assumes a kind of “real reasoning” exists that is unique to humans and somehow mechanistically different (with no description of how it is different or what the mechanisms are).
“Just predicting tokens” is a category error, not an argument. The system we’re dismissing is doing something we can’t cleanly describe ourselves doing, and the dismissal survives because almost nobody pressure-tests the contrast.
First, I find myself in awe of the extent to which language serves as a compression algorithm for describing our reality and interactions within it. In my view, the ability to speak or write comes ‘pre-loaded’ with concepts like objects, interactions, and basic reasoning. Second, the “real human reasoning” that the “token prediction” dismissal gets at is something we’ve yet to define or locate on the human-side.
The dismissal everyone nods along to
The lineage goes back to Emily Bender and Timnit Gebru’s 2021 “Stochastic Parrots” paper, and it has hardened since. Bender and Alex Hanna’s 2025 book The AI Con argues it is “fundamentally confused” to use any human-like term (understanding, reasoning, belief) for what LLMs do (Bender & Hanna, The AI Con, 2025). That’s the purest version of the position. It isn’t an empirical claim. It’s a definitional one: whatever LLMs are doing, it categorically isn’t cognition, because cognition is (implicitly) the other thing.
I want to be fair to it. I also want to be honest about the social shape it’s taken on.
I often find that to the wise, career-weathered software developer or CTO, the dismissal is a status move. It signals you aren’t hype-pilled. It lets you stay in the “I’m not impressed” club while the rest of the room squirms about Claude one-shotting the user story that was originally going to be three engineers’ stand-up updates for the next two weeks. A 2024 LessWrong catalog called “Hunting Undead Stochastic Parrots” documents the frame persisting as vibe rather than argument, with people invoking the phrase without engaging the last two years of empirical work.
The phrase is doing social work; the technical work has moved on without it.
In order to truly defend against this stance, one must show (1) that predicting tokens at frontier scale produces reasoning-like internals, and (2) that the thing we keep gesturing at when we say “real reasoning” isn’t a thing humans cleanly do either.
Language is reasoning, compressed
The core intuition came from Ilya Sutskever before anyone else: if you predict the next token well enough, you have to model the reality that produced the token. Statistics, when done really well, essentially bleed into world-modeling. You can’t keep winning the prediction game on hard text without reconstructing the causal and inferential structure of the thought that wrote it (Sutskever on next-token prediction, 2023). Despite being three years old at this point, it’s still the cleanest articulation of the point.
You could call it an argument about semantics - the good old Chinese Room argument. However, as we dig into the complexities of these language models, we see shadows of understanding emerge.
Anthropic’s “Mapping the Mind of a Large Language Model” paper (May 2024) found features inside Claude 3 Sonnet for abstract concepts (inner conflict, catch-22, code bugs, deception), organized by conceptual similarity and causally shaping behavior when amplified. Turn the “deception” feature up, the model behaves more deceptively. These aren’t token-level artifacts. They’re features that look like concepts, sitting in the spots where concepts would have to live if the system were reasoning with them.
The 2025 follow-on pushed further. Anthropic’s April 2025 circuits update and “On the Biology of a Large Language Model” traces induction heads and multi-step circuits that implement abstract operations across layers: entity resolution, arithmetic carry chains, multi-hop lookups. If you’ve read this literature and still want to call the internal mechanism “just prediction,” you’re describing the training objective, not the reality of the machine once trained.
If you trained me to win a dunk contest - with a loss function designed to promote sick jams - I might grow a plethora of features that support that objective, but aren’t strictly limited to that application (I’m stronger, better cardio, worked on my depth perception, learned a bit of physics, etc). I believe I can fly…

Marcus’s knockout blow, taken seriously
The strongest live version of the skeptical case in 2025 belongs to Gary Marcus, not Bender. Bender hardens into definition, which is easy to dispatch. Marcus makes an empirical prediction, which is much harder. His June 2025 substack “A knockout blow for LLMs?” is the argument at full strength.
LLMs, Marcus writes, “can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution.” The receipt: a seven-year-old can solve Tower of Hanoi. Claude, at the time of writing, scored under 80% at 7 discs and basically zero at 8. The internal process, on this read, “is not logical and intelligent.” It approximates patterns, and it fails when the problem shifts beyond what it saw during training.
If you take that seriously (and you should), the shape of the argument is quite reasonable, empirical, and falsifiable. It says: if the system really reasoned, there would be no distribution cliff.
François Chollet makes the technical sibling version of this argument. LLMs are “big interpolative memory”; scaling increases skill but not intelligence; without discrete program search on genuinely novel problems, what looks like reasoning is memorization of solver templates. The May 2025 ARC-AGI-2 paper is the benchmark built to operationalize the claim, and pure LLMs scored roughly 0% at launch. Not a rounding error. A cliff.
If my belief that “language encodes reasoning” was qualified, Tower of Hanoi at 8 discs is exactly where it should not fail. The cliff is real data. An honest defender has to contend with this.

The cliff is real. But seems to be moving…
The benchmark designed to falsify “LLMs can reason” is being climbed by LLM-based systems. ARC Prize’s December 2025 results and analysis put Gemini 3 Deep Think at 84.6% on the ARC-AGI-1 public leaderboard, with refinement-loop LLM systems hitting 54% on ARC-AGI-2, the specifically-harder benchmark Chollet built to rule out memorization. Fucking expensive and slow - but genuinely making real progress against these tests. The cliff that Chollet said was structural is being walked up by systems whose substrate is the thing he said couldn’t do it.
In Andrej Karpathy’s end-of-2025 post-mortem - “2025 LLM Year in Review” (December 19, 2025) - he coins “summoned ghosts” and “jagged intelligence,” a shape that’s neither human reasoning nor mere lookup, and argues RLVR produces genuine problem-decomposition strategies, not pattern memorization. In a way, he offers new vocabulary that helps sharpen the discussion beyond the blunt comparison to human thinking.
And humans are better right?
Now the uncomfortable part. Humans have distribution cliffs too.
Ask any adult to do long division on 9-digit numbers without paper. Ask them to run a 14-step modus tollens chain in their head. Ask a bright 35-year-old to solve Tower of Hanoi at 12 discs without tools (which is 4,095 moves).
Failure outside distribution isn’t evidence against cognition. It’s what cognition looks like when you push it past its harness or memory capabilities. What varies between systems is where the cliff sits and what tricks move it.
In my view, the Marcus argument proves a weaker thing than it claims. Current LLMs have specific cliffs in specific places. It doesn’t follow that the process “is not logical and intelligent.” It follows that the process has limits, which is true of every cognitive system we know (even you, sorry).
The standard no one has ever met
Suppose, steelman-to-steelman, that LLMs are exactly what Marcus says: interpolative engines that fail predictably at the training distribution’s edge. Here’s the question no one in this camp seems willing to ask out loud. What would “real reasoning” look like, such that humans would pass and LLMs would not, and where has that thing ever been observed?
The best current neuroscience doesn’t give you one. Predictive processing (the Karl Friston and Andy Clark framework) treats the brain as a hierarchical prediction machine, minimizing prediction error across levels from retina to frontal cortex. A 2024 meta-analysis of predictive-processing fMRI studies confirms it as current consensus, not fringe. If the leading account of the brain describes it as a prediction system, the intuition that “prediction” and “understanding” are obviously different categories is doing a lot of philosophical work on zero budget.
Geoffrey Hinton has been unusually blunt about this. In his 2024 and 2025 interviews (Mindplex, October 2024), he says GPT-4 “definitely understands,” and the old-fashioned AI claim that neural networks can’t reason without symbolic scaffolds was “just utterly wrong.” His symmetry point is the one I want to sit with: human understanding is the same kind of distributed-feature computation. Calling it something else in the biological case, and something less in the silicon case, is a move the evidence doesn’t license.
So here’s the category error, fully stated. “Just predicting tokens” only reads as a dismissal if you quietly import a contrast class of prediction-free, symbolic, grounded, non-statistical “real” reasoning. Neuroscience doesn’t have one. Introspection doesn’t have one. Functionalism has covered this ground for half a century. The standard does no work on LLMs because it does no work on us.
Which is why the frontier-lab practitioners who’ve actually built these systems have mostly stopped making the dismissive argument. Karpathy’s “summoned ghosts” isn’t a hype line. It’s a practitioner giving up on vocabulary that didn’t survive contact with the thing.

So back to my engineering coworker…
He’s not wrong that transformers predict the next token. He’s wrong that this tells us anything useful about whether they reason, and even less about whether they have meaningful utility.
Whelp, time for me to go stare at a blank wall …am I just predicting tokens?
Enjoyed this?
Subscribe to get new essays in your inbox. No spam.