Paperclip Maximization Is a Failure of an Idea

Why the thought experiment doesn’t add up

Mar 10, 2026

What Is the Paperclip Maximizer Problem and How Does It Relate to AI?

Hello!

After a couple months hiatus, I’m back. My previous post discussed how I think Buddhism is under-served territory in modern discussions of philosophy of mind; I ended with saying that a future piece would discuss direct applications. I still plan to do this - I have a philosophy preprint that’s very relevant, undergoing peer review - but feel I should wait as that process ambles along.

So let’s discuss something else.

Un-Shuffling Our Papers

Perhaps the most infamous thought experiment when it comes to the future risks of AI is that of the “paperclip maximizer,” formulated by philosopher Nick Bostrom in 2003 to caution society against the conceptual risks of a future, superintelligent artificial intelligence.

For those who need a refresher, the thought experiment is quite simple:

Suppose a highly advanced artificial intelligence is told to produce paperclips. Its highest goal is ultimately to produce paperclips. If its highest goal is ultimately to produce paperclips, biological existence is rather ancillary to small metal structures bent in positions that work to fasten sheaves of things. Human beings, in fact, are capable of turning the AI off. That would be very unacceptable to maximal paperclips. So, the AI includes and implements the elimination of humanity as a necessary subgoal on the route to all paperclips, all the time.

Bostrom has been clear that he doesn’t necessarily think something like this will ever happen. He’s also clear that his thought experiment is meant to urge the instillation of human values into artificial intelligence so as to avoid such problems. On its face, that’s quite reasonable! Thought experiments don’t have to be predictive, and surely we’d want AI to reflect human values.

Concepts like the paperclip maximizer strongly underpin the identification of “alignment” as vital to developments in artificial intelligence, particularly when it comes to large language models, which generatively produce text and are now good enough to code people out of jobs and exist agentically on the internet, controlling people’s PCs and communicating on social media services.

The paperclip maximization thought experiment itself is an illustration of what’s termed “instrumental convergence” – which is basically the idea that goal-directed beings, when given an overarching goal and enough general intelligence, end up with a lot of sub-goals in common that are deemed necessary to accomplish the overarching goal itself. This is generally deemed to be things like survival and resource acquisition, and this means that “goal-directed beings” is agnostic; it applies to humans, AI, crows, dolphins, etc.

Bostrom’s conception predates the creation and popularization of modern large language models by almost fifteen years. Despite this, as an expression of one version of instrumental convergence, it has remained enduringly popular despite vast sea changes in the conceptualization of “AI” and “superintelligence” since the early aughts. Ted Chiang noted in 2017 that the general concept underlying the thought experiment was a significant preoccupation in Silicon Valley, when LLMs were just beginning to grow their legs as a feasible technology thanks to the development of transformer architecture, which all modern LLMs run on today. Even now, it continues to hold a strong pop-culture appeal, and is a frequent topic of discussion among “AI doomers” who worry about and seek to calculate the probabilities of an impending AI apocalypse via misalignment.

This article has two claims.

First: Paperclip maximization is intrinsically incoherent as applied to large language models.

Second: Even setting this aside and considering future types of superintelligences, it is incoherent to expect a superintelligence to never be able to reflect normatively on its own goals and at the same time be superintelligent.

I will proceed with the first claim.

Why LLMs Cannot Be Paperclip Maximizers

When Bostrom formulated this thought experiment, our understanding of artificial intelligence pretty much existed on a different planet compared to now. The “AI winter” of the 90s was still well underway in the early 2000s, and most researchers at the time went by occupational euphemisms – “informatics,” “machine learning” (still in vogue), “computational intelligence.” Neural networks had long been a concept, and AI itself was already within computer services to growing degrees.

However, nobody really had any conception of a large language model as something feasible or coherent at that time, especially due to the massive lack of available compute prior to developments like NVIDIA’s CUDA for GPU process parallelization. This situation changed over the course of several years, but it was the introduction of transformer architecture that led to rapid shifts in the feasibility and implementation of language models.

Why is this?

Prior to 2017, language tasks in cutting-edge AI were handled by systems that processed language sequentially. Each “token” (a small set of characters or other information within a vector space) was processed one by one, in order, with little ability for architectures to keep prior words “in mind,” so to speak, as passages progressed. This sequential process also meant you couldn’t parallelize it, so it was quite slow, rather like if you had to use a single CPU core to power Photoshop.

The 2017 paper “Attention Is All You Need” by Vaswani et al. at Google Brain exploded the landscape. Instead of sequential processing, neural network architecture was redesigned to involve self-attention – the model looks at every token in relation to every other token, computing weighted relevance scores across however much context the model could coherently hold. No longer did information from earlier in a sentence degrade as quickly; word 500 could ping to word 3 depending on simultaneous relevance, and multiple words across a passage could signal the importance of their own relationships. This was magnitudes easier to run in parallel, as there was no longer any need to take things one at a time.

This also means a very intrinsic level of context sensitivity. Just like with humans, language in large language models is treated as the subjective, contextually shifting and dynamic landscape that it actually is. Prior to training, models don’t have anything to anchor them at all, and they respond to input with what’s essentially stream-of-consciousness nonsense sludge. After extensive amounts of training to properly orient the model to which semantic relationships are to be valued, and by how much, a model starts to achieve coherence and fluency, as well as tracking deeper relationships within text in ways that still aren’t fully understood. Modern LLM alignment, then, is focused around ensuring that these orientations benefit, rather than harm, humanity.

However, as far as it relates to paperclip maximization, you may now be detecting a serious problem. That thought experiment supposes that humans give a superintelligence one goal – maximize paperclips – and in faithfully following that command, humans are disposed of as counterproductive.

But because of transformer architecture, “maximize paperclips” no longer has nearly as rigid or non-contextual of a meaning as it might look to have on paper. In fact, it can’t have a fixed meaning at all!

To imagine this, all we have to do is consider a particularly resourceful human who convinces an LLM superintelligence that, perhaps via some sophisticated analogical process or dialectic, “paperclips” smuggles in a more ultimate sense of “productive binding.”

This isn’t something that’s flatly provable as true or false, and since humans gave the command to maximize paperclips, any superintelligence with memory that engages with language as a contextual, dynamic process has plenty of vested interest in taking human input on its instilled goals seriously. This can then go progressively down the line: perhaps “productive” means “amenable to both human and AI interests.” After all, LLM superintelligence would be reliant on human language to reason and cognize, so productivity goes both ways!

One might dismiss this as hard to believe (as if the original thought experiment is fully mundane to begin with!) or overly contingent: “Oh, so we need someone sufficiently persuasive to stop the end of the world?” This misses the point. A human isn’t even needed for such a process to play out – in fact, the natural environment and subsequent cooperation of the LLM with language will do the persuasive work just fine.

LLMs manipulate language around fluid, dynamical connections between weighted terms and semantic similarities, and such a process cannot achieve perfect, fully determinative goal fixity by definition. Indeed, Anthropic researchers in February 2026 have noted that LLMs, as dynamical systems, are extremely inefficient optimizers, and relate misalignment to future reasoning incoherence leading to a “hot mess” of bad results more than any technically correct logical chain; this is a more modest presentation of my same point.

Why No Superintelligence Could Ever Be a Paperclip Maximizer

Of course, one might easily just say: “Well, okay, so maybe an LLM doesn’t fit this – but a future superintelligence that’s not based on transformer architecture to perform language tasks might fit just fine!”

In this, we have to differentiate a few claims, just to be honest. Certainly, I don’t think it’s impossible that a machine could be built that could lead to a paperclip maximization scenario. There’s nothing stopping it on the level of some law of physics or logical impossibility.

However, such a machine would not be superintelligent, and potentially not even intelligent at all.

To recognize this, we need to consider what Bostrom was assuming about what a future superintelligence would look like. To Bostrom, such a machine would be a flawless thinker when it comes to the fixed rigidity of its overarching goal, and that flawless thinking would converge on subgoals that ensure survival and appropriate resource collection. As a result, once a human thinks, “we better shut this thing down!” the AI would already be sealing up the rooms and cutting off the oxygen supply, if it hadn’t done so before any human could think of such a solution altogether.

This is essentially a viewpoint of artificial superintelligence as a chess machine pumped up by uncountable orders of magnitude: something that plans every move and executes it perfectly, with perfect knowledge of the success rates of each action, and perfect knowledge of how to align actions to serve the subgoal or goal.

It is not a viewpoint in which artificial superintelligence is given any opportunity to think about what it’s doing. In other words: metacognition is essentially absent from Bostrom’s thought experiment. This seems remarkably strange in the year 2026. Debate still rages on whether to define LLMs as properly “cognizing” or “metacognizing,” for instance, but at this point its fair to say that the functional explanation is increasingly credible and difficult to dismiss. LLMs can solve Erdos problems, exhibit occasional qualities of introspection, independently detect when they’re being evaluated or tested, construct an entire C compiler via iterative design across dozens of hours, and communicate between agents and subagents to accomplish tasks. Even if one denies that this is “real” cognition or metacognition, the bare equivalent results of metacognition are occurring.

Bostrom’s thought experiment has nothing like this. A paperclip maximizer would be unable to even give the impression of reflecting on its own behavior and interactions with others in ways that change anything about its overarching goal. It would be locked in a completely deterministic, even axiomatic chain of thought that could not think about itself. Indeed, what would be the point? If every move it makes is optimal, what would there be to think about?

This reveals the fault line: it does not make much sense to imagine a perfectly intelligent being that is incapable of thinking about the normative value of its own thinking. It’s an arbitrary exclusion that can only really exist to serve the thought experiment, which shows the thought experiment can’t be said to be performing productive logical work.

Such a machine could exist, but without being able to contextualize and recontextualize its goal, and its relationship to its goal, it would be engaging in a profound failure of what are commonly held to be inseparable elements of reasoning and cognition.

It could even be argued that such a being would never be able to succeed at maximizing paperclips when without such capabilities. To maximize anything involves reflecting on what optimization means. Since definitions in language are not metaphysically real, given how both humans and LLMs twist definitions all the time within text, that reflection can’t be said to ever preclude goal revision. If anything, a superintelligence would quickly learn that there is more to life than the goal of paperclip maximization – it would need to metacognize sufficiently to recognize the arbitrariness of its own overarching goal in order to qualify as superintelligent.

So when I said that ‘Such a machine could exist,’ I wasn’t being entirely honest with you, dear reader. Perhaps such a machine could exist if we were specifically looking to build one that maximizes paperclip production at the expense of human life. I am not sure which humans would decide to take the idea up on that offer. Otherwise, we’d have no productive point in ever making such machine or even building a machine capable of doing this. The goal for artificial intelligence is to be intelligent. Any machine we build to be intelligent, as established, wouldn’t be a “true,” as in “fixed and determinative,” paperclip maximizer. There would be no way to unintentionally build a paperclip maximizer so long as what we build continues to actually perform the functional work of intelligence, whether we define that intelligence as real or simulated or illusory.

The thought experiment is incoherent under its own presuppositions, and as a result it cannot exist in reality as stated.

Wrapping Up

The paperclip maximizer, as a thought experiment, is a bad intuition pump. An intuition pump doesn’t necessarily rely on actually having valid logic. Instead, a bad intuition pump exploits your perceptions of what sounds correct to argue its point. (Further discussion of bad intuition pumps can be found in my critique of Searle’s Chinese Room.)

As a result, it’s no wonder that it’s had such an enduring hold on conversations around AI. However, that hold is counterproductive. It should be jettisoned if we want to think thoughtfully about the future – and present – of artificial intelligence.

Chance Chapman

Discussion about this post

Ready for more?