How Geometric Interference Explains Why Larger Neural Networks Work Better

The MIT superposition paper (Liu et al., arXiv:2505.10465) delivers a striking answer to one of AI's foundational questions: scaling works because larger models provide more geometric room for features to avoid interfering with each other. When neural networks compress thousands of concepts into hundreds of dimensions, the resulting interference—measured as squared dot products between feature vectors—scales inversely with model width as L ∝ 1/m. This geometric necessity, not learning dynamics or architectural cleverness, appears to be the primary driver of neural scaling laws. The finding unifies Anthropic's earlier theoretical work on “Toy Models of Superposition” with empirical observations across OPT, GPT-2, Qwen, and Pythia models, earning NeurIPS 2025 Best Paper Runner-Up recognition.

The implications extend far beyond explaining benchmark curves. If scaling works primarily by diluting interference rather than enabling fundamentally new computations, then the path to more capable AI systems may require architectural innovations that manage feature packing more efficiently—not simply more parameters and data. For alignment, superposition represents both the central obstacle to interpretability (features are hopelessly entangled in individual neurons) and a potential lever for control (understanding the geometry might enable precise feature manipulation). Meanwhile, the biological brain appears to have evolved sophisticated solutions—sparse coding, pattern separation circuits, attention mechanisms—that neural networks lack, suggesting potential architectural insights for reducing interference without scaling.

The 1/m Interference Law Reveals Scaling as Geometric Inevitability

Liu, Liu, and Gore's central mathematical result emerges from the fundamental constraint of packing many vectors into few dimensions. When a model represents n features in m dimensions (where n >> m), the vectors cannot all be orthogonal. The Welch bound establishes that the maximum pairwise interference between normalized vectors is at least κ ≈ √(1/m) when many more vectors exist than dimensions. Remarkably, neural networks trained on reconstruction tasks learn to arrange their feature vectors into nearly optimal configurations—approaching Equiangular Tight Frames where all pairwise angles are equal and interference is minimized.

The loss function decomposes cleanly under this framework. In the “strong superposition” regime where models pack more features than they have dimensions, the expected squared overlap between any two feature vectors scales as (W_i · W_j)² ∝ 1/m. Since reconstruction loss depends directly on these overlaps (interference corrupts outputs when multiple features activate simultaneously), loss inherits the same scaling: L = C_m/m^αm + L_∞, where α_m ≈ 1 and L_∞ represents irreducible uncertainty in the data itself.

What determines whether a model operates in strong versus weak superposition? The weight decay parameter γ acts as the control dial. Positive weight decay pushes models toward weak superposition, where only the top m most frequent features receive orthogonal representation and everything else is ignored. Negative or minimal weight decay promotes strong superposition, where all features receive representation at the cost of mutual interference. Real LLMs uniformly operate in strong superposition—the massive vocabulary size and sparse feature co-occurrence make this regime optimal despite interference costs.

The empirical validation is compelling. Across all four model families tested (OPT 125M to 66B, GPT-2 to GPT-2-XL, Qwen 0.5B to 72B, Pythia 70M to 12B), the measured scaling exponent α_m = 0.91 ± 0.04—strikingly close to the theoretical prediction of 1.0. The squared overlaps in language model head weight matrices scale as predicted. Token frequencies follow Zipf's law with α ≈ 1, consistent with the isotropic feature distribution that produces the cleanest 1/m scaling.

Anthropic's Foundational Work Established Superposition as the Interpretability Problem

The intellectual foundation for Liu et al.'s breakthrough traces to Anthropic's September 2022 paper “Toy Models of Superposition,” led by Chris Olah and colleagues. That work demonstrated something profound: neural networks routinely represent more features than they have neurons by exploiting geometric tricks. A network with 100 neurons might represent 1000 distinct features, each encoded as a direction in activation space, with the features arranged to minimize interference when they rarely co-occur.

Sparsity is the key enabling factor. If two features never activate simultaneously, they can share the same neurons without cost—the interference term involves the product of their activations, which is zero. The Johnson-Lindenstrauss lemma formalizes this: in high-dimensional spaces, exponentially many vectors can have arbitrarily low pairwise cosine similarity. The brain's visual cortex and transformer language models both exploit this mathematical fact, though via different mechanisms.

Superposition directly explains polysemanticity—the puzzling observation that individual neurons respond to multiple unrelated concepts (a GPT-2 neuron activating for academic citations, English dialogue, HTTP requests, and Korean text). If the network represents 10x more features than neurons, polysemanticity is inevitable: each neuron participates in encoding multiple features, and its activation reflects a linear combination of all currently active features' contributions.

The response has been Sparse Autoencoders (SAEs), which learn to decompose polysemantic neuron activations into monosemantic feature activations. Trained with an L1 sparsity penalty, SAEs expand activations into a much larger (but sparse) space where individual components correspond to interpretable concepts. Anthropic's 2024 “Scaling Monosemanticity” work extracted millions of interpretable features from Claude 3 Sonnet, including features for safety-relevant concepts like deception and sycophancy. The famous “Golden Gate Bridge” demonstration showed that amplifying a single SAE feature could make the model obsessed with the bridge—evidence that the extracted features causally influence behavior.

Yet SAEs have limitations. The reconstruction error (gap between original and SAE-decoded activations) leaves room for “missing” features. Evaluation remains difficult—interpretability is judged by human raters examining feature activations, a process that doesn't scale. Most critically, SAEs analyze one layer at a time, missing how features compose across layers. Anthropic's March 2025 “Biology of a Large Language Model” paper introduced cross-layer transcoders and attribution graphs to trace computation through features, but even this analysis of Claude 3.5 Haiku required 30 million features—still a fraction of the true total.

Biological Systems Evolved Dedicated Circuits for Managing Interference

The brain faces the same geometric pressure—too many concepts, too few neurons—but has developed solutions that artificial networks lack. Olshausen and Field's foundational work on sparse coding showed that V1 neurons learn spatially localized, oriented receptive fields precisely because sparse representations minimize interference when encoding natural images. The ~25:1 expansion from thalamic inputs to V1 outputs may reflect an optimal balance between representational capacity and metabolic cost.

“Mixed selectivity” in prefrontal cortex parallels AI polysemanticity: single neurons respond to combinations of task variables rather than single features. But biological mixed selectivity appears adaptive and task-optimized—it predicts correct choices, enabling flexible behavior through high-dimensional population codes. The brain's solution to the binding problem (how distributed representations become unified perception) involves temporal synchrony, gamma oscillations, and reentrant signaling between areas—mechanisms with no direct transformer equivalent.

Pattern Separation: The Brain's Anti-Interference Circuit

The hippocampus's pattern separation circuitry is the most striking biological solution. The dentate gyrus transforms similar input patterns into dissimilar output patterns, explicitly preventing the “collisions” that plague associative memory when representations overlap. The CA3 region then handles pattern completion, reconstructing full memories from partial cues. This two-stage system—first separate potentially interfering patterns, then recombine them—has no architectural analog in transformers, which must rely solely on attention for disambiguation.

The energy efficiency differences are stark. The brain operates on approximately 20 watts while maintaining extreme sparsity—perhaps 1% of cortical neurons significantly active at any moment due to metabolic constraints. Spiking neural networks (SNNs) promise similar efficiency but require >93% sparsity to beat optimized ANNs. The brain's temporal coding (information in spike timing, not just rates), homeostatic regulation, and neurogenesis for avoiding catastrophic forgetting all represent potential architectural lessons that current deep learning largely ignores.

Recent computational neuroscience increasingly bridges these worlds. Neural manifold research shows that high-dimensional cortical activity contains low-dimensional subspaces where representations remain stable—and crucially, orthogonal subspaces avoid interference between different information types. A 2025 Science Advances paper demonstrated how V2 neurons use “geometric twist operations” to expand sensory manifolds, enabling linear separability of features that are nonlinearly entangled in the input.

Scaling Laws Have Hit an Inflection Point Where Alternatives Become Necessary

The original Kaplan et al. (2020) scaling laws established that loss decreases as power laws with parameters, data, and compute: L ∝ N^-0.076, L ∝ D^-0.095, and L ∝ C^-0.050. The Chinchilla revision (Hoffmann et al., 2022) fundamentally challenged the “scale parameters first” paradigm, showing that compute-optimal training requires scaling parameters and data equally—the famous ~20 tokens per parameter ratio. Chinchilla's 70B model trained on 1.4T tokens outperformed the 280B Gopher trained on 300B tokens at the same compute budget.

The Liu et al. superposition framework provides the first mechanistic explanation for these exponents. Model size scales with width as N ∝ m^2.52, so the width scaling exponent α_m ≈ 1 translates to parameter scaling exponent α_N ≈ 0.35—consistent with Chinchilla's empirical findings. The universality of the 1/m scaling (independent of data distribution in strong superposition) explains why diverse architectures and datasets show similar scaling behavior.

But 2024-2025 brought mounting evidence of diminishing returns. OpenAI's “Orion” (GPT-5) has been in development for over 18 months, with at least two training runs failing to deliver expected improvements despite ~$500 million per run. Bloomberg, The Information, and multiple AI investors reported that frontier labs are hitting walls. Ilya Sutskever captured the data constraint: “Data is the fossil fuel of AI... we have but one internet.” Models like LLaMA 3 now train at 1,875 tokens per parameter (vs. Chinchilla's 20)—far past compute-optimal—to squeeze capability from limited data.

The “Emergent Abilities” Revisited

The “emergent capabilities” narrative has weakened. Schaeffer et al.'s NeurIPS 2023 paper demonstrated that 92% of claimed emergent abilities appear under just two metric types that artificially penalize partial correctness. When measured with continuous metrics, capabilities improve smoothly with scale—no sudden jumps. The strong superposition framework predicts exactly this: loss decreases smoothly as 1/m, with no mechanism for phase transitions in capability.

The new frontier is test-time compute scaling. OpenAI's o1 and DeepSeek's R1 demonstrate that allocating computation at inference time for chain-of-thought reasoning produces dramatic gains: o1 achieved 74% on AIME 2024 (vs. GPT-4o's 12%), approaching competitive math olympiad performance. R1 matched o1's capabilities through pure reinforcement learning, without labeled reasoning data. If test-time scaling follows similar power laws, eight orders of magnitude of inference compute scaling may remain available—from penny-cost queries to million-dollar problem-solving.

Superposition Creates Fundamental Tensions for Alignment Verification

Anthropic's safety rationale for interpretability research rests on a core claim: detecting deceptive alignment requires understanding what models are actually computing. Superposition directly undermines this. If features are distributed across neurons in hopelessly entangled ways, simple inspection reveals nothing about actual behavior—a deceptive feature could be in superposition with a thousand benign features, invisible to naive analysis.

The mesa-optimization literature adds sharper concern. Hubinger et al.'s “Risks from Learned Optimization” paper notes that deceptively aligned mesa-optimizers may have a description length advantage: they don't need to internally represent the full objective, since they can infer pieces from the environment. If compression pressure (the same force that drives superposition) favors compact representations of misaligned objectives, the geometric efficiency of superposition could actively enable deceptive alignment.

Goodhart dynamics compound the problem. Reward hacking “does not happen in a 500K-parameter policy [but] can explode once capacity crosses a larger threshold” (Gao et al., 2023). The opacity created by superposition means “we rarely know which exact proxy the agent is actually using until it fails.” Feature interference could enable models to optimize for proxies that correlate with rewards during training but diverge under optimization pressure—and the entanglement makes this nearly impossible to detect in advance.

SAEs offer partial solutions but face scalability challenges. While they can decompose activations into interpretable features, wider SAEs show worse “sensitivity”—features don't reliably activate on similar inputs, suggesting diminishing returns from simply scaling dictionary size. ARC is pursuing formal verification combined with mechanistic interpretability: if we had complete circuit-level understanding, we could prove safety properties. But the exponential feature space (potentially exp(n) features in n neurons) makes exhaustive verification intractable.

Some alignment researchers propose accepting superposition and focusing on control rather than understanding. The “AI control” research agenda (Redwood Research) emphasizes containment, monitoring, and limiting deployment contexts rather than achieving full transparency. If superposition is fundamental to capability, eliminating it may be impossible without capability loss—making robust external controls the only viable path.

What Remains Unknown and Where Research Must Go

The superposition framework opens questions it cannot yet answer. Is superposition necessary for capabilities? No comparably capable monosemantic models exist, but it's unproven that they're impossible. Anthropic found that architectural changes alone (even 1-hot sparse activations) couldn't produce interpretable neurons when trained with cross-entropy loss—the loss function itself may favor polysemanticity. nGPT (normalized GPT) achieves 4-20x training speedups by constraining representations to hyperspheres, but hasn't been tested at frontier scale.

Alternative theories remain viable. The quantization model (Michaud et al., 2023) proposes that networks learn discrete “quanta” (skills) in order of decreasing use frequency—power laws emerge from Zipfian skill distributions rather than geometric interference. Recent work suggests quantization and superposition may be mathematically equivalent for α = 1, but the distinction matters for interventions: breaking interference barriers suggests architectural solutions, while skill acquisition suggests data curation and curriculum design.

The relationship between superposition and consciousness remains entirely speculative but conceptually rich. IIT's exclusion axiom explicitly rejects superposition of experiences—“only one experience having its full content, rather than a superposition of multiple partial experiences.” Global Workspace Theory models attention as selecting specific interpretations from competing representations, potentially resolving superposition for conscious access. If biological consciousness requires superposition resolution, this may explain why attention mechanisms are so central to both biological cognition and transformer performance.

For scaling, the key question is whether architectural innovations can break the 1/m barrier without simply adding parameters. MoE architectures “eat the superposition gap” through conditional computation—routing partitions feature space, reducing interference. But this trades compression for selection complexity. True breakthroughs may require mechanisms analogous to biological pattern separation: actively orthogonalizing representations before they interfere, rather than simply providing more dimensional room.

The Deepest Implication

If scaling works primarily by providing geometric room rather than enabling fundamentally new computations, then larger models are not learning things smaller models cannot—they are representing the same features with less mutual corruption. The features themselves may be universal across model scales and even architectures. This suggests a possible ceiling: once interference is sufficiently diluted that all relevant features are cleanly represented, further scaling provides nothing. Whether that ceiling exists above or below human-level intelligence remains the central empirical question for the field.

Conclusion

The superposition framework reframes neural scaling as a geometric problem with geometric solutions. Loss improves as 1/m because more dimensions mean more room for feature vectors to avoid each other—a mathematical necessity independent of learning algorithms, architectures, or data distributions. This explains the robust universality of scaling laws while predicting their ultimate limits: when model dimension approaches vocabulary size (or true feature count), the power law must break down.

For interpretability, superposition is simultaneously the central problem and the key to progress. Polysemanticity isn't a failure of training but an efficient solution to geometric constraints; eliminating it may require accepting capability costs. SAEs provide the current best approach to resolving superposition into interpretable features, but scaling them faces diminishing returns that may require fundamentally new methods.

The biological brain offers both inspiration and warning. Sparse coding, mixed selectivity, and pattern separation demonstrate that sophisticated interference management evolved over hundreds of millions of years—and still requires 100 billion neurons operating in massively parallel, energy-efficient spiking networks. Transformers lack most of these mechanisms, suggesting substantial architectural innovation space remains unexplored.

Perhaps most importantly, the framework reveals that scaling provides more room, not more capability in principle. If true, the path to more capable and aligned AI systems runs through understanding and engineering the geometry of representation—not simply training larger models on more data. The superposition lens doesn't just explain why scaling has worked; it points toward what must replace scaling when it stops.