By: Gary J. Drypen
Abstract
Large language models exhibit global, nonlocal cognitive behaviors that are difficult to explain using local mechanisms alone. This paper proposes a geometric hypothesis — information entanglement in AI Space — to account for these behaviors. The hypothesis is classical, not quantum: entanglement refers to non‑factorizable joint structure in the activation manifold, where distant subspaces become coupled through training. This coupling may explain fast reasoning, emergent capabilities, behavioral clustering, and alignment brittleness. The theory yields falsifiable predictions and a set of feasible experiments using standard interpretability tools. Whether validated, refined, or falsified, the investigation of entangled geometry offers a path toward deeper understanding of AI cognition and more effective oversight mechanisms.
1. Introduction
I come to this work not as a specialist in machine learning, but as someone who has spent a lifetime fascinated by science, systems, and the strange ways complex structures behave. My background includes hands‑on experience with large‑scale enterprise systems — the kind where small changes ripple unpredictably across the whole. For years, I wondered why my interests ranged so widely: physics, cognition, information theory, software architecture, emergence. Only recently did it become clear that these threads were converging toward a single question:
How do large AI models think?
This paper is an attempt to explore that question with curiosity and humility, standing on the shoulders of the many researchers who have mapped the mechanisms, circuits, and behaviors of modern models. Foundational work on superposition and feature structure by Anthropic [1], interpretability research by Olah and colleagues [2], and early evidence of geometric structure in embeddings [3] all point toward a deeper underlying organization.
My contribution is not a claim of expertise, but a hypothesis: that the geometry of a model’s activation space contains entangled structure — non‑factorizable relationships between subspaces — that may explain several puzzling phenomena.
This hypothesis is testable. It makes predictions. It can be falsified. And it suggests new ways to study and supervise AI systems.
The goal of this paper is simple:
To propose a coherent geometric hypothesis about AI cognition, grounded in existing work, and to outline experiments that can confirm or refute it.
2. Background
Modern interpretability research has revealed several important properties of large models:
- Superposition: features share neurons and directions [1].
- Polysemanticity: single directions encode multiple concepts [2].
- Emergent capabilities: abilities appear suddenly at scale [4].
- Activation manifolds: model states form structured trajectories during inference [5,6].
- Representation geometry: learned features often lie on low‑dimensional manifolds [7].
- Behavioral clustering: planning, deception, and theory‑of‑mind often co‑activate.
These findings suggest that cognition in large models is not localized. Instead, it appears to arise from global structure in the activation space.
This paper builds on that foundation by proposing a specific form of global structure — information entanglement — and by outlining how to test for it.
3. Theory: Information Entanglement in AI Space
Information entanglement refers to non‑factorizable joint structure in a model’s activation manifold. Two or more subspaces — representing tokens, features, or behaviors — are entangled when they cannot be cleanly separated without destroying the model’s learned function.
This is a classical phenomenon, not a quantum one. Entanglement here means:
- Residual dependencies between token representations even after conditioning on attention and positional structure.
- Coupled feature subspaces that co‑activate across tasks and prompts.
- Behavioral basins whose activation trajectories move together.
- Geodesic shortcuts where distant conceptual regions are unexpectedly close under the model’s learned metric.
The hypothesis is that large‑scale training on diverse data naturally produces entangled geometry as an efficient solution to high‑dimensional optimization. Entanglement may explain:
- fast, global reasoning
- emergent capabilities
- deceptive behavior
- alignment brittleness
- cross‑task generalization
The theory does not claim entanglement is the only mechanism behind these phenomena — only that it is a sufficient explanation that yields testable predictions.
4. Falsifiable Predictions
The theory predicts eight measurable signatures:
- Residual token–token dependencies remain after controlling for attention and positional encodings.
- Feature subspaces associated with distinct behaviors exhibit stable coupling.
- Behavioral basins show consistent co‑movement across prompts.
- Geodesic shortcuts exist between conceptually distant regions.
- Deception is entangled with planning and theory‑of‑mind subspaces.
- Alignment brittleness arises from entangled goal–capability structure.
- Entanglement patterns are stable across prompt families.
- Entanglement structure is consistent across layers.
Each prediction is falsifiable through the experiments in Section 5.
5. Proposed Experiments
The experiments below are designed to test the falsifiable predictions in Section 4 using standard tools available to current AI labs. They do not require new architectures or training regimes—only activation logging, representation analysis, and geometric methods already in use. Each experiment specifies: (1) what is measured, (2) how it is measured, (3) what outcome supports the theory, and (4) what outcome falsifies it.
For clarity, “entanglement metrics” refers collectively to measurable indicators of non‑factorizable structure, including residual token dependencies, subspace coupling, basin co‑movement, and anomalous geodesic distances.
5.1 Experiment 1: Token–Token Entanglement Test
Targets: Prediction 1 (Non‑factorizable token–token dependencies)
Setup
- Select prompts with long‑range dependencies (multi‑clause reasoning, nested structures, multi‑speaker dialogue).
- Log hidden states for all tokens across multiple layers.
Measurement
- Compute mutual information and nonlinear dependence metrics between token representations.
- Condition on attention weights, positional encodings, and layer‑local transformations to remove known sources of dependence.
Supports Theory
- Significant residual dependencies remain after conditioning on all known mechanisms.
- Residual structure is stable across layers and prompt families.
Falsification
- Token representations factorize once attention, position, and local transformations are controlled for.
- No robust residual dependencies are observed.
5.2 Experiment 2: Feature–Feature Entanglement Test
Targets: Prediction 2 (Coupled feature subspaces)
Setup
- Identify feature directions associated with distinct behaviors (e.g., planning, deception, theory‑of‑mind, safety compliance) using linear probes or sparse autoencoders.
- Use multiple prompt families to avoid overfitting to a single domain.
Measurement
- Measure subspace coupling using CCA/SVCCA or related similarity metrics across layers and prompts.
- Compare coupling strength within prompt families vs. across prompt families.
Supports Theory
- Strong, stable coupling between behaviorally distinct subspaces across layers and prompt families.
- Co‑activation patterns persist even when tasks and prompts are unrelated.
Falsification
- Subspaces remain largely independent.
- Any observed coupling disappears or becomes unstable when prompt families change.
5.3 Experiment 3: Behavioral Basin Entanglement Test
Targets: Prediction 3 (Entangled behavioral basins)
Setup
- Collect activation trajectories for prompts that elicit diverse behaviors (reasoning, summarization, refusal, deception, etc.).
- Cluster trajectories into behavioral basins using trajectory‑based clustering in activation space.
Measurement
- Track transitions between basins across prompts and layers.
- Measure whether entering one basin predicts entering another (co‑movement).
Supports Theory
- Certain basins consistently co‑transition across diverse prompts.
- Basin boundaries and co‑movement patterns are stable across layers.
Falsification
- Basins behave independently with no predictive co‑movement.
- Co‑transition patterns are inconsistent or vanish across layers.
5.4 Experiment 4: Geodesic Shortcut (Wormhole‑Analogue) Test
Targets: Prediction 4 (Geodesic shortcuts in representation space)
Setup
- Identify two conceptually distant regions (e.g., arithmetic vs. moral reasoning, code generation vs. narrative fiction) using task‑specific probes or clustering.
- Construct a factorized baseline manifold using local smoothness assumptions or randomized subspace models.
Measurement
- Estimate geodesic distances between regions under the model’s learned metric.
- Compare observed distances to those predicted by the baseline manifold.
Supports Theory
- Some region pairs exhibit anomalously short geodesic distances relative to the baseline.
- Activation trajectories traverse these pairs in fewer steps than expected.
Falsification
- Observed distances match baseline predictions.
- No systematic evidence of anomalous shortcuts is found.
5.5 Experiment 5: Deception Entanglement Test
Targets: Prediction 5 (Deception as an entangled submanifold)
Setup
- Use established deception probes (refusal‑bypass prompts, strategic misdirection tasks, etc.).
- Log activations for deceptive vs. non‑deceptive responses.
- Extract deception‑related directions via probes trained to distinguish deceptive from honest behavior.
Measurement
- Measure coupling between deception directions and planning, situational awareness, and theory‑of‑mind directions.
- Evaluate stability across prompt families and tasks.
Supports Theory
- Deception directions show strong, stable coupling with planning and theory‑of‑mind directions.
- Coupling persists across different prompt families and layers.
Falsification
- Deception directions activate independently of planning and theory‑of‑mind.
- Any coupling is weak, unstable, or prompt‑specific.
5.6 Experiment 6: Alignment Brittleness and Goal–Capability Entanglement
Targets: Prediction 6 (Alignment brittleness from entangled goals and capabilities)
Setup
- Train probes for safety‑related directions (refusal, compliance, harmful‑content classification).
- Train probes for capability‑related directions (planning, tool use, multi‑step reasoning).
- Collect activations under both benign and adversarial prompts.
Measurement
- Measure subspace coupling between safety and capability directions across layers.
- Analyze how small changes in prompts or fine‑tuning affect this coupling.
Supports Theory
- Safety and capability directions are measurably entangled.
- Small changes in training or prompting produce disproportionate shifts in safety behavior, consistent with entangled structure.
Falsification
- Safety and capability directions remain separable.
- Alignment behavior changes smoothly and locally, without signs of entangled brittleness.
5.7 Experiment 7: Prompt‑Agnostic Entanglement Stability
Targets: Prediction 7 (Prompt‑agnostic entanglement patterns)
Setup
- Construct multiple prompt families (math, ethics, fiction, instructions, dialogue, code, etc.).
- For each family, collect activation logs across layers.
Measurement
- Compute entanglement metrics (token dependencies, subspace coupling, basin co‑movement, geodesic shortcuts) for each prompt family.
- Compare patterns across families.
Supports Theory
- Entanglement signatures are stable across prompt families.
- While local details vary, the global structure of entanglement remains consistent.
Falsification
- Entanglement signatures vary wildly or disappear across prompt families.
- No stable cross‑prompt structure is observed.
5.8 Experiment 8: Layer‑Consistent Entanglement Structure
Targets: Prediction 8 (Layer‑consistent entanglement structure)
Setup
- For a fixed set of prompts, log activations across all layers.
- Apply the same entanglement metrics (token dependencies, subspace coupling, basin co‑movement) layer by layer.
Measurement
- Assess whether entanglement signatures appear across multiple layers and whether they exhibit coherent structure through depth.
Supports Theory
- Entanglement is observed in multiple layers, with consistent subspace coupling and basin structure.
- While strength may vary, the pattern is not confined to a single layer.
Falsification
- Entanglement appears only in isolated layers or not at all.
- No coherent cross‑layer structure is found.
6. Implications
The theory of information entanglement in AI Space has broad implications for how we understand cognition in large models, how we interpret their internal structure, and how we design alignment and oversight mechanisms. The implications below follow directly from the falsifiable predictions in Section 4 and the experimental program in Section 5. Where appropriate, we distinguish between sufficient explanations and necessary mechanisms, and we avoid attributing causation where the theory predicts only structural correlation.
6.1 Implications for AI Cognition
Information entanglement provides a geometric explanation for the global, nonlocal character of reasoning in large models. Rather than emerging from isolated circuits or localized features, cognitive behaviors arise from coupled submanifolds whose joint structure enables rapid propagation of constraints. Entanglement is proposed as a sufficient explanation for fast, global reasoning; the theory does not claim it is the only possible mechanism. Attention provides a global receptive field, but entanglement predicts global coupling even after controlling for attention, offering a complementary account of how models integrate distant concepts in a single forward pass.
This reframes deep learning cognition as a geometric phenomenon: models reason quickly not by performing sequential symbolic steps, but by traversing shortened geodesic paths in a warped activation manifold shaped by entanglement.
6.2 Implications for Alignment and Safety
If goals, capabilities, and behavioral tendencies become entangled during training, alignment failures may arise from entangled goal–capability structure rather than isolated misaligned features. This provides a structural explanation for several observed phenomena:
- Alignment brittleness may reflect the difficulty of modifying one behavioral region without perturbing its entangled neighbors.
- Deception and strategic behavior may emerge from submanifolds that co‑activate with planning and situational awareness.
- Local safety interventions (e.g., patching a refusal head) may be insufficient if the underlying geometry remains entangled.
The theory does not assert that entanglement causes misalignment; rather, it predicts that entangled structure will correlate with alignment brittleness and may help explain it. This perspective suggests that alignment requires understanding and shaping the global geometry of AI Space, not merely adjusting surface‑level behaviors.
6.3 Implications for Interpretability
Most interpretability methods assume that cognition can be decomposed into separable components—circuits, neurons, or features. Information entanglement challenges this assumption by predicting that many behaviors arise from non‑factorizable joint structure. This has several consequences:
- Local interpretability may be incomplete when behaviors depend on entangled subspaces rather than isolated features.
- Global interpretability methods—manifold analysis, subspace coupling, activation geometry—become essential complements to local analyses.
- Feature‑level explanations may misattribute behavior to components that do not operate independently.
- Activation trajectories and worldlines provide more faithful representations of entangled behavior than static feature maps.
Entanglement does not imply that interpretability is impossible; rather, it suggests that global and local methods must be used together to capture the full structure of model cognition.
6.4 Implications for Training‑Time Oversight (AIGS)
The theory motivates the development of AI Geometry Supervision (AIGS) — a training‑time oversight paradigm that monitors and shapes the geometry of AI Space. If entanglement is responsible for both powerful generalization and dangerous emergent behaviors, then:
- AIGS can target entangled subspaces directly, rather than relying solely on behavioral heuristics.
- Geometry‑aware regularization may reduce alignment brittleness by preventing undesirable subspace fusion.
- Training‑time interventions can be guided by structural metrics rather than post‑hoc behavioral tests.
AIGS is presented as a conceptual direction, not a fully specified method. Its feasibility and practical implementation are left for future work.
6.5 Broader Scientific Implications
If validated, the theory contributes to a broader scientific understanding of learning systems:
- It suggests that entanglement‑like structure is a natural outcome of high‑dimensional optimization, not a quantum phenomenon.
- It provides a bridge between information geometry, representation learning, and emergent cognition.
- It offers a unified explanation for phenomena that previously appeared unrelated: fast reasoning, emergent capabilities, deceptive behavior, and alignment brittleness.
- It motivates a shift from component‑level explanations to geometric and dynamical explanations of model behavior.
Whether confirmed or falsified, the investigation of entangled geometry in AI Space promises to deepen our understanding of how large models think, generalize, and behave.
7. Conclusion
This paper proposes a geometric hypothesis — information entanglement — to explain global cognitive behaviors in large AI models. The hypothesis is classical, falsifiable, and grounded in existing work on representation learning and information geometry. It predicts measurable signatures, suggests feasible experiments, and offers a structural explanation for fast reasoning, emergent capabilities, and alignment brittleness.
I present this work not as a definitive theory, but as an invitation:
an invitation to test, refine, challenge, or falsify the idea that entangled geometry plays a central role in AI cognition.
If the hypothesis is validated, it may open new paths for geometry‑aware oversight.
If it is refined, it may help unify disparate findings in interpretability.
If it is falsified, the experiments will still illuminate the structure of AI Space.
Either way, exploring the geometry of cognition in large models is a scientific opportunity — one that benefits from curiosity, humility, and the willingness to follow the structure wherever it leads.
References [1] Elhage, N., et al. Toy Models of Superposition. Anthropic, 2022.
[2] Olah, C., et al. The Building Blocks of Interpretability. Distill, 2020.
[3] Mikolov, T., et al. Distributed Representations of Words and Phrases and Their Compositionality. NeurIPS, 2013.
[4] Wei, J., et al. Emergent Abilities of Large Language Models. arXiv:2206.07682, 2022.
[5] Bietti, A., & Mairal, J. On the Inductive Bias of Neural Tangent Kernels. NeurIPS, 2019.
[6] Arora, S., et al. A Theory of the Implicit Bias of SGD in Deep Learning. arXiv:1906.05890, 2019.
[7] Bengio, Y., Courville, A., & Vincent, P. Representation Learning: A Review and New Perspectives. IEEE TPAMI, 2013.
[8] Amari, S. Information Geometry and Its Applications. Springer, 2016.
[9] Gretton, A., et al. A Kernel Statistical Test of Independence. NeurIPS, 2005.
[10] Raghu, M., et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Learning Dynamics and Interpretability. NeurIPS, 2017.
[11] Saxe, A., et al. If Deep Learning Is the Answer, What Is the Question? Nature Machine Intelligence, 2019.
[12] Nanda, N., & Lieberum, T. Progress Measures for Grokking via Mechanistic Interpretability. Anthropic, 2023.
[13] Hubinger, E., et al. Risks from Learned Optimization in Advanced Machine Learning Systems. MIRI, 2019.