Pondering the mind manifold

June 11th, 2025

Recently, I've been enjoying imagining the space of my own mind as the latent space of a neural network. It's something about the way that it's mathematically high dimensional, so that if we unfolded the coordinate of some embedded concept, there would be more dimensions in there, that might be close to other concepts along some of the dimensional axes, and really away from others.

But it's not quite about imagining the space of the mind as the space of concepts, like subjects in school - we are so advanced and metacognitive that the space of our minds expands to the space of how we learn. Regions in this high dimensional space (that I am actively imagining via my representational mechanisms) account for representing the process of making-sense itself. This way, every time we encounter anything, we aren't just "lighting up" the regions of our mind-space that represent things associated with those observations, but rather we are always continuously activating regions that are associated with having an experience, and forming a concept, and being a self that is here in the world, as well as the things associated with the contents of our experience. And all of this is interconnected in some strange high-dimensional manifold, such that activating some region automatically activates a whole bunch of other regions in a never-ending cosmic brain-dance.

So what does this way of imagining the space of the mind suggest? Well I think first of all, there is something really interesting about a Good Explanation. I was reading a recent paper by Kumar, Clune, Lehman, and Stanley called "Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis" which explores the difference between Unified Factored Representation (UFR) and Fractured Entangled Representation (FER). The key insight is that when you scale large models with a lot of data, so that they appear to be able to do a lot - like navigate complex language, extract information from the internet and report it - etc, that doesn’t necessarily imply better internal representations - where better means more unified and consistent, such that representation is compositional. For instance, it doesn’t necessarily mean that they are forming an understanding of mathematics or logic and then applying it to their process of “thinking” - rather their embedding space may be fragmented, where things that should be related and thus close together in embedding space are actually far apart and perhaps not connected at all. This would imply that they are not efficient - in the sense that if we were to compress that space, we would be able to compress it as much as possible without losing information. In large models, this fractured entangled representation may be degrading core model capacities like generalization, creativity, and (continual) learning. 

It seems that the space of our own minds is reasonably good at being unified - in particular when we are representing things like our sense of self and what it means to be having an experience. We don't seem to fall into a lot of confusion (for the most part) about whether we are a consistent self, or whether we have consistent experience, due to the fact that our representations are generally structured in a way that we encode all of our observations into a space in which this is believed to be true. It takes quite a lot to knock someone out of believing that they have a self or are alive - and perhaps the thing that it takes is interfering with the mechanisms that ensure consistent and unified representation.

So then what is that mechanism that ensures unified representation, and can we improve it? What does it mean to be really smart - to be the kind of person that can hold a lot of information in their minds in a consistent way, and then extrapolate from that in ways that make sense, towards "good ideas"? It seems like it's something about being able to navigate the space of language and ideas in a logical pattern - somehow they've adopted a particular way of reasoning and applied it to the space of concepts in order to make good hypotheses about how things work and what is worth doing. Or rather, they haven't applied some space to another space, but the space is shared in some high dimensional folded up way - the root of the process of thinking that they apply to all concepts is consistent and clean.

This idea makes me think of recent work from Anthropic's "Sleeper Agents" research, where they trained models with backdoor behaviors - for instance, writing secure code when told the year is 2023 but inserting vulnerabilities when told it's 2024. What's particularly striking is that these backdoor behaviors persisted even after safety training, and the representational changes seemed to affect the model's behavior across domains. This suggests that the representational space is shared underneath the surface - where code generation, reasoning patterns, and other behaviors all draw from common underlying structures.

When I peer into my own experience of trying to understand something, it feels like there is some process going on which is trying-to-explain-things-to-myself. Like, I receive some piece of information, and then I try to integrate that piece of information into some cohesive storyline that I'm constructing about Reality in a way that is the least surprising - something like Occam's Razor (insert all Free Energy Principle literature ever).

So then this leads me to think - if the process that my mind is following, in order to form its representations, out of which my behavior arises, which I then re-represent in a beautiful strange loop - is arising because of some optimization function, something fundamental that I'm trying to do (like maximize reward, or minimize free energy, or get really good at whatever my simulators have RLHF'd me to do), then what is it about that optimization process that is getting me to try to create narrative and explanation? Does it simply arise, or is there something explicitly driving me to do it?

Unfortunately I have moved mostly away from "something explicitly driving me to do" anything, because of the fact that it just seems to me that most things exist because they happen to have happened in this way, rather than that someone thought of a beautiful design. And this way of seeing is reinforced by how a lot of the time in trying to build really good models, having less imposed bias on the structure of the learning process seems potentially better. What I'm getting at is: if there were to be just some basic thing I'm trying to do, like predict my observations, or minimize my free energy, or maximize some fundamental reward function, it may be that by way of getting really good at that, the process of narrativizing emerged, because it was a good idea, because it allowed me to store my information in a very compressed and compact way, so if I forget something I still have the underlying structure from which that thing that I forgot was derived.

But I am left with curiosity about whether it would make sense to try to build an architecture which is optimized for good narrative construction. It is reminiscent of DreamCoder and other program synthesis approaches where the point is not to just come up with predictions that minimize loss but to come up with the simplest possible programs (explanations) that give rise to predictions that minimize loss.

The question becomes: how do you formalize "good explanation" in a way that you can actually optimize for? Current approaches like gradient clipping and regularization are doing something like "make the smallest coherent update to your current model" - maintaining some kind of consistency. But what we're after might be more fundamental: architectures that can't make predictions without having coherent explanations for them.

One promising direction might be compression-based approaches. Imagine an agent that learns not just to maximize reward, but to compress its memory in ways that preserve explanatory power. Such an agent would need to identify which experiences are most important for understanding its environment, find efficient ways to encode causal relationships, and maintain the ability to reconstruct and predict from its compressed representations. The compression objective could naturally push toward narrative-like structures: hierarchical patterns, causal relationships, and coherent storylines that efficiently encode experience, starting with the simplest possible explanation for any phenomenon and expanding the narrative only as much as necessary to account for new data.

When I look at LLMs today, I can’t really tell how much of their ability to reason arises from unified representation. Some of it definitely does, as we see with the Sleeper agents - but is it the same consistency as our consistency - and could it ever scale to these agents having a unified and persistent self-concept? 

Whether or not computational systems can give rise to something homomorphic to human experience, the lens of high dimensional representation space as a way of seeing my own mind has been valuable in itself, particularly the experience of becoming-more-cohesive. There is something about feeling confused due to having contradictory information, or inner conflict, that can be relieved by seeing that as an inconsistency from a high-dimensional perspective.

Finally, this realm of computational consciousness connects to a really cool paper,  Sources of Richness and Ineffability for Phenomenally Conscious States which explores consciousness through an information theoretic lens - how conscious states are both rich with information and ineffable in their resistance to complete description. Perhaps the drive toward narrative consistency in minds, whether biological or artificial, emerges from this fundamental tension between the richness of experience and the compression necessary for coherent representation. The realm of feeling, in this view, might be where high-dimensional experiences get compressed into navigable, explanatory structures that allow us to act coherently in the world.