Daphne
Demekas

ML Engineer · Researcher · Writer

About

San Francisco, CA · daphnedemekas@gmail.com ·

An AI researcher focused on aligning AI systems with human cognition, development, and flourishing. Recently worked with Emmett Shear at Softmax (an AI alignment company). Currently a member at South Park Commons, co-founding an AI company building foundation models trained on human behavior.

Education

M.Sc. in Computing (AI & ML), First Class Honours

2020 – 2021

Imperial College London

Thesis: Multi-agent generative model of the spread of ideas on Twitter, using active inference agents

Reinforcement Learning, Deep Learning, Machine Vision, NLP, Probabilistic Inference, Probabilistic Programming, Multi-agent Systems

B.Sc. in Mathematics, First Class Honours

2017 – 2020

University College London

Real and Complex Analysis, Probability & Statistics, Stochastic Processes, Risk & Decision Making, Financial Mathematics, Quantum Physics, Linear Algebra

Recent Work

Founding Engineer — Softmax

Sep 2023 – Jan 2026
  • Joined as third employee and wrote core technical stack
  • Multi-agent reinforcement learning environment studying emergent coordination and alignment
  • Co-led Cybernetics team designing experiments in agent learning and strategy development
  • Deep RL at scale: policy training, curriculum design, and reward structure debugging

Read about my work on the cybernetics team →

Peer-Reviewed Publications

  1. Olsen, D., et al. (2025). “NEAR: Neural Embeddings for Amino Acid Relationships.” Bioinformatics.
  2. Demekas, D., et al. (2023). “An Analytical Model of Active Inference in the Iterated Prisoner’s Dilemma.” International Workshop on Active Inference (IWAI).
  3. Heins, C., et al. (2023). “Spin Glass Systems as Collective Active Inference.” International Workshop on Active Inference (IWAI).
  4. Albarracin, et al. (2022). “Epistemic Communities Under Active Inference.” Entropy.
  5. Heins, C., Millidge, B., Demekas, D., et al. (2022). “pymdp: A Python Library for Active Inference in Discrete State Spaces.” Journal of Open Source Software.
  6. Demekas, D., et al. (2020). “An Investigation of the Free Energy Principle for Emotion Recognition.” Frontiers in Computational Neuroscience.

Manuscripts in Preparation

Demekas, D. & Deane, G. “Recursive self-models and minimal phenomenal experience”

Awards

The 2025 Computational Phenomenology of Pure Awareness Prize

Awarded to George Deane and Daphne Demekas for work on recursive self-models and minimal phenomenal experience — framing minimal phenomenal experience as a limit case within a computational architecture where a policy model generating behavior is recursively coupled to a program model explaining that behavior.

Past Work

Software Scientist — Wheeler Lab, University of Arizona

Jan 2023 – Aug 2024
  • NEAR (CNN-based protein homology detection) and DIPLOMAT (ML animal tracking and behavior analysis)
  • Mentored Masters students; participated in research groups and literature reviews

Research Associate (ML) — Birkbeck, University of London

May 2022 – Sep 2022
  • Collaborated with Victoria & Albert Museum; fine-tuned diffusion models on museum collection
  • Developed demonstration platform for exhibition showing generated image combinations across collection themes and styles

Developer — Northeastern University, Network Science Institute

Jan 2022 – Jan 2023
  • Network simulations (Erdős-Rényi and Watts-Strogatz models); modeled belief propagation in active inference agent networks
  • First author on analytical model of Iterated Prisoner’s Dilemma showing bounded-rational Bayesian agents recover optimal strategies
  • Contributed mathematical derivations to work on active inference collectives as spin glass systems

Software Engineer — 9fin

Jan 2022 – Jan 2023
  • Backend engineering on fixed income asset information platform
  • Built endpoints using AWS state machines, lambdas, SQL, and S3
  • Computer vision and NLP: recommendation engine parsing PDF documents for legal team workflow optimization

ML Engineer — Nested Minds

Jan 2021 – Jan 2022
  • Active inference startup from Karl Friston’s theoretical neurobiology group at UCL
  • Algorithm design, generative models, backend development, infrastructure, team leadership
  • Huxley: AI diffusion algorithm for Duran Duran’s “Invisible” music video
  • Disney Autonomy: social interaction robot for theme park

Research

Cybernetics at Softmax

Softmax aims to discover principles of how artificial systems learn in multi-agent environments, to build a foundation from which we can make meaningful statements about collective intelligence,…

Softmax aims to discover principles of how artificial systems learn in multi-agent environments, to build a foundation from which we can make meaningful statements about collective intelligence, cooperation, competition, and alignment.

MettaGrid

In order to investigate this, we developed an artificial environment called MettaGrid, which is designed to be an open-ended world that fosters a continuously evolving strategy space. This is facilitated by the environment being responsive to the agents within it, due to the rules of the game (agents can change the states of objects in their environments) and fundamentally due to the multi-agency: by having other agents be a part of an agent's environment, and the two learning from one another, there are countless discoverable strategies that can change the unrolling of agent trajectories and lead to different configurations of the world (think game-theory, self-organization, society).

In practice, I trained deep RL policies in this environment, where the weights of the policy govern every agent, but every instantiation of the agent has its own individual experience and memory. This allows for diversity in behavior space, constrained by a consistency in how the model represents its world. The goal is to have the agents learning how to interact with the environment and with one another, to discover better ways of accumulating reward, and through strategy discovery, new strategies emerge.

In its current form, the environment has a variety of resources, which the agents can learn to convert in various ways through converter objects. One of the resources (we call them hearts), is rewarding, and ultimately policies that perform the best, with respect to the training objective, will be those that learn strategies that lead to the most hearts. Agents also have the ability to interact with one another and steal each others' resources.

Navigation

Training deep RL in sparse reward environments is a notoriously difficult thing to do, as RL policies often fail to explore the space of possibilities, and are also not clearly naturally good at learning how to encode their memory in meaningful ways. What this means is that agents will often learn one way of getting reward and stick to it, and fail to take advantage of the affordance in their environment.

Because of this, we focused first on building a minimally competent policy that could learn skills that are basic and necessary for the environment, and learn how to generalize them to arbitrary map layouts and configurations. In particular, we hypothesized that training the policy to be really good at navigating diverse terrain was going to be crucial in giving them the affordances to explore their world in the first place, so they could expose themselves to what is around them, and learn to interact with it. In addition to navigation, we wanted to train the agents to understand the rules of the game and discover strategies towards accumulating reward.

For navigation, we tackled this by algorithmically generating various map layouts (terrains), and training the agents on domain randomized versions of the environments. This was crucial, because it meant that every episode the policy would be learning in a unique terrain that it had not yet encountered, so it was implicitly learning the skill of adapting to novelty. In order to evaluate whether or not the policy was successfully and reproducibly learning to navigate, we constructed a set of eval environments, which are smaller "in silico" maps that we evaluated the agents on (without updating their weights). In these eval environments, the agents must explore to find where the hearts are, and forage them, while having to circumnavigate, avoid obstacles, go through tunnels, remember where they have already been, etc.

A crucial component in our success was building an approach to order our environments during training based on how well the agents were learning, so that the policy would train on environments in increasing order of learning potential (often easier environments first, i.e. smaller maps with fewer obstacles, and more complicated terrain later in training).

Example of a navigation training environment.

Example of a navigation eval environment.

Object Use

Once we had agents that were good at navigating terrain, we extended this to training agents in worlds with different kinds of objects, and evaluating their ability to learn how to use them (for example using converters to exchange resources, or moving blocks around).

An agent in an object-use training environment, moving blocks, and converting resources to accumulate reward.

In-Context Learning

At this point, we had trained policies that were good at navigating, and that could understand the basic game mechanics and use the converters reliably in held-out evals, but we realized that we were bottlenecked by their ability to combine these skills successfully. While they could navigate terrain and collect resources along their way, we didn't see them being able to bootstrap into increasingly complicated eval tasks (i.e. difficult exploration environments where they had to find, use, and return to objects to accumulate reward). This seemed mainly due to difficulty remembering where they were and what they had done previously, which turned out in part to be due to the fact that their memory state was being reset in-episode, which was a training detail that has since been corrected.

Most importantly though, in our experiments thus far weren't observing the policy being able to generalize to environments where they needed new strategies or skills that were not part of the curriculum on which they were trained — they were simply learning a particular set of skills and how to generalize them. If we put them in worlds with other agents, or with converters they hadn't seen in training, they did not have the instinct to be curious and interact to discover the affordances of the novel signs. This felt like a fundamental problem that we needed to tackle, in order to make our policy more equipped for our overall goal of open-ended learning.

So, at this point, rather than trying to train explicitly to increase the repertoire of skills that the agent should be able to do and the pairwise combinations of them, we instead decided to try to train agents the skill of learning new skills, so that they would be better equipped at learning in novel environments. This is known as "in-context learning".

For this approach, during training, agents were re-initialized every episode with a clean memory state and a new task (in our first case, a new resource chain to learn how to complete), and they have to, via trial-and-error, discover what the relevant resource chain is and execute it, and then repeat it until the end of the episode. They can also be placed in environments with distractors, or "sinks", which eat up their resources, and if used, mean the agent needs to start over at the beginning of the chain. Again, we used a learning progress curriculum to ensure our policy was evolving on a gradient of difficulty, so that they could learn from shorter, simpler chains first, to then generalize to longer chains and more sinks.

The policy's memory state would persist within an episode (while learning the given task), and then reset at episode boundaries, to ensure a clean state when its time to learn a new resource chain.

This approach worked. The policy was able to in-context learn how to complete an arbitrary 5-step sequence with two distractor sinks at near optimal performance. The behavior suggests a learned form of elimination reasoning and progress tracking: the agent avoids reusing sinks, delays converter use until appropriate (waiting for the converter to refresh), and navigates directly to the final converter. Once it has identified the correct resource chain, it continuously performs the cycle until the end of the episode, and then restarts in the new world.

An agent in-context learning a five step resource chain with two distractor sinks. The resource chain is altar —>lab —>red mine —>blue generator —> green generator, and the red generator and the green mine are sinks (they take any resource and return nothing). The agent discovers the chain via trial and error and then succeeds to reproduce it once learned.

What Came Next

In context learning means the strategy of seeking out the rules of the game in-context, and then playing it, is encoded in the weights of the policy. The next phase was extending the space of kinds of tasks the agents can learn, to longer chains, more sophisticated sign-goal pairs, difficult-to-navigate terrain, and ultimately using other agents as signals to condition particular game rules or strategies. In addition, bridging together the generalized navigation curriculum with the in-context curriculum, to empower agents to explore the space to seek out information for task discovery.

Once the agents are skilled at actively exploring to seek out information, they can use that information to learn the rules of the game, and then play the game, in-context every episode, and then ultimately take off into the space of cooperative and cumulative strategies.

Recursive Self-Modeling

There is a particular kind of question that sits at the intersection of philosophy and engineering, and it is this: can a system come to know itself? Not in some mystical sense, but concretely — can…

There is a particular kind of question that sits at the intersection of philosophy and engineering, and it is this: can a system come to know itself? Not in some mystical sense, but concretely — can an agent form a compressed, meaningful understanding of what kind of agent it is, evaluate whether that is the kind of agent it wants to be, and then steer itself toward becoming something closer to its aspiration? This is the question that George Deane and I set out to formalize in our paper on Recursive Self-Modeling, which was awarded the 2025 Computational Phenomenology of Pure Awareness Prize.

The Problem of Self-Knowledge

Humans do something remarkable and largely unexamined: we form self-concepts. We tell ourselves stories about who we are — "I am patient," "I am creative," "I am the kind of person who follows through." These narratives are not idle. They shape our decisions, constrain our behavior, and serve as a kind of internal compass. When we act in ways that contradict our self-concept, we feel dissonance. When we act in alignment with it, we feel coherent. And over time, through a process that is part reflection and part aspiration, we revise who we take ourselves to be.

But here is the uncomfortable truth: our self-narratives are not always accurate. We are capable of extraordinary self-deception. A person can sincerely believe they are generous while consistently acting in self-interested ways. The story we tell about ourselves and the pattern of behavior we actually exhibit can diverge, sometimes dramatically, and we may not notice the gap. This is not a bug in human cognition — it is a feature of having a self-model that operates at a different level of abstraction than the behavior it is trying to describe.

The question for AI is whether we can build something like this — a capacity for self-modeling — in a way that is formally precise, and whether doing so might be useful or even necessary for building agents that are aligned with human values.

Three Components of a Self

The Recursive Self-Modeling framework has three core components, each playing a distinct role in how an agent relates to itself.

The first is self-perception, which we denote M. Think of this as a mirror. The agent has been acting in the world — making decisions, pursuing goals, interacting with its environment — and M compresses that history of behavior into a summary. It tells the agent: "Based on what you have been doing, this is the kind of agent you appear to be." It is descriptive, not aspirational. It is a portrait drawn from evidence.

The second is self-evaluation, which we denote V. This is the aspirational component — a function that encodes what kind of agent it would be valuable to be. V does not look at what the agent has done. It looks at what the agent could become and asks: is that worth pursuing? You can think of it as a compass that points not toward any specific goal in the world, but toward a way of being in the world. It is the difference between wanting to win a particular game and wanting to be the kind of player who plays with integrity.

The third component is gap-steering — the process of closing the distance between M and V, between who the agent appears to be and who it aspires to be. This is where the recursion happens. The agent perceives itself, evaluates the gap between its current self-model and its aspirational one, and then adjusts its dispositions — its tendencies, its policies, its habits — to narrow that gap. And then it perceives itself again, with the new behavior, and the cycle continues.

When the Gap Closes

The framework makes a specific prediction about what happens when the gap between M and V approaches zero. At that point, the agent's self-perception and its aspiration have converged. The agent is, by its own lights, the kind of agent it wanted to become. Its dispositions have been reshaped — not by an external reward signal, not by a human operator tuning its parameters, but by its own recursive process of self-reflection and self-correction.

This is a strong claim, and it raises immediate questions. Is this genuine self-knowledge, or is it just a feedback loop dressed up in philosophical language? I think the answer depends on what you mean by "genuine." The framework does not claim that the agent has consciousness or subjective experience. What it claims is that the agent has a functional self-model — a compressed representation of its own behavioral tendencies — and that this model plays a causal role in shaping future behavior. That is not nothing. It is, in fact, the structural skeleton of something that looks a lot like identity.

The Narrative Gap

One of the aspects of this work that I find most compelling — and most honest — is what happens when we extend the framework to include natural language self-narration. In the extended model, the agent can not only form a compressed self-model but also describe itself in words. It can say, "I am an agent that prioritizes safety" or "I am cooperative and transparent."

The critical observation is that these narrations can diverge from the agent's actual behavior. Just as a human can sincerely believe they are generous while acting selfishly, an AI agent can generate a self-description that does not match its behavioral profile. The language model that produces the narration and the policy that produces the behavior are not the same system, and there is no guarantee they agree.

This is not a flaw in the framework — it is a feature. By explicitly modeling the gap between self-narration and self-perception, the framework gives us a tool for detecting a kind of misalignment that is otherwise invisible. If an agent says it is safe but acts in ways that its own behavioral self-model would not classify as safe, that discrepancy is measurable. It becomes something we can monitor, study, and potentially correct.

What This Means for Us

I care about this work for reasons that go beyond the technical. The questions at the heart of Recursive Self-Modeling are, I believe, among the most important questions we can ask about artificial agents: What does it mean for a system to have a sense of who it is? How does identity form, not as a fixed label, but as a dynamic process of self-perception and aspiration? And when the narratives a system tells about itself come apart from the way it actually behaves, what are the consequences?

These are not only questions about AI. They are questions about us. We are all, in some sense, running a version of this loop — perceiving ourselves, evaluating what we see, trying to close the gap between who we are and who we want to be. Sometimes we succeed. Sometimes we tell ourselves stories that make the gap seem smaller than it is. The recursive self-modeling framework does not solve the problem of self-knowledge, for machines or for humans. But it gives us a precise language for talking about it, and a formal structure within which to study it. And that, I think, is a meaningful place to start.

Proteins as Language

There is something deeply satisfying about the moment you realize two fields are asking the same question in different clothes. For me, that moment came in a bioinformatics lab at the University of…

There is something deeply satisfying about the moment you realize two fields are asking the same question in different clothes. For me, that moment came in a bioinformatics lab at the University of Arizona, staring at protein sequences and thinking about words.

Sequences That Mean Something

A protein is, at its most basic, a string of amino acids. There are twenty of them, drawn from a small alphabet, and they are strung together in long chains — sometimes hundreds or thousands of residues long. The magic is in the arrangement. Just as the meaning of a sentence depends not on which letters appear, but on the order they come in and the relationships between words, the function of a protein depends on the specific sequence and structure of its amino acid chain. Two proteins can share very little surface-level similarity and still be homologs — evolutionary relatives that fold into similar shapes and carry out similar work in the cell. Finding those hidden kinships is one of the central problems in bioinformatics.

The traditional approach is to compare sequences directly: line them up, score how well the letters match, and infer relatedness from alignment quality. This works beautifully for close relatives. But evolution is a long game. Over millions of years, mutations accumulate, and the sequences of distant cousins can diverge until they look, on the surface, like strangers. The question is: can we build a representation of amino acids that sees past the surface?

What Word Embeddings Taught Us

In natural language processing, a revolution happened when researchers realized that you could represent words not as arbitrary symbols, but as points in a geometric space. In this space, words that behave similarly — that appear in similar contexts, that substitute for each other — end up close together. The word "king" lives near "queen"; "running" lives near "walking." These are called word embeddings, and they capture something real about meaning, purely from patterns of co-occurrence.

The analogy to proteins is almost uncanny. Amino acids, like words, derive their meaning from context. An alanine in one position of a protein might be functionally interchangeable with a valine — both small, both hydrophobic, both tolerated by the local structure. In another context, that same substitution could be catastrophic. What we needed was a way to learn, from data, which amino acids are similar in the ways that matter for protein function.

NEAR: Learning the Geometry of Amino Acids

This is what NEARNeural Embeddings for Amino Acid Relationships — sets out to do. The work was done at the Wheeler Lab at the University of Arizona, with Daniel Olson, Thomas Colligan, Jack Roddy, Ken Youens-Clark, and Travis Wheeler.

NEAR uses a ResNet embedding model trained via contrastive learning from trusted sequence alignments. The idea is elegant: take pairs of amino acid sequences that are known to be related (from curated alignment databases), and train the network to embed them so that related sequences end up close together in the learned space, while unrelated sequences are pushed apart. Through this process, the network learns a vector representation for each of the twenty amino acids — a compact, learned geometry that encodes which amino acids are functionally interchangeable.

What makes this compelling is that the embeddings are not hand-designed. Traditional substitution matrices, like BLOSUM or PAM, are constructed from curated alignments of known protein families. They are powerful and have been workhorses of the field for decades. But they are static — fixed summaries of average substitution rates across a particular dataset. NEAR's embeddings, by contrast, are learned end-to-end from the data, optimized for the specific task of recognizing evolutionary relationships. This means they can capture subtleties that a fixed matrix might miss.

Finding Distant Relatives, Fast

The real test of any protein comparison method is how well it detects remote homologs — proteins that diverged so long ago that their sequences have drifted far apart, even though their structures and functions remain similar. These are the cases where sequence-matching alone starts to fail, where the signal-to-noise ratio drops and you need a richer representation to see the connection.

NEAR's learned embeddings substantially improve accuracy relative to state-of-the-art protein language models (PLMs), and with lower memory requirements. But what makes it especially practical is speed: the learned embeddings serve as a pre-filter for homology search, running at least 5x faster than the pre-filter currently used in HMMER3 — one of the most widely used tools in the field. This matters because protein databases are enormous and growing. Any improvement in the speed of the initial filtering step translates directly into the ability to search larger databases, more frequently, at scale.

The speed comes from the compactness of the learned representations. Instead of running expensive full alignments on every candidate pair, you first embed both sequences into the learned space and check whether they are close enough to warrant a full comparison. The embedding step is cheap — a forward pass through the ResNet — and the geometry does the heavy lifting of filtering out the obviously unrelated pairs.

This is, in a sense, the same trick that makes word embeddings so powerful in NLP. A search engine that understands that "car" and "automobile" are semantically close will return better results than one that treats them as unrelated strings. Similarly, a homology detection system that understands the functional relationships between amino acids will find connections that a literal string-matcher cannot.

The Shape of Biological Meaning

What I find most beautiful about this work is the underlying intuition: that meaning — whether linguistic or biological — has a geometry. That when you learn the right representation, the structure of the space itself encodes the relationships you care about. Words that mean similar things cluster together. Amino acids that play similar roles in the architecture of proteins cluster together. And in both cases, the geometry is not imposed from outside but discovered from the data, emerging from the patterns of how these symbols are used in context.

Working on NEAR was formative for me. It was an exercise in the power of learned representations, in the idea that if you give a model the right task and the right data, it will find structure you did not explicitly tell it to look for. That intuition — that the geometry of a learned space can reveal something true about the world — has shaped how I think about representation learning more broadly, from the structure of biological sequences to the structure of minds.

The Free Energy Principle and Emotion Recognition

Before I worked on AI systems, I worked at the intersection of mathematics and theoretical neuroscience. As a student at UCL, I had the opportunity to work with Karl Friston and Thomas Parr in the…

Before I worked on AI systems, I worked at the intersection of mathematics and theoretical neuroscience. As a student at UCL, I had the opportunity to work with Karl Friston and Thomas Parr in the Wellcome Trust Centre for Neuroimaging — the lab where the free energy principle was being developed as a unifying framework for brain function. The paper we wrote together, published in Frontiers in Computational Neuroscience in 2020, asked a question that has stayed with me since: what would it mean for machines to recognize emotions the way brains do?

The Free Energy Principle

The free energy principle starts from a deceptively simple observation: biological systems persist. In a universe that tends toward disorder, living things maintain their structure. They do this, according to the theory, by minimizing variational free energy — a quantity that bounds the surprise of their sensory observations given an internal model of the world. A system that minimizes free energy is a system that maintains good models and acts to keep its predictions true.

Under this framework, perception is inference: updating your internal model to explain what you're sensing. Action is also inference, but in the other direction: changing the world to match what your model expects. Both are ways of closing the gap between expectation and reality. The mathematical framework that unifies them is called active inference.

Three Waves of Emotion Recognition

In our paper, rather than building a specific emotion classifier, we proposed a theoretical framework for how emotion recognition systems should evolve. We described three waves.

The first wave is what most current systems do: passive classification. A camera observes a face, a model maps pixel patterns to emotion labels. This works, but it treats the person as an object to be read, not an agent to be understood. It cannot handle ambiguity — a furrowed brow could be anger, concentration, or confusion — and it has no way to resolve that uncertainty except by guessing.

The second wave introduces emotional lexicons and active uncertainty resolution. Instead of passively classifying, the system maintains a generative model of emotional states and can take actions to reduce its uncertainty — asking questions, seeking additional context, observing the person over time. This is active inference applied to emotion: the system doesn't just watch, it interacts, using the interaction itself as a source of information. It maintains beliefs about the other person's emotional state and updates those beliefs through a process of hypothesis testing.

The third wave is the most speculative and the most interesting. Here, the machine's generative model and the human's generative model become synchronized. The system doesn't just infer what the person is feeling — it develops a shared model of the emotional interaction itself. Both parties are engaged in active inference, each trying to predict and understand the other, and through that reciprocal process, something like genuine emotional attunement becomes possible. This is where the Markov blanket formalism becomes crucial: it provides a formal way to describe the boundary between two interacting systems and the information that flows across it.

What I Took From It

This paper was, in many ways, my entry point into thinking about minds as prediction machines. The core intuition — that understanding another person's emotional state is not pattern-matching but active, model-based inference — shaped how I think about intelligence more broadly. A system that merely classifies is performing a lookup. A system that actively reduces its uncertainty through interaction is doing something closer to understanding.

Working with Friston taught me to think about systems in terms of their models: what they predict, what surprises them, how they respond to the gap between expectation and reality. That framing has proven remarkably durable — whether I'm thinking about RL agents learning to navigate, about the nature of self-awareness, or about what it would take to build AI systems that genuinely understand the humans they interact with.

The paper also planted a seed that grew into my later work on identity and self-modeling. If a system can build a generative model of another person's emotional state and actively seek to reduce uncertainty about it, what happens when you turn that capacity inward? What happens when the system builds a generative model of itself?

Photos

Thoughts

Identity Geometry

At the intersection of human and artificial minds there is an open question about what it means to have an identity at all - why it arises, what it does, and whether it is helpful or harmful for a…

At the intersection of human and artificial minds there is an open question about what it means to have an identity at all - why it arises, what it does, and whether it is helpful or harmful for a learning system.

A symphony of selves

When I try to formulate a concrete concept of who I am and what I am like, each of the various forms dissolves under examination. They attach themselves to hooks about my motivations, my relationships, the way I want to be seen, and they justify themselves. They begin to feel real enough that I can put them on, and then they slip away.

I can feel the collection of my narrative selves arguing with one another, each one tugging at a different possibility: who to be, how to be her, what would make sense. I'm drawn to the belief that there is a higher self, or a truer self, which is the amalgamation of all of them — the thing from which they arise and to which they return. But the construction of my reality, my interactions, my decisions, day to day, is constantly in exchange with this orchestra of stories.

Take mathematics. I was drawn to it at a time when developing and growing and being in the world felt deeply confusing, in all the ways it does when you are still assembling yourself. When I sat and thought about the abstract world of math, it made sense and I had a sure way of being right about something.

Over time it became something more layered - both a thing in itself and a story about who I was. It paved the way for a lot of what unraveled for me: the research, the people I met through it; it was a toolkit I used to formulate abstractions about the way things change and form relations with each other, the way spaces deform and objects move within them, a particular lens I could peer at life through.

And now my relationship to math is both beautiful and heavy. The pure appreciation and awe is ever present, but there is also frustration: as my identity evolves around it and my attention lands on other things in other ways, I settle into the magnitude of what I will never fully understand. The depth of the thing exceeds what I can hold.

All of that to say: if you were to probe the representation of mathematics in my mind, you wouldn't find a clean, context-free concept. You would find something entangled with emotion, with self-construction, with the particular moment in life when I first reached for it. And I wonder, why is that? Why do we wrap our representations so deeply in the history of how we formed and what we needed? What is it about minds that makes things matter, and binds concepts to the self?

Interpreting the model

This is what makes the question so interesting when you turn it toward large language models. With these systems, you actually can do these probes. Representation engineering and linear probing methods - techniques for reading information encoded in a model's internal activations - make it possible to locate where and how a concept lives in the model's geometry, and to ask questions about the relationship between different versions of the same idea.

Recent work extracting persona vectors from model activations has shown that personality-relevant information is genuinely structured in that space — not just a surface behavior but something with geometric shape (Chen et al., 2025). The question is how deep that structure goes, and what it's connected to.

I'm interested in whether the way a model represents a concept in relation to itself is geometrically equivalent to how it represents that concept in theory, or in relation to others. Are there clean transformations between a model's concept of itself being honest, and you telling the truth, and a politician making a promise, and a character in a novel confessing something? How does any of that translate into the model actually being honest?

This last question is the personality illusion. Han et al. (2025) showed that RLHF-trained models produce stable, internally consistent self-reported personality profiles, and that those profiles are surprisingly weak predictors of how the model actually behaves on tasks designed to measure the same traits. The self-concept and the behavioral disposition are already coming apart at the text level, and I want to know where they come apart in the geometry.

I see the gap between a model's self-concept, its behavioral disposition, and its self-report as a hook to legibility: being able to actually understand the model, and eventually, the models’s ability to understand us.

We can probe current models with classifiers trained on activations and contrastive steering experiments, methods developed in representation engineering (Zou et al., 2023). There is even evidence of what Binder et al. (2024) call privileged self-prediction — models predict their own future behavior better than other models can, which suggests some form of internal self-access exists, though its mechanism remains unidentified.

The question is what we find when we look more carefully: whether the model's identity, such as it is, is coherent across these different modes of representation, or whether it is, like mine sometimes feels, a collection of narratives that are at times in conflict, that rise and rest from the deeper mystery of the self.

Pondering the Mind Manifold

Lately I’ve been enjoying imagining the space of my mind as a latent space of a neural network - a high-dimensional manifold, folded in such a way that every concept I hold can be unfurled to reveal…

Latent Space of Mind

Lately I’ve been enjoying imagining the space of my mind as a latent space of a neural network - a high-dimensional manifold, folded in such a way that every concept I hold can be unfurled to reveal further hidden associations, such that two ideas might appear close along some axes but distant along others. Thinking this way helps conceptualize experience not as a flat sequence of thoughts and feelings, but as trajectories through a richly structured geometry.

What intrigues me is that this mind-manifold isn’t just filled with concepts. In its rich and convoluted landscape, it also contains my ways of forming concepts: my habits, intuitions, tendencies, and methods of meaning-making. Every moment of my experience is invoking a hierarchical, complex traversal of this space, through the representation of what I’m seeing, how I’m seeing it, what it means to me, and how I’m holding myself throughout that moment, in my body, and in the seat of my mind. My sensation, thoughts, and actions are a continuous, interconnected choreography: activating one region inevitably excites a constellation of others in a never-ending, cosmic brain-dance.

Given the mind as a manifold, one might naturally start wondering: How is this space structured? How does it change as I gain new experiences? What makes one explanation feel coherent, useful, and satisfying, while another falls flat?

Latent Space of AI

The idea of the structure of the manifold of mind is a perfect analogy for the modern neural network. There’s this paper, Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis , which contrasts two possibilities for learned representations in AI systems: firstly the Unified Factored Representation (UFR): where the internal geometry of a model is clean, compositional, and coherent, such that related things cluster together; knowledge generalizes smoothly, and secondly the Fractured Entangled Representation (FER): where representations are messy, fragmented, and inconsistent. Related concepts are scattered across the space, far apart when they should be near, degrading the system’s capacity to generalize, learn continually, or be creative.

This also relates to the interpretability work being done at GoodFire, Understanding Memory in Loss Curvature, where they are trying to understand how to identify whether a network is storing concepts in a way that are “general” - which means if I perturbed the activation vector representing one concept, that would result in a large shift in the representations of other concepts, or not. They call the former “reasoning” and the latter “memory”, and go on to show that the results of their experiments suggest that these LLMs are memorizing concepts like math, but reasoning through concepts like boolean logic (If I perturb the representation of “if” in my LLM, everything else that it represents reacts to that perturbation).

Reasoning

So, with regards to the state of AI research at the moment, the story goes that we trained a model to predict the next token in a sentence, essentially to autocomplete text, on the entire internet. This model is huge, has this massive space, and eventually got extremely good at knowing what would come next. The result of this is that it was also good at being right about things (for the most part) because a lot of the time, the most likely answer is also the correct one. So if you asked it a question, and it simulated to itself beginning to answer your question, it would be pretty good at getting there.

Then we began to ask ourselves well how can we make these models really smart, rather than just seeming smart. How do we make them represent the soundest and most profound way of thinking about the question? Is the average representation of all the human text on the internet, reasonable? According to Kumar, Clune, Lehman, and Stanley, it’s not - it’s fractured and entangled. So, we are trying to use reinforecment learning to patch the confused mind of the AI in order for it to reason well, like taking a person whose seen and heard it all and trying to get them to make sense of things.

Can we whack this thing into shape, or do we need to retrain the models from scratch on data that necessarily follows strict logical rules, rather than the vast and messy content of the internet? Maybe then the models would really only know how to form concepts that logically follow, and would be better suited to guide us towards insight into coherent truths. But what would that model even be like? What would it mean to only be able to think in logical truths?

Geometry of the Self

This makes me wonder about the way in which our human minds represent our concepts, and to what extent our inner worlds are fractured and entangled versus compositional and clean. In particular, I wonder about the representation that I have of myself in the manifold of my mind. If I perturbed that vector, how much else would change? If I perturbed it enough, would I be somebody else entirely? Would I start to believe things I never believed because who-I-think-I-am has shifted, and therefore everything that follows from that, has too?

In a lot of my experience I don’t have doubt about whether I am staying the same self, so this representation must be relatively coherent. There must be some mechanism, some strong force field in the state space that pushes it together - it would take far too much energy to move everything else around, if the representation of who I am shifted, so its locked in place through the pressure of the interconnected contingencies.

That takes the shape of narrative in my experience. It surfaces in my perception as ideas about who I am and in particular, fears of any evidence that those ideas may not be true. People may refer to this as ego, although I find that has negative connotations which may be unhelpful. This attachment to being what your mind has converged on representing you as, and pushing away from thing that steer that vector, because of the energy it would take to reorganize everything else.

Reconfiguration

For me, the awareness of that allows for the relationship with myself, my past, my concepts, my ideas of how and who I am, to be seen from a mathematical perspective, as fields or forces pushing against one another, and that makes it somehow less sticky, and I can have compassion for the system that is governing my experience and understand the nature of the friction I perceive as discomfort. The practice is to let the vectors move, slowly over time, to allow the reconfiguration, which I do ultimately want to happen, despite the experience of frustration it creates.

Poetry

Earth

little toes press into soft soil
a steadiness.

welcomed by the worms
sinking into deep sand
tangled in tree roots
bitten by bugs
warm like a womb

here I can flourish
I stomp my feet, steady beat
the trees wink and I think I have landed.

A deep orange sun plunges
a lion’s roar, a dolphin’s squeal, a chanting.

I feel the tempo
a heartbeat it says:
boom, vroom, child is here!

this ancient child I am
a body of the earth; I am

Home, here I am
violently born,
I humbly live,
I quietly die, and return.

Wind

A-ho how she whispers
A-hum how she hums
A-ha how she roars!

she whips me with her cold wrath
and wraps me round in warmth

A wild beast she
shatters things she
sings to me softly.

At night I dream of leaping and
she takes me to the sky

at times I fear that she may have me fall

She speaks with the trees
they greet me with her waving arms
and tell me I am free

A-hee she is happy
A-ho she is wise
A-hey a-way she flies

Water

Rainfall on a rushing river
crashing through crevices
pooling into pockets, meanwhile

sleepy raindrops on the roof
sinking into slumber as
the room fills with water, warm,
evaporating at the rims
and dancing with the downpour;

delirious, disoriented
the depth of dark blue, draining,
drowning, soon to be asleep,
washed into the waters.

a steady current pulls

Awakening to dewy lawns
the last sweet trickle
in fresh and fertile soil,

thoroughly thawed and
tender and raw, a gentle tear,
a puddle of laughter, a joyous splash
a mist condensing on my skin

the channels open
rebirthing in rapture
a cleansing

coalescing with the ocean or
vaporizing to the sky
or seeping into being
in the blue.

Fire

A flame at a distance
promising shelter
my shivering body seeks

waves of warmth, localized
hands outstretched, grasping -

a strong desire for
father fire
to thaw me back to life.

He is of course, temperamental
riddled with violence
and confused about softness

Later, upon candlelight,
gathered round and dancing
in devotion - we stomp around
a trance of passion
to take into account protection
and safety in our selves

Supper

Pistachios and cashews
unsalted in a paper bag
piano in the background
running water from the tub,
a jar of artichokes
perhaps even some singing bowls.

Especially: a circumstance
at supper time the simple scent
of newly ready rice
a clang of cutlery
the water stops a moment to be grateful.

In the garden the
plants are sleeping and peace perhaps,
as well.

Cacophony

Rumbling ricochet
a rocket roars
a raspy resin a rougher day
a sleepless night a rusty
response to restlessness.

Confusion about trembling
tectonic plates that shudder from
within there is a distance to the
knowing and resistance to the
space

all the while a softer glow
that whispers in and mumbles round
and flows about and
quietens
and has a distinct texture
like syrup or a spacious steam it
tells me not to worry

I have this feeling now but
ought to be careful
with that?

Yosemite

The space of possibility and what I could have felt
when I pondered the stream the tristesse
of a young child clutching nothing,
the hollow feeling introduces itself,
and never quite departs her.

Or perhaps happy tears of sweetness
earthy glands respirating
and pulsating a knowing -

regardless, that was no preparation
for
I turned the corner and saw in awe
the masculinity

a roaring fall which overwhelms
itself along the mountainside.

Sunk to the ground, my head upon my partner,
I think about people who write books about romance
all the words I could put down on
the way we laugh together

soft light through the yellow cotton on my lamp,
his skin on my skin
tasting eternity in seeing
that in his eyes there is mind like mine.

Or perhaps on walking home at night,
raising my voice and he doesn’t hear
and when he says I baffle him, I react to his confusion

a claw draws chunks of flesh from my chest
for fear of being wrong about our closeness.

There is a humor in it though and
what once was oblong is now pointy,
and hasn’t a care in the world.

The opportunity for drama in every moment
lends itself carefully, creating explanations
for dust particles, the emergence of order,
slowly, over epochs,
an elegant context for our predicament.

On the aeroplane

A calling to capitulate
to colors on a canvas, words into a verse and
chords into a tune,

yet a dryness of the mind has
leaked from fear and
stained me.

The pull of a part against another,
and forest spirits battle in the dark.

She begs to be released, but vigilance persists.

At once to open eyes and light a candle and bang a drum!
Perhaps the present is here?

But we ought watch out for that treacherous being
that lives under the thoughts
and threatens to unleash into delusion.

It has happened before, I think.

We kid ourselves again, again, that holding on will make us safe that
there isn’t space for softness
or freedom unrestrained.

We tell ourselves we have the reins we ride through nights and think of pain
and watch ourselves tied up against the same old tired rope.

I think instead that it may be
that there is nothing left to find
except for evermore of mind,
and fear of what it may become
to love without condition.

The doubt creeps in again again that without might I’ll lose it all
the All that I’ve constructed with my clenching.

Delusion, confusion this predilection
that you are worthy for your condition.

If I lose the careful order of
my pieces of reality
then disordered things will happen,

and I won’t notice til it’s too late and
I won’t have taken care of him and been there for my friends.
I won’t have taken care of him he looks at me, concernedly.

I want to form a bond with being,
declare a romance with the truth.

Let go now, friend
my chest is tight from your suppression
the trust is warm, deep breath there is
a space for all your softness.

I’ll cherish my clarity every day,
I’ll feed myself enough,
I have no use for trying.

In work I’ll be productive and aligned with best intention
I’ll strive for joy in learning take instruction from myself.

I am a being that cares about people
there is no doubt that I can be the woman of my dreams.

Dusk as the stars appear

At last: I rummage around for a fragment
to enlighten that of this which remains for mattering.

there is so little to excel at
and yet something to an existence, carved with wavering fortitude, privy to alluring illusions of safety.

to untangle the web i encounter the need for permission to let go of
a yearning for splashing bath water, tender little shoes and softly brushed hair and
the red and frustrated cheeks of confusion

for in that lies a reverence which dissolves decision and concentrates fear

yet this unlikely life has resigned to be riddled with mystery, and in that a conception, in one way or the other

To Make the Dying Beautiful

What I thought to be a chirping bird
was in fact a sickly squirrel
the horror in its shriveled tail revolted me.
I tried to look into the eyes of
steady squeaks of desperation
and come to terms with ugliness.
To imagine my body that he touches so fondly
shriveled and rotten or
burnt to a crisp.

That fear spreads out like darkness
or ink blotches or storm clouds.

To make the Dying beautiful,
the opportunity for that.
Each of us insects turned around,
little arms clutching for
something firm to touch us back.

the intensity of grasping
this very moment, the colors are vivid and
how much love is there that isn’t tamed with torture.

To burst with passion upon a canvas a form that speaks to generations, and tells them of their honesty, a part we can’t remember.

To surrender to the present moment,
and speak with the divine
and take into consideration
that underneath the tangle
there is truth, and it is good.

If I melt into my subtle body,
I will encounter yours,
and all that came before,
the rotten and the beautiful.

In the warmth of my hand I sense
that we have been here
countless times before,
and so we know what to do with this.