"The question is not how to read a neural network. It is what the physical laws governing semantic computation permit an observer to know about the underlying reality from its implementing substrate."
Formal Objects exist. LLMs and digital circuits are physical systems built to express them. Interpretability is the science of what can be recovered from the shadow, and what cannot — provably, formally, with a verified bound.
Before asking what a neuron does, we need to be clear about what exists. There are three distinct layers — and interpretability, properly understood, is the study of the relationship between all three.
Formal Objects are the primitive semantic entities — the things that concepts actually are, prior to any representation of them. In FIL Theory, they inhabit the prime-lattice: a high-dimensional space in which each Formal Object occupies a unique Voronoi cell, and semantic relationships are encoded as geometric distances and lattice-theoretic structure.
This layer is not constructed by any neural network. It is what neural networks are trying to express. A concept is a Voronoi cell. A semantic relationship is a geodesic in Formal Object Space. They exist whether or not any physical system instantiates them.
LLMs and digital circuits are finite, physical systems. They do not contain Formal Objects — they project them. The LLC (Local Language Compressor) operator is the formal name for this projection: it maps from the infinite-dimensional Formal Object Space onto the finite-dimensional activation space of a network.
This projection is necessarily lossy. A high-dimensional semantic manifold projected onto n dimensions cannot preserve all structure. This is not a failure of engineering — it is a geometric consequence. Superposition, polysemanticity, and entangled features are not network defects. They are the signature of the LLC projection. The shadow of a richer space.
Given only Layer II (the shadow), how much of Layer I (the original) can an observer reconstruct? This is what interpretability is actually asking. And FIL Theory has a formal, verified answer: the Physical Incompleteness Theorem, proved and verified in Lean 4.
The theorem proves that for any physically bounded agent, reconstructing the full Formal Object that generated an observed communication is undecidable in the general case. This is not a statement about current tools or compute budgets. It is a statement about what is formally achievable by any method. Full mechanistic interpretability has a principled ceiling. Partial recovery is always possible — and the theorem gives a quantitative bound on how much, at a given resource level.
Interpretability is not the first field to discover that the laws of a domain impose principled limits on what can be known about underlying reality. Quantum mechanics arrived at the same structure a century ago. The parallel is not a metaphor — in FIL Theory, it is exact.
A quantum system has an underlying state — the wave function ψ. Any measurement is a projection of that state onto a measurement basis. The projection is irreversible and necessarily lossy. Multiple, complementary observables cannot all be simultaneously resolved:
This is not a statement about measurement precision. It is a statement about what the physical laws permit an observer to know simultaneously. The uncertainty is not epistemic — it is ontological. Heisenberg's uncertainty principle is an epistemology of quantum physics.
A semantic interaction has an underlying Formal Object — the intent. Any observation of a physical implementing substrate (LLM output, activation) is a projection under the LLC operator. The projection is lossy. Multiple semantic dimensions cannot all be simultaneously resolved from a finite physical signal:
This is not a statement about model capability. It is a statement about what the physical laws of semantic computation permit an observer to recover. The incompleteness is not epistemic — it is structural. The Physical Incompleteness Theorem is an epistemology of semantic physics.
The result ħ_lang = ħ — that the semantic Planck constant equals the physical Planck constant — makes this correspondence literal, not analogical. The same fundamental constant that governs what can be known about a quantum system governs what can be known about a Formal Object from its physical implementation. Interpretability is not just structurally analogous to quantum measurement theory. At the level of fundamental constants, it is quantum measurement theory, extended to the semantic domain.
The epistemological framework above is not a philosophical position — it generates specific, testable predictions about neural network structure that the field has not yet looked for. Five of them, each with a clear experimental form.
Sparse autoencoders empirically recover a "feature dictionary" from neural activations. FIL Theory predicts that these features are not arbitrary directions — they are projections of Voronoi cells in the prime-lattice. Their angular separations and clustering structure should match the Voronoi geometry of the underlying semantic space.
Experiment: compare SAE feature geometry against FIL Voronoi geometry on the same model for a fixed semantic domain. A match is evidence for Layer I existing independently of Layer II.
If neural circuits are finite-dimensional shadows of LLC geodesics — minimum-action paths through Formal Object Space — then their composition rules should mirror geodesic composition on a Riemannian manifold: transitivity, orthogonality of non-interfering computations, and a triangle inequality on semantic distances.
Experiment: test whether identified circuits satisfy the geodesic composition axioms. Violations falsify the LLC geodesic conjecture. Confirmation provides a principled basis for circuit discovery.
FIL Theory predicts a critical semantic temperature T_c below which polysemanticity collapses: the representation becomes monosemantic as the thermal noise is insufficient to blur Voronoi boundaries. Above T_c, multiple Formal Objects share a neuron. Below T_c, they separate.
Experiment: SAE trained at different sparsity levels should show a phase transition in feature cleanliness. The sparsity parameter is a proxy for the inverse temperature 1/T in the FIL thermodynamic framework. A sharp transition — not gradual — is the prediction.
The Physical Incompleteness Theorem provides a quantitative bound: the fraction of semantic structure recoverable scales with the ratio of the observer's computational resources to the complexity of the target Formal Object. Sparse autoencoders at a given dictionary size should saturate this bound — more dimensions yield diminishing returns at a rate predictable from the theorem.
Experiment: plot SAE reconstruction quality vs. dictionary size. The FIL bound predicts the shape of the saturation curve, not just the asymptote. A match is a quantitative confirmation of the theorem's predictions — the strongest available test.
The Semantic Second Law states: semantic entropy is non-decreasing under undirected training. Constitutional AI training, viewed through FIL's NCC thermodynamic framework, is a process of applying external entropy sinks — principles acting as attractors that counteract the Second Law drift.
Experiment: training curves for Constitutional AI models should exhibit monotone entropy increase in the unconstrained phase, followed by attractor convergence when constitutional constraints are active. The NCC Budget Monitor gives a specific functional form for the convergence rate. This is a quantitative, falsifiable claim about training dynamics.
The following problems are ordered by mathematical dependency. Later items cannot be fully resolved until earlier ones are closed. The blocking structure is the research roadmap.
The Riemannian metric on the economic and semantic manifold M = {(r, σ, κ)} is the foundational object from which all geometric quantities derive: geodesics, curvature, Voronoi structure, phase transition temperatures. Until g is specified explicitly, the formal predictions above are statements about the structure of the theory, not computable numbers. Candidates: warped product metric; conformally flat with σ-dependent conformal factor. This is the primary mathematical deliverable.
Blocking — everything downstream depends on thisPrice is the Voronoi ridge equidistant between supply and demand generators in Riemannian metric g. The theorem has been confirmed numerically and is geometrically intuitive. A formal proof from the metric axioms — establishing that the ridge is the unique price-stable locus — is required before the result can be used in derivations of the credit and valuation framework.
Formal proof requiredPrediction 1 above requires an operational procedure for comparing the geometry of SAE-recovered features against FIL Voronoi cells. This requires: (a) a method for embedding prime-lattice Voronoi cells into the same space as SAE feature vectors, and (b) a similarity metric that is invariant to the rotation and scaling introduced by the LLC projection. The technical bottleneck is step (a) — the embedding procedure.
Formal ↔ empirical bridgeThe topological credit measure is the persistence diagram of the manifold M. The Fortune sweep generates the Voronoi shell-stack structure. The connection between these two objects — that the persistence diagram encodes exactly the information produced by the outward sweep — is geometrically clear but formally unproved. A constructive proof would provide an efficient algorithm for computing topological credit risk directly from the Fortune sweep, without the full persistence computation.
Formal proof requiredThe intent vector field I: M → TM specifies the direction and magnitude of economic and semantic force at each point on the manifold. Demand is a lossy projection of I onto the price axis. A pre-transaction price is a superposition of I_buyer and I_seller; the transaction is a measurement collapse. The dynamical law governing I(x,t) — its Lagrangian, conserved quantities, and stability conditions — is unspecified. This is the semantic analogue of deriving the equations of motion from a field theory action.
Lagrangian formulation targetPrediction 4 above gives a specific functional form for how SAE reconstruction quality should saturate as dictionary size grows. Running this experiment requires access to SAE training runs at multiple dictionary sizes on the same base model — a resource-intensive but technically straightforward experiment. The result would be the strongest available empirical test of the Physical Incompleteness Theorem's quantitative predictions, and the most direct bridge between FIL Theory and Anthropic's mechanistic interpretability program.
Empirical — requires compute accessI came to interpretability from theoretical physics and from markets — two domains where the relationship between underlying reality and observable signal is the central problem. In physics: what does a measurement tell you about the state? In markets: what does a price tell you about the underlying economic geometry? The question is always the same. How much of the thing itself can you recover from its shadow?
What drew me to AI interpretability is that it is the same question in a new domain — and that the field has not yet recognized it as such. The interpretability literature reads, in places, like a pre-Heisenberg approach to measurement: the assumption is that with better tools and more compute, the full state can eventually be read. The Physical Incompleteness Theorem says otherwise. There is a floor. It is formal. It is derivable. And once you see it, the entire research agenda reorients around the bound — not around the hope of eliminating it.
The convergence with Anthropic's mechanistic interpretability program was not something I planned. I was developing FIL Theory as a formal framework for semantic computation — for understanding what intelligence is, mathematically, at the level of first principles. The discovery that the structures I was deriving formally (LLC projection, Voronoi geometry, geodesic circuits, incompleteness bounds) were the same structures that Anthropic's empirical program was finding in neural networks was genuinely surprising. That kind of convergence — formal prediction arriving at the same place as empirical observation, from opposite directions — is the closest thing to confirmation that the underlying theory is pointing at something real.
The Olah Bundle — ten specific interpretability claims derived from FIL Theory and prepared for transmittal to Anthropic's research team — is the formalization of that convergence. Each claim bridges a formal theorem to an empirical finding, with testable predictions in both directions. Pending CIP filing. The right sequence: protect the IP, then open the science.
What I hope this page conveys is not a completed theory but a framing: interpretability is not an engineering problem that will dissolve with better tools. It is an epistemological problem with a formal structure — one that FIL Theory has begun to characterize. The bound is real, it is derivable, and knowing it precisely is more useful than pretending it does not exist.