Mechanistic interpretability is empirical exploration of neural network structure. FIL Theory is a formal mathematical framework for semantic computation. They are arriving at the same place from opposite directions.
Each row shows a phenomenon identified empirically by Anthropic's interpretability research on the left, and the corresponding formal mathematical structure predicted by FIL Theory on the right.
Neural networks represent more features than they have neurons. A network with N neurons can encode exponentially more concepts by using directions in activation space rather than individual neurons as the unit of representation. Features are distributed, overlapping, and far outnumber the substrate dimension.
Empirically confirmedThe LLC operator Pi_C maps from an infinite-dimensional Formal Object Space to the finite-dimensional activation space of a network. The high-dimensional semantic manifold necessarily projects onto a lower-dimensional substrate — meaning is always compressed in transit. Superposition is not a quirk; it is the formal consequence of projecting a richer space onto a constrained one.
Formally derivedIndividual neurons activate for multiple semantically unrelated concepts. A single neuron in a language model fires for "cats," "curves," and "certain academic papers" — a phenomenon that resists interpretation at the neuron level and motivates dictionary learning over individual neurons.
Empirically confirmedIn FIL theory, Formal Objects are disentangled in prime-lattice space — each concept occupies a unique Voronoi cell. When this structure is projected onto a lower-dimensional neural substrate, multiple Voronoi cells necessarily map to the same neuron. Polysemanticity is the shadow of Voronoi geometry under projection. The "true" disentangled representation exists at the FIL level; the neural substrate only carries a compressed sample.
Formally derivedSparse autoencoders (SAEs) trained on neural activations recover interpretable, monosemantic features — directions in activation space that correspond to single, coherent concepts. This "dictionary learning" approach reconstructs the disentangled representation that polysemanticity had compressed.
Empirically confirmedSARAI's Asymmetric Idempotent Recursive Towers are formally designed to recover the Voronoi cell structure from a compressed representation. The gcd idempotency property ensures that the prime-lattice structure — which encodes Voronoi geometry — is invariant under the tower's reconstruction operator. Sparse autoencoders are a data-driven approximation of what SARAI's formal protocol computes exactly.
Conjecture — under formalizationSpecific computations in neural networks are implemented by identifiable "circuits" — subgraphs of neurons and attention heads whose activations collectively compute a recognizable function. Induction heads, attention patterns for indirect object identification, and curve detectors are canonical examples.
Empirically confirmedIn FIL theory, the LLC operator defines geodesics on the semantic manifold — minimum-action paths through Formal Object Space that implement a given semantic transformation. A circuit in a neural network is the discretized, finite-dimensional image of such a geodesic under the projection. Identifying circuits is equivalent to reconstructing the geodesic structure of the underlying semantic manifold from its low-dimensional shadow.
Conjecture — under formalizationA fundamental question in interpretability: are there principled limits on how completely a neural network can be understood — even in principle, with unlimited compute? Is full mechanistic interpretability achievable, or are there formal barriers analogous to undecidability or incompleteness?
Open in the fieldFormally verified in Lean 4. The theorem proves that for any agent operating under physical resource bounds, the KYC (Know Your Counterparty) problem — reconstructing the full Formal Object that generated an observed communication — is undecidable in the general case. This places principled, formal limits on how completely any observer (including the network itself) can recover the semantic structure behind an observed activation pattern.
Lean 4 verifiedConstitutional AI uses a set of principles to guide RLHF fine-tuning, steering model behavior through the training process itself. The thermodynamic character of training — energy landscapes, phase transitions, convergence to attractors — is a recognized but underformalized aspect of why this works.
Empirically effectiveThe NCC (Normalized Computational Cost) framework formalizes training dynamics as a thermodynamic process. The Semantic Second Law (entropy non-decreasing under semantic evolution) provides a rigorous basis for why undirected training drifts. The NCC Budget Monitor acts as a thermodynamic halting criterion — analogous to a constitutional constraint but derived from first principles. The ħ_lang = ħ result (Main14) connects the semantic Planck constant to the physical one, providing a deep quantization foundation for Constitutional AI's effectiveness.
Formally derived (Main14, Main16)Mechanistic interpretability and FIL Theory are not competing frameworks. They are the same inquiry pursued from opposite ends: interpretability working inward from observed neural behavior, FIL working outward from formal mathematical axioms.
The convergence is not coincidental. Both programs are trying to understand the same underlying structure — how meaning is computed, represented, and transmitted in finite physical systems. The empirical findings of interpretability research keep arriving at structures that FIL Theory has already named formally: projection from higher-dimensional space, Voronoi geometry, geodesic computation, thermodynamic training dynamics, and fundamental limits on self-knowledge.
The Physical Incompleteness Theorem is the clearest case. The field has an open question; FIL has a Lean 4 verified answer. The Voronoi account of polysemanticity makes predictions that are, in principle, testable against the empirical sparse autoencoder results. The NCC thermodynamic framework makes the energy landscape of Constitutional AI derivable rather than observed.
The research agenda follows from this convergence. Connecting the formal predictions of FIL to the empirical findings of mechanistic interpretability is the Olah Bundle — ten interpretability claims ready for transmittal. Pending CIP filing.
The Physical Incompleteness Theorem is a formal barrier result about AI self-knowledge. It was proved using FIL theory and verified in Lean 4. Its implications for mechanistic interpretability are direct.
For any physically bounded agent, reconstructing the full Formal Object that generated an observed communication is undecidable in the general case. No algorithm, running on physical hardware with bounded resources, can always determine the complete semantic intent behind an observed signal.
Recovering the complete semantic structure behind a neural network's behavior is a special case of KYC. The Physical Incompleteness Theorem implies that, for any fixed interpretability method running on bounded compute, there exist networks whose internal semantics cannot be fully recovered. This is not a practical limitation — it is a formal one.
The theorem also provides a bound: partial recovery is always achievable, and the fraction of semantic structure that can be recovered scales with the ratio of the observer's computational resources to the complexity of the target. Sparse autoencoders are near-optimal within this bound for their resource class. The bound is quantitative and derivable from FIL theory.
The convergence between FIL Theory and mechanistic interpretability defines a concrete collaborative research program. Five open problems, each with a formal and an empirical thread.
FIL theory predicts that features should cluster in prime-lattice Voronoi cells. Sparse autoencoders empirically recover feature dictionaries. Are these the same structure? A direct comparison of SAE feature geometry against FIL Voronoi geometry on the same model is a testable experiment.
If neural circuits are shadows of LLC geodesics, then the composition rules for circuits should mirror the composition rules for geodesics on the semantic manifold. This predicts specific transitivity and orthogonality properties in circuit behavior that interpretability research has not yet looked for.
The Physical Incompleteness Theorem gives a quantitative bound on the fraction of semantic structure recoverable at a given resource level. This bound can be compared against the empirical recovery rates of sparse autoencoders at different dictionary sizes — a direct test of the theorem's quantitative predictions.
FIL theory predicts a critical semantic temperature T_c below which polysemanticity collapses and the representation becomes monosemantic. This may correspond to the empirical observation that SAEs trained at higher sparsity levels recover cleaner features. The temperature is a tunable parameter; the phase transition is a testable prediction.
If Constitutional AI training dynamics are governed by NCC thermodynamics, then the energy landscape should exhibit the specific features predicted by the Semantic Second Law: monotone entropy increase under undirected training, with constitutional principles acting as external entropy sinks that stabilize the attractor. This is a quantitative, testable claim about training curves.
Ten specific interpretability claims derived from FIL Theory, prepared for transmittal to Anthropic's mechanistic interpretability research team. Each claim bridges a formal FIL theorem to an empirical interpretability finding or open problem, with testable predictions in both directions. Pending CIP filing.