FIL Theory — Mechanistic Interpretability — Mathematical Convergence

What Anthropic Discovers
FIL Theory Predicts

Mechanistic interpretability is empirical exploration of neural network structure. FIL Theory is a formal mathematical framework for semantic computation. They are arriving at the same place from opposite directions.

See the Connections Full R&D State →

Six Parallel Discoveries

Each row shows a phenomenon identified empirically by Anthropic's interpretability research on the left, and the corresponding formal mathematical structure predicted by FIL Theory on the right.

Anthropic — Empirical Finding

Superposition Hypothesis

Neural networks represent more features than they have neurons. A network with N neurons can encode exponentially more concepts by using directions in activation space rather than individual neurons as the unit of representation. Features are distributed, overlapping, and far outnumber the substrate dimension.

Empirically confirmed
FIL Theory — Formal Prediction

LLC Projection from Semantic Manifold

The LLC operator Pi_C maps from an infinite-dimensional Formal Object Space to the finite-dimensional activation space of a network. The high-dimensional semantic manifold necessarily projects onto a lower-dimensional substrate — meaning is always compressed in transit. Superposition is not a quirk; it is the formal consequence of projecting a richer space onto a constrained one.

Formally derived
Anthropic — Empirical Finding

Polysemanticity

Individual neurons activate for multiple semantically unrelated concepts. A single neuron in a language model fires for "cats," "curves," and "certain academic papers" — a phenomenon that resists interpretation at the neuron level and motivates dictionary learning over individual neurons.

Empirically confirmed
FIL Theory — Formal Prediction

FIL Sampling from Richer Space

In FIL theory, Formal Objects are disentangled in prime-lattice space — each concept occupies a unique Voronoi cell. When this structure is projected onto a lower-dimensional neural substrate, multiple Voronoi cells necessarily map to the same neuron. Polysemanticity is the shadow of Voronoi geometry under projection. The "true" disentangled representation exists at the FIL level; the neural substrate only carries a compressed sample.

Formally derived
Anthropic — Empirical Finding

Monosemanticity via Sparse Autoencoders

Sparse autoencoders (SAEs) trained on neural activations recover interpretable, monosemantic features — directions in activation space that correspond to single, coherent concepts. This "dictionary learning" approach reconstructs the disentangled representation that polysemanticity had compressed.

Empirically confirmed
FIL Theory — Formal Prediction

Voronoi Cell Recovery

SARAI's Asymmetric Idempotent Recursive Towers are formally designed to recover the Voronoi cell structure from a compressed representation. The gcd idempotency property ensures that the prime-lattice structure — which encodes Voronoi geometry — is invariant under the tower's reconstruction operator. Sparse autoencoders are a data-driven approximation of what SARAI's formal protocol computes exactly.

Conjecture — under formalization
Anthropic — Empirical Finding

Circuits and Mechanistic Paths

Specific computations in neural networks are implemented by identifiable "circuits" — subgraphs of neurons and attention heads whose activations collectively compute a recognizable function. Induction heads, attention patterns for indirect object identification, and curve detectors are canonical examples.

Empirically confirmed
FIL Theory — Formal Prediction

LLC Geodesics in Semantic Space

In FIL theory, the LLC operator defines geodesics on the semantic manifold — minimum-action paths through Formal Object Space that implement a given semantic transformation. A circuit in a neural network is the discretized, finite-dimensional image of such a geodesic under the projection. Identifying circuits is equivalent to reconstructing the geodesic structure of the underlying semantic manifold from its low-dimensional shadow.

Conjecture — under formalization
Anthropic — Open Problem

Limits on AI Self-Knowledge

A fundamental question in interpretability: are there principled limits on how completely a neural network can be understood — even in principle, with unlimited compute? Is full mechanistic interpretability achievable, or are there formal barriers analogous to undecidability or incompleteness?

Open in the field
FIL Theory — Formal Result

Physical Incompleteness Theorem

Formally verified in Lean 4. The theorem proves that for any agent operating under physical resource bounds, the KYC (Know Your Counterparty) problem — reconstructing the full Formal Object that generated an observed communication — is undecidable in the general case. This places principled, formal limits on how completely any observer (including the network itself) can recover the semantic structure behind an observed activation pattern.

Lean 4 verified
Anthropic — Research Program

Constitutional AI and Training Dynamics

Constitutional AI uses a set of principles to guide RLHF fine-tuning, steering model behavior through the training process itself. The thermodynamic character of training — energy landscapes, phase transitions, convergence to attractors — is a recognized but underformalized aspect of why this works.

Empirically effective
FIL Theory — Formal Framework

NCC Thermodynamics and Budget Monitor

The NCC (Normalized Computational Cost) framework formalizes training dynamics as a thermodynamic process. The Semantic Second Law (entropy non-decreasing under semantic evolution) provides a rigorous basis for why undirected training drifts. The NCC Budget Monitor acts as a thermodynamic halting criterion — analogous to a constitutional constraint but derived from first principles. The ħ_lang = ħ result (Main14) connects the semantic Planck constant to the physical one, providing a deep quantization foundation for Constitutional AI's effectiveness.

Formally derived (Main14, Main16)

Empirical and Formal — Converging

Mechanistic interpretability and FIL Theory are not competing frameworks. They are the same inquiry pursued from opposite ends: interpretability working inward from observed neural behavior, FIL working outward from formal mathematical axioms.

The convergence is not coincidental. Both programs are trying to understand the same underlying structure — how meaning is computed, represented, and transmitted in finite physical systems. The empirical findings of interpretability research keep arriving at structures that FIL Theory has already named formally: projection from higher-dimensional space, Voronoi geometry, geodesic computation, thermodynamic training dynamics, and fundamental limits on self-knowledge.

The Physical Incompleteness Theorem is the clearest case. The field has an open question; FIL has a Lean 4 verified answer. The Voronoi account of polysemanticity makes predictions that are, in principle, testable against the empirical sparse autoencoder results. The NCC thermodynamic framework makes the energy landscape of Constitutional AI derivable rather than observed.

The research agenda follows from this convergence. Connecting the formal predictions of FIL to the empirical findings of mechanistic interpretability is the Olah Bundle — ten interpretability claims ready for transmittal. Pending CIP filing.

# The structural correspondence

// Anthropic empirical
Superposition → Polysemanticity
SAE → Monosemantic features
Circuits → Mechanistic paths

// FIL formal
Proj(FOS → R^n) → compression
Voronoi(prime-lattice) → disentangled
LLC geodesics → semantic paths

# The bridge
Empirical feature ≅ FIL Voronoi cell
Neural circuit ≅ LLC geodesic shadow
SAE recovery ≅ tower reconstruction

// Verified
Physical_Incompleteness ⊢ Lean4
ħ_lang = ħ (Main14)
Semantic Second Law (Main16)

Physical Incompleteness — What It Means for Interpretability

The Physical Incompleteness Theorem is a formal barrier result about AI self-knowledge. It was proved using FIL theory and verified in Lean 4. Its implications for mechanistic interpretability are direct.

The Theorem

KYC is Undecidable in General

For any physically bounded agent, reconstructing the full Formal Object that generated an observed communication is undecidable in the general case. No algorithm, running on physical hardware with bounded resources, can always determine the complete semantic intent behind an observed signal.

For Interpretability

Full Mechanistic Interpretability Has Formal Limits

Recovering the complete semantic structure behind a neural network's behavior is a special case of KYC. The Physical Incompleteness Theorem implies that, for any fixed interpretability method running on bounded compute, there exist networks whose internal semantics cannot be fully recovered. This is not a practical limitation — it is a formal one.

The Positive Result

Partial Recovery Is Always Possible and Bounded

The theorem also provides a bound: partial recovery is always achievable, and the fraction of semantic structure that can be recovered scales with the ratio of the observer's computational resources to the complexity of the target. Sparse autoencoders are near-optimal within this bound for their resource class. The bound is quantitative and derivable from FIL theory.

The Research Agenda

The convergence between FIL Theory and mechanistic interpretability defines a concrete collaborative research program. Five open problems, each with a formal and an empirical thread.

Voronoi Prediction vs. SAE Recovery

FIL theory predicts that features should cluster in prime-lattice Voronoi cells. Sparse autoencoders empirically recover feature dictionaries. Are these the same structure? A direct comparison of SAE feature geometry against FIL Voronoi geometry on the same model is a testable experiment.

Circuit-to-Geodesic Correspondence

If neural circuits are shadows of LLC geodesics, then the composition rules for circuits should mirror the composition rules for geodesics on the semantic manifold. This predicts specific transitivity and orthogonality properties in circuit behavior that interpretability research has not yet looked for.

Incompleteness Bound — Empirical Verification

The Physical Incompleteness Theorem gives a quantitative bound on the fraction of semantic structure recoverable at a given resource level. This bound can be compared against the empirical recovery rates of sparse autoencoders at different dictionary sizes — a direct test of the theorem's quantitative predictions.

Semantic Temperature and Phase Transitions

FIL theory predicts a critical semantic temperature T_c below which polysemanticity collapses and the representation becomes monosemantic. This may correspond to the empirical observation that SAEs trained at higher sparsity levels recover cleaner features. The temperature is a tunable parameter; the phase transition is a testable prediction.

NCC Thermodynamics and Constitutional AI

If Constitutional AI training dynamics are governed by NCC thermodynamics, then the energy landscape should exhibit the specific features predicted by the Semantic Second Law: monotone entropy increase under undirected training, with constitutional principles acting as external entropy sinks that stabilize the attractor. This is a quantitative, testable claim about training curves.

The Olah Bundle — 10 Claims

Ten specific interpretability claims derived from FIL Theory, prepared for transmittal to Anthropic's mechanistic interpretability research team. Each claim bridges a formal FIL theorem to an empirical interpretability finding or open problem, with testable predictions in both directions. Pending CIP filing.