spectroscopy deep-learning group-theory research

TheSpectralInverseProblem:FromGroupTheorytoFoundationModels

March 8, 2026 · 11 min read

Vibrational spectroscopy — IR and Raman — is one of the most widely deployed analytical techniques in chemistry. You shine light on a molecule, measure what comes back, and try to figure out what the molecule looks like. The forward direction of this problem is solved: given a structure, compute its spectrum. The inverse direction — given a spectrum, recover the structure — is fundamentally harder, and the reason is group theory.

The Forward Map

The starting point is the Wilson GF secular equation:

$\det(\mathbf{GF} - \lambda \mathbf{I}) = 0$

The matrix G encodes atomic masses and molecular geometry. The matrix F is the force constant matrix — essentially the Hessian of the potential energy surface. The eigenvalues give the squared vibrational frequencies, and the eigenvectors determine which modes are observable by IR and Raman spectroscopy.

What makes the forward map well-behaved is that it’s smooth, computable, and well-conditioned. Given any reasonable molecular geometry, you can compute the full IR and Raman spectrum to arbitrary precision. DFT codes do this routinely at the B3LYP/def2-TZVP level.

The inverse map has none of these properties.

forward_map.py

Why the Inverse Fails: Symmetry

The fundamental obstruction to inversion is molecular symmetry. A molecule’s point group G determines which vibrational modes are visible to each technique. The selection rules are strict:

A mode is IR-active only if it transforms as a translation (changes the dipole moment)
A mode is Raman-active only if it transforms as a quadratic form (changes the polarizability)
Modes that do neither are silent — permanently invisible to both techniques

The Information Completeness Ratio measures the damage:

$R(G, N) = \frac{N_{\text{IR}} + N_{\text{Raman}}}{3N - 6}$

When $R = 1$ , every vibrational degree of freedom is observable by at least one technique. When $R < 1$ , information is permanently lost.

Theorem 1 — Symmetry Quotient

The vibrational forward map is G-invariant: it factors through the quotient space M/G. The inverse map recovers structure only up to symmetry equivalence. When R(G, N) = 1, the quotient map is potentially injective. When R < 1, the silent modes create a degenerate fiber — multiple distinct force constant matrices produce identical spectra.

How bad does it get? For 99.9% of organic molecules, R = 1 and everything is observable. But the exceptions matter:

information_completeness.py

There is a structural result that makes combined IR + Raman strictly better than either alone. For molecules with a center of inversion (the centrosymmetric ones — CO₂, benzene, cubane), the mutual exclusion principle applies:

Theorem 2 — Modal Complementarity

For centrosymmetric molecules, IR-active and Raman-active modes are completely disjoint. Gerade (symmetric) modes are Raman-only. Ungerade (antisymmetric) modes are IR-only. Combined measurement always strictly increases the observable degrees of freedom.

This is not an approximation — it follows directly from the character table. The practical consequence: any ML model that fuses IR + Raman should see its largest accuracy gains on centrosymmetric molecules. This is a testable, quantitative prediction from the theory.

Generic Identifiability

The central open question is whether combined IR + Raman can uniquely determine molecular structure (up to symmetry equivalence) at generic points:

Conjecture 3 — Generic Identifiability

For almost all molecular geometries (outside a measure-zero set), the combined IR + Raman forward map is injective on the quotient space: distinct force constant equivalence classes produce distinct combined spectra.

This is a conjecture, not a theorem. The obstruction to proving it is that the forward map’s smoothness breaks at eigenvalue degeneracies, so Sard’s theorem does not directly apply. But the numerical evidence is strong:

jacobian_rank_analysis.py

A 4:1 overdetermination ratio means the combined spectra contain roughly four times more equations than unknowns. The inverse problem is not just solvable — it is well-conditioned.

The Architecture: Spektron

The theory says what is achievable. The model is designed to get there. Spektron is a CNN-Transformer encoder with a Variational Information Bottleneck (VIB) that splits the latent space into chemistry and instrument:

$\mathcal{L}_{\text{VIB}} = \mathbb{E}_{q(z|x)}\!\left[-\log p(y|z)\right] + \beta \, D_{\text{KL}}\!\left(q(z|x) \| p(z)\right)$

The latent vector splits into z_chem (128 dimensions, transferable chemistry) and z_inst (64 dimensions, instrument artifacts). At transfer time, z_inst is discarded — only the chemistry survives.

Key Design Choice

A 1D CNN tokenizer before the Transformer gives 8–10% accuracy gains over raw patch tokenization on spectral data. Vibrational peaks are sharp, narrow features — convolutional kernels capture this local structure before attention handles global context. This is the single largest architectural improvement in ablation studies.

architecture.py

Calibration Transfer

The practical test case. A model trained on spectra from instrument A fails on instrument B — different detectors, optical paths, lamp aging all shift the spectral shape. Current approaches (PDS, SBC) require 25+ paired transfer samples. The VIB architecture targets ≤10 by learning instrument-invariant representations during pretraining.

The transfer objective aligns latent distributions across instruments using Sinkhorn-based optimal transport:

$\mathcal{L}_{\text{OT}} = W_\epsilon\!\left( q(z_{\text{chem}} | \mathcal{D}_A), \, q(z_{\text{chem}} | \mathcal{D}_B) \right)$

Combined with test-time training — running a few self-supervised gradient steps at inference on the new instrument — this enables adaptation without labeled target data.

calibration transfer

traditional (PDS)

spektron (VIB + OT)

Benchmark Target

R² > 0.952 on corn moisture prediction (beating LoRA-CT) with ≤10 transfer samples across three NIR instruments (m5, mp5, mp6). The corn dataset has 80 samples × 3 instruments × 700 channels — a small but well-characterized benchmark where calibration transfer methods are directly comparable.

Current Status

project_status.sh

The theoretical framework is complete. The model is pretraining on QM9S (130K molecules, computed IR + Raman + UV at B3LYP/def2-TZVP) and ChEMBL (220K experimental spectra). Next: symmetry-stratified evaluation to test whether empirical accuracy tracks R(G, N) as the theory predicts. Details on the theory are in the companion post on spectral identifiability.

Companion post: Spectral Identifiability Theory — formal treatment of the group-theoretic constraints
SSM deep dive: State Space Models for Spectroscopy — why the CNN + D-LinOSS hybrid works
ML challenges: Why Spectra Are Harder Than Images — the broader constraints shaping this architecture
Research paper: Hybrid SSA Spectroscopy — the Spektron architecture paper
Research paper: Spectral Identifiability — information-theoretic limits of spectroscopic identification
Project: Spektron — the foundation model implementation
Digital twins: Neural ODEs for Reactor Modeling — parallel problem: learning dynamics from data with physics constraints