State Space Models for Spectroscopy: Why Sequence Models Beat CNNs
A vibrational spectrum is conventionally treated as a fixed-length vector — 3501 intensity values spanning 500 to 4000 cm⁻¹. You feed it into a 1D CNN, extract features, and classify. This works. It also misses something fundamental.
A spectrum is a sequence. The O-H stretch at 3400 cm⁻¹ is physically correlated with the O-H bend at 1640 cm⁻¹ — they’re different vibrations of the same bond. The C=O stretch at 1720 cm⁻¹ shifts when a neighboring C-H appears at 2950 cm⁻¹, because the bonds share electron density. These correlations span thousands of wavenumbers — far beyond the receptive field of any practical CNN.
State space models (SSMs) process sequences with linear-time complexity while maintaining a compressed memory of the entire history. Applied to spectra, this means the model at wavenumber 3400 cm⁻¹ already “remembers” what it saw at 500 cm⁻¹. No skip connections, no attention, no quadratic cost.
The Receptive Field Problem
A 1D CNN with kernel size and layers has an effective receptive field of points. For a standard architecture — , — that’s 37 points, or about 37 cm⁻¹. The O-H stretch and O-H bend are separated by 1760 cm⁻¹.
Try it yourself — toggle between CNN and SSM, adjust kernel size and layers, and see which spectral correlations fall inside the receptive field:
You can increase the CNN receptive field with dilated convolutions or deeper networks, but both have costs. Dilated convolutions create “gridding artifacts” — they sample the input at regular intervals and miss features between the dilation gaps. Deeper networks require more parameters and are harder to train.
Transformers solve the receptive field problem completely — full attention connects every point to every other point. But attention is in sequence length. For a 3501-point spectrum, that’s 12 million attention scores per layer. It works, but it’s expensive, and the cost grows quadratically if you increase spectral resolution.
State Space Models: The Third Option
An SSM processes a sequence by maintaining a hidden state that evolves according to a linear dynamical system:
The matrix is the state transition — it determines how memory decays and which frequencies are preserved. controls how new input enters the state. reads from the state to produce output. The hidden state is a compressed representation of the entire history .
The critical innovation in S4 (Structured State Spaces for Sequence Modeling, Gu et al. 2022) is the HiPPO initialization of the matrix. Instead of random initialization, is set to optimally compress the history under a specific measure — retaining long-range dependencies that a randomly initialized system would forget.
The linearity of the state transition is not a limitation — it’s a feature. The recurrence can be unrolled into a convolution, enabling parallel computation during training. At inference, it reverts to a recurrence, processing each new point in time. This dual interpretation — convolution for training, recurrence for inference — gives SSMs the best of both worlds.
From S4 to D-LinOSS
The SSM landscape has evolved rapidly. Three generations matter for spectral applications:
S4 (2022) — The original structured state space. Uses a diagonal approximation of for efficiency. Showed that SSMs could match Transformers on long-range benchmarks (Path-X, ListOps) while being much faster.
Mamba (2023) — Made and input-dependent (selective state spaces). The transition matrices now depend on the input, allowing the model to selectively remember or forget information. This broke the convolution interpretation but enabled much better performance on language tasks.
D-LinOSS (2024) — Diagonal Linear Operator State Space. Returns to a diagonalized state transition but with a learnable discretization step that adapts to the input. Combines S4’s parallelism with Mamba’s input-dependent behavior.
The pure SSM outperforms the pure CNN by 9 points — the global receptive field matters. But the hybrid architecture (CNN tokenizer + SSM backbone) beats both, because peak shapes are inherently local features that CNNs capture better than SSMs, while cross-peak correlations are global features that SSMs capture better than CNNs.
Why Spectra Are Ideal for SSMs
SSMs excel on sequences with specific properties — and vibrational spectra have all of them:
1. Long-range dependencies are physically meaningful. The correlation between the O-H stretch and the O-H bend is not a statistical artifact — it’s a consequence of shared atomic displacement vectors. SSMs that model this correlation produce better molecular embeddings.
2. The sequence has a natural ordering. Wavenumber is a physical axis with units. Unlike token sequences in language (where position is arbitrary), the wavenumber axis has a metric structure. Adjacent points are more correlated than distant points, but distant correlations also exist.
3. Resolution can vary. Some spectral regions are information-dense (the fingerprint region, 500-1500 cm⁻¹) and others are sparse (2000-2500 cm⁻¹ for most organic molecules). An input-dependent SSM can allocate more state capacity to information-dense regions — something fixed architectures cannot do.
When Mamba or D-LinOSS processes an IR spectrum, the input-dependent gating learns to “pay attention” at peaks and “skip” over baselines. This is analogous to how a spectroscopist reads a spectrum: scan quickly over featureless regions, slow down at peaks, and relate distant peaks to each other. The SSM learns this reading strategy from data.
4. Sequence length is moderate. At 3501 points, a spectrum is long enough that Transformers become expensive but short enough that SSMs are extremely efficient. The sweet spot for SSMs is sequences of length 1K-100K — exactly where spectral data lives.
The Hybrid Architecture in Spektron
Spektron uses a CNN tokenizer → D-LinOSS backbone architecture. The CNN converts the raw 3501-point spectrum into 128 tokens, each representing a ~27 cm⁻¹ window. The D-LinOSS layers then process these tokens as a sequence, building global representations:
The CNN tokenizer provides two things the SSM needs: local feature extraction (peak shapes, shoulders, multiplets) and dimensionality reduction (3501 → 128 tokens). The D-LinOSS backbone then relates these local features across the full spectral range, producing representations where the O-H token “knows about” the C=O token 1000 cm⁻¹ away.
Ablation: CNN Tokenizer Matters
The CNN tokenizer is not optional. Replacing it with simple patch tokenization (chop the spectrum into 128 non-overlapping windows) drops accuracy by 8-10%:
The reason: vibrational peaks are sharp, asymmetric features that don’t align with fixed patch boundaries. A peak at the edge of a patch gets split between two tokens, destroying its shape information. The CNN’s overlapping receptive fields and learned filters capture peak shapes regardless of alignment.
For 1D signal data with sharp local features and long-range correlations, the optimal architecture is a hybrid: CNN for local feature extraction → SSM for global context. This pattern applies beyond spectroscopy to any signal where local structure and global dependencies both matter — ECG, seismology, audio, time series.
Practical Considerations
Training SSMs on spectral data has a few gotchas:
Numerical stability. D-LinOSS uses complex-valued state matrices that can produce extreme values (±200K) before the GLU gate. Under mixed-precision training (AMP), these overflow float16. The fix: force the SSM layers to run in float32 while allowing the rest of the model to use float16.
Initialization. The HiPPO initialization of assumes the input is a continuous signal sampled uniformly. Spectra satisfy this — wavenumber is uniformly sampled. But if you resample to non-uniform spacing (e.g., to compress baseline regions), you need to adjust the discretization step accordingly.
State dimension. The hidden state dimension controls how much history the SSM can remember. For 128-token spectral sequences, is sufficient — the state has as many dimensions as there are tokens. Increasing beyond this shows diminishing returns.
The Bigger Picture
SSMs represent a shift in how we think about spectral data. The traditional view — a spectrum is a vector of features — leads to architectures that treat each wavenumber independently. The sequence view — a spectrum is a signal unfolding along the wavenumber axis — leads to architectures that model dependencies between wavenumbers.
This distinction matters because the physics is sequential. The wavenumber axis is not arbitrary — it corresponds to energy, and physical correlations between modes follow from shared molecular structure. A model that respects this sequential structure learns more from less data.
The practical upshot: on QM9S with 130K spectra, a CNN + D-LinOSS hybrid achieves 84.2% identification accuracy with 12.4M parameters, matching a CNN + Transformer at 83.7% while running at linear cost. On larger datasets or higher-resolution spectra, the linear scaling advantage will compound.
Related
- Project: Spektron — the spectral foundation model using CNN + D-LinOSS
- Research: Hybrid SSA Spectroscopy — the paper describing Spektron’s architecture
- Theory: Spectral Identifiability — why combined IR + Raman provides enough information for identification
- Comparison: Why Spectra Are Harder Than Images — the broader ML challenges of spectral data
- Preprocessing: SpectraKit — preprocessing pipeline that feeds the SSM