Skip to main content
spectroscopy machine-learning deep-learning research

WhyVibrationalSpectraAreHarderThanImages

· 12 min read

A vibrational spectrum is a 1D signal — intensity as a function of wavenumber. An image is a 2D signal — pixel intensity as a function of spatial coordinates. Both are arrays of floats. Both feed into neural networks. The resemblance ends there.

Every technique that makes deep learning work on images — transfer learning from ImageNet, data augmentation by flipping and cropping, batch normalization, large-scale pretraining — either fails outright or requires non-obvious modifications when applied to spectral data. This post catalogs the differences and explains why spectral ML is a distinct problem domain.

The Shape of the Signal

An image is spatially smooth. Adjacent pixels are highly correlated. Edges are rare events — most of the image is gradual gradient. This smoothness is why convolutional filters work: a 3×3 kernel captures most local structure.

A vibrational spectrum is the opposite. Peaks are sharp, narrow, and information-dense. A single C-H stretching peak at 2900 cm⁻¹ might span 20 wavenumbers out of a 3500-wavenumber range. The peak position, width, and intensity each encode different physical information. Between peaks, the signal is nearly zero — featureless baseline.

signal_comparison.py
1
2
3
4
5
6
7
8
9
10
11
12
13

This matters for architecture choice. In vision, a 3×3 conv kernel captures a meaningful spatial neighborhood. In spectroscopy, a kernel needs to span the full width of a peak — typically 15-40 points — to capture its shape. Too narrow and the kernel sees only the slope of a peak; too wide and it blurs adjacent peaks that encode different functional groups.

No Pretrained Backbones

ImageNet pretraining is the foundation of modern computer vision. A ResNet trained on 1.2M labeled images learns low-level features (edges, textures) in early layers and high-level features (objects, scenes) in later layers. These features transfer to medical imaging, satellite imagery, and manufacturing inspection with minimal fine-tuning.

There is no spectral equivalent of ImageNet.

pretraining_comparison.py
1
2
3
4
5
6
7
8
9
10

The reason is data scarcity. The largest public spectral database — SDBS from AIST — contains about 35,000 IR spectra. QM9S has 130K computed spectra but only for molecules with ≤9 heavy atoms. Compare this to ImageNet’s 14 million images or Common Crawl’s trillions of tokens. There simply isn’t enough diverse spectral data to learn general-purpose features.

This means every spectral ML project starts cold. No fine-tuning, no transfer learning, no “just use a ResNet backbone.” The features must be learned from the task-specific dataset, which is rarely larger than 10K-100K samples.

Why This Motivates Foundation Models

This is exactly why Spektron exists. By pretraining on QM9S (130K computed spectra) + ChEMBL (220K experimental spectra), the goal is to build the first general-purpose spectral backbone — a model that learns transferable features like peak shapes, functional group signatures, and spectral fingerprints that can be fine-tuned for downstream tasks.

Augmentation Is Severely Constrained

In computer vision, data augmentation is effectively free. Horizontal flips, random crops, color jitter, cutout — these transformations preserve the semantic content of an image while expanding the training set by 10-100x.

Spectral augmentation is physically constrained. Most transformations that are harmless for images are destructive for spectra:

augmentation constraints
vision (safe)
spectroscopy (dangerous)

Flipping a spectrum reverses the wavenumber axis — the C-H stretch at 2900 cm⁻¹ moves to 600 cm⁻¹, which is a completely different physical regime. Cropping removes peaks, changing the chemical identity. Scaling the x-axis shifts peak positions, which changes functional group assignments.

The only safe augmentations are additive noise (simulates detector noise) and small wavenumber shifts (simulates calibration variation). This gives maybe a 2-3x effective dataset expansion — not the 10-100x that vision gets.

Instrument Variance

Two cameras photographing the same object produce nearly identical images. Two spectrometers measuring the same sample produce systematically different spectra.

The differences are not random noise. They are structured biases caused by:

  • Detector response curves — different detector materials (MCT vs DTGS for IR) have different sensitivity profiles
  • Optical path geometry — beam splitter efficiency, mirror alignment, and sample cell geometry vary between instruments
  • Source aging — lamp intensity degrades over time, shifting the baseline
  • Resolution and sampling — different instruments digitize at different wavenumber intervals
instrument_variance.py
1
2
3
4
5
6
7
8
9
10
11
12

This is the calibration transfer problem. A model trained on spectra from instrument A degrades dramatically on instrument B — not because the chemistry changed, but because the instrument’s signature shifted the spectral shape. In vision terms, it would be like a model trained on Canon photos failing on Nikon photos of the same scene.

Traditional solutions (Piecewise Direct Standardization, Shenk-Westerhaus) require 25+ paired samples measured on both instruments. Getting these samples is expensive and logistically painful. This is one of the central problems that Spektron’s VIB architecture is designed to solve — by learning instrument-invariant representations during pretraining.

Physics Constrains the Loss Function

In vision, the loss function is straightforward: cross-entropy for classification, MSE for regression. The model learns whatever features minimize the loss. There are no physical laws constraining what a cat looks like.

Spectral data obeys conservation laws. Total spectral intensity is related to the number of oscillators. Peak positions are determined by bond force constants. Relative intensities follow selection rules from group theory. A model that violates these constraints is producing physically impossible outputs — even if the loss is low.

iAi=const(oscillator strength sum rule)\sum_i A_i = \text{const} \quad \text{(oscillator strength sum rule)}

νi=12πkiμi(harmonic frequency-force constant relation)\nu_i = \frac{1}{2\pi}\sqrt{\frac{k_i}{\mu_i}} \quad \text{(harmonic frequency-force constant relation)}

This means spectral ML benefits from physics-informed losses: penalty terms that enforce conservation laws, symmetry constraints, and thermodynamic bounds. These terms don’t just regularize the model — they encode domain knowledge that the model would otherwise need thousands of examples to learn.

Physics-Informed Training

In Spektron’s training pipeline, the total loss combines reconstruction quality with physics constraints:

L=Lrecon+αLphysics+βLVIB\mathcal{L} = \mathcal{L}_{\text{recon}} + \alpha \mathcal{L}_{\text{physics}} + \beta \mathcal{L}_{\text{VIB}}

The physics loss penalizes violations of the oscillator strength sum rule and enforces smooth baseline behavior. Without it, the model learns to reconstruct spectra accurately but produces physically inconsistent latent representations.

The Dimensionality Mismatch

ImageNet classification has 1,000 classes with 1.2 million images — roughly 1,200 images per class. This is a well-conditioned learning problem.

Molecular identification from spectra has, in principle, millions of classes (one per molecule) with perhaps 1-10 spectra each. Most molecules have been measured exactly once. Some have never been measured at all.

class_distribution.py
1
2
3
4
5
6
7
8
9
10

This flips the standard ML paradigm. In vision, you have too many images and not enough compute. In spectroscopy, you have too few spectra and need to extract maximum information from each one. Techniques like metric learning, contrastive pretraining, and retrieval-based decoding become essential — not because they’re trendy, but because classification simply doesn’t work with one sample per class.

What Actually Works

Given these constraints, the recipe that works for spectral ML looks very different from the standard vision pipeline:

  1. 1D CNN tokenizers with wide kernels (15-41 points) to capture peak shapes — not 3×3 convolutions
  2. Attention mechanisms that relate peaks across the full spectral range — not local receptive fields
  3. Metric learning with retrieval decoding — not softmax classification
  4. Physics-informed losses that encode conservation laws — not pure reconstruction
  5. Domain-specific augmentation limited to noise and small shifts — not aggressive transforms
  6. Instrument disentanglement in the latent space — not domain adaptation as an afterthought
The Takeaway

Spectral ML is not a special case of computer vision. It’s a different problem with different data characteristics, different constraints, and different solutions. Importing architectures and training recipes from vision without modification will produce models that underperform physics-aware, spectroscopy-specific approaches. The field needs its own foundation models, its own pretraining datasets, and its own evaluation protocols.

This is the perspective that guides the design of Spektron and SpectraKit: build tools specifically for spectral data, not adapted from other domains.

  • Foundation model: Spektron — the spectral foundation model designed around these constraints
  • Preprocessing: SpectraKit — functional preprocessing library for spectral data
  • Theory: Spectral Identifiability — group-theoretic limits on what spectra can reveal
  • Architecture: The Spectral Inverse Problem — how Spektron’s design addresses these challenges