Skip to main content
self-supervised-learning spectroscopy deep-learning research

MaskedPretrainingforScientificSpectra:LessonsfromBreakingBERT

· 16 min read

The fundamental challenge of spectral machine learning is the data asymmetry. ImageNet has 1,200 labeled images per class. Spectroscopy has approximately one labeled spectrum per molecule — sometimes zero if the compound has never been synthesized. You cannot train a foundation model on a dataset where every class has a single example.

Self-supervised pretraining sidesteps the label bottleneck entirely. Instead of “given this spectrum, predict the molecule,” the model learns from a different signal: “given part of this spectrum, predict the rest.” No labels. No classification. Just structure — the statistical regularities that make spectra more than random noise. Masked pretraining is the simplest and most effective way to extract this structure, and adapting it from discrete text to continuous spectra turned out to be harder than expected.

From Tokens to Patches

BERT masks discrete tokens (words) and predicts them from context. Spectra are continuous 1D signals — there are no natural tokens. The solution is patching: divide the spectrum into contiguous wavenumber regions and treat each region as a token.

A 3,501-point IR spectrum split into 128 patches gives approximately 27 wavenumber points per patch. Each patch is embedded into a dd-dimensional vector via a learned linear projection:

pi=Embed(s[iP:(i+1)P])Rd\mathbf{p}_i = \text{Embed}(s[i \cdot P : (i+1) \cdot P]) \in \mathbb{R}^d

where PP is the patch size and sR3501s \in \mathbb{R}^{3501} is the raw spectrum. The patches play the role of BERT’s word tokens. Masking a patch means replacing its embedding with a learned mask vector before feeding it into the encoder.

patching.py
1
2
3
4
5
6
7
8
9
10

The connection to Masked Autoencoders (MAE) is direct: He et al. applied the same idea to image patches in 2022. Spectra have different properties than images — we will return to this — but the core mechanism is identical: mask some patches, predict them from context, and hope the representations learned in the process are useful for downstream tasks.

Patches vs. Points. Masking individual wavenumber points is too fine-grained. Spectra are locally smooth — the value at wavenumber wiw_i is highly correlated with wi1w_{i-1} and wi+1w_{i+1}. The model can trivially interpolate single masked points from neighbors without learning any higher-level structure. Masking contiguous patches of ~27 points forces the model to reconstruct entire peak shapes from distant context — overtone correlations, combination band patterns, functional group fingerprints. This is the representation-building signal.

The Masking Strategy

Select a random subset of patches to mask. Three design choices matter:

Masking ratio. What fraction of patches to replace with the mask token. BERT uses 15% (conservative, designed for fine-tuning stability). MAE uses 75% (aggressive, works because images have high 2D spatial redundancy). For spectra, 30–40% works best. Higher than BERT because spectra have substantial local redundancy along the wavenumber axis. Lower than MAE because spectra are sparser than images — fewer peaks, more baseline — so masking too aggressively leaves insufficient context for reconstruction.

Mask token. A single learnable parameter mRd\mathbf{m} \in \mathbb{R}^d shared across all masked positions. This is the model’s way of saying “I don’t know what goes here.” The mask token participates in self-attention (or SSM processing), allowing information from visible patches to flow into masked positions through the backbone.

Where to apply the mask. This is the critical decision. The mask replaces the patch embedding before the encoder sees it:

p~i={mif iMpiotherwise\tilde{\mathbf{p}}_i = \begin{cases} \mathbf{m} & \text{if } i \in \mathcal{M} \\ \mathbf{p}_i & \text{otherwise} \end{cases}

where M\mathcal{M} is the random set of masked patch indices. This operation corrupts the encoder’s input — the model cannot see the ground truth at masked positions.

masking_sweep.py
1
2
3
4
5
6
7
8
9
10
11

The Architecture

The full pretraining pipeline:

  1. Raw spectrum sR3501s \in \mathbb{R}^{3501} — area-normalized IR or Raman spectrum
  2. Patch embedding — linear projection to {pi}i=1128Rd\{\mathbf{p}_i\}_{i=1}^{128} \in \mathbb{R}^{d}
  3. Mask injection — replace pi\mathbf{p}_i with m\mathbf{m} for iMi \in \mathcal{M}
  4. Positional encoding — add learnable position embeddings
  5. D-LinOSS backbone — 4 layers of Diagonal Linear Operator State Space blocks
  6. Reconstruction head — linear projection back to patch dimension R27\mathbb{R}^{27}
  7. Loss — MSE computed only on masked patches

LMPM=1MiMs^isi22\mathcal{L}_{\text{MPM}} = \frac{1}{|\mathcal{M}|}\sum_{i \in \mathcal{M}} \left\| \hat{s}_i - s_i \right\|_2^2

The loss is computed exclusively on masked patches. Visible patches are not penalized — the model is free to represent them however it wants. This forces the backbone to build contextual representations at every position: the output at a masked position must encode the prediction, and this prediction can only come from attending to visible neighbors.

training paradigm
supervised classification
masked pretraining (self-supervised)

The Near-Identity Collapse

This is the most important section of this post. It describes the single most dangerous pitfall in adapting masked pretraining from text to continuous signals.

The temptation is to implement masking as a loss mask rather than an input mask. Instead of replacing masked patch embeddings with m\mathbf{m} before the encoder, you feed the full, unmasked spectrum through the encoder and simply compute the loss only on the masked positions:

# The wrong way (loss-only masking)
embeddings = embed(full_spectrum)          # no masking!
outputs = encoder(embeddings)              # sees everything
reconstruction = decode(outputs)
loss = mse(reconstruction[mask], spectrum[mask])  # loss on masked only

This compiles. It runs. The loss drops beautifully — from 0.42 to 0.003 within 700 training steps. The training curve looks perfect. The model is completely useless.

What happened: without input masking, the encoder sees the ground truth at every position including the masked ones. The shortest path to zero reconstruction loss is the identity function — pass the input through unchanged. The latent dimension (d=256d = 256) is large enough that the spectrum’s intrinsic dimensionality fits comfortably. The model learns to copy, not to understand.

The training metrics are deceptive. MSRP of 0.003 looks like remarkable reconstruction quality. But the model has learned nothing about molecular structure, peak correlations, or spectral physics. It has learned f(x)xf(x) \approx x.

masking_demo.pyinteractive
encoder input[M][M][M][M][M][M]reconstruction (masked patches)ground truthpredictedMSRP = 0.0045reconstructing masked peaks
Input masking replaces masked patches with a learned [MASK] token before the encoder. The model must infer masked content from surrounding peaks.

With input masking, the encoder at masked positions sees m\mathbf{m} — a fixed, learned vector with no information about the local spectrum. The only way to reconstruct the masked patch is to infer it from surrounding context. This forces the model to learn:

  • Peak correlations: the O–H stretch at 3300 cm⁻¹ implies an O–H bend near 1400 cm⁻¹
  • Functional group patterns: C=O at 1720 cm⁻¹ with specific C–H neighbors constrains the carbonyl environment
  • Overtone relationships: fundamentals predict their overtones and combination bands at fixed frequency ratios
  • Baseline structure: smooth, globally constrained — trivially interpolated, freeing the model to focus on peaks

The Masking Principle. For masked pretraining to learn non-trivial representations, the mask must corrupt the encoder’s input, not just the loss computation. This is obvious in hindsight — BERT replaces masked tokens with [MASK] before feeding to the Transformer. But when adapting to continuous signals, it is tempting to mask only the loss, since “the model should figure out what to predict.” The model does figure it out: it predicts the identity.

Why This Works for Spectra

Masked pretraining works when the signal has structure that allows masked regions to be inferred from unmasked context. Spectra have this structure in abundance, through several distinct mechanisms:

Overtones and combination bands. The fundamental C–H stretch at 2900 cm⁻¹ has a first overtone near 5800 cm⁻¹ and combination bands at predictable positions determined by anharmonicity constants. Mask the overtone region; the fundamental constrains the reconstruction. This is a hard physical constraint — not a statistical correlation.

Functional group fingerprints. The carbonyl C=O stretch at 1720 cm⁻¹ almost always co-occurs with specific C–H stretching and bending patterns. The 1000–1300 cm⁻¹ “fingerprint region” contains C–O, C–N, and C–C stretches that are structurally coupled to peaks elsewhere. Mask the carbonyl; the surrounding environment predicts it.

Baseline physics. The spectral baseline is smooth and globally constrained by the instrument response function and sample scattering properties. Masked baseline regions are trivially interpolated from neighbors. This means the model quickly learns to separate baseline from peaks — exactly the right inductive bias for downstream tasks like peak detection and quantification.

Physical constraints. Spectral intensities are non-negative. Integrated band areas are proportional to transition dipole moments (IR) or polarizability derivatives (Raman). Peak positions cluster at frequencies corresponding to molecular vibrations, not uniformly across the axis. These soft constraints narrow the reconstruction space and help the model converge to physically plausible predictions.

The comparison to images is instructive. Images have 2D spatial redundancy — a masked patch can be inferred from surrounding patches in all directions. Spectra have 1D spectral redundancy plus long-range physical correlations that span the entire wavenumber range. The effective redundancy per masked position is lower for spectra, which is why 30–40% masking works best (not 75% as in MAE for images).

Masking as Feature Selection. After pretraining, the encoder’s output at a visible (unmasked) position encodes not just the local peak shape, but its relationship to all other peaks in the spectrum. The representation at 2900 cm⁻¹ (C–H stretch) carries information about what the model expects at 1450 cm⁻¹ (C–H bend), 5800 cm⁻¹ (overtone), and 1720 cm⁻¹ (whether a carbonyl is present). These contextual representations are exactly what downstream tasks — identification, quantification, anomaly detection — need.

Pretraining Results

Training setup: 222K QM9S computed spectra (IR + Raman), masked patch modeling with 35% masking ratio, D-LinOSS backbone (4 layers, d=256d = 256, state dimension 128), trained on 2× RTX 5060 Ti 16GB.

pretraining_results.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

Pretraining matters most when labels are scarce. The gap between pretrained and from-scratch grows as labels decrease. At 100% labels (the full 222K dataset), the gap is ~2 points — both approaches have enough data to learn. At 10% labels, the gap is 12 points. At 1% labels (2,200 spectra), pretrained reaches 71% versus from-scratch at 43% — a 29-point gap. This is the practical value: pretraining makes spectral ML viable in the realistic regime where labeled experimental data is expensive to produce.

Practical Pitfalls

Hard-won lessons from implementation:

Patch size matters. Too small (5 points) and the model interpolates from immediate neighbors — no long-range learning. Too large (100 points) and each masked region contains multiple overlapping peaks that are too complex to reconstruct from context. 27 points — matching the CNN tokenizer’s receptive field and approximately one peak width — is the sweet spot for our architecture and spectral resolution.

Learning rate for the mask token. The mask embedding m\mathbf{m} is a single parameter being pulled in different directions by every masked position in every training sample. Without a learning rate boost (10× the backbone LR), it gets stuck near initialization and all masked positions produce similar, uninformative outputs. A dedicated learning rate group for m\mathbf{m} fixes this.

Random masking per sample per epoch. If the masking pattern is deterministic (same patches masked every time a spectrum is seen), the model memorizes the reconstruction for each training sample rather than learning general spectral relationships. The mask must be resampled independently for every sample in every epoch.

Combine with OT loss. Pure MSE reconstruction loss misses shifted peaks, as described in the optimal transport post. Using the hybrid MSE + Sinkhorn loss from that work improves downstream accuracy by ~1.5 percentage points — the model learns to produce sharper, better-positioned peaks.

reconstruction loss
MSE only
MSE + Sinkhorn OT (α = 0.3)

This post is part of a series on the design of Spektron, a spectral foundation model. The optimal transport post explains the Sinkhorn loss used here for reconstruction. The state-space models post covers the D-LinOSS backbone architecture. The spectral identifiability post provides the information-theoretic motivation for why pretraining needs to capture both IR and Raman modalities. The spectral inverse problem post frames the broader challenge that pretraining helps solve. For the preprocessing pipeline that prepares raw spectra before patching, see SpectraKit.