Spektron
A self-supervised foundation model for vibrational spectroscopy using damped linear oscillatory state-space models, variational information bottleneck disentanglement, and physics-informed multi-task pretraining.
Spektron is a self-supervised foundation model for vibrational spectroscopy that achieves few-shot calibration transfer across instruments and modalities. Built on a physics-aligned backbone that mirrors the dynamics of molecular vibrations, it disentangles transferable chemical information from discardable instrument signatures.
The Problem
Vibrational spectroscopy (IR, Raman, NIR) is one of the most deployed analytical techniques in chemistry — but every spectrometer introduces its own instrumental fingerprint. A model trained on one instrument performs poorly on another. Classical calibration transfer methods like Direct Standardization require paired measurements on both instruments, which is expensive and impractical at scale.
Spektron addresses this by learning a latent representation where chemical information is invariant to the instrument that produced the measurement.
Architecture
The model uses a multi-stage encoder with seven loss objectives for self-supervised pretraining:
Raw Spectrum (B, 2048)
→ Raw Spectral Embedding (Conv1d, k=15, stride=1)
→ [CLS] + [DOMAIN] token prepend
→ D-LinOSS Backbone (4 layers, d_model=256, d_state=128)
→ Mixture of Experts (4 experts, top-k=2)
→ Transformer Encoder (2 blocks, 8 heads)
→ VIB Head → z_chem (128-dim) + z_inst (64-dim)
→ Task Heads: reconstruction | regression | transfer
D-LinOSS Backbone
The backbone uses Damped Linear Oscillatory State-Space (D-LinOSS) layers instead of standard SSMs like Mamba. Each layer models a set of 128 damped harmonic oscillators — a physics-aligned inductive bias for spectroscopy, since vibrational spectra literally arise from molecular vibrations (harmonic and anharmonic oscillators).
Each oscillator is parameterized by:
- Natural frequency — learned diagonal entries of the stiffness matrix
- Damping coefficient — controls energy dissipation
- IMEX symplectic discretization — preserves oscillatory structure during the 2048-step recurrence
The recurrence matrix for each oscillator takes the form:
where is the stride and is the step size. A critical stability requirement is the CFL condition: the ratio must remain below 2.0 to keep eigenvalues inside the unit circle. During training, values can grow, causing to exceed this threshold and producing exponential divergence in the 2048-step scan. We clamp to guarantee stability.
The entire LinOSSBlock is forced to run in float32 even under mixed precision, because the SSM scan accumulates values that can reach — well beyond float16 range.
Raw Spectral Embedding
Unlike the wavelet-based embedding used with the Mamba backbone, D-LinOSS operates on full-resolution spectra (2048 tokens, one per spectral point). A local Conv1d with kernel size 15 and stride 1 maps each spectral point to the model dimension, preserving all spectral detail. Wavenumber-aware positional encoding injects the physical frequency axis into the representation.
Variational Information Bottleneck (VIB)
The VIB head splits the CLS token representation into two latent variables:
- (128-dim): chemistry-invariant representation, regularized via KL divergence
- (64-dim): instrument-specific information, trained to be discardable
A gradient reversal layer on ensures it cannot encode instrument identity — the adversarial classifier receives reversed gradients, forcing to be instrument-agnostic. This is the key mechanism enabling zero-shot transfer: at inference time, is discarded and only is used for downstream tasks.
Mixture of Experts (MoE)
Four expert networks (one per modality: NIR, IR, Raman, Cross-domain) with top-k=2 sparse gating. Optional KAN (Kolmogorov-Arnold Network) activations in the expert FFNs provide interpretability — the learned activation shapes can reveal which spectral features each expert specializes in.
Pretraining Objectives
Spektron uses seven concurrent loss functions during self-supervised pretraining:
| Loss | Weight | Purpose |
|---|---|---|
| MSRP (Masked Spectrum Reconstruction) | 1.0 | Mask 20% of spectral points in contiguous 3-point blocks, reconstruct from context |
| Contrastive (BYOL-style) | 0.3 | Same sample across instruments → similar |
| Denoising | 0.2 | Reconstruct clean spectrum from augmented (noise, baseline drift, wavelength shift) input |
| Physics-informed | 0.1 | Beer-Lambert linearity, smoothness, non-negativity, peak symmetry |
| Optimal Transport (Sinkhorn) | 0.1 | Align latent distributions across instruments via Wasserstein distance |
| VIB | 0.15 | KL regularization + adversarial instrument classification with gradient reversal |
| MoE balance | 0.01 | Prevent expert collapse via load balancing |
A learnable mask_token parameter replaces masked positions in embedding space before the backbone — this is critical. Without masking the input, the model degenerates to a near-identity mapping (MSRP loss drops to 0.003 within 700 steps, learning nothing useful).
Training Infrastructure
- Hardware: 2x NVIDIA RTX 5060 Ti (16GB each) via Vast.ai
- Batch size: 16 (8 per GPU), gradient accumulation 4 steps → effective batch 64
- Optimizer: AdamW (lr=3e-4, weight_decay=0.01) with linear warmup (1K steps) → cosine annealing
- Precision: bfloat16 AMP with LinOSSBlock forced to float32
- Data: 222K QM9S training samples (IR + Raman spectra, 2048 points each)
- Throughput: ~39 samples/sec, ~23 hours for 50K pretraining steps
- Memory: ~7.5GB per GPU
Key Design Decisions
-
D-LinOSS over Mamba: The oscillatory dynamics of D-LinOSS naturally align with vibrational spectroscopy — each latent oscillator can learn to track a specific molecular vibration mode. Mamba’s selective gating is more general but lacks this physics prior.
-
Full-resolution embedding: With D-LinOSS’s O(n) complexity, we can process all 2048 spectral points as individual tokens (vs. 127 patches with wavelet embedding). No information is lost to patching.
-
Gradient reversal for VIB: Rather than training a separate adversarial loop, a gradient reversal layer in the forward pass cleanly separates instrument information from chemistry during backpropagation.
-
CFL clamping: Learned frequency parameters can grow unboundedly during training, causing the discretized recurrence to become unstable. Clamping the CFL ratio at 1.99 prevents eigenvalue escape without constraining the learned dynamics.
-
Sinkhorn regularization at 1.0: For 128-dimensional embeddings, the standard regularization (0.05) causes the transport kernel to underflow in float16. A regularization of 1.0 keeps the computation numerically stable.
Downstream Tasks
- Calibration transfer: Predict corn moisture/oil/protein/starch across 3 instruments with ≤10 labeled transfer samples (target: R² > 0.95)
- Compound identification: Few-shot classification from learned embeddings
- Property prediction: Regression from CLS token to molecular properties
- Test-Time Training (TTT): K gradient steps on unlabeled target spectra using MSRP loss, adapting layer norms or LoRA parameters for zero-shot instrument adaptation
Related
- Paper: Hybrid SSA Spectroscopy — the research paper describing Spektron’s architecture and evaluation
- Theory: Spectral Identifiability — information-theoretic framework motivating the VIB design
- Preprocessing: SpectraKit — the spectral preprocessing library powering Spektron’s data pipeline
- Blog: The Spectral Inverse Problem — accessible overview of the theory behind spectral inversion
- Blog: Masked Pretraining for Scientific Spectra — lessons learned from the masking strategy
- Blog: State-Space Models for Spectroscopy — why SSMs are a natural fit for spectral sequences