Skip to main content
optimal-transport spectroscopy deep-learning research

OptimalTransportforSpectralMatching:WhyWassersteinBeatsMSE

· 14 min read

When you compare two spectra, the standard loss function has a blind spot. Consider a single Gaussian peak at 1720 cm⁻¹ — the C=O stretch. Shift it by 1 cm⁻¹ (calibration drift) and shift it by 100 cm⁻¹ (a completely different functional group). Under mean squared error, both shifts produce massive pointwise loss. MSE doesn’t know that 1 cm⁻¹ and 100 cm⁻¹ are different amounts of “wrong.” It sees mismatched bins and penalizes them equally.

This matters because spectra are not just vectors of numbers — they are distributions on a metric space. Each bin sits at a specific wavenumber, and the distance between wavenumbers has physical meaning. The right loss function should respect this geometry. Optimal transport does exactly that, and Sinkhorn’s algorithm makes it fast enough to use inside a training loop.

The Problem with MSE

Mean squared error computes pointwise differences:

MSE(μ,ν)=1Ni=1N(μiνi)2\text{MSE}(\mu, \nu) = \frac{1}{N}\sum_{i=1}^{N}(\mu_i - \nu_i)^2

Every bin is treated identically. A mismatch at bin ii is penalized the same regardless of whether the nearest matching peak is 1 bin away or 500 bins away. For spectra, this is pathological: a peak shifted by a single wavenumber creates loss at both the source position (where intensity dropped) and the destination (where it appeared), with no credit given for the shift being small.

mse_vs_shift.py
1
2
3
4
5
6
7
8
9
10
11
12

The gradient problem is the real issue. MSE’s gradient at the original peak position says “increase intensity here” and at the shifted position says “decrease intensity there.” It never says “slide the peak left by 5 cm⁻¹.” The optimizer sees two independent problems instead of one geometric correction. In practice, this means MSE-trained models produce blurry, averaged-out reconstructions — they learn to hedge by spreading intensity across nearby bins rather than committing to a precise peak position.

Spectra as Probability Distributions

Vibrational spectra have a natural interpretation as distributions. Intensity values are non-negative (absorbance or scattering intensity cannot be negative) and can be area-normalized to integrate to 1. The wavenumber axis is a metric space with the standard Euclidean distance: the “cost” of moving a unit of intensity from wavenumber wiw_i to wavenumber wjw_j is wiwj|w_i - w_j|.

This puts spectra in the Wasserstein space: the space of probability measures on a metric ground space. In this framing, comparing two spectra is equivalent to asking: what is the cheapest way to rearrange the intensity of one spectrum to match the other, where “cost” is proportional to how far each unit of intensity has to move?

This is the Monge-Kantorovich formulation of optimal transport — one of the deepest ideas in probability theory, and exactly the right mathematical framework for spectral comparison.

Wasserstein Distance: The Right Loss

The Wasserstein-pp distance between two distributions μ\mu and ν\nu on a metric space is:

Wp(μ,ν)=(infπΠ(μ,ν)c(x,y)pdπ(x,y))1/pW_p(\mu, \nu) = \left(\inf_{\pi \in \Pi(\mu,\nu)} \int c(x,y)^p \, d\pi(x,y)\right)^{1/p}

For discrete spectra over NN wavenumber bins, this becomes a linear program. Using the squared Euclidean cost c(wi,wj)=wiwj2c(w_i, w_j) = |w_i - w_j|^2, the 1-Wasserstein distance is:

W1(μ,ν)=minπΠ(μ,ν)C,πW_1(\mu, \nu) = \min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle

where Cij=wiwj2C_{ij} = |w_i - w_j|^2 is the cost matrix and Π(μ,ν)={π0:π1=μ,πT1=ν}\Pi(\mu,\nu) = \{\pi \geq 0 : \pi \mathbf{1} = \mu, \pi^T \mathbf{1} = \nu\} is the set of all valid transport plans — joint distributions with marginals μ\mu and ν\nu.

The transport plan π\pi tells you exactly how much intensity to move from each source bin to each target bin. For a spectrum with a single shifted peak, the optimal plan is trivially simple: move all the mass from the old position to the new position, paying cost proportional to the shift distance.

Why 1D is special. For probability distributions on the real line, optimal transport has a closed-form solution via the cumulative distribution function:

W1(μ,ν)=01Fμ1(t)Fν1(t)dtW_1(\mu, \nu) = \int_0^1 |F_\mu^{-1}(t) - F_\nu^{-1}(t)| \, dt

where F1F^{-1} is the quantile function. In discrete form, this is the 1\ell_1 distance between sorted CDFs — no linear programming needed. The optimal plan never crosses: mass always moves in one direction. This makes 1D Wasserstein distance computationally cheap (O(NlogN)O(N \log N) for sorting) and geometrically intuitive.

wasserstein_1d.py
1
2
3
4
5
6
7
8
9
10
11

The Sinkhorn Algorithm

The 1D closed-form is elegant but limited. In practice we need Wasserstein distance as a differentiable loss function inside a training loop, often in a latent space where the “spectrum” is a dd-dimensional embedding rather than a literal 1D signal. This means we need the general N×NN \times N optimal transport computation, which is an O(N3)O(N^3) linear program — far too slow for backpropagation.

The solution is entropic regularization. Instead of the exact OT problem, we solve:

Wε(μ,ν)=minπΠ(μ,ν)C,πεH(π)W_\varepsilon(\mu,\nu) = \min_{\pi \in \Pi(\mu,\nu)} \langle C, \pi \rangle - \varepsilon H(\pi)

where H(π)=ijπijlogπijH(\pi) = -\sum_{ij} \pi_{ij} \log \pi_{ij} is the entropy of the transport plan and ε>0\varepsilon > 0 controls the regularization strength. The entropy term makes the problem strictly convex, and the optimal plan takes the form:

π=diag(u)Kdiag(v)\pi^* = \text{diag}(\mathbf{u}) \, K \, \text{diag}(\mathbf{v})

where Kij=eCij/εK_{ij} = e^{-C_{ij}/\varepsilon} is the Gibbs kernel. The scaling vectors u\mathbf{u} and v\mathbf{v} are found by alternating normalization — the Sinkhorn iterations:

u(k+1)=μKv(k),v(k+1)=νKTu(k+1)\mathbf{u}^{(k+1)} = \frac{\mu}{K \mathbf{v}^{(k)}}, \qquad \mathbf{v}^{(k+1)} = \frac{\nu}{K^T \mathbf{u}^{(k+1)}}

Each iteration is a matrix-vector product — fully parallelizable on GPU. Convergence typically takes 50–100 iterations for well-chosen ε\varepsilon.

sinkhorn_explorer.pyinteractive
0.00.51.0source μtarget νwavenumber (cm⁻¹)transport plan πW_ε = 2.785e+0
Wasserstein cost scales smoothly with shift distance.

Computational complexity. Each Sinkhorn iteration is O(N2)O(N^2): one matrix-vector product with the N×NN \times N kernel KK. With LL iterations, total cost is O(LN2)O(LN^2). For a 64-bin coarse spectrum and L=100L = 100, this is ~400K FLOPs — trivial. For the full 3501-bin spectrum, it is ~1.2 billion FLOPs — comparable to a single Transformer self-attention layer. The computation is fully batched and differentiable through the iterations via implicit differentiation.

The Epsilon Problem

The regularization parameter ε\varepsilon controls the sharpness of the transport plan. This is where theory meets numerical reality:

  • ε0\varepsilon \to 0: the solution approaches exact OT. But the Gibbs kernel entries Kij=eCij/εK_{ij} = e^{-C_{ij}/\varepsilon} become exponentially small for distant bins. With Cij=(ij)2C_{ij} = (i - j)^2 and ε=0.05\varepsilon = 0.05, a distance of just ij=4|i - j| = 4 gives Kij=e16/0.05=e3200K_{ij} = e^{-16/0.05} = e^{-320} \approx 0. The entire kernel matrix becomes numerically zero except on the diagonal.

  • ε\varepsilon \to \infty: the entropy term dominates and the optimal plan approaches the product measure μν\mu \otimes \nu — a uniform smear with no geometric information. The gradients become uninformative.

In Spektron’s training, we started with ε=0.05\varepsilon = 0.05 following the literature. In float64, this worked. When we enabled mixed-precision training (AMP) for speed, the Gibbs kernel entries fell below float16’s minimum representable value (6×108\sim 6 \times 10^{-8}), producing NaN within 100 training steps.

The fix: increase ε\varepsilon to 1.0 for our 128-dimensional embeddings, and use log-domain Sinkhorn for extra numerical safety:

logui(k+1)=logμilogsumexpj ⁣(Cijε+logvj(k))\log u_i^{(k+1)} = \log \mu_i - \text{logsumexp}_j\!\left(-\frac{C_{ij}}{\varepsilon} + \log v_j^{(k)}\right)

This replaces e320e^{-320} with 320-320 in log space — no underflow possible.

epsilon tuning
ε = 0.05 (float64 → AMP)
ε = 1.0 + log-domain
sinkhorn_stability.py
1
2
3
4
5
6
7
8
9
10

OT as a Reconstruction Loss

In Spektron’s pretraining objective, the model reconstructs masked spectral patches. The reconstruction loss needs three properties: differentiability (for backpropagation), geometric awareness (shifted peaks cost less than wrong peaks), and numerical stability under AMP.

Sinkhorn OT satisfies all three. The OT loss for a single patch:

LOT=Wε ⁣(s^patch,spatch)\mathcal{L}_{\text{OT}} = W_\varepsilon\!\left(\hat{s}_{\text{patch}},\, s_{\text{patch}}\right)

In practice, pure OT converges slower than MSE because its gradients are smoother (a feature for geometry, a bug for speed). The hybrid loss combines both:

L=(1α)MSE+αWε\mathcal{L} = (1 - \alpha)\,\text{MSE} + \alpha\,W_\varepsilon

Ablation over α\alpha shows α=0.3\alpha = 0.3 balances convergence speed with geometric sensitivity.

loss_ablation.py
1
2
3
4
5
6
7
8
9
10
11
12

The peak position error reduction is the key result. MSE-trained models reconstruct approximate peak shapes but systematically blur the positions. Adding the Sinkhorn OT term sharpens position accuracy by nearly 3×, because the loss function now explicitly rewards moving predicted peaks toward their true positions rather than averaging over possible locations.

Beyond Loss Functions: Spectral Comparison

Optimal transport is not limited to training objectives. As a spectral similarity metric, it has immediate applications:

Library search. Given an unknown spectrum, find the closest match in a reference database. Wasserstein distance between the query and each reference spectrum produces more chemically meaningful rankings than cosine similarity, because it respects the fact that a near-miss in peak position is more informative than a near-miss in peak intensity.

Quality control. In process analytical technology (PAT), spectra are continuously monitored for deviations from a reference. OT-based control charts detect peak shifts at lower signal-to-noise ratios than MSE-based charts, because the shift signal is not diluted across all bins.

Calibration transfer. Different spectrometers produce systematically shifted spectra. The optimal transport plan between spectra from instrument A and instrument B directly encodes the calibration function — no parametric model required.

OT vs. Cosine Similarity. Cosine similarity treats the spectrum as a vector in ambient space. It measures the angle between two spectra: if all peaks are present but shifted, cosine similarity can still be high (the angle is small). Wasserstein distance treats the spectrum as a distribution on the wavenumber axis. It measures the work required to reshape one spectrum into the other. For the dominant instrument artifact — small, systematic peak shifts — Wasserstein is strictly more informative.

This post is part of a series on the mathematical foundations behind Spektron. The spectral inverse problem post covers the group-theoretic constraints on spectral inversion. The state-space models post explains why sequence architectures outperform CNNs for spectral processing. The spectral identifiability post derives the information-theoretic limits of structure determination from spectra. Next: masked pretraining for scientific spectra — how BERT-style self-supervised learning creates general-purpose spectral representations, using OT as part of the reconstruction loss.