Skip to content

Resources

Curated datasets, tools, and papers for spectral machine learning. Updated as I find useful things.

Datasets

Computed Spectra

  • QM9S — 130,831 small organic molecules (≤9 heavy atoms) with computed IR, Raman, and UV-Vis spectra at B3LYP/def2-TZVP. The dataset behind Spekron pretraining.
  • QM9 — The parent dataset. 134K molecules with DFT-computed properties (HOMO, LUMO, dipole, etc.). No spectra, but widely used for molecular ML benchmarks.
  • QM7-X — 4.2M conformational geometries for 6,950 molecules with PBE0+MBD energy, forces, and multipole properties. Useful for learning conformational effects on spectra.

Experimental Spectra

  • SDBS (AIST) — ~35,000 experimental IR, Raman, mass, and NMR spectra from AIST Japan. The largest free experimental spectral database. Requires manual download.
  • RRUFF — Raman spectra of ~5,000 mineral species. High-quality reference spectra with crystal structure data. Excellent for testing Raman-specific models.
  • NIST Chemistry WebBook — Gas-phase IR spectra for ~16,000 compounds. Low resolution but well-characterized and freely available.
  • Corn Dataset (Eigenvector) — 80 corn samples × 3 NIR instruments × 700 channels. The standard benchmark for calibration transfer methods.

Tools & Libraries

Spectral Preprocessing

  • SpectraKit — My library. Functional API over NumPy arrays for baseline correction, smoothing, normalization, derivative computation, scatter correction, and file I/O. Two dependencies (numpy, scipy). Design philosophy.
  • rampy — Python library focused on Raman spectroscopy. Good for Raman-specific baseline algorithms and peak fitting.
  • SpectroChemPy — Full-featured spectral analysis framework. Object-oriented (unlike SpectraKit's functional approach). Supports 2D correlation spectroscopy.

Molecular Machine Learning

  • PyTorch Geometric — Graph neural networks for molecular property prediction. SchNet, DimeNet, SphereNet implementations.
  • DeepChem — High-level molecular ML library. Built-in featurizers, splitters, and models for molecular property prediction.
  • RDKit — The standard cheminformatics toolkit. Molecular descriptors, fingerprints, 3D geometry generation. Essential for any molecular ML pipeline.

State Space Models

  • Mamba — Official Mamba implementation. Selective state space model with hardware-aware CUDA kernels. Why SSMs work for spectra.
  • S4 — Structured State Spaces for Sequence Modeling. The original SSM that started the field. HiPPO initialization for long-range dependencies.

Key Papers

Spectral Machine Learning

State Space Models

  • Gu et al. (2022) — Efficiently Modeling Long Sequences with Structured State Spaces (S4). HiPPO initialization and the diagonal approximation.
  • Gu & Dao (2024) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Input-dependent gating that breaks the convolution interpretation.

Neural ODEs & Physics-Informed ML

  • Chen et al. (2018) — Neural Ordinary Differential Equations. The adjoint method for backpropagating through ODE solvers. My take on reactor applications.
  • Raissi et al. (2019) — Physics-informed neural networks (PINNs). Encoding PDEs as loss terms. Foundation for ReactorTwin.

Calibration Transfer

  • Wang et al. (1991) — Piecewise Direct Standardization (PDS). The classical baseline for cross-instrument transfer.
  • Mishra et al. (2021) — Deep calibration transfer. First application of deep learning to the calibration transfer problem.

Missing something? Let me know and I'll add it.