Resources
Curated datasets, tools, and papers for spectral machine learning. Updated as I find useful things.
Datasets
Computed Spectra
- QM9S — 130,831 small organic molecules (≤9 heavy atoms) with computed IR, Raman, and UV-Vis spectra at B3LYP/def2-TZVP. The dataset behind Spekron pretraining.
- QM9 — The parent dataset. 134K molecules with DFT-computed properties (HOMO, LUMO, dipole, etc.). No spectra, but widely used for molecular ML benchmarks.
- QM7-X — 4.2M conformational geometries for 6,950 molecules with PBE0+MBD energy, forces, and multipole properties. Useful for learning conformational effects on spectra.
Experimental Spectra
- SDBS (AIST) — ~35,000 experimental IR, Raman, mass, and NMR spectra from AIST Japan. The largest free experimental spectral database. Requires manual download.
- RRUFF — Raman spectra of ~5,000 mineral species. High-quality reference spectra with crystal structure data. Excellent for testing Raman-specific models.
- NIST Chemistry WebBook — Gas-phase IR spectra for ~16,000 compounds. Low resolution but well-characterized and freely available.
- Corn Dataset (Eigenvector) — 80 corn samples × 3 NIR instruments × 700 channels. The standard benchmark for calibration transfer methods.
Tools & Libraries
Spectral Preprocessing
- SpectraKit — My library. Functional API over NumPy arrays for baseline correction, smoothing, normalization, derivative computation, scatter correction, and file I/O. Two dependencies (numpy, scipy). Design philosophy.
- rampy — Python library focused on Raman spectroscopy. Good for Raman-specific baseline algorithms and peak fitting.
- SpectroChemPy — Full-featured spectral analysis framework. Object-oriented (unlike SpectraKit's functional approach). Supports 2D correlation spectroscopy.
Molecular Machine Learning
- PyTorch Geometric — Graph neural networks for molecular property prediction. SchNet, DimeNet, SphereNet implementations.
- DeepChem — High-level molecular ML library. Built-in featurizers, splitters, and models for molecular property prediction.
- RDKit — The standard cheminformatics toolkit. Molecular descriptors, fingerprints, 3D geometry generation. Essential for any molecular ML pipeline.
State Space Models
- Mamba — Official Mamba implementation. Selective state space model with hardware-aware CUDA kernels. Why SSMs work for spectra.
- S4 — Structured State Spaces for Sequence Modeling. The original SSM that started the field. HiPPO initialization for long-range dependencies.
Key Papers
Spectral Machine Learning
- Karthikeyan et al. (2026) — Information-Theoretic Limits of Spectroscopic Molecular Identification. When combined IR + Raman uniquely determines molecular structure.
- Karthikeyan et al. (2026) — Hybrid State Space Architecture for Vibrational Spectroscopy. CNN + D-LinOSS encoder with VIB for calibration transfer.
- Gastegger et al. (2017) — Machine learning molecular dynamics for the simulation of infrared spectra. Neural network potentials for computing spectra.
- Schütt et al. (2018) — SchNet: A deep learning architecture for molecules and materials. Continuous-filter convolutions on molecular graphs.
State Space Models
- Gu et al. (2022) — Efficiently Modeling Long Sequences with Structured State Spaces (S4). HiPPO initialization and the diagonal approximation.
- Gu & Dao (2024) — Mamba: Linear-Time Sequence Modeling with Selective State Spaces. Input-dependent gating that breaks the convolution interpretation.
Neural ODEs & Physics-Informed ML
- Chen et al. (2018) — Neural Ordinary Differential Equations. The adjoint method for backpropagating through ODE solvers. My take on reactor applications.
- Raissi et al. (2019) — Physics-informed neural networks (PINNs). Encoding PDEs as loss terms. Foundation for ReactorTwin.
Calibration Transfer
- Wang et al. (1991) — Piecewise Direct Standardization (PDS). The classical baseline for cross-instrument transfer.
- Mishra et al. (2021) — Deep calibration transfer. First application of deep learning to the calibration transfer problem.
Missing something? Let me know and I'll add it.