Skip to content
python spectroscopy open-source software-design

SpectraKit: A Functional API for Spectral Preprocessing

· 15 min read

Spectral preprocessing is the unglamorous part of spectroscopy. Before you can identify a compound, quantify a concentration, or train a model, you need to remove baselines, smooth noise, normalize intensities, and correct for scatter. Every spectroscopist does this. Most write their own scripts. The scripts are never reusable.

SpectraKit exists because I got tired of rewriting the same preprocessing code for every project. It’s a Python library — pip install pyspectrakit — that provides a functional API over NumPy arrays. No classes, no state, no framework lock-in. Every function takes arrays in and returns arrays out.

install + verify
1
2
3
4
5
6

Why Functional

Most preprocessing libraries for spectroscopy are object-oriented. You create a Spectrum object, call methods on it, and the object mutates internal state. This design has two problems.

First, it forces a data model. Your spectra live in whatever container the library invented — Spectrum, SpectralCollection, Dataset. You can’t use plain NumPy arrays. You can’t use pandas DataFrames without wrapping them. Integration with any other tool requires conversion.

Second, it makes composition opaque. When you chain spectrum.baseline().smooth().normalize(), you can’t easily inspect intermediate results, swap one step for another, or build a pipeline that sklearn can use. The method chain is convenient but rigid.

SpectraKit takes the opposite approach. Every function signature follows the same pattern: ndarray in, ndarray out.

api_pattern.py
1
2
3
4
5
6
7
8

You can inspect corrected before passing it to smooth_savgol. You can swap baseline_als for baseline_snip without changing anything else. You can use these functions inside a for loop, a multiprocessing pool, or a PyTorch data loader.

API design comparison
object-oriented (typical)
functional (SpectraKit)

What It Covers

The library handles the full preprocessing pipeline that every spectroscopist needs:

spectrakit — module overview
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

Every baseline correction method returns convergence diagnostics — not just the corrected spectrum, but also the number of iterations and the final residual. You don’t have to trust that ALS converged. You can check.

The Dependency Decision

SpectraKit has two core dependencies: numpy and scipy. That’s it. Everything else — matplotlib for plotting, h5py for HDF5 I/O, scikit-learn for pipeline integration — is optional. You install what you need.

This was a deliberate constraint. Spectroscopy code runs in environments ranging from Jupyter notebooks to embedded systems to production pipelines. A library that drags in tensorflow or torch as a dependency is unusable in half these contexts. NumPy and SciPy are the common denominator.

dependency_tree.sh
1
2
3
4
5
6
7

The I/O Problem

Spectral file formats are a mess. JCAMP-DX has six variants. SPC files encode data differently depending on whether the vendor is Thermo, PerkinElmer, or Shimadzu. Bruker OPUS is a binary format with no official spec — you need to reverse-engineer the byte layout.

SpectraKit’s I/O module handles all of these with a single consistent interface. read_jcamp, read_spc, read_opus — each returns a named tuple with wavenumbers, intensities, and metadata. The format detection is automatic: pass a file path and the library figures out the rest.

io_demo.py
1
2
3
4
5
6
7
8
9
10
11

The Bruker OPUS parser deserves special mention. Most Python libraries that claim OPUS support wrap the Bruker SDK or shell out to a command-line converter. SpectraKit reads the binary format directly — no external dependencies, no SDK license, no subprocess calls. It handles single-channel, interferogram, and ratioed spectra from any Bruker instrument manufactured after 2000.

Pipelines and sklearn

Functional composition is natural — you chain function calls. But for production use, you often want a reusable pipeline object that can be serialized, logged, and dropped into a sklearn workflow.

SpectraKit’s Pipeline class wraps the functional API into a declarative chain:

pipeline_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14

SpectralTransformer wraps any SpectraKit pipeline into a sklearn-compatible transformer. It implements fit, transform, and fit_transform. This means you can use SpectraKit preprocessing inside GridSearchCV, cross_val_score, or any sklearn meta-estimator without writing adapter code.

Testing

699 tests. Zero mypy strict-mode errors. Zero ruff violations. Every public function has tests for:

  • Correctness — Output matches reference implementations (SciPy, MATLAB, published papers)
  • Shape preservation — 1D input produces 1D output, 2D batch input produces 2D output
  • Edge cases — Empty arrays, single-point spectra, constant signals, NaN handling
  • Numerical stability — Large dynamic ranges, near-zero denominators, ill-conditioned matrices
test_results.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Testing Philosophy

The baseline tests are the most important. ALS and ArPLS are iterative algorithms — they can silently fail to converge, producing baselines that look reasonable but introduce systematic error downstream. Every baseline function in SpectraKit returns convergence metadata (iterations, residual norm), and the tests verify convergence on real-world spectral shapes, not just synthetic Gaussians.

What’s Next

SpectraKit is stable, tested, and published. The next step is using it as the preprocessing foundation for Spekron — the spectral foundation model. Every spectrum that enters the Spekron training pipeline goes through SpectraKit preprocessing first. The functional API makes this trivial: the data loader calls baseline_als, normalize_snv, and resampling in sequence, each operating on raw NumPy arrays that PyTorch can consume directly.

  • Project page: SpectraKit — overview, features, and installation
  • Foundation model: Spekron — uses SpectraKit as its preprocessing backbone
  • Research: Hybrid SSA Spectroscopy — the paper behind the Spekron training pipeline
spectrakit — summary
1
2
3
4
5
6
7
8
9
10
11
12