Methodology March 25, 2026

Our Benchmark Protocol: What We Measure and Why

The reasoning behind our validation approach and what it tells you about prediction reliability.

We publish benchmarks for every property FluxMateria predicts. This article explains the reasoning behind our validation protocol: why we chose these validation sets, why we report these metrics, and what the results actually tell you.

Principle 1: Test against everything, not a curated subset

It is tempting to benchmark against a hand-picked set of "representative" compounds where your tool performs well. This is misleading. We benchmark against the broadest available experimental datasets for each property:

ADMET: 175,000+ compounds across multiple public datasets (TDC PPBR, Clearance, DILI), plus a curated set of 246 FDA-approved drugs with verified experimental values
Bond lengths: 450+ bonds across 60+ elements, covering metals, semiconductors, and ionic crystals
Band gaps: 1,000+ materials spanning metals, semiconductors, perovskites, TMDs, and oxides
Bulk modulus: 195 materials across 20 structural categories (FCC, BCC, HCP, diamond, wurtzite, rocksalt, perovskite, and more)
Mechanism classification: 336 experimental cases covering SN1, SN2, E1, and E2 pathways

Broad validation sets expose systematic errors that narrow sets hide. If our engine struggles with BCC metals or IV-VI semiconductors, we want to know — and we want you to know.

Principle 2: Report multiple metrics, not one headline number

For every benchmark, we report at minimum:

MAE / MAPE

Mean absolute error (absolute units) and mean absolute percentage error (scale-normalized). Together these tell you both the typical magnitude of error and how it relates to the property values.

Pass rate

Percentage of predictions within ≤10% and ≤20% of experiment. This tells you how many results are usefully accurate, not just the average.

Bias

Mean signed error. Tells you whether predictions systematically over- or under-estimate. A tool with low MAE but high bias is consistently wrong in one direction.

Category breakdowns

Results split by material type, crystal structure, chemical class. This reveals systematic strengths and weaknesses that aggregate metrics hide.

A single MAPE number is not enough. You need to know the error distribution, the bias, and where the worst predictions occur.

Principle 3: Show the worst cases

Every tool has outliers. We publish ours. For each benchmark, we identify the materials or compounds with the largest prediction errors and report them explicitly.

Why? Because outliers tell you where the physics is incomplete. A tool that is 3% accurate on average but 50% off on a specific material class has a physics gap in that class. If your work involves that class, the aggregate accuracy is irrelevant. The outlier information is what matters.

We also explain why outliers occur when we can identify the physics gap. This is part of our commitment to honest reporting: not hiding limitations but understanding and documenting them.

Principle 4: Zero fitted parameters

FluxMateria's physics kernel has no training data and no fitted parameters. This means our benchmark results cannot be inflated by tuning internal parameters to match the validation set. The numbers you see are pure physics predictions compared to experiment.

This is a double-edged sword. Without calibration, we cannot optimize our performance on a specific benchmark. Our numbers for some properties are higher than they could be with tuning. But they are honest — and they represent what the tool will actually deliver on your data, not just on our test set.

We consider this a fundamental advantage. Better to have 5% MAPE with pure physics than 2% MAPE with hidden calibration that may not generalize to your compounds.

Principle 5: Make it reproducible

Our engine is deterministic. Same input, same version, same output. This means:

You can reproduce any benchmark result we publish by running the same inputs through the same engine version
There is no stochastic variation between runs — no random seeds, no model sampling
Benchmark comparisons between engine versions are meaningful because both sides are deterministic

We publish the engine version, the input sets, and the experimental reference data for each benchmark. If you want to verify our numbers, you can.

What our benchmarks do not tell you

Benchmarks measure accuracy on specific test sets. They do not guarantee that the tool will perform equally well on your specific use case. The best way to evaluate any computational tool — including ours — is to test it on your own data.

Our benchmarks tell you: "here is how the engine performs on broad, diverse, publicly available datasets." They do not tell you: "here is exactly how it will perform on your proprietary compound series." That requires a pilot evaluation.

We encourage this. If our tool does not work well for your specific chemistry, we would rather you discover that in a pilot than after a purchase decision.

Full benchmark results with methodology, validation sets, and category breakdowns: See all benchmarks.

Check the numbers yourself

All benchmarks are published with validation sets, error breakdowns, and worst-case analysis.

See Benchmarks