Why Benchmark Methodology Matters More Than Benchmark Numbers
A guide to evaluating computational screening tools without being misled by headline metrics.
A guide to evaluating computational screening tools without being misled by headline metrics.
Every computational screening vendor publishes impressive numbers. High accuracy, low error, broad coverage. But two tools can report the same metric and mean entirely different things by it. The difference is methodology — and methodology is where most evaluation processes go wrong.
This article is a guide for R&D teams evaluating screening tools. It is written from the perspective of a vendor that publishes its own methodology, but the framework applies to evaluating any tool — including ours.
When a vendor reports a benchmark number, ask these questions before taking it at face value:
The single most important question. A model that reports 95% accuracy on 50 hand-picked molecules is telling you very little. A model that reports 80% accuracy across 175,000 structurally diverse compounds is telling you a lot.
Look for:
For ML/AI tools, this is critical. If the validation set overlaps with the training set — even partially — the reported accuracy is inflated. This is not always intentional; it can happen through dataset curation shortcuts, fingerprint similarity thresholds, or scaffold-level overlap.
Physics-based tools (DFT and physics kernels) do not have training data, so data leakage does not apply. But they have a different risk: parameter fitting on the validation set itself. If a tool's internal parameters were tuned to improve performance on the published benchmark, the benchmark is not an independent test.
Not all accuracy metrics are equivalent:
| Metric | What it tells you | What it hides |
|---|---|---|
| MAE (Mean Absolute Error) | Average deviation from experiment | Catastrophic outliers |
| MAPE (Mean Absolute % Error) | Scale-normalized deviation | Performance on small-value targets |
| R² / Pearson r | Correlation strength | Systematic bias (over/under-prediction) |
| AUC-ROC | Binary classification ranking | Class imbalance, calibration |
| Pass rate (% within threshold) | Fraction of predictions meeting a tolerance | Severity of failures outside threshold |
A tool with 4% MAPE and 90% pass rate at ≤10% error is saying something specific and verifiable. A tool that reports only “high accuracy” is saying nothing.
Aggregate metrics can hide systematic failures. A tool might report 5% overall MAPE for bulk modulus prediction while being 40% off on BCC metals and 2% on FCC metals. If your work involves BCC metals, the aggregate number is misleading.
Good benchmark reporting breaks results down by material category, chemical class, property range, or structure type. This lets you assess relevance to your specific use case, not just overall performance.
Every tool has outliers. The question is whether the vendor shows them to you.
If a benchmark page shows only averages and pass rates, ask: what are the 10 worst predictions? Are they off by 20% or by 300%? Are they clustered in a specific chemical class (which would indicate a systematic physics gap) or randomly distributed (which would indicate noise)?
A tool that publishes its worst cases is a tool that trusts its own methodology. A tool that hides them is a tool that does not.
This is the most important practical test. Take the published validation set. Run it yourself. Compare your results to the published numbers.
For DFT, reproducibility depends on matching the functional, basis set, and convergence criteria. For ML, it depends on model version and inference settings. For physics kernels, it should be exact: same input, same version, same output.
If a vendor will not let you reproduce their benchmark, treat the benchmark as marketing rather than evidence.
A prediction without a confidence indicator is an assertion. It may be accurate or it may be wildly wrong, and you have no way to distinguish the two without experimental verification.
Good confidence signals are:
We apply these same principles to our own work. Here is what we publish for every property we benchmark:
Full compound/material lists with experimental reference values. Publicly inspectable.
MAE, MAPE, pass rates at ≤10% and ≤20%, bias, correlation. Not just one headline number.
Results split by material type, crystal structure, chemical class. You see where we are strong and where we are weak.
We show outliers. If a prediction is 50% off, we publish it and explain the physics gap.
Zero fitted parameters. Our results come from physics, not from tuning to the validation set.
Deterministic engine. Same input, same version, same output. Run it yourself.
We do this not because it makes our numbers look better — sometimes it makes them look worse — but because we believe the only honest stance in computational chemistry is: here is what we got, check it yourself.
When evaluating computational screening tools:
A tool that is honest about where it fails is a tool you can trust where it succeeds.
FluxMateria publishes full benchmark methodology, test conditions, and per-category error rates for every property. See our benchmarks.
All benchmarks are published with methodology, validation sets, and category breakdowns.
See Benchmarks