Our Benchmark Protocol: What We Measure and Why
The reasoning behind our validation approach and what it tells you about prediction reliability.
The reasoning behind our validation approach and what it tells you about prediction reliability.
We publish benchmarks for every property FluxMateria predicts. This article explains the reasoning behind our validation protocol: why we chose these validation sets, why we report these metrics, and what the results actually tell you.
It is tempting to benchmark against a hand-picked set of "representative" compounds where your tool performs well. This is misleading. We benchmark against the broadest available experimental datasets for each property:
Broad validation sets expose systematic errors that narrow sets hide. If our engine struggles with BCC metals or IV-VI semiconductors, we want to know — and we want you to know.
For every benchmark, we report at minimum:
Mean absolute error (absolute units) and mean absolute percentage error (scale-normalized). Together these tell you both the typical magnitude of error and how it relates to the property values.
Percentage of predictions within ≤10% and ≤20% of experiment. This tells you how many results are usefully accurate, not just the average.
Mean signed error. Tells you whether predictions systematically over- or under-estimate. A tool with low MAE but high bias is consistently wrong in one direction.
Results split by material type, crystal structure, chemical class. This reveals systematic strengths and weaknesses that aggregate metrics hide.
A single MAPE number is not enough. You need to know the error distribution, the bias, and where the worst predictions occur.
Every tool has outliers. We publish ours. For each benchmark, we identify the materials or compounds with the largest prediction errors and report them explicitly.
Why? Because outliers tell you where the physics is incomplete. A tool that is 3% accurate on average but 50% off on a specific material class has a physics gap in that class. If your work involves that class, the aggregate accuracy is irrelevant. The outlier information is what matters.
We also explain why outliers occur when we can identify the physics gap. This is part of our commitment to honest reporting: not hiding limitations but understanding and documenting them.
FluxMateria's physics kernel has no training data and no fitted parameters. This means our benchmark results cannot be inflated by tuning internal parameters to match the validation set. The numbers you see are pure physics predictions compared to experiment.
This is a double-edged sword. Without calibration, we cannot optimize our performance on a specific benchmark. Our numbers for some properties are higher than they could be with tuning. But they are honest — and they represent what the tool will actually deliver on your data, not just on our test set.
We consider this a fundamental advantage. Better to have 5% MAPE with pure physics than 2% MAPE with hidden calibration that may not generalize to your compounds.
Our engine is deterministic. Same input, same version, same output. This means:
We publish the engine version, the input sets, and the experimental reference data for each benchmark. If you want to verify our numbers, you can.
Benchmarks measure accuracy on specific test sets. They do not guarantee that the tool will perform equally well on your specific use case. The best way to evaluate any computational tool — including ours — is to test it on your own data.
Our benchmarks tell you: "here is how the engine performs on broad, diverse, publicly available datasets." They do not tell you: "here is exactly how it will perform on your proprietary compound series." That requires a pilot evaluation.
We encourage this. If our tool does not work well for your specific chemistry, we would rather you discover that in a pilot than after a purchase decision.
Full benchmark results with methodology, validation sets, and category breakdowns: See all benchmarks.
All benchmarks are published with validation sets, error breakdowns, and worst-case analysis.
See Benchmarks