Validation Scope: What We've Learned About Prediction Reliability
An honest assessment of where FluxMateria works well, where it is still improving, and what that means for your evaluation.
An honest assessment of where FluxMateria works well, where it is still improving, and what that means for your evaluation.
Every computational tool has a scope within which it performs well and a frontier beyond which its predictions become less reliable. The question is whether the vendor is transparent about where that frontier is. This article is our attempt at that transparency.
We describe where FluxMateria's predictions are strongest, where they are still improving, and how to interpret these assessments in the context of your own evaluation.
Bond-length prediction is our most mature capability. Across 450+ bonds and 60+ elements, the engine achieves under 0.1% mean error. FCC and BCC metals are consistently below 1% error; III-V semiconductors below 2%. This is competitive with DFT and significantly faster.
What this means for you: if your workflow involves structural property prediction for common crystal types, the engine's predictions are reliable for triage and shortlisting without further validation.
For organic reaction mechanism prediction (SN1, SN2, E1, E2), the engine achieves 100% classification accuracy across 336 experimental cases. Activation barrier predictions have a MAE of 7.4 kJ/mol (6.76% MAPE). This is competitive with mid-tier DFT methods at a fraction of the computational cost.
What this means for you: mechanism classification is reliable for teaching, route planning, and first-pass reaction analysis. Barrier predictions are useful for ranking but should be verified with DFT for quantitative kinetics work.
The ADMET module runs full panels (solubility, permeability, CYP inhibition, hERG, hepatotoxicity) at approximately 350 molecules per second with confidence indicators on every prediction. Validated across 175,000+ compounds. The retrospective analysis on 34 withdrawn drugs detected 88.2% of safety failures with 0% false positives on controls.
What this means for you: the ADMET module is suitable for first-pass triage on large libraries. High-confidence predictions are reliable for shortlisting. Low-confidence predictions should be experimentally verified.
Bulk modulus prediction across 195 materials achieves approximately 8% MAPE overall, with excellent performance on FCC metals (under 4%) and diamond semiconductors (under 3%). Performance is weaker on some complex structure types: fluorites, CsCl-type intermetallics, and rutile-structure oxides show higher errors (20–70% MAPE depending on category).
What this means for you: elastic property predictions are reliable for common metals and semiconductors. For complex oxides and intermetallics, treat predictions as indicative rather than quantitative, and verify with DFT or experiment for the top candidates.
Debye temperature and sound velocity predictions have improved significantly but remain more challenging than structural or mechanical properties. Metals and simple semiconductors are well-captured; complex ionic compounds and layered materials show larger errors.
What this means for you: thermal predictions are useful for relative ranking within a structural family. Absolute values for complex structures should be validated experimentally.
For the core set of III-V, II-VI, and elemental semiconductors (the 26 materials in our primary benchmark), band-gap predictions achieve under 1% error. For the broader set of 1,000+ materials including oxides, perovskites, and transition-metal compounds, the overall MAE is under 0.7 eV. Mott insulators and strongly correlated oxides remain challenging.
What this means for you: band-gap predictions for standard semiconductors are highly reliable. For complex oxides and correlated materials, use predictions for triage and ranking, but confirm with GW or hybrid-DFT calculations for quantitative work.
Materials with strong electron correlation (Mott insulators, heavy-fermion compounds, some transition-metal oxides) remain the most challenging class for the physics kernel. This is an active area of development. The confidence system correctly flags most of these predictions as low-confidence, which is the right behavior — but improving the predictions themselves is a priority.
TMDs, graphene derivatives, and other layered materials present challenges for interlayer interaction modeling. Predictions for in-plane properties are generally reliable; predictions for cross-plane properties (interlayer spacing, c-axis modulus) carry higher uncertainty.
As compositions become more complex (perovskites, spinels, garnets), the number of structural degrees of freedom increases and prediction accuracy tends to decrease. The engine handles common perovskites well but struggles with highly distorted or disordered variants.
This article is not a disclaimer. It is a guide to resource allocation. Specifically:
In all cases, the confidence indicators on each prediction give you real-time guidance on which category a specific result falls into. A prediction in a "strongest" category with a low-confidence flag is telling you something important: even within a well-characterized property, this specific input is unusual.
We will continue to publish honest assessments of where the engine works well and where it does not. When physics improvements expand the reliable scope, we will update our benchmarks and document the changes. When we identify persistent limitations, we will say so rather than hiding them behind aggregate statistics.
A tool that is honest about its limitations is a tool you can build a workflow around. A tool that only shows its best results is a tool that will surprise you at the worst possible time.
Detailed per-property benchmarks with category breakdowns: See all benchmarks. To test on your own data: request pilot access.
The best way to evaluate any tool is to run it on your chemistry. Request a pilot to see how FluxMateria performs on your specific use case.
Request Pilot Access