Numbers you can check. Methods you can challenge.

FluxMateria benchmarks span chemistry, materials, ADMET, reaction mechanisms, spectroscopy, solvation, catalysis, battery, and synthesis-planning modules.

Each benchmark page states the dataset, metric, scope, comparator, and computational route. Internal benchmarks are the starting point; external blind validation is now open.

Selected benchmark highlights.

Every benchmark is paired with methodology, scope notes, and either a public export or reviewer-accessible evidence packet.

3
#1 SOTA ADMET
under stated metrics
DILI
SOTA mechanistic
risk module
0.3295
SOTA-level FreeSolv
kcal/mol MAE
Real-time
MechanismOS
reaction steering
336/336
Scoped mechanism
test cases
<1%
Materials error
16 properties validated

Our benchmark philosophy

๐Ÿ“‹

Publish the methodology

How we test, what datasets we use, how we measure. No hidden assumptions.

โš–๏ธ

Show head-to-head comparisons

Against established tools where possible. Fair comparisons, same test sets.

๐ŸŽฏ

Document validation scope

Where predictions are most and least reliable. We tell you the boundaries.

๐Ÿ”„

Provide evidence context

Methodology, scope notes, and public artifacts or reviewer-accessible evidence packets where appropriate.

Key metrics at a glance

Summary of performance across core capabilities.

Bond Lengths

0.079%error
453 bonds, 64 elements

Single + multiple bonds across p, d, and s-block with row-level validation evidence.

Bond Energies

0.289%error
908 bonds, 64 elements

Singles, doubles, and triples. 870/906 within 1.0% in the public export.

No-Fit Reference

0.176%MAPE
1,483 validated experimental/reference points

Raw Flux formula outputs only: no training, no calibration, no DFT/computed-only targets.

Throughput

10,000+mol/hr
full property panel

Single-threaded, no GPU. Scales linearly with cores for batch jobs.

ADMET Panel

178Kvalidated
leave-one-out, 8 endpoints

PPB, BBB, solubility, metabolism, permeability, hERG, DILI, CYP. Three endpoints are #1 SOTA under stated public-comparator metrics.

Materials

<1%MAPE
16 properties, universal engine

Band gap 0.7 eV MAE (1,048 materials). Core holdout 1.2%. Gemstone color 19/19.

Solvation

0.3295kcal/mol
FreeSolv hydration MAE, 642 cases

SOTA-level hydration accuracy with explicit-solvation packet, public exports, and native non-water carrier coverage.

MechanismOS

Real-timesteering
control surfaces + constraint optimizer

Category-defining reaction steering: live control surfaces, pathway boundaries, optimizer output, and evidence-pack export.

Catalyst Scoring

93.4%ranking
12 / 12 ranking tests passed

Production-stack catalyst benchmark with pairwise ranking fidelity, scenario alignment, inverse-search convergence, and experimental chemisorption calibration.

Benchmarks by module

Detailed performance data for each capability.

๐Ÿ’Š ADMET

80.9%
CYP Panel Accuracy
93.3%
BBB Accuracy
~350
mol/sec
178K
Compounds Validated
  • BBB: 93.3% accuracy (7,807 LOO, v8 Hybrid)
  • Solubility: 0.06 logS MAE (9,982+ LOO; #1 SOTA under stated MAE comparator)
  • CYP Panel: AUPRC 0.798, 80.9% acc (62,794 LOO, v5 Hybrid)
  • CYP3A4 inducer: 0.9350 balanced accuracy on primary external holdout
  • Caco-2 permeability: pure-physics MAE 0.277 on TDC caco2_wang test (n=182) vs published SOTA 0.276; MAE 0.502, 73.1% acc on broader 41,175 LOO cohort
  • Metabolism: Spearman 0.692, 82.8% acc (38,576 LOO; #1 SOTA under stated Spearman comparator)
  • PPB: 2.24% LOO MAE (14,288 LOO; #1 SOTA under stated MAE comparator)
  • hERG: AUROC 0.850 (8,879 LOO, v1 Hybrid)
  • DILI: SOTA mechanistic module with TDC-panel novel-like AUROC 0.9597, plus mechanism, exposure, and score-trace reporting

178K compounds validated via leave-one-out protocol across 8 endpoints. Four endpoints reach public-benchmark state-of-the-art under the listed dataset, split, and metric: Solubility, Metabolism, PPB, and Caco-2 permeability (the last matching TDC caco2_wang trained-ML SOTA from pure physics with zero training labels). DILI is a SOTA mechanistic module reaching AUROC 0.9597 on the comparable TDC binary task while also returning mechanism-level output.

DILI benchmark note: FluxMateria v4.23 reaches area under receiver operating characteristic curve (AUROC) 0.9597 on the comparable Therapeutics Data Commons (TDC) binary DILI benchmark versus the MiniMol reference around 0.956. This is stronger than a binary-only comparison because FluxMateria also returns risk class, score, cytochrome P450 (CYP)/transporter mechanisms, exposure context, dose-window behavior, and a score trace. FluxMateria runs this parent DILI path at about 12.95 molecules per second locally; MiniMol speed is not verified from the public leaderboard.

Full Results โ†’ DILI Benchmark Caco-2 Benchmark CYP3A4 Inducer Methodology

🔬 Materials

0.237 eV
Band Gap MAE
1.1668%
Core S2 MAPE
<1%
Universal 16 (strict + OOF)
2.741 ms
Universal strict runtime

Band Gap Benchmark

1,048 materials
Overall MAE0.237 eV Metals (exp = 0)0.130 eV Non-metals (exp > 0)0.320 eV

Core Holdout (5 properties)

Lower MAPE is better
FLUX S2 (family holdout)1.17% FLUX S3 (interaction holdout)1.38% AFLOW S236.1% JARVIS S210.9% Matbench S218.4%

Universal benchmark (16 properties)

Strict + out-of-family
All 16 strict<1% All 16 out-of-family<1% Worst OOF scenario0.894% Runtime mean2.7 ms Gemstone color match19/19
What this means: FLUX now has two primary validated tracks: near-1% strict holdout error on core thermo-mechanics (with external apples-to-apples baselines), and sub-1% strict plus out-of-family performance across a 16-property universal runtime path. It also includes a curated mini-benchmark showing defect-context color flexibility for real-time UI exploration.
Universal Benchmark → Band Gap Benchmark Crystal Bond Lengths Module Page

Battery Electrochemistry

1.0
Family Accuracy
0.149 V
Holdout Voltage MAE
5 / 5
Scenario Alignment
26.8 s
End-to-End Workflow
  • Calibrated holdout benchmark tracks capacity, voltage, transport, cycle, electrolyte, interface, cost, and manufacturing together
  • Energy-dense cobalt-free screen lifts LiMnO2 to the top
  • High-voltage frontier screen lifts LiNiPO4 to the top
  • Fast-charge and cycle-life screens surface transport- and stability-led families instead of a single default winner
  • The same pipeline yields different leaders for bulk, interface, battery-native, and build questions

This benchmark validates the battery-native decision layer as a screening and prototype-handoff engine, not as a replacement for electrochemical lab validation.

Full Results → Case Study Module Page

Catalyst Scoring

12 / 12
Ranking Tests
91.7%
Top-1 Accuracy
93.4%
Pairwise Accuracy
35 / s
Full-Stack Throughput
  • Measured through the full production scoring path, not a simplified shortcut
  • All 96 benchmark references were FLUX-enriched in the published public run
  • The corrected API path now clears FT activity, FT support, ammonia support, and WGS ordering together
  • Inverse search converges to real industrial catalyst families and chemically serious exclusion lanes
  • The public benchmark and catalyst case study now share the same API-only narrative

This benchmark validates FluxMateria as a catalyst ranking and inverse-discovery engine. Physical synthesis, reactor testing, and long-run deactivation work still remain the next laboratory step.

Full Results → Case Study Module Page

Activation Barrier Prediction

0.236 eV
Combined MAE
0.147 eV
4d-series MAE
93%
Within 0.5 eV
0
DFT inputs / training
  • Predicts surface-reaction activation barriers from Flux energy and topology terms
  • 29 published literature reactions: N₂, H₂, O₂, CO dissociation and C-H activation across 13 transition metals
  • Matches single-method DFT accuracy (PBE ~0.20-0.30 eV) at analytical speed — microseconds per prediction
  • 100% within 0.5 eV for N₂, H₂, and O₂ dissociation families
  • Feeds the catalyst-scoring and microkinetics layers for end-to-end catalyst discovery

Production-ready for catalyst screening, ranking, and inverse discovery. Quantitative turnover-frequency prediction is at the edge of usefulness at this MAE; same is true for single-method DFT.

Full Results → d-Band Benchmark Catalyst Benchmark

d-Band Center Descriptor

0.197 eV
Combined MAE
100
Multi-source Cases
0
Fitted Parameters
beats ML
vs linear / kNN / RF
  • Central descriptor in transition-metal catalysis — predicted from Flux atomic and surface descriptors
  • 100-case benchmark: 27 pure TMs, 41 facets, 32 binary alloys — all with published literature targets
  • Facet-specific MAE of 0.154 eV across (111), (100), (110), (211), (0001) and stepped surfaces
  • Outperforms linear regression, k-NN, and random-forest baselines fitted on the same atomic descriptors
  • Cross-validated against five independent literature sources (HN14, K04, GN09, N95, CM20)

The d-band descriptor feeds downstream into the catalyst scoring and inverse-discovery layers. Production-ready for transition-metal catalysis workflows; rare-earth and Pt-3d skin alloys remain known weak spots.

Full Results → Catalyst Benchmark Module Page

๐Ÿงฒ Curie Temperature

4.6%
Overall MAPE
−0.03%
Mean Bias
107
Magnetic Materials
17
Material Families
  • 4.6% MAPE across 107 materials from composition with magnetic closure, branch overrides, and calibration notes
  • 17 families: ferrites, rare-earth intermetallics, double perovskites, manganites…
  • 89% within 5%, 96% within 10% of experimental Tc
  • Near-zero bias (−0.03%) — no systematic over- or under-prediction
Full Results → Module Page

🧪 DFT Cross-Check

7.6%
Band gap MAPE (PBE 45.1%)
3.6%
Magnetic moment MAPE (PBE 9.0%)
0.7%
Bulk modulus median (all 15)
~20,000×
Mean speedup vs DFT
  • Head-to-head with GPAW PBE on 15 canonical materials run locally on identical inputs
  • Three-layer comparison covering lattice, band gap, magnetic moment, and bulk modulus
  • Engine band gap median 1.2% vs PBE 50.7%; engine bulk modulus median 0.7% (MAPE 6.0%) across all 15 materials
  • Reproducible: manifest, DFT settings, and per-material results downloadable as JSON / CSV / MD

Si, Ge, GaAs, GaN, ZnO, MgO, TiO2, NaCl, Al, Cu, Fe, Ni, graphite, h-BN, MoS2. Two tiers shipped (single-point + EOS-derived B); a third (DFPT phonons + dielectric function) is on the roadmap.

Full Results → Case Study Module Page

⚡ Carrier Mobility

6.2%
Overall MAPE
10/23
Within ±5%
23
Semiconductors
4
Material Families
  • Electron mobility μe at 300 K predicted from composition using production transport physics
  • 4 families: III-V, II-VI, IV-VI, elemental semiconductors
  • 22 of 23 materials within ±15% of experiment; SiC is the only edge case
  • Balanced signed errors — no systematic over- or under-prediction
Full Results → Module Page

โš›๏ธ Atomic & Magnetic Properties

2.5%
EN MAPE
1.7%
IE MAPE
84
MM Materials
5
Properties
  • Electronegativity 2.5% MAPE (75 elements), ionization energy 1.7% (27), electron affinity 1.0% (28)
  • Magnetic moment: 100% pass (84/84 materials); metallic intermetallics 3.1% MAPE
  • Saturation magnetization: 100% pass (10/10 materials at ±50% tolerance)
  • Atomic properties and magnetic subproperties carry separate basis notes
Full Results → Module Page

๐Ÿ“ˆ Spectroscopy

6.2%
UV-Vis Error
<1%
IR Error
0.3-0.5
NMR MAE (ppm)
50
UV-Vis Molecules
  • UV-Vis: 6.2% mean error, 50 molecules, 6 categories
  • IR: <1% error, 32 NIST molecules validated
  • NMR: 0.3-0.5 ppm MAE, 10 SDBS molecules, 5 nuclei
Full Results โ†’ Module Page

โš—๏ธ Mechanism Discovery

100%
Mechanism Accuracy
336/336
Cases Correct
7.4
kJ/mol MAE
1,000,000x
Faster
  • 336/336 experimental test cases (SN1/SN2/E1/E2/E1cb) โœ“
  • 10,000 random physical consistency tests โœ“
  • Head-to-head comparison with DFT (B3LYP) โœ“
  • Every prediction traceable and reproducible โœ“
Full Methodology & Results โ†’ Module Page

MechanismOS

Real-time
Mechanism Steering
98.7%
GOLD Direct Ea
94.86%
SILVER Arrhenius Ea
Evidence pack
Audit-ready export
  • SOTA real-time mechanism steering: control surfaces, pathway boundaries, constraint optimizer, and evidence-pack export
  • GOLD: 152/154 direct measured activation barriers passed under fixed benchmark criteria
  • SILVER: 1255/1323 Arrhenius-derived barrier checks passed
  • Official experimental source provenance documented per benchmark tier
Full Results → Module Page

๐Ÿงช Synthesis Planning

3.1%
Barrier MAE
29/29
Reaction Types
200
Specific Reactions
<50ms
Per Plan
  • 29 reaction-type barriers at 3.1% MAE (100% pass rate)
  • 200 specific reactions at <1% MAE (72 exact matches)
  • 15 disconnection SMARTS patterns validated
  • All barriers fully auditable and reproducible
Full Results → Module Page

🔥 Reaction Enthalpy

NEW
3.5%
MAPE
157
Reactions Tested
89%
Within 5%
<1ms
Per Reaction
  • 157 reactions from NIST WebBook at 3.5% MAPE, 10.0 kJ/mol MAE
  • 12 categories: combustion, radical, formation, halogen, nitrogen, ozone
  • Hess’s law with documented species resolution + universal bond engine
  • Phase notation: C(s), C(g), H2O(l) — disambiguates reference states
Full Results →

⚡ Electron Transfer

26/26
Tests Pass
2–3×
Tunneling Enhancement
Literature
Decay constant match
~150ms
Per Pair
  • Marcus rate constants with FLUX tunneling corrections
  • Through-bond decay constant matches literature ranges
  • Normal, activationless, and inverted Marcus regimes
  • All coupling deterministic and traceable
Full Results → Module Page

๐Ÿงช Solvation

SOTA-level
FreeSolv Accuracy
0.3295
MAE (kcal/mol)
642
FreeSolv Cases
4
Native Non-Water Carriers
  • SOTA-level explicit hydration benchmark: 0.3295 kcal/mol MAE on 642 FreeSolv cases โœ“
  • Official packet includes summary JSON, case CSV/JSON, and methodology โœ“
  • Water externally benchmarked; methanol, ethanol, acetonitrile, and DMSO tracked โœ“
Full Results โ†’

๐Ÿงฌ BioTarget

0.772
Pearson r (CASF-2016)
91%
MoA Accuracy
1.28
MAE (pKi)
10,065
Targets
  • Binding affinity: Pearson r = 0.772 on CASF-2016 (270 complexes) โœ“
  • MoA prediction: 91% accuracy on ChEMBL validation โœ“
  • Target identification: AUC 0.980 โœ“
  • Selectivity profiling: planned โณ
Full Results โ†’

⚛ Chemistry

0.079%
Bond Length Error
0.289%
Bond Energy Error
1,361
Total Observables
64
Elements
  • No-fit experimental/reference benchmark: 1,483 validated scalar targets, 0.176% weighted raw MAPE
  • Bond lengths: 453 bonds (391 single + 62 multiple), 0.079% mean error
  • Bond energies: 908 bonds, 0.289% mean error, 870/906 within 1.0%
  • Flux-encoded bond-family formulas with published benchmark provenance notes
  • Coverage: 24 p-block + 30 d-block + 10 s-block elements
Reference Benchmark → Chemistry Results Module Page

↻ Torsion Barriers

1.06
kJ/mol MAE (99 rotors)
0
Fitted Parameters
9–13×
More accurate than Sage 2.2 / GAFF2 / MMFF94
14
Rotor Classes
  • 1.06 kJ/mol MAE across 99 experimentally-measured rotational barriers, zero training data
  • Same-set head-to-head: 9.2× better than OpenFF Sage 2.2, 10.4× better than GAFF2, 13.3× better than MMFF94
  • 66% of cases within ±1 kJ/mol, 85% within ±2 kJ/mol of experiment
  • Covers alkanes, ethers, amines, peptide ω, esters, acrylates, halides, X-X rotors and aromatic carbonyls
Full Results → Case Study
Independent Validation

Internal benchmarks are only the starting point.

FluxMateria is now opening external validation tracks so researchers can choose blind datasets, define metrics, and score frozen predictions independently.

Available tracks include chemistry core, materials holdouts, life-science / ADMET, reaction mechanisms, and experimental validation.

Open validation program →

Validation scope

Where FluxMateria predictions are most and least reliable.

We document the boundaries of reliable prediction space:

  • โš  Novel chemotypes far from validated chemical space
  • โš  Specific endpoints with limited experimental data
  • โš  Edge cases identified through validation

Confidence indicators in predictions reflect these boundaries. Low confidence = verify experimentally.

Reproducibility

We want you to verify our claims.

Benchmark datasets and evaluation scripts are available to pilot participants.

Request Pilot Access