← Benchmarks | Materials Physics

Materials Benchmark CORE + UNIVERSAL

Two primary benchmark tracks: core 5-property strict holdouts (S2/S3) with external apples-to-apples comparisons, and universal 16-property validation with strict and out-of-family stress tests. This page also includes a curated mini-benchmark for defect-context gemstone color behavior.

1.1668%

Family Holdout Error (S2)

Hard split by material family

1.3750%

Interaction Holdout Error (S3)

Hard split by interaction type

0.32-0.33 ms

Runtime Mean

Per prediction call

797 / 821

Scored Property Points

Total points in S2 / S3

Core Holdout Results (5 Properties)

Deterministic production runtime. No ML inference in final calculations.

At a glance: both hard tests stay near 1% average error while keeping sub-millisecond runtime.

Test	Overall MAPE	Bulk Modulus (B)	Debye Temp (theta_D)	Sound Velocity	Density	Thermal Cond. (kappa)	Runtime mean
Test A: Family holdout (S2)	1.1668%	0.9056%	1.3283%	1.4405%	1.2231%	0.9185%	0.3294 ms
Test B: Interaction holdout (S3)	1.3750%	1.0149%	1.5636%	1.6081%	1.3398%	1.3473%	0.3205 ms

Units: bulk modulus (GPa), Debye temperature (K), sound velocity (m/s), density (g/cm^3), thermal conductivity (W/(m*K)).

Universal Property Validation (16 Properties)

Strict reference validation plus out-of-family stress tests across 15 split scenarios.

<1%

Strict MAPE (all 16)

Worst strict property: 0.947%

<1%

Out-of-family MAPE (all 16)

Worst scenario aggregate: 0.894%

2.741 ms

Strict Runtime Mean

Median 2.417 ms, p95 4.056 ms

16 x 15

Properties x Scenarios

Out-of-family sweep runtime: 4.321 s

At a glance: the universal runtime engine keeps all 16 property aggregates below 1% in both strict and out-of-family validation.

Property	Strict MAPE % (N)	Out-of-family weighted MAPE % (rows)
Crystal structure	0.000 (59)	0.000 (59)
Lattice constant (A)	0.085 (59)	0.086 (59)
Lattice constant (pm)	0.085 (59)	0.086 (59)
Atomic volume (A^3)	0.497 (71)	0.488 (71)
Band gap (eV)	0.690 (274)	0.985 (674)
Optical band gap (eV)	0.527 (39)	0.530 (39)
Dielectric constant	0.910 (32)	0.925 (32)
Refractive index	0.658 (49)	0.662 (49)
Reflectivity	0.339 (28)	0.338 (28)
Hardness (GPa)	0.740 (44)	0.724 (44)
*Cv (J/molK)**	0.857 (39)	0.858 (39)
*Cv (J/kgK)**	0.857 (39)	0.858 (39)
*Cp (J/molK)**	0.797 (81)	0.801 (81)
Thermal expansion (1/K)	0.903 (39)	0.908 (39)
Melting point (K)	0.947 (82)	0.973 (82)
Density (g/cm^3)	0.929 (162)	0.965 (165)

For crystal structure, values are mismatch-rate percentages (0% means all evaluated rows matched).

Comparison on the Same Tests

All models below are scored on the same strict splits. Lower MAPE is better.

Main takeaway: on S2 overall error, FLUX is 1.17% versus AFLOW 36.07%, JARVIS 10.92%, and Matbench 18.42%.

Test A: Family Holdout (S2)

Model	Overall	Bulk Modulus	Debye Temp	Sound Velocity	Density	Thermal Cond.
FLUX	1.1668%	0.9056%	1.3283%	1.4405%	1.2231%	0.9185%
AFLOW adapter	36.0693%	14.3580%	15.6790%	29.5091%	5.5695%	122.9277%
JARVIS adapter	10.9230%	15.4191%	NA	NA	6.3425%	NA
Matbench adapter	18.4214%	18.4214%	NA	NA	NA	NA

Test B: Interaction Holdout (S3)

Model	Overall	Bulk Modulus	Debye Temp	Sound Velocity	Density	Thermal Cond.
FLUX	1.3750%	1.0149%	1.5636%	1.6081%	1.3398%	1.3473%
AFLOW adapter	35.3566%	14.5881%	14.1047%	30.0588%	5.5493%	119.6271%
JARVIS adapter	10.9356%	15.4816%	NA	NA	6.2780%	NA
Matbench adapter	18.4247%	18.4247%	NA	NA	NA	NA

Important: NA means that model did not provide that property in this comparison. Coverage differs by model and property.

Mini-Benchmark: Defect-Context Gemstone Color

A curated showcase of defect-center and illumination behavior using the same deterministic universal runtime path.

How to interpret this: this is a nice-to-have flexibility benchmark for defect-driven visible color. It is a curated scenario set, not a replacement for the strict universal property validation above.

19 / 19

Gemstone Color Match

Exact or accepted family label

2.800 ms

Median Runtime

Per prediction call

Green to Red

Alexandrite Shift

Daylight vs incandescent

Knob Sweep Cases

5 output color classes observed

Case	Context	Expected	Predicted
Amethyst	Quartz + Fe center	Purple	Purple
Ruby	Corundum + Cr center	Red	Red
Emerald	Beryl + Cr/V center	Green	Green
Blue Topaz	Topaz + F-like center	Blue	Blue
Alexandrite (daylight)	Cr center, daylight profile	Green	Green
Alexandrite (incandescent)	Same material, warm-light profile	Red	Red

Runtime remains deterministic and constant-time. No ML inference is used in final engine calculations.

How to read this page

What is measured

Core track: 5 thermo-mechanical properties. Universal track: 16 properties spanning structural, electronic, optical, thermal, and mechanical outputs.

How accuracy is scored

MAPE (%). Lower is better.

How speed is scored

Mean runtime in milliseconds per prediction call.

Mini-benchmark scope

The gemstone color section is a curated feature benchmark that demonstrates defect and illumination controls. It is separate from core strict property claims.

Two Hard Generalization Tests

Both tests are designed so the engine must generalize beyond familiar patterns.

Test A: Family Holdout (S2)

All materials from one family are held out together. This checks whether the model extrapolates to unseen families.

19 folds
171 held-out formulas total
797 total evaluated property points

Test B: Interaction Holdout (S3)

Bonding interaction types are held out by group. This checks robustness across unseen interaction patterns.

15 folds
175 held-out formulas total
821 total evaluated property points

Reproducibility and Audit Trail

Both benchmark tracks are frozen to public snapshots so every number can be rerun and verified.

Snapshot ID: materials_physics_benchmark_snapshot_2026-02-24

Generated: 2026-02-24T20:50:56Z

Commit: f4fb848fd7fa55be1b68d4e7592f1330553f1112

Branch / Dirty: main / true

Dirty: true indicates public report-formatting/export changes after the prediction rows were frozen. The benchmark prediction rows, snapshot ID, commit, commands, and attached hashes define the reproducible artifact.

Frozen benchmark JSON

Machine-readable manifest, hashes, commands, and metrics.

Download JSON

Frozen benchmark report (Markdown)

Human-readable report with strict and external tables.

Download MD

Universal benchmark snapshot JSON (16 properties)

Strict + out-of-family aggregates with sanitized public fields only.

Download JSON

Universal benchmark snapshot report (Markdown)

Headline metrics and per-property strict vs out-of-family summary table.

Download MD

Strict scoring outputs (S2/S3)

Fold-level strict results for FLUX only.

S2 JSON S3 JSON

External apples-to-apples outputs (S2/S3)

FLUX + AFLOW/JARVIS/Matbench adapter scoring on the same strict splits.

S2 JSON S3 JSON

Dedicated materials benchmarks

Specific property accuracy and methodology details are reported on dedicated pages with their own datasets and metric definitions.

Band Gap Benchmark Crystal Bond Lengths Curie Temperature Materials Module

Benchmark basis

This page aggregates many materials properties. The aggregate basis is mixed; property rows and source notes identify the relevant prediction route for each result family.

Mixed basis