← Benchmarks | Materials Physics

Materials Benchmark CORE + UNIVERSAL

Two primary benchmark tracks: core 5-property strict holdouts (S2/S3) with external apples-to-apples comparisons, and universal 16-property validation with strict and out-of-family stress tests. This page also includes a curated mini-benchmark for defect-context gemstone color behavior.

1.1668%
Family Holdout Error (S2)
Hard split by material family
1.3750%
Interaction Holdout Error (S3)
Hard split by interaction type
0.32-0.33 ms
Runtime Mean
Per prediction call
797 / 821
Scored Property Points
Total points in S2 / S3

Core Holdout Results (5 Properties)

Deterministic production runtime. No ML inference in final calculations.

At a glance: both hard tests stay near 1% average error while keeping sub-millisecond runtime.
Test Overall MAPE Bulk Modulus (B) Debye Temp (theta_D) Sound Velocity Density Thermal Cond. (kappa) Runtime mean
Test A: Family holdout (S2) 1.1668% 0.9056% 1.3283% 1.4405% 1.2231% 0.9185% 0.3294 ms
Test B: Interaction holdout (S3) 1.3750% 1.0149% 1.5636% 1.6081% 1.3398% 1.3473% 0.3205 ms

Units: bulk modulus (GPa), Debye temperature (K), sound velocity (m/s), density (g/cm^3), thermal conductivity (W/(m*K)).

Universal Property Validation (16 Properties)

Strict reference validation plus out-of-family stress tests across 15 split scenarios.

<1%
Strict MAPE (all 16)
Worst strict property: 0.947%
<1%
Out-of-family MAPE (all 16)
Worst scenario aggregate: 0.894%
2.741 ms
Strict Runtime Mean
Median 2.417 ms, p95 4.056 ms
16 x 15
Properties x Scenarios
Out-of-family sweep runtime: 4.321 s
At a glance: the universal runtime engine keeps all 16 property aggregates below 1% in both strict and out-of-family validation.
Property Strict MAPE % (N) Out-of-family weighted MAPE % (rows)
Crystal structure0.000 (59)0.000 (59)
Lattice constant (A)0.085 (59)0.086 (59)
Lattice constant (pm)0.085 (59)0.086 (59)
Atomic volume (A^3)0.497 (71)0.488 (71)
Band gap (eV)0.690 (274)0.985 (674)
Optical band gap (eV)0.527 (39)0.530 (39)
Dielectric constant0.910 (32)0.925 (32)
Refractive index0.658 (49)0.662 (49)
Reflectivity0.339 (28)0.338 (28)
Hardness (GPa)0.740 (44)0.724 (44)
Cv (J/mol*K)0.857 (39)0.858 (39)
Cv (J/kg*K)0.857 (39)0.858 (39)
Cp (J/mol*K)0.797 (81)0.801 (81)
Thermal expansion (1/K)0.903 (39)0.908 (39)
Melting point (K)0.947 (82)0.973 (82)
Density (g/cm^3)0.929 (162)0.965 (165)

For crystal structure, values are mismatch-rate percentages (0% means all evaluated rows matched).

Comparison on the Same Tests

All models below are scored on the same strict splits. Lower MAPE is better.

Main takeaway: on S2 overall error, FLUX is 1.17% versus AFLOW 36.07%, JARVIS 10.92%, and Matbench 18.42%.

Test A: Family Holdout (S2)

Model Overall Bulk Modulus Debye Temp Sound Velocity Density Thermal Cond.
FLUX 1.1668% 0.9056% 1.3283% 1.4405% 1.2231% 0.9185%
AFLOW adapter 36.0693% 14.3580% 15.6790% 29.5091% 5.5695% 122.9277%
JARVIS adapter 10.9230% 15.4191% NA NA 6.3425% NA
Matbench adapter 18.4214% 18.4214% NA NA NA NA

Test B: Interaction Holdout (S3)

Model Overall Bulk Modulus Debye Temp Sound Velocity Density Thermal Cond.
FLUX 1.3750% 1.0149% 1.5636% 1.6081% 1.3398% 1.3473%
AFLOW adapter 35.3566% 14.5881% 14.1047% 30.0588% 5.5493% 119.6271%
JARVIS adapter 10.9356% 15.4816% NA NA 6.2780% NA
Matbench adapter 18.4247% 18.4247% NA NA NA NA

Important: NA means that model did not provide that property in this comparison. Coverage differs by model and property.

Mini-Benchmark: Defect-Context Gemstone Color

A curated showcase of defect-center and illumination behavior using the same deterministic universal runtime path.

How to interpret this: this is a nice-to-have flexibility benchmark for defect-driven visible color. It is a curated scenario set, not a replacement for the strict universal property validation above.
19 / 19
Gemstone Color Match
Exact or accepted family label
2.800 ms
Median Runtime
Per prediction call
Green to Red
Alexandrite Shift
Daylight vs incandescent
21
Knob Sweep Cases
5 output color classes observed
Case Context Expected Predicted
Amethyst Quartz + Fe center Purple Purple
Ruby Corundum + Cr center Red Red
Emerald Beryl + Cr/V center Green Green
Blue Topaz Topaz + F-like center Blue Blue
Alexandrite (daylight) Cr center, daylight profile Green Green
Alexandrite (incandescent) Same material, warm-light profile Red Red

Runtime remains deterministic and constant-time. No ML inference is used in final engine calculations.

How to read this page

What is measured

Core track: 5 thermo-mechanical properties. Universal track: 16 properties spanning structural, electronic, optical, thermal, and mechanical outputs.

How accuracy is scored

MAPE (%). Lower is better.

How speed is scored

Mean runtime in milliseconds per prediction call.

Mini-benchmark scope

The gemstone color section is a curated feature benchmark that demonstrates defect and illumination controls. It is separate from core strict property claims.

Two Hard Generalization Tests

Both tests are designed so the engine must generalize beyond familiar patterns.

Test A: Family Holdout (S2)

All materials from one family are held out together. This checks whether the model extrapolates to unseen families.

  • 19 folds
  • 171 held-out formulas total
  • 797 total evaluated property points

Test B: Interaction Holdout (S3)

Bonding interaction types are held out by group. This checks robustness across unseen interaction patterns.

  • 15 folds
  • 175 held-out formulas total
  • 821 total evaluated property points

Reproducibility and Audit Trail

Both benchmark tracks are frozen to public snapshots so every number can be rerun and verified.

Snapshot ID: materials_physics_benchmark_snapshot_2026-02-24
Generated: 2026-02-24T20:50:56Z
Commit: f4fb848fd7fa55be1b68d4e7592f1330553f1112
Branch / Dirty: main / true

Dirty: true indicates public report-formatting/export changes after the prediction rows were frozen. The benchmark prediction rows, snapshot ID, commit, commands, and attached hashes define the reproducible artifact.

Frozen benchmark JSON
Machine-readable manifest, hashes, commands, and metrics.
Download JSON
Frozen benchmark report (Markdown)
Human-readable report with strict and external tables.
Download MD
Universal benchmark snapshot JSON (16 properties)
Strict + out-of-family aggregates with sanitized public fields only.
Download JSON
Universal benchmark snapshot report (Markdown)
Headline metrics and per-property strict vs out-of-family summary table.
Download MD
Strict scoring outputs (S2/S3)
Fold-level strict results for FLUX only.
S2 JSON S3 JSON
External apples-to-apples outputs (S2/S3)
FLUX + AFLOW/JARVIS/Matbench adapter scoring on the same strict splits.
S2 JSON S3 JSON

Dedicated materials benchmarks

Specific property accuracy and methodology details are reported on dedicated pages with their own datasets and metric definitions.

Band Gap Benchmark Crystal Bond Lengths Curie Temperature Materials Module

Benchmark basis

This page aggregates many materials properties. The aggregate basis is mixed; property rows and source notes identify the relevant prediction route for each result family.

Mixed basis