← Benchmarks | Materials Physics

Materials Benchmark CORE + UNIVERSAL

Two primary benchmark tracks: core 5-property strict holdouts (S2/S3) with external apples-to-apples comparisons, and universal Layer 7 validation across 16 properties with strict and out-of-family stress tests. This page also includes a curated mini-benchmark for defect-context gemstone color behavior.

How to read this page

What is measured

Core track: 5 thermo-mechanical properties. Universal track: 16 properties spanning structural, electronic, optical, thermal, and mechanical outputs.

How accuracy is scored

MAPE (%). Lower is better.

How speed is scored

Mean runtime in milliseconds per prediction call.

Mini-benchmark scope

The gemstone color section is a curated feature benchmark that demonstrates defect and illumination controls. It is separate from core strict property claims.

1.1668%
Family Holdout Error (S2)
Hard split by material family
1.3750%
Interaction Holdout Error (S3)
Hard split by interaction type
0.32-0.33 ms
Runtime Mean
Per prediction call
797 / 821
Scored Property Points
Total points in S2 / S3

Two Hard Generalization Tests

Both tests are designed so the engine must generalize beyond familiar patterns.

Test A: Family Holdout (S2)

All materials from one family are held out together. This checks whether the model extrapolates to unseen families.

  • 19 folds
  • 171 held-out formulas total
  • 797 total evaluated property points

Test B: Interaction Holdout (S3)

Bonding interaction types are held out by group. This checks robustness across unseen interaction patterns.

  • 15 folds
  • 175 held-out formulas total
  • 821 total evaluated property points

Core Holdout Results (5 Properties)

Deterministic production runtime. No ML inference in final calculations.

At a glance: both hard tests stay near 1% average error while keeping sub-millisecond runtime.
Test Overall MAPE Bulk Modulus (B) Debye Temp (theta_D) Sound Velocity Density Thermal Cond. (kappa) Runtime mean
Test A: Family holdout (S2) 1.1668% 0.9056% 1.3283% 1.4405% 1.2231% 0.9185% 0.3294 ms
Test B: Interaction holdout (S3) 1.3750% 1.0149% 1.5636% 1.6081% 1.3398% 1.3473% 0.3205 ms

Units: bulk modulus (GPa), Debye temperature (K), sound velocity (m/s), density (g/cm^3), thermal conductivity (W/(m*K)).

Universal Layer 7 Validation (16 Properties)

Strict reference validation plus out-of-family stress tests across 15 split scenarios.

<1%
Strict MAPE (all 16)
Worst strict property: 0.947%
<1%
Out-of-family MAPE (all 16)
Worst scenario aggregate: 0.894%
2.741 ms
Strict Runtime Mean
Median 2.417 ms, p95 4.056 ms
16 x 15
Properties x Scenarios
Out-of-family sweep runtime: 4.321 s
At a glance: the universal runtime engine keeps all 16 property aggregates below 1% in both strict and out-of-family validation.
Property Strict MAPE % (N) Out-of-family weighted MAPE % (rows)
Crystal structure0.000 (59)0.000 (59)
Lattice constant (A)0.085 (59)0.086 (59)
Lattice constant (pm)0.085 (59)0.086 (59)
Atomic volume (A^3)0.497 (71)0.488 (71)
Band gap (eV)0.690 (274)0.985 (674)
Optical band gap (eV)0.527 (39)0.530 (39)
Dielectric constant0.910 (32)0.925 (32)
Refractive index0.658 (49)0.662 (49)
Reflectivity0.339 (28)0.338 (28)
Hardness (GPa)0.740 (44)0.724 (44)
Cv (J/mol*K)0.857 (39)0.858 (39)
Cv (J/kg*K)0.857 (39)0.858 (39)
Cp (J/mol*K)0.797 (81)0.801 (81)
Thermal expansion (1/K)0.903 (39)0.908 (39)
Melting point (K)0.947 (82)0.973 (82)
Density (g/cm^3)0.929 (162)0.965 (165)

For crystal structure, values are mismatch-rate percentages (0% means all evaluated rows matched).

Comparison on the Same Tests

All models below are scored on the same strict splits. Lower MAPE is better.

Main takeaway: on S2 overall error, FLUX is 1.17% versus AFLOW 36.07%, JARVIS 10.92%, and Matbench 18.42%.

Test A: Family Holdout (S2)

Model Overall Bulk Modulus Debye Temp Sound Velocity Density Thermal Cond.
FLUX 1.1668% 0.9056% 1.3283% 1.4405% 1.2231% 0.9185%
AFLOW adapter 36.0693% 14.3580% 15.6790% 29.5091% 5.5695% 122.9277%
JARVIS adapter 10.9230% 15.4191% NA NA 6.3425% NA
Matbench adapter 18.4214% 18.4214% NA NA NA NA

Test B: Interaction Holdout (S3)

Model Overall Bulk Modulus Debye Temp Sound Velocity Density Thermal Cond.
FLUX 1.3750% 1.0149% 1.5636% 1.6081% 1.3398% 1.3473%
AFLOW adapter 35.3566% 14.5881% 14.1047% 30.0588% 5.5493% 119.6271%
JARVIS adapter 10.9356% 15.4816% NA NA 6.2780% NA
Matbench adapter 18.4247% 18.4247% NA NA NA NA

Important: NA means that model did not provide that property in this comparison. Coverage differs by model and property.

Reproducibility and Audit Trail

Both benchmark tracks are frozen to public snapshots so every number can be rerun and verified.

Snapshot ID: materials_physics_benchmark_snapshot_2026-02-24
Generated: 2026-02-24T20:50:56Z
Commit: f4fb848fd7fa55be1b68d4e7592f1330553f1112
Branch / Dirty: main / true

Public export label uses reader-facing naming conventions only.

Frozen benchmark JSON
Machine-readable manifest, hashes, commands, and metrics.
Download JSON
Frozen benchmark report (Markdown)
Human-readable report with strict and external tables.
Download MD
Universal Layer 7 snapshot JSON (16 properties)
Strict + out-of-family aggregates with sanitized public fields only.
Download JSON
Universal Layer 7 snapshot report (Markdown)
Headline metrics and per-property strict vs out-of-family summary table.
Download MD
Strict scoring outputs (S2/S3)
Fold-level strict results for FLUX only.
S2 JSON S3 JSON
External apples-to-apples outputs (S2/S3)
FLUX + AFLOW/JARVIS/Matbench adapter scoring on the same strict splits.
S2 JSON S3 JSON

Need the Band Gap benchmark?

Band gap accuracy is reported on a dedicated page with its own dataset and metric definitions.

Band Gap Benchmark Materials Module

Mini-Benchmark: Defect-Context Gemstone Color

A curated showcase of defect-center and illumination behavior using the same deterministic universal runtime path.

How to interpret this: this is a nice-to-have flexibility benchmark for defect-driven visible color. It is a curated scenario set, not a replacement for the strict universal property validation above.
19 / 19
Gemstone Color Match
Exact or accepted family label
2.800 ms
Median Runtime
Per prediction call
Green to Red
Alexandrite Shift
Daylight vs incandescent
21
Knob Sweep Cases
5 output color classes observed
Case Context Expected Predicted
Amethyst Quartz + Fe center Purple Purple
Ruby Corundum + Cr center Red Red
Emerald Beryl + Cr/V center Green Green
Blue Topaz Topaz + F-like center Blue Blue
Alexandrite (daylight) Cr center, daylight profile Green Green
Alexandrite (incandescent) Same material, warm-light profile Red Red

Runtime remains deterministic and constant-time. No ML inference is used in final engine calculations.