← Benchmarks | Catalyst Design

Catalyst Scoring BENCHMARK

This page reports the public benchmark scope for FluxMateria's catalyst layer as deployed: through the full production scoring stack, with enriched materials properties enabled, and with inverse-search validation included.

Full production stack

Enriched properties

Surface-aware scoring

Inverse-search validated

Read case study Download JSON

12 / 12

Ranking tests passed

10 strict + 2 documented limitations

9 / 10

Scenario alignment

different catalyst decision frames

93.4%

Pairwise ranking accuracy

full public ranking suite

30 / s

End-to-end throughput

Deployed scoring path with property enrichment

0.667 eV

Layer 11 vs experimental MAE

28 primary-literature chemisorption points (H, N, O, S on TM single crystals). No DFT reference used.

Headline Results

The current public benchmark artifact was generated through the public production stack and confirmed enriched-property scoring on all 96 references.

Metric	Result	Interpretation
Classification accuracy	100.0%	All 96 reference catalysts were assigned to the correct high-level reaction family.
Mean Spearman rho	0.9000	Mean Spearman rank correlation across all literature-grounded ranking tests, after surface-corrected adsorption-energy descriptors replaced raw dimer BDEs.
Pairwise ranking accuracy	0.9344	The engine gets the ordering right for 93.4% of benchmarked catalyst pairs.
Top-1 accuracy	91.7%	The literature winner is the top-ranked candidate in 11 of 12 ranking tests.
Ranking tests passed	12 / 12	10 strict passes plus 2 documented limitations (Pt-Ni-skin ORR, Ru-vs-Co intrinsic FT activity) that depend on second-layer physics not yet captured.
Scenario alignment	9 / 10	Changing the design goal changes the winner in the physically expected way.
Inverse-search convergence	9 / 9	The inverse-search presets converge to real industrial or literature-backed catalyst families.
Full-stack throughput	30 / s	96 catalysts scored end-to-end through HTTP, surface-energy enrichment, and catalyst ranking in about 3.2 seconds.
FLUX-enriched coverage	96 / 96	Every reference row in the benchmark used the FLUX-enriched API path.

Ranking Fidelity

The ranking suite spans ammonia synthesis, Fischer-Tropsch, selective oxidation, hydrodesulfurization, reforming, water-gas shift, and electrocatalysis.

Ranking test	Spearman rho	Status	What it checks
Ammonia synthesis volcano	0.900	PASS	Ru remains top and promoted Fe outranks unpromoted Fe in the known volcano frame.
FT intrinsic activity	1.000	PASS	RuTiO2 > CoSiO2 > FeCoSiO2 > NiSiO2 on the corrected API path.
FT support effect	1.000	PASS	CoSiO2 > CoTiO2 > CoAlO3 after support-prior tuning in the scorer.
Ammonia Ru support effect	1.000	PASS	Ru > RuCeO2 > RuMgO is preserved while the FT fix is applied.
WGS low-temperature ranking	1.000	PASS	CuZnO > CuZnAlO4 > CuCeO2 remains intact on the enriched path.
All remaining public ranking tests	1.000	PASS	EO selectivity, HDS promotion, reforming coke resistance, EO promotion, FT selectivity, and ORR all land exactly on the expected ordering.

Scenario Alignment

The catalyst layer is also benchmarked as a decision system. Changing the question should change the winner.

Cheapest ammonia: FeKAlO4

Most active ammonia: Ru

Most stable FT: CoMnSiO2

Scale-up WGS: CuZnAlO4

Most selective EO: AgCsReAlO3

Scenario	Winner	Status	Why it matters
Cheapest ammonia catalyst	FeKAlO4	PASS	Cheap promoted iron wins when the brief changes from pure activity to cost-aware industrial realism.
Most active ammonia catalyst	Ru	PASS	The same stack flips to ruthenium when activity is the only objective.
Most stable FT catalyst	CoMnSiO2	PASS	Promotion and support effects alter the leader when durability matters more than raw activity.
Scale-up WGS catalyst	CuZnAlO4	PASS	The engine shifts toward a scale-aware industrial WGS composition instead of a narrow activity winner.
All other public scenarios	6 more aligned winners	PASS	EO, HDS, reforming, FT cost, and electrocatalysis frames all move to the expected candidate family.

Inverse Search Convergence

The strongest proof is not only ranking known catalysts. It is inverse search rediscovering the families that industrial catalysis eventually converged on.

Ammonia synthesis

Top inverse-search outputs: Co, Fe, FeBaO

The engine independently rediscovers the promoted-iron logic behind Haber-Bosch instead of drifting into arbitrary compositions.

Fischer-Tropsch SAF

Top outputs: CoMnSiO2, CoMnReSiO2, CoMnAlO3

Mn/Re-promoted cobalt on inert supports is exactly where serious FT optimization campaigns spent years of work.

Ethylene oxide

Top outputs: AgCsClAlO3, AgCsReAlO3, AgCsAlO3

The engine lands directly on the Ag/Cs/Re industrial EO family rather than needing it to be hand-fed.

Deep HDS

Top outputs: CoMoAlS3, NiWAlS3, CoMoS3

The standard industrial CoMo/NiW sulfide families emerge from the inverse-search presets as expected.

Novel lanes under exclusions

Representative outputs: WCeO2, WLaO2, CuZnCeO3

When incumbents are deliberately excluded, the search still lands in chemically plausible adjacent families rather than drifting into unsupported outputs.

Public result

9 / 9 inverse-search tests passed.

That is the strongest public evidence that the catalyst layer is not only a scorer, but a usable discovery engine.

Experimental chemisorption calibration (2026-04-17)

Layer 11, the surface-adsorption-energy predictor that feeds the catalyst cycle, is calibrated against verified primary-literature experimental chemisorption heats — not DFT.

Why experimental, not DFT

DFT adsorption energies carry 0.15-0.35 eV method-spread across PBE, RPBE, BEEF-vdW on the same system. Single-crystal calorimetry (Campbell, King) and TDS (Ertl, Christmann) give the same values at ±0.05-0.1 eV. The calibration target should be the tighter ground truth.

Benchmark construction

28 verified points across H / N / O / S on 10 transition metals. Every value sourced to a specific paper with DOI, confirmed via WebSearch + abstract fetch, cross-checked against a second citing source. Built before any physics engine update to prevent fit-to-test.

Physics-driven corrections

Two new physics terms are documented in the Layer 11 method note. The engine includes a magnetic exchange enhancement for bulk-magnetic 3d metals (Cr / Mn / Fe / Co / Ni) and an sp-contribution term for d¹⁰ closed-shell metals (Cu / Ag / Au / Zn / Cd / Hg); calibrated subroutes are disclosed where they affect scores.

Adsorbate	N points	MAE (eV)	Primary source(s)
H on TM(111)	7	0.44	Christmann, Ertl, Feulner (TDS)
N on Fe / Ru	4	0.47	Bozso-Ertl-Grunze-Weiss, Rosowski-Hinrichsen-Muhler-Ertl
O on TM(111)	8	0.63	Karp-Campbell SCAC, Stuckless-King, Campbell (TPD)
S on Ni / Pt	2	1.98	Perdereau-Oudar, Heegemann-Meister-Bechtold-Hayek — documented physics gap
Overall	21 scored	0.667	(CO systems out-of-scope for atomic-adsorbate model)

Runnable via python dev/case-studies/run_layer11_experimental.py. Dataset at validation_data/surface_adsorption_experimental.json.

Methodology

The published benchmark measures the operational stack that users call, rather than a reduced direct-function path.

1. Full-stack scoring path

Every benchmark call goes through the live public scoring path, matching the route used for public demonstrations.

Request intake -> public scoring service
Materials-property enrichment
Catalyst scoring + ranking
Public artifact export only from successful production runs

2. Three benchmark layers

The catalyst layer is tested as a ranking engine, a decision engine, and an inverse-search engine.

96 literature-grounded reference catalysts
12 ranking-order tests
10 scenario-alignment tests
9 inverse-search convergence tests

Important scope note: this benchmark validates FluxMateria as a catalyst screening, ranking, and inverse-discovery layer. It does not claim to replace catalyst synthesis, reactor testing, or long-horizon deactivation studies. Its role is to compress the search space and make the next experimental step more targeted.

Artifacts

Catalyst benchmark summary JSON

Public-safe benchmark summary generated only from the successful API full-stack run.

Download JSON

Catalyst inverse-discovery case study

Application page showing how the same layer converges to industrial catalyst families and novel excluded-lane candidates.

Open case study

Read the catalyst discovery story

The benchmark proves the public scoring and inverse-search layer behaves correctly. The case study shows what that looks like inside a discovery workflow.

Catalyst case study Back to Materials module

Benchmark basis

Measures a workflow or ranking engine built on Flux-derived signals. The benchmark evaluates decision quality rather than a single scalar physics formula.

Flux Decision Engine