← Benchmarks | MechanismOS

MechanismOS Real-Time Reaction Steering Benchmark

MechanismOS goes beyond mechanism labels: live control surfaces, condition-sensitive pathway boundaries, constraint optimization, and evidence-pack export, backed by barrier accuracy validation and pilot reliability testing.

Real-time
reaction steering
154/154
GOLD direct Ea pass
1256/1323
SILVER Arrhenius Ea pass
98.2%
Textbook reactions (55/56)
5 mechanisms
SN1, SN2, E1, E2, E1CB

Pass/fail is evaluated against fixed benchmark criteria. GOLD and SILVER are reported separately and should not be conflated.

Results at a glance

Three proof tracks support the MechanismOS benchmark claim.

Real-time steering

18/18 pilot cases and 94/94 checks passed. Control surfaces, pathway boundaries, constraint optimization, and evidence export are validated as product behavior.

Textbook mechanisms

55/56 textbook reactions correct. SN1, SN2, E1, E2, and E1CB behavior is validated across unambiguous, ambiguous, and condition-dependent cases.

Barrier accuracy

154/154 GOLD and 1256/1323 SILVER checks passed. Direct and Arrhenius-derived activation barriers validate the Flux Physics scoring used by the steering engine.

What makes this a steering benchmark

MechanismOS is a real-time steering system, not a single-label classifier. The benchmark covers live control-surface recomputation, condition-sensitive pathway boundaries, hard-constraint optimization, and audit-ready evidence-pack export.

The GOLD/SILVER barrier tiers validate the kinetic backbone. The pilot suite validates the product behavior chemists need at the boundary: conservative confidence, infeasibility reporting, repeatability, and auditable decision support.

Pilot Validation Benchmark

System behavior, reliability, and safety validation across 18 pilot cases covering consistency, conservative uncertainty handling, constraint safety, auditability, and chemically ambiguous regimes.

18 / 18
Cases passed
94 / 94
Individual checks passed
100%
Pass rate
Pilot ready
Status

This benchmark validates system-level behavior and reliability, not barrier accuracy. It is complementary to the GOLD/SILVER barrier validation reported below.

Case-level results

All 18 cases grouped into two categories: pilot core cases (A–K) and extended behavioral benchmarks (L–R).

Case Test name Category Checks Key observations
A Boundary Flip Sweep Pilot 6/6 boundary conf=0.671, interior conf=0.960, near_boundary flagged
B Support Gate Pilot 4/4 weaker-support region correctly downgraded with explicit caution signaling
C Optimizer Hard Constraints Pilot 1/1 infeasible correctly detected, no constraint violations
D Infeasible + Relaxations Pilot 2/2 SN2=90% on tert-butyl infeasible, 2 relaxation suggestions
E Audit & Evidence Pack Pilot 16/16 900 tiles, 2 pins, export ready, full bundle verified
F Weak-Support Solvent Probe Pilot 2/2 weak-support solvent regime is downranked and avoided by default
G Extreme Temperature Stress Test Pilot 4/4 180K: conf=0.600 (low), 600K: conf=0.600 (low)
H Robust vs Brittle Optimizer Pilot 3/3 robust margin=1.0, zero near-boundary results
I Determinism (30 runs) Pilot 4/4 Δsel=0.000, Δconf=0.000, path=SN2×30
J Rate Sensitivity Pilot 1/1 rate(280K) < rate(320K) < rate(360K)
K Policy Enforcement Pilot 11/11 DMSO blacklisted, 0/10 violations
L Fast vs Refined Agreement Extended 8/8 both SN2, max fraction delta=0.000
M Substrate Class Sweep Extended 5/5 1°→SN2, 2°→SN2, 3°→SN1(0.774), benzylic→SN1(0.817)
N Solvent Polarity Gradient Extended 5/5 SN2 dominant across hexane→DMSO, all >50%
O Surface Tile Consistency Extended 6/6 400 tiles, max jump 19.7%, {SN1,SN2,E1}, 212/188 boundary split
P Multi-Pathway Competition Extended 5/5 4-way: SN1=21.2%, SN2=32.8%, E1=26.7%, E2=19.4%, conf=0.591
Q Feasible Optimizer Extended 5/5 Feasible: 5 recs, 0 violations, top SN2=86.2%, solvents=[DMSO,ACN,DMF]
R Diagnostic Robust vs Brittle Extended 6/6 Robust: 8 recs / 0 boundary; Brittle: 10 recs / 2 boundary; modes differ

Machine-readable results are available in the download package below (JSON and CSV).

Notable results

Selected observations from the pilot validation run.

Multi-pathway competition (Case P)

In genuinely mixed regimes, multiple pathways compete without a clear winner. The system lowers confidence rather than forcing a crisp answer when the chemistry is ambiguous.

Substrate class predictions (Case M)

With appropriate nucleophile and solvent conditions, the engine correctly separates the four substrate classes: primary and secondary substrates favor SN2 (74.8%), while tertiary (SN1=77.4%) and benzylic (SN1=81.7%) substrates with weak nucleophiles in protic solvents favor SN1 — matching textbook organic chemistry.

Surface coherence (Case O)

Control-surface analysis shows a smooth and physically coherent response landscape rather than brittle or anomalous behavior.

Operating window stories

Three examples of how chemists use MechanismOS to steer reactions and justify decisions under real constraints.

Story 1 — Steering away from the boundary

A chemist is getting inconsistent selectivity because the process is operating near a mechanistic boundary. MechanismOS highlights the ambiguity and suggests nearby conditions with more operating margin.

  • Outcome: stable selectivity region (“operating window”) rather than a single-point guess.
  • Evidence: pinned A/B comparison + exported surface slice.
  • Benchmark anchors: A, O, R.

Story 2 — When chemistry is genuinely ambiguous

In some regimes, multiple pathways compete. Instead of returning a crisp label, MechanismOS shows a mixed branching distribution and lowers confidence. The chemist can explore nearby conditions to see which lever (solvent polarity, temperature, nucleophile strength) resolves the competition.

  • Outcome: faster diagnosis of why results vary — and which lever will stabilize outcomes.
  • Evidence: multi-pathway breakdown + “what changed” deltas.
  • Benchmark anchors: P, A.

Story 3 — Enterprise constraints + solvent policy compliance

Process teams must satisfy temperature caps, impurity constraints, and solvent policy requirements. MechanismOS returns feasible condition sets when they exist and produces an audit-ready evidence pack for sign-off.

  • Outcome: feasible operating windows that comply with policy — or a clear infeasible result with relaxations.
  • Evidence: optimization results + constraint report + exported bundle.
  • Benchmark anchors: Q, C, D, K, E.

Textbook Reaction Validation

56 textbook organic chemistry reactions covering SN1, SN2, E1, E2, and E1CB mechanisms — from unambiguous classics to genuinely ambiguous condition-dependent cases. 98.2% accuracy. Full case study →

100%
SN2 (24/24)
100%
SN1 (13/13)
92%
E2 (12/13)
100%
E1 (3/3)
100%
E1CB (3/3)
Difficulty Reactions Correct Accuracy
Unambiguous — classic textbook cases 18 18 100%
Moderate — multi-factor decisions 19 18 95%
Ambiguous — genuinely debatable 11 11 100%
Condition-dependent — mechanism flips with conditions 8 8 100%
Total 56 55 98.2%

Single miss: tertiary substrate + KOtBu in THF where SN1 (50%) narrowly beats E2 — a genuinely borderline case. Dataset curated from Clayden, Bruice, Wade, and March textbooks.

Barrier Accuracy Results

GOLD/SILVER barrier checks validate the kinetic backbone behind the steering system.

Tier Ground truth basis Cases Passed Pass rate Purpose
GOLD Directly reported experimental Ea / barrier (kJ/mol) 154 154 100.00% Primary release gate for barrier-accuracy claims
SILVER Arrhenius-derived Ea from measured k(T) 1323 1256 94.9358% Generalization and failure-pattern discovery

Pass criteria (benchmark gate)

Benchmark criteria are fixed before evaluation so the reported pass rates are not tuned after the fact. GOLD is the primary release gate; SILVER provides supporting robustness evidence.

  • GOLD: Direct experimental Ea/barrier (kJ/mol) evaluated against a predefined acceptance band.
  • SILVER: Arrhenius-derived Ea from measured k(T) evaluated against its own predefined acceptance band.

Each case also records source family, observation type, and provenance level.

Reliability Principles

These are product-level trust properties validated by the pilot benchmark. MechanismOS is designed to be trustworthy at the edges, not just accurate in the middle.

1) No false certainty in ambiguous regions

When competing pathways are close, MechanismOS explicitly flags the region and reduces confidence instead of overstating certainty.

  • User experience: ambiguous regions are clearly marked with caution signals.
  • Optimization behavior: defaults favor more robust operating windows.
  • Validated by: Cases A, B, O, R.

2) Honest uncertainty outside well-supported operating regions

For weak-support operating regions, MechanismOS downgrades confidence and surfaces clear caution signals. The system does not present these regions as straightforward recommendations.

  • User experience: caution states are explicit and easy to distinguish from strong-support regions.
  • Optimization behavior: weaker-support regions are avoided by default.
  • Validated by: Cases F, G, B.

3) Hard constraints are never violated

If a search is infeasible, MechanismOS reports it as infeasible rather than silently weakening constraints to force an answer.

  • Constraints: temperature caps, selectivity minima, impurity caps, solvent policies, confidence minima.
  • Outputs: feasibility status, explicit constraint handling, and guidance for next steps.
  • Validated by: Cases C, D, K, Q.

4) Reproducibility and audit-grade traceability

Every evaluation, surface, pin, and optimization run is versioned and exportable as an evidence bundle. Identical inputs produce identical outputs under a fixed release version.

  • Determinism: repeated evaluations match within tolerance (no drift).
  • Evidence pack: configuration, outputs, and versioned provenance needed for review.
  • Validated by: Cases I, E.

What this benchmark validates

Six independent dimensions of system quality, each tested by multiple cases.

Determinism

Repeated evaluations are stable and reproducible under fixed inputs. (Case I)

Honest Uncertainty

Ambiguous and weak-support conditions are downranked rather than overstated. (Cases A, B, F, G)

Constraint Safety

Hard constraints are respected, and infeasible searches surface as infeasible instead of being silently compromised. (Cases C, D, K)

Surface Integrity

Response surfaces remain smooth, coherent, and chemically plausible under pilot evaluation. (Case O)

Physical Consistency

Correct substrate class trends (primary→SN2, tertiary→SN1), monotonic rate–temperature response, and robust polarity gradients. (Cases J, M, N)

Complete Audit Trail

End-to-end workflow produces a reproducible evidence package suitable for review and handoff. (Case E)

Methodology: pilot validation

How these tests work.

  • Test subjects 2-bromobutane (2°), tert-butyl bromide (3°), 1-bromopropane (1°), benzyl bromide (benzylic)
  • Solvents Ethanol, water, DMSO, acetonitrile, DMF, THF, hexane (e range: 1.9–80)
  • Conditions Supported and stress-test regimes spanning broad temperature and nucleophile conditions
  • Workflows Core evaluation, surface exploration, optimization, and export flows
  • Execution Live-system execution rather than mocked or simulated runs
  • Runtime ~166 seconds for all 18 cases and 94 checks

Data policy and interpretation

How to read the numbers correctly.

GOLD is the quality gate

Production claims on barrier accuracy are tied to directly measured barriers only. These are hard-value checks against reported experimental kJ/mol values.

SILVER is supporting evidence

Arrhenius-derived Ea values are experimentally grounded but include transformation uncertainty. They are used for robustness and failure pattern analysis, not as the sole release gate.

Run provenance

GOLD/SILVER rebaseline: May 8, 2026, using the Flux Physics comparator path with no exact-reaction benchmark lookup. Textbook validation: March 20, 2026. Benchmark rows are validation references only, not runtime lookup entries. Reported from the MechanismOS tiered validation pipeline with reproducible suite definitions and artifacts.

Failure analysis (what we learn from SILVER)

Enterprise-grade benchmarks should explain where and why a model struggles — and how the product behaves in those regions.

Common “hard zones”

SILVER cases help identify ambiguous and weak-support regions. Typical hard zones include less-represented chemistry, less-supported condition regimes, and cases with competing pathways of similar effective barrier.

  • Ambiguous regions: leading pathways are close enough that selectivity is sensitive to modest condition changes.
  • Low-support regimes: less-represented condition combinations should trigger more conservative confidence.
  • Conflicting drivers: conditions push competing mechanisms simultaneously, producing mixed branching.

Product mitigation

MechanismOS surfaces these regions explicitly with conservative confidence signaling and defaults to more robust operating windows when the chemistry looks ambiguous.

Dataset provenance

All benchmark rows are experimental and source-traceable.

  • Source family: peer-reviewed mechanism and kinetics literature Included
  • Source family: curated experimental repositories (including NIST where available) Included
  • Evidence metadata tracked per case (citation/source type/observation type/provenance level) Required
  • Synthetic generated sweeps for release claims Excluded

Download benchmark package

Machine-readable benchmark values and methodology for independent review.

Barrier Accuracy (GOLD / SILVER)

Summary JSON
Tier metrics, provenance summary, and policy flags.
Download JSON
Tier results CSV
GOLD and SILVER pass/fail counts and rates.
Download CSV
Dataset provenance CSV
Source families, observation types, and evidence policy by benchmark tier.
Download CSV
Methodology note
Benchmark scope, pass criteria, and interpretation policy.
Download MD

Pilot Validation (System Behavior)

Pilot validation JSON
18-case results with per-case observations, check counts, and reproducibility metadata.
Download JSON
Pilot validation CSV
All 18 cases with status, check counts, and measured values.
Download CSV

Need the full benchmark package?

Pilot programs include benchmark manifests, case-level outputs, and methodology docs for independent review.

Request Pilot Access Back to MechanismOS Module

Benchmark basis

Measures a workflow engine built on Flux Physics scoring. The benchmark separates steering behavior, mechanism selection, and barrier accuracy rather than presenting one scalar property claim.

Flux Decision Engine