← Benchmarks | MechanismOS

MechanismOS Benchmark

Two independent benchmark suites: barrier accuracy validation (GOLD/SILVER) and system behavior & reliability validation (18-case pilot readiness).

154/154
GOLD direct Ea pass
1255/1323
SILVER Arrhenius Ea pass
100%
Experimental-only cases
3 families
Source families

Pass/fail is evaluated against pre-registered thresholds defined in the methodology note. GOLD and SILVER are reported separately and should not be conflated.

Validation tiers

The two tiers answer different quality questions and are reported separately.

Pass criteria (benchmark gate)

Thresholds are pre-registered in the methodology note to prevent post-hoc tuning. The intent is conservative: GOLD is the release gate; SILVER supports robustness analysis.

  • GOLD: Direct experimental Ea/barrier (kJ/mol). Pass if |ΔEa| ≤ X (threshold defined in methodology).
  • SILVER: Arrhenius-derived Ea from measured k(T). Pass if |ΔEa| ≤ Y (threshold defined in methodology).

Each case also records source family, observation type, and provenance level.

Tier Ground truth basis Cases Passed Pass rate Purpose
GOLD Directly reported experimental Ea / barrier (kJ/mol) 154 154 100.00% Primary release gate for barrier-accuracy claims
SILVER Arrhenius-derived Ea from measured k(T) 1323 1255 94.8602% Generalization and failure-pattern discovery

Dataset provenance

All benchmark rows are experimental and source-traceable.

  • Source family: peer-reviewed mechanism and kinetics literature Included
  • Source family: curated experimental repositories (including NIST where available) Included
  • Evidence metadata tracked per case (citation/source type/observation type/provenance level) Required
  • Synthetic generated sweeps for release claims Excluded

Data policy and interpretation

How to read the numbers correctly.

GOLD is the quality gate

Production claims on barrier accuracy are tied to directly measured barriers only. These are hard-value checks against reported experimental kJ/mol values.

SILVER is supporting evidence

Arrhenius-derived Ea values are experimentally grounded but include transformation uncertainty. They are used for robustness and failure pattern analysis, not as the sole release gate.

Run provenance

Validation cycle date: February 9, 2026. Reported from the MechanismOS tiered validation pipeline with reproducible suite definitions and artifacts.

Failure analysis (what we learn from SILVER)

Enterprise-grade benchmarks should explain where and why a model struggles — and how the product behaves in those regions.

Common “hard zones”

SILVER cases help identify boundary-adjacent regions and extrapolation risk. Typical hard zones include sparse leaving-group coverage, solvent regimes outside the calibration envelope, and cases with competing pathways of similar effective barrier.

  • Near-boundary: top-2 pathway margins are small → selectivity is sensitive to small condition changes.
  • Out-of-domain (OOD): solvent/feature distance exceeds calibrated support → confidence must drop.
  • Conflicting drivers: conditions push competing mechanisms simultaneously → mixed branching.

Product mitigation

MechanismOS surfaces these regions explicitly using confidence overlays and boundary hatching, and the optimizer defaults to robust operating windows unless brittle exploration is explicitly enabled.

Pilot Validation Benchmark

System behavior, reliability, and safety validation — 18 test cases covering determinism, boundary detection, constraint enforcement, optimizer feasibility, robust/brittle diagnostics, audit trails, and multi-pathway competition.

18 / 18
Cases passed
94 / 94
Individual checks passed
100%
Pass rate
Pilot ready
Status

This benchmark validates system-level behavior and reliability, not barrier accuracy. It is complementary to the GOLD/SILVER barrier validation above.

What this benchmark validates

Six independent dimensions of system quality, each tested by multiple cases.

Determinism

30 identical evaluations produce exactly zero variance — same selectivity, confidence, pathway label, and model hash every time. (Case I)

Honest Uncertainty

Boundary points, OOD solvents, and extreme temperatures are all correctly downranked. No false "high confidence" labels at domain edges. (Cases A, B, F, G)

Constraint Safety

Optimizer never violates hard constraints. Infeasible searches return relaxation suggestions. Policy blacklists are enforced with zero violations. (Cases C, D, K)

Surface Integrity

400-tile surfaces show smooth gradients (max 19.7% adjacent jump), valid mixture sums, proper boundary distribution, and 3+ pathway diversity. (Case O)

Physical Consistency

Correct substrate class trends (primary→SN2, tertiary→SN1), monotonic rate–temperature response, and robust polarity gradients. (Cases J, M, N)

Complete Audit Trail

End-to-end workflow produces a downloadable evidence bundle with session state, pins, optimizer results, surface metadata, and reproducibility block. (Case E)

Case-level results

All 18 cases grouped into two categories: pilot core cases (A–K) and extended behavioral benchmarks (L–R).

Case Test name Category Checks Key observations
A Boundary Flip Sweep Pilot 6/6 boundary conf=0.671, interior conf=0.960, near_boundary flagged
B Calibration Gate Pilot 4/4 uncalibrated, conf=0.671, tier=medium, 4 reason codes
C Optimizer Hard Constraints Pilot 1/1 infeasible correctly detected, no constraint violations
D Infeasible + Relaxations Pilot 2/2 SN2≥90% on tert-butyl infeasible, 2 relaxation suggestions
E Audit & Evidence Pack Pilot 16/16 900 tiles, 2 pins, export ready, full bundle verified
F OOD Solvent Probe Pilot 2/2 hexane conf=0.681, optimizer avoids OOD
G Extreme Temperature OOD Pilot 4/4 180K: conf=0.600 (low), 600K: conf=0.600 (low)
H Robust vs Brittle Optimizer Pilot 3/3 robust margin=1.0, zero near-boundary results
I Determinism (30 runs) Pilot 4/4 Δsel=0.000, Δconf=0.000, path=SN2×30
J Rate Sensitivity Pilot 1/1 rate(280K) < rate(320K) < rate(360K)
K Policy Enforcement Pilot 11/11 DMSO blacklisted, 0/10 violations
L Fast vs Refined Agreement Extended 8/8 both SN2, max fraction delta=0.000
M Substrate Class Sweep Extended 5/5 1°→SN2, 2°→SN2, 3°→SN1(0.774), benzylic→SN1(0.817)
N Solvent Polarity Gradient Extended 5/5 SN2 dominant across hexane→DMSO, all >50%
O Surface Tile Consistency Extended 6/6 400 tiles, max jump 19.7%, {SN1,SN2,E1}, 212/188 boundary split
P Multi-Pathway Competition Extended 5/5 4-way: SN1=21.2%, SN2=32.8%, E1=26.7%, E2=19.4%, conf=0.591
Q Feasible Optimizer Extended 5/5 Feasible: 5 recs, 0 violations, top SN2=86.2%, solvents=[DMSO,ACN,DMF]
R Diagnostic Robust vs Brittle Extended 6/6 Robust: 8 recs / 0 boundary; Brittle: 10 recs / 2 boundary; modes differ

Machine-readable results are available in the download package below (JSON and CSV).

Notable results

Selected observations from the pilot validation run.

Multi-pathway competition (Case P)

Under moderate conditions (secondary substrate, ethanol, 340K, nucleophile=5.0), all four pathways compete with significant fractions: SN2=32.8%, E1=26.7%, SN1=21.2%, E2=19.4%. The system correctly flags this as low confidence (0.591) and near-boundary — it does not pretend to have a decisive answer when the chemistry is genuinely ambiguous.

Substrate class predictions (Case M)

With appropriate nucleophile and solvent conditions, the engine correctly separates the four substrate classes: primary and secondary substrates favor SN2 (74.8%), while tertiary (SN1=77.4%) and benzylic (SN1=81.7%) substrates with weak nucleophiles in protic solvents favor SN1 — matching textbook organic chemistry.

Surface coherence (Case O)

A 400-tile (20×20) control surface shows exactly zero invalid mixture sums, a maximum adjacent-tile selectivity jump of 19.7% (well below the 50% anomaly threshold), three distinct pathway regions (SN1, SN2, E1), and an even 212/188 boundary–interior split — indicating a physically smooth response surface.

Methodology: pilot validation

How these tests work.

  • Test subjects 2-bromobutane (2°), tert-butyl bromide (3°), 1-bromopropane (1°), benzyl bromide (benzylic)
  • Solvents Ethanol, water, DMSO, acetonitrile, DMF, THF, hexane (ε range: 1.9–80)
  • Conditions T: 180–600 K including OOD probes. Nucleophile strength: 1.5–8.9
  • API endpoints evaluate, surface, optimize, export, pins, session, export download
  • Execution Live API calls against production engine, no mocking or simulation
  • Runtime ~166 seconds for all 18 cases and 94 checks

Model Behavior Contract

These are product-level invariants validated by the Pilot Validation Benchmark. MechanismOS is designed to be trustworthy at the edges — not just accurate in the middle.

1) No false certainty near boundaries

When competing pathways are close (small margin), MechanismOS explicitly flags the region and reduces confidence. Boundary regions are visualized with hatching/fog and reason codes appear in the decision stack.

  • UI behavior: boundary hatching + confidence fog, "Near boundary" warning, reason codes.
  • Optimizer behavior: robust mode prefers high-margin operating windows; brittle mode requires explicit opt-in.
  • Validated by: Cases A, B, O, R.

2) Honest uncertainty outside calibrated support (OOD / uncalibrated)

For solvents/temperatures/features outside the calibration envelope, MechanismOS downgrades confidence and surfaces an OOD/uncalibrated signal. The system will not label these regions as “Recommended.”

  • UI behavior: “Uncalibrated/OOD” tier, fog overlay, and clear caution labels.
  • Optimizer behavior: avoids OOD regions by default and requires explicit allowance to explore them.
  • Validated by: Cases F, G, B.

3) Hard constraints are never violated

If a search is infeasible, MechanismOS returns “no feasible solution” plus relaxation suggestions. It does not silently break constraints to return a result.

  • Constraints: temperature caps, selectivity minima, impurity caps, solvent policies, confidence minima.
  • Outputs: feasibility flag, explicit violations (if any), and recommended relaxations.
  • Validated by: Cases C, D, K, Q.

4) Reproducibility and audit-grade traceability

Every evaluation, surface, pin, and optimization run is versioned and exportable as an evidence bundle. Identical inputs produce identical outputs under a fixed model hash.

  • Determinism: repeated evaluations match within tolerance (no drift).
  • Evidence pack: surface config, pins, optimizer inputs/outputs, reason codes, model & calibration versions.
  • Validated by: Cases I, E.

Operating window stories

Three examples of how chemists use MechanismOS to steer reactions and justify decisions under real constraints.

Story 1 — Steering away from the boundary

A chemist is getting inconsistent selectivity because the process is operating near a mechanistic boundary. MechanismOS shows the boundary line (hatching) and the confidence fog, then suggests a small move (ΔT, solvent change, or nucleophile shift) that increases margin.

  • Outcome: stable selectivity region (“operating window”) rather than a single-point guess.
  • Evidence: pinned A/B comparison + exported surface slice.
  • Benchmark anchors: A, O, R.

Story 2 — When chemistry is genuinely ambiguous

In some regimes, multiple pathways compete. Instead of returning a crisp label, MechanismOS shows a mixed branching distribution and lowers confidence. The chemist can explore nearby conditions to see which lever (solvent polarity, temperature, nucleophile strength) resolves the competition.

  • Outcome: faster diagnosis of why results vary — and which lever will stabilize outcomes.
  • Evidence: multi-pathway breakdown + “what changed” deltas.
  • Benchmark anchors: P, A.

Story 3 — Enterprise constraints + solvent policy compliance

Process teams must satisfy temperature caps, impurity constraints, and solvent policy (whitelist/blacklist). MechanismOS runs a constraint-aware optimization, returns feasible condition sets, and produces an audit-ready evidence pack for sign-off.

  • Outcome: feasible operating windows that comply with policy — or a clear infeasible result with relaxations.
  • Evidence: optimization results + constraint report + exported bundle.
  • Benchmark anchors: Q, C, D, K, E.

Download benchmark package

Machine-readable benchmark values and methodology for independent review.

Barrier Accuracy (GOLD / SILVER)

Summary JSON
Tier metrics, provenance summary, and policy flags.
Download JSON
Tier results CSV
GOLD and SILVER pass/fail counts and rates.
Download CSV
Dataset provenance CSV
Source families, observation types, and evidence policy by benchmark tier.
Download CSV
Methodology note
Benchmark scope, pass criteria, and interpretation policy.
Download MD

Pilot Validation (System Behavior)

Pilot validation JSON
18-case results with per-case observations, check counts, and reproducibility metadata.
Download JSON
Pilot validation CSV
All 18 cases with status, check counts, and measured values.
Download CSV

Need the full benchmark package?

Pilot programs include benchmark manifests, case-level outputs, and methodology docs for independent review.

Request Pilot Access Back to MechanismOS Module