MechanismOS Benchmark
Two independent benchmark suites: barrier accuracy validation (GOLD/SILVER) and system behavior & reliability validation (18-case pilot readiness).
Two independent benchmark suites: barrier accuracy validation (GOLD/SILVER) and system behavior & reliability validation (18-case pilot readiness).
Pass/fail is evaluated against pre-registered thresholds defined in the methodology note. GOLD and SILVER are reported separately and should not be conflated.
The two tiers answer different quality questions and are reported separately.
Thresholds are pre-registered in the methodology note to prevent post-hoc tuning. The intent is conservative: GOLD is the release gate; SILVER supports robustness analysis.
Each case also records source family, observation type, and provenance level.
| Tier | Ground truth basis | Cases | Passed | Pass rate | Purpose |
|---|---|---|---|---|---|
| GOLD | Directly reported experimental Ea / barrier (kJ/mol) | 154 | 154 | 100.00% | Primary release gate for barrier-accuracy claims |
| SILVER | Arrhenius-derived Ea from measured k(T) | 1323 | 1255 | 94.8602% | Generalization and failure-pattern discovery |
All benchmark rows are experimental and source-traceable.
How to read the numbers correctly.
Production claims on barrier accuracy are tied to directly measured barriers only. These are hard-value checks against reported experimental kJ/mol values.
Arrhenius-derived Ea values are experimentally grounded but include transformation uncertainty. They are used for robustness and failure pattern analysis, not as the sole release gate.
Validation cycle date: February 9, 2026. Reported from the MechanismOS tiered validation pipeline with reproducible suite definitions and artifacts.
Enterprise-grade benchmarks should explain where and why a model struggles — and how the product behaves in those regions.
SILVER cases help identify boundary-adjacent regions and extrapolation risk. Typical hard zones include sparse leaving-group coverage, solvent regimes outside the calibration envelope, and cases with competing pathways of similar effective barrier.
MechanismOS surfaces these regions explicitly using confidence overlays and boundary hatching, and the optimizer defaults to robust operating windows unless brittle exploration is explicitly enabled.
System behavior, reliability, and safety validation — 18 test cases covering determinism, boundary detection, constraint enforcement, optimizer feasibility, robust/brittle diagnostics, audit trails, and multi-pathway competition.
This benchmark validates system-level behavior and reliability, not barrier accuracy. It is complementary to the GOLD/SILVER barrier validation above.
Six independent dimensions of system quality, each tested by multiple cases.
30 identical evaluations produce exactly zero variance — same selectivity, confidence, pathway label, and model hash every time. (Case I)
Boundary points, OOD solvents, and extreme temperatures are all correctly downranked. No false "high confidence" labels at domain edges. (Cases A, B, F, G)
Optimizer never violates hard constraints. Infeasible searches return relaxation suggestions. Policy blacklists are enforced with zero violations. (Cases C, D, K)
400-tile surfaces show smooth gradients (max 19.7% adjacent jump), valid mixture sums, proper boundary distribution, and 3+ pathway diversity. (Case O)
Correct substrate class trends (primary→SN2, tertiary→SN1), monotonic rate–temperature response, and robust polarity gradients. (Cases J, M, N)
End-to-end workflow produces a downloadable evidence bundle with session state, pins, optimizer results, surface metadata, and reproducibility block. (Case E)
All 18 cases grouped into two categories: pilot core cases (A–K) and extended behavioral benchmarks (L–R).
| Case | Test name | Category | Checks | Key observations | |
|---|---|---|---|---|---|
| ✓ | A | Boundary Flip Sweep | Pilot | 6/6 | boundary conf=0.671, interior conf=0.960, near_boundary flagged |
| ✓ | B | Calibration Gate | Pilot | 4/4 | uncalibrated, conf=0.671, tier=medium, 4 reason codes |
| ✓ | C | Optimizer Hard Constraints | Pilot | 1/1 | infeasible correctly detected, no constraint violations |
| ✓ | D | Infeasible + Relaxations | Pilot | 2/2 | SN2≥90% on tert-butyl infeasible, 2 relaxation suggestions |
| ✓ | E | Audit & Evidence Pack | Pilot | 16/16 | 900 tiles, 2 pins, export ready, full bundle verified |
| ✓ | F | OOD Solvent Probe | Pilot | 2/2 | hexane conf=0.681, optimizer avoids OOD |
| ✓ | G | Extreme Temperature OOD | Pilot | 4/4 | 180K: conf=0.600 (low), 600K: conf=0.600 (low) |
| ✓ | H | Robust vs Brittle Optimizer | Pilot | 3/3 | robust margin=1.0, zero near-boundary results |
| ✓ | I | Determinism (30 runs) | Pilot | 4/4 | Δsel=0.000, Δconf=0.000, path=SN2×30 |
| ✓ | J | Rate Sensitivity | Pilot | 1/1 | rate(280K) < rate(320K) < rate(360K) |
| ✓ | K | Policy Enforcement | Pilot | 11/11 | DMSO blacklisted, 0/10 violations |
| ✓ | L | Fast vs Refined Agreement | Extended | 8/8 | both SN2, max fraction delta=0.000 |
| ✓ | M | Substrate Class Sweep | Extended | 5/5 | 1°→SN2, 2°→SN2, 3°→SN1(0.774), benzylic→SN1(0.817) |
| ✓ | N | Solvent Polarity Gradient | Extended | 5/5 | SN2 dominant across hexane→DMSO, all >50% |
| ✓ | O | Surface Tile Consistency | Extended | 6/6 | 400 tiles, max jump 19.7%, {SN1,SN2,E1}, 212/188 boundary split |
| ✓ | P | Multi-Pathway Competition | Extended | 5/5 | 4-way: SN1=21.2%, SN2=32.8%, E1=26.7%, E2=19.4%, conf=0.591 |
| ✓ | Q | Feasible Optimizer | Extended | 5/5 | Feasible: 5 recs, 0 violations, top SN2=86.2%, solvents=[DMSO,ACN,DMF] |
| ✓ | R | Diagnostic Robust vs Brittle | Extended | 6/6 | Robust: 8 recs / 0 boundary; Brittle: 10 recs / 2 boundary; modes differ |
Machine-readable results are available in the download package below (JSON and CSV).
Selected observations from the pilot validation run.
Under moderate conditions (secondary substrate, ethanol, 340K, nucleophile=5.0), all four pathways compete with significant fractions: SN2=32.8%, E1=26.7%, SN1=21.2%, E2=19.4%. The system correctly flags this as low confidence (0.591) and near-boundary — it does not pretend to have a decisive answer when the chemistry is genuinely ambiguous.
With appropriate nucleophile and solvent conditions, the engine correctly separates the four substrate classes: primary and secondary substrates favor SN2 (74.8%), while tertiary (SN1=77.4%) and benzylic (SN1=81.7%) substrates with weak nucleophiles in protic solvents favor SN1 — matching textbook organic chemistry.
A 400-tile (20×20) control surface shows exactly zero invalid mixture sums, a maximum adjacent-tile selectivity jump of 19.7% (well below the 50% anomaly threshold), three distinct pathway regions (SN1, SN2, E1), and an even 212/188 boundary–interior split — indicating a physically smooth response surface.
How these tests work.
These are product-level invariants validated by the Pilot Validation Benchmark. MechanismOS is designed to be trustworthy at the edges — not just accurate in the middle.
When competing pathways are close (small margin), MechanismOS explicitly flags the region and reduces confidence. Boundary regions are visualized with hatching/fog and reason codes appear in the decision stack.
For solvents/temperatures/features outside the calibration envelope, MechanismOS downgrades confidence and surfaces an OOD/uncalibrated signal. The system will not label these regions as “Recommended.”
If a search is infeasible, MechanismOS returns “no feasible solution” plus relaxation suggestions. It does not silently break constraints to return a result.
Every evaluation, surface, pin, and optimization run is versioned and exportable as an evidence bundle. Identical inputs produce identical outputs under a fixed model hash.
Three examples of how chemists use MechanismOS to steer reactions and justify decisions under real constraints.
A chemist is getting inconsistent selectivity because the process is operating near a mechanistic boundary. MechanismOS shows the boundary line (hatching) and the confidence fog, then suggests a small move (ΔT, solvent change, or nucleophile shift) that increases margin.
In some regimes, multiple pathways compete. Instead of returning a crisp label, MechanismOS shows a mixed branching distribution and lowers confidence. The chemist can explore nearby conditions to see which lever (solvent polarity, temperature, nucleophile strength) resolves the competition.
Process teams must satisfy temperature caps, impurity constraints, and solvent policy (whitelist/blacklist). MechanismOS runs a constraint-aware optimization, returns feasible condition sets, and produces an audit-ready evidence pack for sign-off.
Machine-readable benchmark values and methodology for independent review.
Pilot programs include benchmark manifests, case-level outputs, and methodology docs for independent review.