MechanismOS Benchmark
Two independent benchmark suites: barrier accuracy validation (GOLD/SILVER) and system behavior & reliability validation (18-case pilot readiness).
Two independent benchmark suites: barrier accuracy validation (GOLD/SILVER) and system behavior & reliability validation (18-case pilot readiness).
Pass/fail is evaluated against fixed benchmark criteria. GOLD and SILVER are reported separately and should not be conflated.
The two tiers answer different quality questions and are reported separately.
Benchmark criteria are fixed before evaluation so the reported pass rates are not tuned after the fact. GOLD is the primary release gate; SILVER provides supporting robustness evidence.
Each case also records source family, observation type, and provenance level.
| Tier | Ground truth basis | Cases | Passed | Pass rate | Purpose |
|---|---|---|---|---|---|
| GOLD | Directly reported experimental Ea / barrier (kJ/mol) | 154 | 152 | 98.70% | Primary release gate for barrier-accuracy claims |
| SILVER | Arrhenius-derived Ea from measured k(T) | 1323 | 1255 | 94.8602% | Generalization and failure-pattern discovery |
All benchmark rows are experimental and source-traceable.
How to read the numbers correctly.
Production claims on barrier accuracy are tied to directly measured barriers only. These are hard-value checks against reported experimental kJ/mol values.
Arrhenius-derived Ea values are experimentally grounded but include transformation uncertainty. They are used for robustness and failure pattern analysis, not as the sole release gate.
GOLD/SILVER validation cycle: February 9, 2026. Textbook validation: March 20, 2026. GOLD re-verified March 20, 2026 (152/154 after CASF-era binding recalibration). Reported from the MechanismOS tiered validation pipeline with reproducible suite definitions and artifacts.
Enterprise-grade benchmarks should explain where and why a model struggles — and how the product behaves in those regions.
SILVER cases help identify ambiguous and weak-support regions. Typical hard zones include less-represented chemistry, less-supported condition regimes, and cases with competing pathways of similar effective barrier.
MechanismOS surfaces these regions explicitly with conservative confidence signaling and defaults to more robust operating windows when the chemistry looks ambiguous.
56 textbook organic chemistry reactions covering SN1, SN2, E1, E2, and E1CB mechanisms — from unambiguous classics to genuinely ambiguous condition-dependent cases. 98.2% accuracy. Full case study →
| Difficulty | Reactions | Correct | Accuracy |
|---|---|---|---|
| Unambiguous — classic textbook cases | 18 | 18 | 100% |
| Moderate — multi-factor decisions | 19 | 18 | 95% |
| Ambiguous — genuinely debatable | 11 | 11 | 100% |
| Condition-dependent — mechanism flips with conditions | 8 | 8 | 100% |
| Total | 56 | 55 | 98.2% |
Single miss: tertiary substrate + KOtBu in THF where SN1 (50%) narrowly beats E2 — a genuinely borderline case. Dataset curated from Clayden, Bruice, Wade, and March textbooks.
System behavior, reliability, and safety validation across 18 pilot cases covering consistency, conservative uncertainty handling, constraint safety, auditability, and chemically ambiguous regimes.
This benchmark validates system-level behavior and reliability, not barrier accuracy. It is complementary to the GOLD/SILVER barrier validation above.
Six independent dimensions of system quality, each tested by multiple cases.
Repeated evaluations are stable and reproducible under fixed inputs. (Case I)
Ambiguous and weak-support conditions are downranked rather than overstated. (Cases A, B, F, G)
Hard constraints are respected, and infeasible searches surface as infeasible instead of being silently compromised. (Cases C, D, K)
Response surfaces remain smooth, coherent, and chemically plausible under pilot evaluation. (Case O)
Correct substrate class trends (primary→SN2, tertiary→SN1), monotonic rate–temperature response, and robust polarity gradients. (Cases J, M, N)
End-to-end workflow produces a reproducible evidence package suitable for review and handoff. (Case E)
All 18 cases grouped into two categories: pilot core cases (A–K) and extended behavioral benchmarks (L–R).
| Case | Test name | Category | Checks | Key observations | |
|---|---|---|---|---|---|
| ✓ | A | Boundary Flip Sweep | Pilot | 6/6 | boundary conf=0.671, interior conf=0.960, near_boundary flagged |
| ✓ | B | Calibration Gate | Pilot | 4/4 | weaker-support region correctly downgraded with explicit caution signaling |
| ✓ | C | Optimizer Hard Constraints | Pilot | 1/1 | infeasible correctly detected, no constraint violations |
| ✓ | D | Infeasible + Relaxations | Pilot | 2/2 | SN2≥90% on tert-butyl infeasible, 2 relaxation suggestions |
| ✓ | E | Audit & Evidence Pack | Pilot | 16/16 | 900 tiles, 2 pins, export ready, full bundle verified |
| ✓ | F | Weak-Support Solvent Probe | Pilot | 2/2 | weak-support solvent regime is downranked and avoided by default |
| ✓ | G | Extreme Temperature Stress Test | Pilot | 4/4 | 180K: conf=0.600 (low), 600K: conf=0.600 (low) |
| ✓ | H | Robust vs Brittle Optimizer | Pilot | 3/3 | robust margin=1.0, zero near-boundary results |
| ✓ | I | Determinism (30 runs) | Pilot | 4/4 | Δsel=0.000, Δconf=0.000, path=SN2×30 |
| ✓ | J | Rate Sensitivity | Pilot | 1/1 | rate(280K) < rate(320K) < rate(360K) |
| ✓ | K | Policy Enforcement | Pilot | 11/11 | DMSO blacklisted, 0/10 violations |
| ✓ | L | Fast vs Refined Agreement | Extended | 8/8 | both SN2, max fraction delta=0.000 |
| ✓ | M | Substrate Class Sweep | Extended | 5/5 | 1°→SN2, 2°→SN2, 3°→SN1(0.774), benzylic→SN1(0.817) |
| ✓ | N | Solvent Polarity Gradient | Extended | 5/5 | SN2 dominant across hexane→DMSO, all >50% |
| ✓ | O | Surface Tile Consistency | Extended | 6/6 | 400 tiles, max jump 19.7%, {SN1,SN2,E1}, 212/188 boundary split |
| ✓ | P | Multi-Pathway Competition | Extended | 5/5 | 4-way: SN1=21.2%, SN2=32.8%, E1=26.7%, E2=19.4%, conf=0.591 |
| ✓ | Q | Feasible Optimizer | Extended | 5/5 | Feasible: 5 recs, 0 violations, top SN2=86.2%, solvents=[DMSO,ACN,DMF] |
| ✓ | R | Diagnostic Robust vs Brittle | Extended | 6/6 | Robust: 8 recs / 0 boundary; Brittle: 10 recs / 2 boundary; modes differ |
Machine-readable results are available in the download package below (JSON and CSV).
Selected observations from the pilot validation run.
In genuinely mixed regimes, multiple pathways compete without a clear winner. The system lowers confidence rather than forcing a crisp answer when the chemistry is ambiguous.
With appropriate nucleophile and solvent conditions, the engine correctly separates the four substrate classes: primary and secondary substrates favor SN2 (74.8%), while tertiary (SN1=77.4%) and benzylic (SN1=81.7%) substrates with weak nucleophiles in protic solvents favor SN1 — matching textbook organic chemistry.
Control-surface analysis shows a smooth and physically coherent response landscape rather than brittle or anomalous behavior.
How these tests work.
These are product-level trust properties validated by the pilot benchmark. MechanismOS is designed to be trustworthy at the edges, not just accurate in the middle.
When competing pathways are close, MechanismOS explicitly flags the region and reduces confidence instead of overstating certainty.
For weak-support operating regions, MechanismOS downgrades confidence and surfaces clear caution signals. The system does not present these regions as straightforward recommendations.
If a search is infeasible, MechanismOS reports it as infeasible rather than silently weakening constraints to force an answer.
Every evaluation, surface, pin, and optimization run is versioned and exportable as an evidence bundle. Identical inputs produce identical outputs under a fixed release version.
Three examples of how chemists use MechanismOS to steer reactions and justify decisions under real constraints.
A chemist is getting inconsistent selectivity because the process is operating near a mechanistic boundary. MechanismOS highlights the ambiguity and suggests nearby conditions with more operating margin.
In some regimes, multiple pathways compete. Instead of returning a crisp label, MechanismOS shows a mixed branching distribution and lowers confidence. The chemist can explore nearby conditions to see which lever (solvent polarity, temperature, nucleophile strength) resolves the competition.
Process teams must satisfy temperature caps, impurity constraints, and solvent policy requirements. MechanismOS returns feasible condition sets when they exist and produces an audit-ready evidence pack for sign-off.
Machine-readable benchmark values and methodology for independent review.
Pilot programs include benchmark manifests, case-level outputs, and methodology docs for independent review.