← Benchmarks | DILI

DILI Benchmark MECHANISTIC ENGINE

Detailed validation evidence for the FluxMateria drug-induced liver injury engine: binary benchmark position, cross-panel transfer, throughput, and mechanism-output coverage.

State-of-the-art benchmark position

FluxMateria is state-of-the-art for mechanistic DILI prediction.

On the comparable Therapeutics Data Commons (TDC) binary drug-induced liver injury (DILI) task, FluxMateria reaches area under receiver operating characteristic curve (AUROC) 0.9597 versus the MiniMol public reference around AUROC 0.956. FluxMateria also returns outputs that binary benchmark entries do not: risk score, class, confidence, mechanism attribution, hepatic exposure context, optional dose-window behavior, and a calculation trace.

0.9597AUROC on comparable TDC binary DILI benchmark
~0.956MiniMol public AUROC reference on the binary benchmark
0.9275Hepatotox validated novel-like AUROC
12.95 / sFluxMateria parent DILI path, molecules per second locally
FluxMateria DILI mechanism coverage infographic showing molecular structure input, hepatic exposure, efflux and retention, cytochrome P450 enzyme context, injury chemistry, and parent DILI risk output.
Mechanism coverage supporting the benchmark: exposure in, clearance out, enzyme context, injury chemistry, parent risk, and review-ready trace.

Benchmark evidence

The headline claim is anchored to the comparable public binary task, then stress-tested with broader clinical-risk panels.

Evidence layer Metric FluxMateria result Comparator / context Interpretation
Comparable public binary benchmark AUROC 0.9597 MiniMol reference around 0.956 on the TDC binary DILI task. FluxMateria clears the public binary reference while providing richer outputs.
DILIRank clinical-transfer panel Novel-like AUROC 0.9063 Clinical-risk oriented drug list, separate from the binary benchmark framing. Supports transfer from benchmark labels to clinically interpretable risk ordering.
Hepatotox validated panel Novel-like AUROC 0.9275 Independent hepatotoxicity validation context. Confirms that the engine is not only matching one benchmark split.
Clinical risk stratification context 907-compound leave-one-out risk stratification 3-class accuracy 0.7696; 3-class balanced accuracy 0.7677; high-vs-rest balanced accuracy 0.8397 Three-class risk view rather than a binary yes/no benchmark. Useful for review workflow; not the primary binary state-of-the-art claim.
Throughput Molecules per second 12.95 locally for the parent DILI path MiniMol speed is not verified from the public leaderboard. Fast enough for portfolio-scale triage and interactive safety review.

Benchmark interpretation: This is intentionally not a binary-only apples-to-apples claim. The binary AUROC comparison establishes public benchmark position; the added mechanism, exposure, dose, confidence, and trace outputs establish why FluxMateria is a mechanistic safety engine rather than just another classifier.

Full benchmark matrix

The table below separates known-compound behavior from novel-like leave-one-out behavior. The public state-of-the-art claim uses the novel-like Therapeutics Data Commons (TDC) rows, where exact clinical self-matches are masked.

Panel Mode Rows Speed 3-class accuracy 3-class balanced accuracy High-vs-rest balanced accuracy High-vs-rest AUROC High-vs-rest AUPRC (area under precision-recall curve) DILI-concern AUROC Exact anchors
TDC DILI public Novel-like leave-one-out 475 12.8806 molecules/second n/a n/a 0.8223 0.9597 0.9455 0.9597 0
TDC DILI raw Novel-like leave-one-out 475 12.9478 molecules/second n/a n/a 0.8223 0.9597 0.9455 0.9597 0
DILIRank Novel-like leave-one-out 907 11.3889 molecules/second 0.7696 0.7677 0.8397 0.9063 0.7355 0.9157 0
Hepatotox validated Novel-like leave-one-out 614 13.6910 molecules/second 0.6857 0.6648 0.8077 0.9275 0.8594 0.9394 0
TDC DILI public Known-compound production 475 12.8170 molecules/second n/a n/a 0.9237 0.9932 0.9948 0.9932 454
DILIRank Known-compound production 907 8.5056 molecules/second 0.9713 0.9740 0.9897 0.9976 0.9965 0.9987 488
Hepatotox validated Known-compound production 614 12.6597 molecules/second 0.8648 0.8519 0.9278 0.9850 0.9795 0.9880 575

Protocol note: Novel-like leave-one-out mode is the correct proxy for new drug candidates because exact clinical self-matches are removed. Known-compound production mode is useful for reference-drug reproducibility and user-facing known-drug behavior, but it is not the public novel-drug state-of-the-art claim.

Mechanism signal audit

For the 475-row TDC public novel-like run, the engine records which mechanism layers contributed or constrained the parent score. These counts are included so reviewers can see that the benchmark is not a single opaque binary score.

Mechanism family Observed count in TDC public novel-like run Scientific role
CYP3A4 (cytochrome P450 family 3 subfamily A member 4) induction7Enzyme induction context.
CYP3A4 time-dependent inhibition19Mechanism-based enzyme inhibition context.
CYP2E1 (cytochrome P450 family 2 subfamily E member 1) contribution9Small-molecule bioactivation and acetaminophen-like pathway context.
CYP2B6 (cytochrome P450 family 2 subfamily B member 6) contribution5Metabolic activation and enzyme-specific liver-risk contribution.
CYP2C8 (cytochrome P450 family 2 subfamily C member 8) contribution5Primary-metabolizer and acyl-glucuronide-adjacent context.
OATP1B1 (organic anion transporting polypeptide 1B1) uptake144Liver-entry context.
Liver exposure circuit144Combined hepatic uptake, exposure, retention, and clearance context.
UGT2B7 (UDP glucuronosyltransferase family 2 member B7) acyl context12Detox and acyl-glucuronide context.
OATP1B3 (organic anion transporting polypeptide 1B3) uptake53Liver-entry context.
BCRP (breast cancer resistance protein) efflux172Efflux context.
BCRP inhibitor context135Transporter inhibition context.
CYP1A2 (cytochrome P450 family 1 subfamily A member 2) bioactivation4Aromatic bioactivation context.
CYP1A2 induction51Aryl hydrocarbon receptor-linked induction context.
Targeted chemistry deltas19High-specificity injury-chemistry support from residual failure-cluster review.
Targeted chemistry floors6Minimum-risk support for validated high-specificity injury families.
Aromatic cluster deltas18Near-threshold aromatic liver-risk support.
Aromatic cluster floors11Minimum-risk support for validated aromatic failure clusters.
Mechanism-coherence floors45Support when independent evidence layers agree on the same liver-risk direction.
Confidence-combiner caps7False-positive control when broad backbone risk lacks sufficient mechanism support.
Small inert caps2False-positive control for low-process small-molecule chemistry.
Low-backbone mechanism rescues4Allows high-specificity mechanism evidence to surface even when the initial clinical-neighbor backbone is low.

Descriptor validation

Reusable mechanism descriptors were validated as separable mechanism signals before they were used inside the parent DILI engine. This is a signal-quality check, not a final binary classifier.

Descriptor family Status Positive mean Control mean Separation Minimum positive Maximum control
Hepatobiliary residence pressurePass0.46230.14860.31370.40490.1636
Selective retention specificityPass0.34880.12210.22670.30610.1669
Polar antimetabolite stressPass0.73910.12390.61520.62680.2543
Net reactive-metabolite pressurePass0.28110.08140.19970.21400.1236
Mitochondrial and beta-oxidation pressurePass0.51650.07640.44010.47160.1877

Descriptor audit: The descriptor validation covered 38 probe cases at 1.07 cases per second. Seventeen additional parent process probes passed as high-specificity chemistry checks, with controls kept below the high-risk band.

Output coverage beyond binary classification

FluxMateria returns the evidence safety teams need to decide what to do next.

Output layer What the reviewer sees Why it matters
Risk score and class Numeric score, low/moderate/high class, and confidence. Supports portfolio ranking and safety-governance thresholds.
Hepatic exposure Organic anion transporting polypeptide (OATP) uptake context and exposure pressure. Separates structural hazard from likely liver access.
Retention and efflux Bile salt export pump (BSEP), breast cancer resistance protein (BCRP), and multidrug resistance-associated protein 2 (MRP2) evidence. Highlights cholestatic and hepatobiliary-retention concerns.
Enzyme context Cytochrome P450 (CYP) metabolism, inhibition, induction, and bioactivation context where available. Connects DILI risk to drug-drug interaction and metabolic-liability review.
Injury chemistry Reactive-metabolite, mitochondrial-stress, chronic-duration, and phenotype-specific evidence. Gives toxicology teams a plausible follow-up assay direction.
Score trace Structured calculation trace from baseline evidence to final class. Makes the call reviewable, challengeable, and reproducible.

Download benchmark package

Sanitized machine-readable results and methodology for independent scientific review.

DILI benchmark evidence package

Summary JSON
Headline claim, primary public comparison, cross-panel validation, output coverage, and interpretation policy.
Download JSON
Benchmark matrix CSV
Panel-by-panel metrics: rows, mode, speed, balanced accuracy, AUROC, AUPRC, and exact-anchor policy.
Download CSV
Mechanism signal counts CSV
Counts for cytochrome P450, transporter, exposure, chemistry, confidence, and rescue layers in the public novel-like run.
Download CSV
Descriptor validation CSV
Mechanism-signal separation checks for residence, retention, antimetabolite, reactive-metabolite, and mitochondrial pressure descriptors.
Download CSV
Methodology note
Benchmark scope, evaluation modes, reported metrics, interpretation policy, and use boundary.
Download MD

How to read the claim

What we claim

  • State-of-the-art mechanistic DILI prediction engine.
  • Comparable binary benchmark AUROC above the MiniMol public reference.
  • Richer output than binary classifiers: mechanisms, exposure, dose-window, confidence, and trace.
  • Fast enough for screening and review workflows: about 12.95 molecules per second locally.

Boundary conditions

  • Prediction supports screening, prioritization, and scientific review; it does not replace regulated toxicology studies.
  • MiniMol speed is not verified from public leaderboard material.
  • Three-class clinical risk stratification is a separate workflow from the binary benchmark comparison.
  • Novel chemistry should be interpreted with the reported confidence and follow-up evidence needs.

Review the DILI engine in context

The detailed DILI benchmark should be read alongside the full ADMET benchmark and the product DILI page.

Open DILI engine page Open full ADMET benchmark

Benchmark basis

Combines Flux physics signals with endpoint-specific reference evidence. Reported metrics describe the performance of that combined prediction route.

Flux Hybrid