← Back to Articles
Technical March 25, 2026

When ML Predictions Fail: The Extrapolation Problem

Understanding why machine learning models struggle with novel chemistry and what it means for screening workflows.

Machine learning models for molecular property prediction have made remarkable progress. Models like Chemprop, SchNet, and MACE achieve impressive accuracy on established benchmarks. For many well-characterized chemical spaces, they are fast, practical, and useful.

But they have a structural limitation that matters for drug discovery and materials screening: they are interpolation machines. They perform well on chemistry that resembles their training data and degrade — sometimes silently — on chemistry that does not.

This article explains why, what it means in practice, and how to build workflows that account for it.

Why ML models are interpolators

A machine learning model learns a mapping from molecular features (fingerprints, graphs, descriptors) to target properties (solubility, toxicity, band gap). It learns this mapping from examples: thousands or millions of molecules with known properties. During training, the model adjusts its parameters to minimize prediction error on these examples.

The result is a model that is highly accurate within the region of chemical space covered by the training data. This is interpolation: making predictions in the spaces between known data points.

The problem arises when you ask the model to predict properties for molecules that are structurally different from anything in the training set. This is extrapolation, and ML models are not designed for it. There is no physical law encoded in the model that constrains what the output should be for a truly novel input. The model can only pattern-match against what it has seen before.

How failure manifests

The insidious aspect of ML extrapolation failure is that it is often silent. The model does not throw an error. It does not refuse to make a prediction. It returns a number with the same apparent precision as any other prediction. The only sign that something is wrong is that the number is inaccurate — and you will not discover that until you run the experiment.

Common failure modes include:

Confident but wrong. The model produces a prediction with high apparent confidence (if it reports confidence at all) that is far from the experimental value. This happens when the novel input superficially resembles training examples but differs in a structurally important way that the model's features do not capture.

Flat predictions. For inputs far from the training distribution, some models revert to predicting the mean of the training set. Every novel molecule gets roughly the same prediction. This is safe (low average error on the training set) but useless for screening, where you need to distinguish between candidates.

Unstable predictions. Small changes in molecular structure (adding a methyl group, changing a stereocenter) produce disproportionately large changes in predicted properties. This happens when the model is navigating a region of feature space where its learned function is poorly constrained.

The applicability domain problem

The ML community's response to the extrapolation problem is the concept of an "applicability domain" — a defined region of chemical space within which the model's predictions are considered reliable. If a new input falls outside this domain, the prediction is flagged as unreliable.

In principle, this is the right approach. In practice, applicability domain estimation is an unsolved problem:

  • Feature-space distance metrics (Tanimoto similarity to nearest training example) are crude proxies for prediction reliability. Two molecules can be "similar" in fingerprint space but different in the structural feature that drives the target property.
  • Ensemble disagreement (training multiple models and checking if they agree) detects some extrapolation failures but misses cases where all models share the same blind spot.
  • Conformal prediction provides calibrated prediction intervals but cannot distinguish between "the model is uncertain because the problem is inherently noisy" and "the model is uncertain because it has never seen this type of molecule."

None of these methods can reliably detect the case that matters most: a novel scaffold where the model is confidently wrong.

Why this matters for screening

If you are screening a library of analogs within a well-characterized chemical series, the extrapolation problem may not affect you. Your candidates are structurally similar to the training data, and ML predictions will likely be accurate.

But if you are doing any of the following, you are extrapolating:

  • Screening a diverse virtual library for hit-finding (structurally dissimilar from training data)
  • Evaluating novel scaffolds from generative chemistry or scaffold-hopping campaigns
  • Predicting properties for materials compositions not represented in training databases
  • Working in an emerging chemical space (PROTACs, molecular glues, covalent inhibitors) where training data is sparse

In these scenarios, ML predictions carry hidden risk. They look like reliable numbers. They may not be.

What to do about it

The extrapolation problem is not a reason to abandon ML. It is a reason to use ML for what it is good at (interpolation within known chemical space) and to complement it with methods that handle novel chemistry differently.

Know your domain

Before trusting ML predictions, ask: how similar are my candidates to the model's training data? If the answer is "not very," treat predictions with caution.

Use physics for triage

Physics-based methods (DFT, physics kernels) do not have training distributions. They generalize to novel chemistry by construction. Use them for the first-pass screen on diverse libraries.

Use ML for refinement

Within a narrowed candidate set that is structurally similar to training data, ML models excel at relative ranking and property optimization.

Demand confidence signals

If a tool does not tell you when it is uncertain, you cannot distinguish reliable predictions from guesses. Require per-prediction confidence on every screening output.

The structural issue

The extrapolation problem is not a bug in ML. It is a structural property of any method that learns from data rather than deriving from physical law. It cannot be fixed by more data, better architectures, or larger models — because the set of possible molecules is vastly larger than any training set, and the molecules that matter most for discovery are precisely the ones that have not been characterized yet.

Acknowledging this is not anti-ML. It is pro-rigour. The best screening workflows use each tool where its strengths apply and its limitations do not.

FluxMateria's physics kernel has no training data and generalizes to novel chemistry on day one. Read the comparison framework or try the demo.

No training data. No extrapolation problem.

FluxMateria derives properties from physics, not from data. Novel chemistry works on day one.

Request Pilot Access