When ML Predictions Fail: The Extrapolation Problem
Understanding why machine learning models struggle with novel chemistry and what it means for screening workflows.
Understanding why machine learning models struggle with novel chemistry and what it means for screening workflows.
Machine learning models for molecular property prediction have made remarkable progress. Models like Chemprop, SchNet, and MACE achieve impressive accuracy on established benchmarks. For many well-characterized chemical spaces, they are fast, practical, and useful.
But they have a structural limitation that matters for drug discovery and materials screening: they are interpolation machines. They perform well on chemistry that resembles their training data and degrade — sometimes silently — on chemistry that does not.
This article explains why, what it means in practice, and how to build workflows that account for it.
A machine learning model learns a mapping from molecular features (fingerprints, graphs, descriptors) to target properties (solubility, toxicity, band gap). It learns this mapping from examples: thousands or millions of molecules with known properties. During training, the model adjusts its parameters to minimize prediction error on these examples.
The result is a model that is highly accurate within the region of chemical space covered by the training data. This is interpolation: making predictions in the spaces between known data points.
The problem arises when you ask the model to predict properties for molecules that are structurally different from anything in the training set. This is extrapolation, and ML models are not designed for it. There is no physical law encoded in the model that constrains what the output should be for a truly novel input. The model can only pattern-match against what it has seen before.
The insidious aspect of ML extrapolation failure is that it is often silent. The model does not throw an error. It does not refuse to make a prediction. It returns a number with the same apparent precision as any other prediction. The only sign that something is wrong is that the number is inaccurate — and you will not discover that until you run the experiment.
Common failure modes include:
Confident but wrong. The model produces a prediction with high apparent confidence (if it reports confidence at all) that is far from the experimental value. This happens when the novel input superficially resembles training examples but differs in a structurally important way that the model's features do not capture.
Flat predictions. For inputs far from the training distribution, some models revert to predicting the mean of the training set. Every novel molecule gets roughly the same prediction. This is safe (low average error on the training set) but useless for screening, where you need to distinguish between candidates.
Unstable predictions. Small changes in molecular structure (adding a methyl group, changing a stereocenter) produce disproportionately large changes in predicted properties. This happens when the model is navigating a region of feature space where its learned function is poorly constrained.
The ML community's response to the extrapolation problem is the concept of an "applicability domain" — a defined region of chemical space within which the model's predictions are considered reliable. If a new input falls outside this domain, the prediction is flagged as unreliable.
In principle, this is the right approach. In practice, applicability domain estimation is an unsolved problem:
None of these methods can reliably detect the case that matters most: a novel scaffold where the model is confidently wrong.
If you are screening a library of analogs within a well-characterized chemical series, the extrapolation problem may not affect you. Your candidates are structurally similar to the training data, and ML predictions will likely be accurate.
But if you are doing any of the following, you are extrapolating:
In these scenarios, ML predictions carry hidden risk. They look like reliable numbers. They may not be.
The extrapolation problem is not a reason to abandon ML. It is a reason to use ML for what it is good at (interpolation within known chemical space) and to complement it with methods that handle novel chemistry differently.
Before trusting ML predictions, ask: how similar are my candidates to the model's training data? If the answer is "not very," treat predictions with caution.
Physics-based methods (DFT, physics kernels) do not have training distributions. They generalize to novel chemistry by construction. Use them for the first-pass screen on diverse libraries.
Within a narrowed candidate set that is structurally similar to training data, ML models excel at relative ranking and property optimization.
If a tool does not tell you when it is uncertain, you cannot distinguish reliable predictions from guesses. Require per-prediction confidence on every screening output.
The extrapolation problem is not a bug in ML. It is a structural property of any method that learns from data rather than deriving from physical law. It cannot be fixed by more data, better architectures, or larger models — because the set of possible molecules is vastly larger than any training set, and the molecules that matter most for discovery are precisely the ones that have not been characterized yet.
Acknowledging this is not anti-ML. It is pro-rigour. The best screening workflows use each tool where its strengths apply and its limitations do not.
FluxMateria's physics kernel has no training data and generalizes to novel chemistry on day one. Read the comparison framework or try the demo.
FluxMateria derives properties from physics, not from data. Novel chemistry works on day one.
Request Pilot Access