24 Uncertainty Quantification
The question is not whether our predictions are uncertain; they always are. The question is whether we have the honesty to say so.
Prerequisites: This chapter builds on concepts from earlier chapters including model calibration basics (Chapter 12), pretraining objectives (Chapter 8), and variant effect prediction (Chapter 18). Familiarity with basic probability and statistics (means, variances, distributions) is assumed.
Learning Objectives: After completing this chapter, you will be able to:
- Distinguish between epistemic and aleatoric uncertainty, and explain why this distinction matters for clinical decision-making
- Assess model calibration using reliability diagrams and expected calibration error (ECE)
- Apply post-hoc calibration methods (temperature scaling, Platt scaling, isotonic regression) to foundation model outputs
- Evaluate uncertainty quantification methods including deep ensembles, MC dropout, and heteroscedastic models
- Implement conformal prediction to obtain distribution-free coverage guarantees
- Design out-of-distribution detection pipelines for genomic sequences
- Communicate uncertainty appropriately to different stakeholders (clinicians, researchers, patients)
Estimated Reading Time: 45-60 minutes
This is not a philosophical riddle. If the model is well-calibrated, approximately 73% of variants receiving this score are truly pathogenic, and a clinician can weigh this probability against the costs of further testing. If the model is miscalibrated, the true pathogenicity rate among variants scored at 0.73 could be 40% or 95%, and the nominal probability provides no reliable basis for decision-making. The distinction is not between accurate and inaccurate models but between models that know what they know and models that do not. A miscalibrated model with high average accuracy can be more dangerous than a calibrated model with lower accuracy, because the miscalibrated model provides false confidence that leads to systematically wrong decisions.
Foundation models produce continuous scores, but clinical decisions require categorical actions: test or do not test, treat or do not treat, report to the family or continue monitoring. This translation from probability to action only works when probabilities are trustworthy. A model that systematically overstates confidence will trigger unnecessary interventions. A model that understates confidence will miss actionable findings. A model that reports high confidence on inputs it has never seen before fails at a fundamental level regardless of its average performance on familiar data. Uncertainty quantification provides the tools to assess when model predictions deserve trust.
24.1 Types of Uncertainty in Genomic Prediction
Uncertainty in genomic prediction arises from two fundamentally different sources that demand different responses. One source reflects limitations in what the model has learned from available data; this uncertainty can, in principle, be reduced by gathering more examples or improving model architecture. The other source reflects genuine randomness in the biological system itself, where identical genotypes produce variable phenotypes through stochastic developmental processes, environmental interactions, or incomplete penetrance. Distinguishing between these sources determines whether additional data collection would help or whether we must accept irreducible limits on predictive confidence.
For biology readers: Two fundamentally different types of uncertainty require different responses:
Epistemic uncertainty (“knowledge uncertainty”): what the model does not know:
- Arises from limited training data or model limitations
- Can be reduced by collecting more data or improving the model
- Example: A variant in an understudied gene has high epistemic uncertainty because few similar variants were in training data
- Detected by: ensemble disagreement, out-of-distribution detection
Aleatoric uncertainty (“data uncertainty”): inherent randomness:
- Arises from genuine biological variability or measurement noise
- Cannot be reduced even with infinite data
- Example: Incomplete penetrance in BRCA1: the same variant causes cancer in some carriers but not others due to modifier genes and environment (Kuchenbaecker et al. 2017)
- Detected by: heteroscedastic models, known biology
Why the distinction matters clinically:
| Uncertainty Type | Response | Action |
|---|---|---|
| High epistemic | Defer, investigate | Order additional testing, expert consult |
| High aleatoric | Accept, communicate | Explain inherent limits to patient |
| High both | Maximum caution | Conservative management |
A model that conflates these types cannot guide appropriate action: it cannot tell you whether more data would help or whether you must accept the uncertainty as irreducible.
24.1.1 Why Uncertainty Matters
Clinical genetics operates under fundamental uncertainty. When a laboratory reports a variant of uncertain significance (VUS), they acknowledge that current evidence cannot confidently classify the variant as pathogenic or benign. ClinVar contains approximately two million VUS compared to roughly 250,000 variants classified as pathogenic (Landrum et al. 2018), reflecting the reality that most genetic variation remains incompletely understood. Foundation models inherit and sometimes amplify this uncertainty: they may produce confident-seeming scores for variants where the underlying biology remains genuinely unknown. The challenges of VUS classification and current interpretation frameworks are examined in detail in Chapter 29.
The consequences of ignoring uncertainty extend beyond statistical abstraction. An overconfident pathogenic prediction may trigger unnecessary interventions, from prophylactic surgeries to reproductive decisions that alter family planning. An overconfident benign prediction may provide false reassurance, delaying diagnosis while a treatable condition progresses. In both cases, the harm stems not from prediction error per se but from the mismatch between stated confidence and actual reliability. A model that accurately conveys its uncertainty enables appropriate clinical reasoning even when the prediction itself is imperfect.
A model can be highly accurate on average yet dangerously miscalibrated. If a model achieves 90% accuracy but assigns 95% confidence to all predictions, clinicians will trust predictions that deserve skepticism. Conversely, a model with 80% accuracy that honestly reports its uncertainty enables better decisions than the overconfident 90% model. The goal is not just to be right, but to know when you are right.
Decision theory formalizes this intuition. The expected value of a clinical action depends on the probability of each outcome weighted by its utility. When a model reports 0.73 probability of pathogenicity, downstream decision-making implicitly assumes this probability is accurate. If the true probability is 0.50, actions optimized for 0.73 will systematically err. Uncertainty quantification ensures that the probabilities entering clinical decisions reflect genuine knowledge rather than artifacts of model architecture or training procedure.
24.1.2 Epistemic Uncertainty
A model trained exclusively on European-ancestry data encounters its first genome from an individual of African ancestry. The model’s predictions may be statistically valid within the distribution it has seen, yet unreliable for this new input due to limited exposure to ancestry-specific patterns of variation, linkage disequilibrium, and regulatory architecture. This uncertainty about what the model has learned, as distinct from noise inherent in the prediction task itself, constitutes epistemic uncertainty.
Epistemic uncertainty arises from limitations in training data that could, in principle, be reduced by gathering more examples. In genomic foundation models, epistemic uncertainty concentrates in predictable regions of biological space. Proteins from poorly characterized families, where training data contained few homologs, exhibit high epistemic uncertainty because the model has limited basis for inference. This manifests concretely in protein benchmarks: ProteinGym performance varies substantially across protein families (Section 11.1.3). Genes with few characterized variants in ClinVar or gnomAD provide sparse supervision, leaving the model uncertain about which sequence features distinguish pathogenic from benign variation (see Section 2.8.1 and Section 2.2.3 for data resource details). Rare variant classes, such as in-frame deletions in specific protein domains, appear infrequently in training data and consequently generate uncertain predictions. Populations under-represented in biobanks contribute fewer training examples, creating systematic epistemic uncertainty for individuals from these backgrounds, a challenge examined in Section 3.7 and with confounding implications discussed in Section 13.2.1.
Mathematically, epistemic uncertainty reflects uncertainty over model parameters or learned representations. A Bayesian perspective treats the trained model as one sample from a posterior distribution over possible models consistent with the training data. Different plausible models may disagree on predictions for inputs far from training examples while agreeing on well-represented inputs. This disagreement manifests as high variance in predictions across model variants, sensitivity to random initialization, or instability under small perturbations to training data.
Foundation models exhibit epistemic uncertainty through several observable signatures. Embeddings for unfamiliar sequences cluster in sparse regions of representation space, distant from the dense clusters formed by well-represented sequence families. Ensemble members trained with different random seeds produce divergent predictions for novel inputs while converging for familiar ones. Fine-tuning on the same downstream task with different random seeds yields inconsistent results for edge cases. These signatures provide practical diagnostics for identifying when epistemic uncertainty is high.
Before reading the next section, consider: If you have a variant in a gene with only 5 known pathogenic variants in ClinVar (versus thousands for a well-studied gene like BRCA1), what type of uncertainty dominates your prediction? Would collecting more data help? What kind of data would be most valuable?
24.1.3 Aleatoric Uncertainty
Some variants are genuinely ambiguous regardless of how much data we collect. The same pathogenic variant in BRCA1 causes breast cancer in one carrier but not another due to modifier genes, hormonal exposures, or stochastic developmental processes. Incomplete penetrance, the phenomenon where disease-associated variants do not always produce disease, creates irreducible uncertainty that no amount of training data can eliminate. This inherent randomness in the mapping from genotype to phenotype constitutes aleatoric uncertainty.
Aleatoric uncertainty reflects noise or stochasticity intrinsic to the prediction problem rather than limitations of the model. In statistical learning theory, this concept connects directly to the Bayes error rate—the minimum achievable error for any classifier given the inherent overlap between class distributions. Even an optimal classifier with access to the true data-generating distribution cannot achieve zero error when biological variability causes the same genotype to produce different phenotypes. This represents the irreducible error component in the bias-variance decomposition: total error = bias + variance + irreducible error. The irreducible term persists regardless of model complexity or data quantity.
Variable expressivity means that even when a variant causes disease, the severity and specific manifestations vary across individuals. Measurement noise in functional assays introduces uncertainty into the labels used for training: deep mutational scanning experiments typically exhibit 10 to 20 percent technical variation between replicates (Fowler and Fields 2014; Rubin et al. 2017), creating a floor below which prediction error cannot decrease regardless of model sophistication (see Section 2.4.4 for a discussion of DMS data characteristics). Replicate convergence analysis provides a practical estimate of this floor: when averaging across increasing numbers of experimental replicates yields diminishing returns, the residual variance approximates aleatoric uncertainty. Stochastic gene expression means that two genetically identical cells may express a gene at different levels due to random fluctuations in transcription and translation. These sources of randomness set fundamental limits on predictive accuracy.
Biology is inherently stochastic at the molecular level. Gene expression involves probabilistic events: transcription factor binding follows mass-action kinetics with random association and dissociation; RNA polymerase initiation occurs stochastically; mRNA molecules degrade with exponential waiting times; translation initiation is probabilistic. These molecular fluctuations propagate to cellular phenotypes.
Key stochastic phenomena relevant to genomic prediction include:
- Transcriptional bursting: Genes transcribe in discrete pulses rather than continuously, creating cell-to-cell variability even among genetically identical cells (raj_nature_2008?)
- Allelic imbalance: Random monoallelic expression and X-chromosome inactivation introduce stochastic differences between cells
- Developmental noise: Random fluctuations during embryogenesis can amplify small initial differences, contributing to incomplete penetrance
- Epigenetic bistability: Some regulatory circuits exhibit bistable dynamics where stochastic fluctuations can flip cells between states
These processes explain why identical genotypes can produce variable phenotypes, and why this variability represents a fundamental limit on predictive accuracy rather than a failure of modeling.
The term “irreducible” in aleatoric uncertainty requires careful interpretation. From the perspective of scaling training data alone, this uncertainty cannot be reduced. However, process improvements can shrink what appeared irreducible:
- Better phenotyping: Finer-grained outcome measurements may resolve apparent stochasticity (distinguishing subtypes of disease previously grouped together)
- Additional covariates: Measuring modifier genes, environmental exposures, or epigenetic state can explain previously random variation
- Improved assays: Reducing technical noise in functional experiments lowers the measurement component of aleatoric uncertainty
What counts as “irreducible” depends on the information available. Today’s aleatoric uncertainty may become tomorrow’s explained variance through richer data collection, though some genuine biological stochasticity will always remain.
Aleatoric uncertainty often varies with the input, a property termed heteroscedasticity (from Greek: “different scatter”). This statistical term describes situations where the variance of residuals differs across the range of predictor values, in contrast to homoscedasticity where variance remains constant. In genomic prediction, heteroscedasticity is the norm rather than the exception. Coding variants in essential genes may have relatively low aleatoric uncertainty because strong selection pressure produces consistent phenotypic effects. Regulatory variants exhibit higher aleatoric uncertainty because their effects depend on cellular context, developmental timing, and interactions with other genetic and environmental factors. A model that captures this heteroscedasticity can provide more informative uncertainty estimates by conveying that some predictions are inherently more reliable than others.
The following table summarizes the key distinctions between epistemic and aleatoric uncertainty:
| Property | Epistemic Uncertainty | Aleatoric Uncertainty |
|---|---|---|
| Source | Limited training data, model knowledge gaps | Inherent noise in biological system |
| Reducibility | Can be reduced with more data | Irreducible, fundamental limit |
| Example causes | Under-represented populations, rare genes, novel folds | Incomplete penetrance, measurement noise, stochastic expression |
| Detection method | Ensemble disagreement, embedding distance | Heteroscedastic model predictions |
| Response | Collect more data, seek expert review | Accept limits, communicate uncertainty |
| Clinical implication | Defer decision, request additional evidence | Proceed with acknowledged uncertainty |
24.1.4 Decomposing Total Uncertainty
Total predictive uncertainty combines epistemic and aleatoric components, and distinguishing between them has practical implications for decision-making. High epistemic uncertainty suggests that gathering more data, either through additional training examples or further investigation of the specific case, could reduce uncertainty and improve the prediction. High aleatoric uncertainty indicates that the prediction is as good as it can get given inherent noise in the problem; additional data will not help because the underlying biology is stochastic.
The following mathematical framework for uncertainty decomposition requires comfort with variance and conditional probability. If these concepts are unfamiliar, focus on the intuition: ensemble disagreement measures epistemic uncertainty, while within-model variance measures aleatoric uncertainty. The key insight is that total uncertainty can be partitioned into “uncertainty we can reduce” and “uncertainty we cannot.”
The law of total variance provides a mathematical framework for this decomposition. For a prediction \(Y\) given input \(x\) and model parameters \(\theta\):
\[ \underbrace{\text{Var}[Y \mid x]}_{\text{Total uncertainty}} = \underbrace{\mathbb{E}_\theta[\text{Var}[Y \mid x, \theta]]}_{\text{Aleatoric (expected variance)}} + \underbrace{\text{Var}_\theta[\mathbb{E}[Y \mid x, \theta]]}_{\text{Epistemic (variance of means)}} \tag{24.1}\]
In words:
- Total uncertainty = uncertainty in predictions for input \(x\)
- Aleatoric component = average variance across different models (inherent noise that persists regardless of which model parameters we have)
- Epistemic component = variance of the mean predictions across models (disagreement about what the prediction should be)
For an ensemble of \(M\) models, this decomposition can be estimated empirically:
\[ \hat{\sigma}^2_{\text{epistemic}}(x) = \frac{1}{M} \sum_{m=1}^{M} \left( \hat{y}_m(x) - \bar{y}(x) \right)^2 \tag{24.2}\]
\[ \hat{\sigma}^2_{\text{aleatoric}}(x) = \frac{1}{M} \sum_{m=1}^{M} \hat{\sigma}^2_m(x) \tag{24.3}\]
where \(\hat{y}_m(x)\) is model \(m\)’s prediction, \(\bar{y}(x) = \frac{1}{M}\sum_m \hat{y}_m(x)\) is the ensemble mean, and \(\hat{\sigma}^2_m(x)\) is model \(m\)’s predicted variance (for heteroscedastic models).
In practice, ensemble methods approximate epistemic uncertainty through disagreement between members: if five independently trained models produce predictions of 0.65, 0.68, 0.70, 0.72, and 0.75, the spread reflects epistemic uncertainty, while the residual variance within each model’s predictions reflects aleatoric uncertainty. Heteroscedastic neural networks, which output both a predicted mean and a predicted variance, can estimate aleatoric uncertainty by learning input-dependent noise levels.
These decompositions depend on modeling assumptions and provide approximations rather than exact separations. Ensemble disagreement may underestimate epistemic uncertainty if all members share similar biases from common training data. Heteroscedastic models may confound aleatoric and epistemic uncertainty if the training data is too sparse to reliably estimate noise levels. Despite these limitations, approximate decomposition provides actionable information: variants flagged for high epistemic uncertainty warrant additional data collection or expert review, while variants with high aleatoric uncertainty may require acceptance of irreducible limits on predictive confidence.
24.2 Bayesian Uncertainty Quantification
Bayesian methods provide a principled framework for uncertainty that naturally distinguishes epistemic from aleatoric sources. Rather than learning point estimates of model parameters, Bayesian approaches maintain distributions over parameters, propagating uncertainty through predictions.
24.2.1 Bayesian Framework
Core Idea. Instead of point estimates \(\hat{\theta}\), maintain a distribution over parameters \(p(\theta | D)\):
\[p(\theta | D) = \frac{p(D | \theta) p(\theta)}{p(D)} \tag{24.4}\]
where the prior \(p(\theta)\) encodes beliefs before seeing data, the likelihood \(p(D|\theta)\) measures how well parameters explain data, and the posterior \(p(\theta|D)\) represents updated beliefs after observing data.
Predictive Distribution. For new input \(x^*\):
\[p(y^* | x^*, D) = \int p(y^* | x^*, \theta) p(\theta | D) d\theta \tag{24.5}\]
This integral marginalizes over parameter uncertainty, so predictions account for not knowing the “true” parameters.
24.2.2 Approximate Bayesian Inference for Neural Networks
Full Bayesian inference is computationally prohibitive for large neural networks. Practical approximations include:
MC Dropout. Dropout at test time approximates variational inference with Bernoulli posterior over weights (Gal and Ghahramani 2016). Multiple forward passes with dropout enabled yield a distribution of predictions; disagreement indicates epistemic uncertainty.
Deep Ensembles as Posterior Samples. Training \(M\) networks from different initializations can be viewed as sampling from an implicit posterior. Ensemble disagreement estimates epistemic uncertainty without explicit Bayesian computation.
Variational Inference. Approximate the posterior with a tractable family \(q_\phi(\theta)\) by maximizing the evidence lower bound (ELBO). Mean-field VI assumes independent posteriors per parameter, which scales to large networks but may underestimate uncertainty.
- Start with deep ensemble (simple, parallelizable)
- If ensemble disagrees significantly, flag for review
- For high-stakes decisions, consider full MCMC on ensemble predictions
- Report credible intervals, not just point predictions
24.3 Calibration and Confidence Interpretation
Calibration determines whether model outputs can be interpreted as probabilities. A score of 0.85 from a pathogenicity predictor should mean that 85% of variants receiving similar scores are truly pathogenic; only then can clinicians rationally weight computational evidence against other diagnostic criteria. The following sections provide a comprehensive treatment of calibration theory and methods, serving as the formal foundation for calibration discussions in applied contexts such as variant effect prediction (Section 18.5) and evaluation methodology (Chapter 12).
24.3.1 The Calibration Problem
AlphaMissense outputs a continuous score between 0 and 1 for each possible missense variant in the human proteome. When it reports 0.85 for a particular variant, what does this number mean? If the model is calibrated, collecting all variants scored near 0.85 and checking their true clinical status should reveal that approximately 85% are pathogenic. Perfect calibration means that predicted probabilities match observed frequencies across the entire range of model outputs: among variants scored at 0.30, roughly 30% should be pathogenic; among variants scored at 0.95, roughly 95% should be pathogenic. This alignment between stated confidence and empirical accuracy is calibration, and most foundation models fail to achieve it.
A familiar analogy clarifies the concept: weather forecasts. When a weather app reports “70% chance of rain,” that prediction is calibrated if, across many days receiving 70% forecasts, it actually rains about 70% of the time. If rain occurs on only 40% of those days, the app is overconfident, and its numerical predictions cannot reliably inform planning. Similarly, a pathogenicity predictor claiming 85% confidence should be correct 85% of the time at that confidence level; otherwise, clinicians cannot rationally weigh its predictions against other evidence.
Formally, a model \(f\) mapping inputs \(X\) to probability estimates \(p = f(X)\) is calibrated if \(P(Y = 1 \mid f(X) = p) = p\) for all \(p\) in the interval from \(0\) to \(1\). The calibration condition requires that the model’s stated confidence equals the true probability of the positive class conditional on that stated confidence. Miscalibration occurs when this equality fails: overconfident models produce predicted probabilities that exceed true frequencies (a variant scored at \(0.85\) is pathogenic only \(60\%\) of the time), while underconfident models produce predicted probabilities below true frequencies.
Modern deep neural networks are systematically miscalibrated despite achieving high accuracy. Guo and colleagues demonstrated that contemporary architectures exhibit worse calibration than older, less accurate models (Guo et al. 2017). The phenomenon arises because standard training objectives like cross-entropy loss optimize for discrimination (separating positive from negative examples) rather than calibration (matching predicted probabilities to frequencies). Over-parameterized models with capacity exceeding what the data requires can achieve near-perfect training loss while producing overconfident predictions on held-out data. The softmax temperature in transformer architectures affects the sharpness of probability distributions, and default settings often produce excessively peaked outputs.
Calibration and discrimination are distinct properties. A model can achieve perfect area under the receiver operating characteristic curve (auROC), correctly ranking all positive examples above all negative examples, while being arbitrarily miscalibrated. If a classifier assigns probability 0.99 to all positive examples and 0.98 to all negative examples, it ranks perfectly but provides useless probability estimates. Conversely, a calibrated model that assigns probability 0.5 to every variant would be perfectly calibrated if half of all variants are truly pathogenic, but such a model would be useless for discrimination since it provides no information about which variants are which. Clinical applications typically require both: accurate ranking to identify high-risk variants and accurate probabilities to inform decision-making.
A variant effect predictor achieves auROC of 0.95 on a held-out test set. A colleague concludes that “the model’s probability estimates are reliable.” What is wrong with this reasoning? What additional assessment would you recommend?
auROC measures discrimination (how well the model ranks pathogenic variants above benign ones) but says nothing about calibration (whether predicted probabilities match observed frequencies). A model can achieve perfect auROC while being arbitrarily miscalibrated (for example, by assigning 0.99 to all pathogenic variants and 0.98 to all benign variants). Proper assessment requires reliability diagrams and ECE, stratified by relevant subgroups like ancestry and variant type.
These terms are often used loosely in machine learning, but they have distinct meanings that matter for clinical interpretation:
Score: A generic term for any numerical output from a model. Scores may be unbounded (like logits), bounded (like sigmoid outputs), or transformed in various ways. A score carries no inherent probabilistic interpretation; it is simply a number that the model uses to rank or classify inputs.
Confidence: In machine learning, typically refers to the model’s internal certainty about a prediction, often operationalized as the maximum softmax probability. A model reporting “95% confidence” means its highest class probability is 0.95. However, confidence can be miscalibrated: a model may express high confidence while being frequently wrong.
Probability: A calibrated estimate that can be interpreted as a true frequency. When a model reports “probability 0.7 of pathogenicity,” this should mean that 70% of variants receiving this score are actually pathogenic. Probability requires calibration; raw model outputs rarely achieve this interpretation.
Likelihood: In statistics, likelihood refers specifically to \(P(\text{data} \mid \text{parameters})\)—the probability of observing the data given specific model parameters. This is distinct from probability of the hypothesis given the data. Language models compute likelihoods (probability of a sequence given the model), which can be converted to variant effect scores but are not pathogenicity probabilities without calibration.
Clinical implication: When communicating with clinicians, clarify whether a number is a raw score (useful only for ranking), a confidence (the model’s internal certainty, which may be miscalibrated), or a calibrated probability (a true frequency estimate suitable for decision-making).
24.3.2 Measuring Calibration
Reliability diagrams provide visual assessment of calibration by plotting predicted probabilities against observed frequencies. Predictions are binned into intervals (commonly using decile bins: 0-0.1, 0.1-0.2, …, 0.9-1.0), and for each bin, the mean predicted probability is plotted against the observed fraction of positive examples. A perfectly calibrated model produces points along the diagonal where predicted probability equals observed frequency. Systematic deviations reveal calibration patterns: points below the diagonal indicate overconfidence (predictions exceed reality), points above indicate underconfidence, and S-shaped curves suggest nonlinear miscalibration requiring more flexible correction.
Expected calibration error (ECE) provides a scalar summary of calibration quality. ECE computes the weighted average absolute difference between predicted probabilities and observed frequencies across bins:
\[ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right| \tag{24.6}\]
where:
- \(M\) is the number of bins (typically 10-15)
- \(B_m\) denotes the set of examples in bin \(m\), \(|B_m|\) is the number of examples in that bin, \(n\) is the total number of examples, \(\text{acc}(B_m)\) is the accuracy (fraction of positives) in bin \(m\), and \(\text{conf}(B_m)\) is the mean predicted probability in bin \(m\). Lower ECE indicates better calibration, with zero representing perfect calibration. ECE depends on binning strategy; equal-width bins may place most examples in a few bins for models with concentrated predictions, while equal-mass bins ensure each bin contains the same number of examples but may span wide probability ranges.
Maximum calibration error (MCE) captures worst-case miscalibration by reporting the largest absolute gap between predicted and observed frequencies across all bins. MCE is appropriate when any severe miscalibration is unacceptable, as in high-stakes clinical applications where even rare catastrophic errors carry significant consequences.
Brier score decomposes into components measuring calibration and discrimination (refinement), providing a single proper scoring rule that rewards both properties. The Brier score equals the mean squared difference between predicted probabilities and binary outcomes, and its decomposition reveals whether poor scores stem from miscalibration, poor discrimination, or both.
The following table summarizes these calibration metrics and their appropriate use cases:
| Metric | Formula/Description | Range | Best Value | Use Case |
|---|---|---|---|---|
| Reliability diagram | Visual: predicted vs. observed frequency | N/A | Points on diagonal | Initial assessment, pattern identification |
| ECE | Weighted average | predicted - observed | [0, 1] | |
| MCE | Maximum | predicted - observed | across bins | [0, 1] |
| Brier score | Mean squared error of probabilities | [0, 1] | 0 | Combined calibration + discrimination |
24.3.3 Why Foundation Models Are Often Miscalibrated
Foundation models face calibration challenges beyond those affecting standard neural networks. Pretraining objectives like masked language modeling optimize for predicting held-out tokens, not for producing calibrated probability distributions over downstream tasks (see Chapter 8 for a detailed discussion of pretraining objectives). The representations learned during pretraining may encode useful information about sequence biology while providing no guarantee that fine-tuned classifiers will be well-calibrated.
Distribution shift between pretraining and evaluation compounds miscalibration. A protein language model pretrained on UniRef sequences encounters a fine-tuning task using ClinVar variants. The pretraining distribution emphasizes common proteins with many homologs, while clinical variants concentrate in disease-associated genes with different sequence characteristics. Models may be well-calibrated on held-out pretraining data while miscalibrated on clinically relevant evaluation sets. The broader challenges of distribution shift are examined in Section 10.5.
Label noise in training data propagates to calibration errors. ClinVar annotations reflect the state of knowledge at submission time and may contain errors, particularly for older entries or variants from less-studied genes. Deep mutational scanning experiments provide functional labels but with measurement noise that varies across assays. Models trained on noisy labels may learn the noise distribution, producing predictions that match training labels but not underlying truth.
Zero-shot approaches present particular calibration challenges. ESM-1v log-likelihood ratios measure how surprising a mutation is to the language model, but these ratios are not probabilities and have no inherent calibration. Converting log-likelihood ratios to pathogenicity probabilities requires explicit calibration against external labels, and the resulting calibration depends on the reference dataset used for this conversion. The protein language model family and its variant effect scoring capabilities are discussed in Section 16.6.
24.3.4 Calibration Across Subgroups
Aggregate calibration metrics can mask severe miscalibration in clinically important subgroups. A model might achieve low ECE overall while being systematically miscalibrated for specific populations, with opposite errors canceling in aggregate statistics. For example, overconfidence in one ancestry group and underconfidence in another can produce near-zero average calibration error despite poor performance for both. Subgroup-stratified calibration assessment is essential for any model intended for diverse populations.
While European versus African ancestry comparisons appear frequently in genomic literature (reflecting the populations most represented and most underrepresented in training data, respectively), calibration disparities extend across the full spectrum of human genetic diversity. Consider also:
- Admixed populations: Individuals with mixed ancestry may experience calibration errors that average between their component populations or show unique patterns not captured by any single-ancestry model
- Rare and isolated populations: Indigenous populations, genetic isolates (e.g., Ashkenazi Jewish, Finnish, Amish), and island populations often harbor unique variants absent from reference databases
- South Asian and Middle Eastern populations: These groups are frequently underrepresented in both training data and evaluation benchmarks
- Within-continent diversity: “African ancestry” encompasses tremendous genetic diversity; models calibrated on East African populations may miscalibrate for West or Southern African individuals
Calibration assessment should examine as many ancestry groups as evaluation data permit, not just the most common comparison.
Aggregate calibration metrics (ECE, Brier score) can appear excellent while masking severe disparities across subgroups. A model that is overconfident for one population and underconfident for another may achieve near-zero aggregate ECE if errors cancel. Always compute calibration metrics stratified by ancestry, gene family, and variant class. This is not just a technical concern; it directly impacts health equity.
Ancestry-stratified calibration reveals systematic patterns in current foundation models. Training data for protein language models and variant effect predictors derive predominantly from European-ancestry cohorts, creating differential epistemic uncertainty across populations. Calibration curves stratified by ancestry often show that models are better calibrated for populations well-represented in training data and overconfident or underconfident for underrepresented populations. This differential calibration has direct fairness implications: clinical decisions based on miscalibrated predictions will be systematically worse for patients from underrepresented backgrounds. The broader challenges of fairness and health equity are addressed in Section 3.7.2 and Chapter 27.
Calibration may also vary by variant class, gene constraint level, protein family, or disease category. Missense variants in highly constrained genes may show different calibration patterns than those in tolerant genes. Variants in well-studied protein families with abundant training examples may be better calibrated than variants in orphan proteins. Stratified reliability diagrams across these categories reveal whether a single calibration correction suffices or whether subgroup-specific approaches are necessary.
24.4 Post-Hoc Calibration Methods
Before learning about calibration methods, consider: If a model consistently assigns probabilities that are too high (overconfident), what mathematical operation might correct this? What if the overconfidence varies depending on the predicted probability level?
24.4.1 Temperature Scaling
The simplest calibration fix is often the most effective. Temperature scaling applies a single learned parameter to adjust model confidence, dramatically improving calibration with minimal computational overhead and no change to model predictions’ ranking.
The method modifies the softmax function by dividing logits by a temperature parameter \(T\) before applying softmax. The intuition is straightforward: overconfident models produce logits that are too large in magnitude, causing softmax probabilities to concentrate near 0 or 1 rather than reflecting true uncertainty. Dividing by \(T > 1\) shrinks these logits toward zero, which “softens” the probability distribution and reduces spurious confidence:
\[ \hat{p}_i = \frac{\exp(z_i / T)}{\sum_{j=1}^{K} \exp(z_j / T)} \tag{24.7}\]
where:
- \(z_i\) are the logits (pre-softmax outputs) for class \(i\)
- \(K\) is the number of classes
- \(T > 0\) is the temperature parameter:
- \(T = 1\): original softmax (no calibration)
- \(T > 1\): softer distribution (reduces overconfidence)
- \(T < 1\): sharper distribution (increases confidence)
- \(\hat{p}_i\) are the calibrated probabilities
A variant effect prediction model produces logits for a missense variant:
Before calibration (T = 1.0): - Logits: \(z = [2.5, -1.2]\) for [pathogenic, benign] - Softmax: \(p = [\frac{e^{2.5}}{e^{2.5} + e^{-1.2}}, \frac{e^{-1.2}}{e^{2.5} + e^{-1.2}}] = [0.976, 0.024]\) - The model claims 97.6% confidence in pathogenicity
After calibration (T = 2.0, learned from calibration set): - Scaled logits: \(z/T = [1.25, -0.6]\) - Softmax: \(p = [\frac{e^{1.25}}{e^{1.25} + e^{-0.6}}, \frac{e^{-0.6}}{e^{1.25} + e^{-0.6}}] = [0.86, 0.14]\) - Now the model reports 86% confidence, still favoring pathogenic, but acknowledging greater uncertainty
| Temperature | P(pathogenic) | P(benign) | Interpretation |
|---|---|---|---|
| T = 1.0 (uncalibrated) | 0.976 | 0.024 | Overconfident |
| T = 2.0 (calibrated) | 0.86 | 0.14 | Realistic uncertainty |
| T = 4.0 (over-softened) | 0.71 | 0.29 | Underconfident |
Note that the ranking is preserved: pathogenic remains more likely than benign. Only the magnitude of confidence changes.
When \(T > 1\), the distribution becomes softer (more uniform), reducing overconfidence. When \(T < 1\), the distribution becomes sharper, increasing confidence. The optimal temperature is learned by minimizing negative log-likelihood on a held-out calibration set, typically yielding \(T\) between 1.5 and 3 for overconfident deep networks.
Temperature scaling preserves the model’s ranking because dividing all logits by the same constant does not change their relative ordering. A variant ranked as more likely pathogenic than another remains more likely after temperature scaling; only the magnitudes of probability estimates change. This preservation of discrimination while improving calibration makes temperature scaling particularly attractive: calibration improves without sacrificing the model’s hard-won ability to distinguish pathogenic from benign variants.
The method’s simplicity (one parameter) is both strength and limitation. A single global temperature cannot fix heterogeneous miscalibration where the model is overconfident in some regions of input space and underconfident in others. When reliability diagrams show complex nonlinear patterns, more flexible calibration methods are necessary.
24.4.2 Platt Scaling
Platt scaling fits a logistic regression model on the original model’s outputs, learning both a slope and intercept to transform scores into calibrated probabilities. For binary classification:
\[ \hat{p}(x) = \sigma(a \cdot f(x) + b) = \frac{1}{1 + \exp(-(a \cdot f(x) + b))} \tag{24.8}\]
where:
- \(f(x) \in [0, 1]\) is the original model’s predicted probability for input \(x\)
- \(\sigma(\cdot)\) is the sigmoid function
- \(a, b \in \mathbb{R}\) are learned parameters: \(a\) controls sharpness, \(b\) controls the decision threshold
- Parameters are fit by maximizing log-likelihood on a held-out calibration set The two parameters provide more flexibility than temperature scaling’s single parameter, allowing correction of both the sharpness and the location of the probability distribution.
Platt scaling is appropriate when miscalibration involves systematic bias (predictions consistently too high or too low) in addition to over- or underconfidence. The method assumes that a monotonic logistic transformation suffices to correct miscalibration, which may not hold for models with complex, non-monotonic calibration curves.
24.4.3 Isotonic Regression
Isotonic regression provides a non-parametric approach that fits a monotonically increasing function mapping raw scores to calibrated probabilities. Unlike temperature or Platt scaling, isotonic regression makes no assumptions about the functional form of miscalibration, allowing it to correct arbitrary monotonic patterns.
The method works by pooling adjacent bins whose empirical frequencies violate monotonicity, then assigning each bin its pooled frequency. The resulting calibration function is a step function that increases with the original score. This flexibility comes at a cost: with limited calibration data, isotonic regression may overfit to noise in the calibration set, and the step-function output can appear discontinuous. Additionally, isotonic regression provides no uncertainty estimate on the calibration itself; we learn a point estimate of the calibration function without knowing how reliable that estimate is.
The following table provides guidance for selecting among post-hoc calibration methods:
| Method | Parameters | Best For | Limitations | Calibration Data Needed |
|---|---|---|---|---|
| Temperature scaling | 1 | Uniform overconfidence, when ranking must be preserved | Cannot fix heterogeneous or nonlinear miscalibration | ~1,000 examples |
| Platt scaling | 2 | Bias + overconfidence, binary classification | Assumes logistic correction is sufficient | ~1,000 examples |
| Isotonic regression | Many | Complex, nonlinear calibration curves | Overfits with limited data, discontinuous output | ~5,000+ examples |
| Subgroup-specific | Varies | Known calibration disparities across groups | Requires labeled data per subgroup | ~1,000 per group |
24.4.4 Calibrating Foundation Model Outputs
Genomic foundation models present specific calibration considerations beyond standard classification settings. The choice of calibration approach depends on whether the model produces logits, log-likelihood ratios, or continuous regression outputs, and on whether calibration targets are available for the deployment distribution.
For zero-shot variant effect scores like ESM-1v log-likelihood ratios, raw outputs have no inherent probabilistic interpretation. Calibration requires mapping these continuous scores to pathogenicity probabilities using external labels, typically from ClinVar or population frequency data. This mapping should occur on held-out genes or variants not used for any model development, and the resulting calibration reflects the specific label set used; calibration against ClinVar pathogenic/benign labels may not transfer to other clinical contexts. The principles of proper held-out evaluation are discussed in Chapter 12.
Multi-output models that predict across many tasks (multiple cell types, tissues, or assays) may require separate calibration for each output. A regulatory model predicting expression across 200 cell types is unlikely to be uniformly calibrated across all outputs; cell types with more training data may show better calibration than rare cell types.
Temporal stability of calibration deserves consideration. As ClinVar annotations evolve with new evidence, the ground truth against which models were calibrated changes. A model calibrated against 2020 ClinVar labels may become miscalibrated relative to 2025 labels as variant classifications are updated. Periodic recalibration against current labels helps maintain clinical relevance.
When deploying a foundation model for clinical variant interpretation:
- Assess baseline calibration using reliability diagrams and ECE on held-out data representative of deployment
- Stratify by subgroup (ancestry, gene family, variant class) to identify hidden disparities
- Start simple: Apply temperature scaling; check if the calibration curve approaches the diagonal
- Escalate if needed: If temperature scaling leaves nonlinear patterns, try Platt scaling or isotonic regression
- Validate on deployment distribution: Calibration learned on one distribution may not transfer
- Monitor and recalibrate: Track calibration over time as ground truth labels evolve
24.5 Uncertainty Quantification Methods for Foundation Models
Calibration ensures that stated probabilities match observed frequencies, but even well-calibrated models provide only point estimates. When a model reports 0.70 pathogenicity probability, is that uncertainty reducible with more data, or does it reflect genuine ambiguity in the biological signal? Distinguishing these sources of uncertainty enables more appropriate clinical responses: epistemic uncertainty (arising from limited data) suggests the prediction might change with additional evidence, while aleatoric uncertainty (inherent to the problem) indicates that even perfect models would remain uncertain.
24.5.1 Deep Ensembles
If one model expresses uncertainty about a prediction, querying multiple models reveals whether that uncertainty reflects genuine ambiguity in the data or an artifact of a particular training run. When five independently trained models agree on a prediction, confidence is warranted; when they disagree, the disagreement itself signals uncertainty. Ensemble disagreement provides one of the most reliable uncertainty estimates available in deep learning, at the cost of training and maintaining multiple models.
Deep ensembles train M models (typically 5 to 10) with different random initializations, data orderings, or minor architectural variations. At inference time, all members produce predictions, and uncertainty is estimated from the variance or entropy of the ensemble distribution. For classification, epistemic uncertainty appears as disagreement in predicted class probabilities across members. For regression, epistemic uncertainty appears as variance in predicted values.
Why do ensembles work for uncertainty estimation? The key insight is that ensemble disagreement reveals where the data underdetermines the model. The theoretical basis for ensemble uncertainty estimation rests on the observation that disagreement between models reflects regions of input space where the training data provides insufficient constraint. Where training examples are dense, gradient descent from different initializations converges to similar solutions, producing agreement. Where training examples are sparse or conflicting, different initializations find different local optima, producing disagreement. This interpretation connects ensembles to Bayesian model averaging, where predictions are averaged over the posterior distribution of model parameters.
For foundation models with billions of parameters, training full ensembles becomes prohibitively expensive. Training five copies of a large protein language model requires approximately five times the compute of a single model, potentially millions of dollars in cloud computing costs. However, some foundation models are released as ensembles by design. ESM-1v (Meier et al. 2021) exemplifies this approach: the original release includes five models trained with different random seeds, explicitly enabling ensemble-based uncertainty quantification for variant effect prediction. Ensemble disagreement among the five ESM-1v models provides a principled measure of epistemic uncertainty, and variants where all five models agree warrant higher confidence than variants where predictions diverge.
For models not released as ensembles, several practical alternatives reduce the computational burden. Last-layer ensembles freeze the pretrained backbone and train only an ensemble of prediction heads, reducing cost by orders of magnitude while still capturing uncertainty from the fine-tuning process. Snapshot ensembles save model checkpoints at various points during optimization and use these snapshots as ensemble members, requiring only single-model training time. Multi-seed fine-tuning trains the same architecture from multiple random seeds on the fine-tuning task, which is far cheaper than multi-seed pretraining. The broader considerations of fine-tuning and adaptation strategies are discussed in Chapter 9.
24.5.2 Monte Carlo Dropout
Monte Carlo (MC) dropout provides uncertainty estimates from a single trained model by treating dropout regularization as approximate Bayesian inference. During standard training with dropout, random subsets of neurons are zeroed at each forward pass. MC dropout keeps dropout active at test time and performs multiple stochastic forward passes, treating the variation across passes as a measure of model uncertainty.
Gal and Ghahramani showed that this procedure approximates variational inference over the model’s weights (Gal and Ghahramani 2016). Each forward pass with dropout samples a different subnetwork, and the distribution of predictions across samples approximates the predictive distribution under a particular prior over weights. High variance across MC samples indicates epistemic uncertainty about the model’s parameters for that input.
Why does dropout approximate Bayesian inference? During training, dropout randomly masks neurons to prevent co-adaptation, forcing the network to learn redundant representations. Treating this masking as sampling from a distribution over subnetworks connects to Bayesian inference: each sampled subnetwork is like a draw from a posterior over model architectures. When many subnetworks agree on a prediction, the input lies in a region well-constrained by training data; when subnetworks disagree, the input lies in a region where different architectural variants have learned different solutions, exactly the signature of epistemic uncertainty.
MC dropout offers the significant advantage of requiring only a single trained model, avoiding the computational overhead of ensembles. Implementation is straightforward: enable dropout during inference and average predictions over 10 to 50 stochastic forward passes. The variance or entropy of these predictions serves as the uncertainty estimate.
Limitations temper the method’s appeal. Modern transformer architectures often do not use dropout in their standard configurations, or use dropout only in specific locations (attention dropout, residual dropout) where the approximation may be less accurate. The quality of uncertainty estimates depends on the dropout rate and architecture, with higher dropout rates providing better uncertainty estimates but potentially degrading mean predictions. Empirical comparisons often find that MC dropout underestimates uncertainty relative to deep ensembles, particularly in low-data regimes where epistemic uncertainty should be high.
You need to deploy a variant effect predictor with uncertainty quantification, but you have a limited compute budget. Full ensembles of the foundation model are too expensive. What alternatives could you consider, and what trade-offs would each involve? Think about this before reading the heteroscedastic models section.
24.5.3 Heteroscedastic Models
Standard regression models predict a single output value, implicitly assuming constant noise variance across all inputs. Heteroscedastic models instead predict both a mean and a variance for each input, capturing the intuition that prediction uncertainty varies depending on the input. For genomic applications, this approach naturally handles the observation that some prediction tasks are inherently noisier than others: coding variant effects may be more predictable than regulatory variant effects, constrained genes more predictable than tolerant genes.
Architecture modifications are minimal. Instead of outputting a single value, the model outputs two values interpreted as the mean \(\mu(x)\) and variance \(\sigma^2(x)\) of a Gaussian distribution over outputs. Training uses negative log-likelihood loss under this Gaussian, which penalizes both prediction errors and miscalibrated variance estimates:
\[ \mathcal{L}(x, y) = \frac{(y - \mu(x))^2}{2\sigma^2(x)} + \frac{1}{2}\log \sigma^2(x) \tag{24.9}\]
where:
- \(y\) is the ground truth value for input \(x\)
- \(\mu(x)\) is the model’s predicted mean (one network output)
- \(\sigma^2(x) = \exp(s(x))\) is the predicted variance (second network output, exponentiated to ensure positivity)
- The first term penalizes prediction errors, weighted by inverse variance
- The second term prevents trivially predicting infinite variance
Why does this loss function take this particular form? It derives from the negative log-likelihood of a Gaussian distribution: if the model predicts that the output follows a Gaussian with mean \(\mu(x)\) and variance \(\sigma^2(x)\), then the probability of observing the true value \(y\) is higher when \(y\) is close to \(\mu\) and when variance is well-matched to actual prediction error. The first term penalizes prediction errors, weighted by inverse variance so that high-variance predictions are penalized less for the same absolute error; this is mathematically necessary because a Gaussian with larger variance assigns non-negligible probability to a wider range of values. The second term prevents the model from simply predicting infinite variance to avoid all penalties; without it, the optimal strategy would be to always predict \(\sigma^2 \to \infty\), making any prediction equally likely. The balance between these terms forces the model to predict variance that actually matches the empirical noise level. The result is a model that learns to predict larger variance for inputs where training labels are noisy or inconsistent, capturing aleatoric uncertainty in an input-dependent manner.
Heteroscedastic models capture aleatoric uncertainty but not epistemic uncertainty. The predicted variance reflects noise inherent in the labels, not uncertainty about model parameters. Combining heteroscedastic outputs with ensemble methods provides estimates of both uncertainty types: ensemble disagreement captures epistemic uncertainty while the predicted variance captures aleatoric uncertainty.
24.5.4 Evidential Deep Learning
Evidential deep learning places a prior distribution over the class probabilities themselves rather than directly predicting probabilities. For classification, the model outputs parameters of a Dirichlet distribution, which serves as a prior over the simplex of class probabilities. The concentration parameters of this Dirichlet encode both the predicted class probabilities (via their relative magnitudes) and the model’s uncertainty (via their absolute magnitudes).
The Dirichlet distribution is a probability distribution whose samples are themselves probability distributions. For readers unfamiliar with this concept, an intuitive explanation helps.
The simplex constraint: For a \(K\)-class classification problem, any valid probability distribution must have \(K\) values that are non-negative and sum to 1. Geometrically, this defines a simplex: for \(K=3\), valid probability vectors \((p_1, p_2, p_3)\) form a triangle where each corner represents certainty in one class.
What the Dirichlet does: The Dirichlet distribution places a probability density over this simplex. Its parameters \(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \ldots, \alpha_K)\), called concentration parameters, control where probability mass concentrates:
- High \(\alpha_i\) values (e.g., \(\alpha = (10, 10, 10)\)): Samples cluster near the center of the simplex, representing uncertainty about which class is correct but confidence that probabilities should be roughly equal
- Low \(\alpha_i\) values (e.g., \(\alpha = (0.1, 0.1, 0.1)\)): Samples cluster near the corners, representing uncertainty about which class but confidence that one class dominates
- Asymmetric values (e.g., \(\alpha = (100, 2, 2)\)): Samples cluster near the corner for class 1, representing confident prediction of class 1
Total evidence: The sum \(\sum_i \alpha_i\) represents total “evidence” or confidence. Higher totals mean tighter clustering (more certainty), regardless of which class is favored. This is how evidential deep learning encodes uncertainty: low total concentration = high epistemic uncertainty.
Example: A model predicting pathogenicity might output \(\alpha = (15, 3)\) for [pathogenic, benign]. The expected probability is \((15/18, 3/18) = (0.83, 0.17)\), favoring pathogenic. The total concentration of 18 indicates moderate confidence. Compare to \(\alpha = (150, 30)\): same expected probability but much higher confidence, or \(\alpha = (1.5, 0.3)\): same expected probability but very low confidence.
Low total concentration indicates high uncertainty: the model is unsure which class is correct. High total concentration with one dominant class indicates confident prediction. This framework provides a principled way to separate epistemic uncertainty (low concentration) from confident predictions (high concentration), all from a single forward pass without ensembling or MC sampling.
Critics have noted that evidential deep learning can produce unreliable uncertainty estimates when the distributional assumptions are violated or when training data is limited (Bengs, Hüllermeier, and Waegeman 2022). Practical experience suggests that ensembles and MC dropout often provide more robust uncertainty estimates, though evidential methods continue to be refined.
24.5.5 Selecting Uncertainty Quantification Methods
The choice among uncertainty quantification methods depends on computational constraints, the types of uncertainty relevant to the application, and the foundation model architecture.
For applications where distinguishing epistemic from aleatoric uncertainty matters, combining ensemble methods with heteroscedastic predictions provides both. Ensemble disagreement identifies variants where more training data might reduce uncertainty, while high predicted variance identifies variants where uncertainty is inherent to the prediction task.
For foundation model applications where full ensembles are impractical, last-layer ensembles offer the best trade-off between computational cost and uncertainty quality. The pretrained representations capture most of the model’s knowledge, and ensembling only the prediction heads captures uncertainty arising from the fine-tuning task.
For real-time applications requiring single forward passes, evidential deep learning or heteroscedastic models provide uncertainty estimates without inference-time overhead. These methods capture aleatoric uncertainty effectively but may underestimate epistemic uncertainty for out-of-distribution inputs.
The following table summarizes uncertainty quantification methods and their trade-offs:
| Method | Epistemic | Aleatoric | Training Cost | Inference Cost | Foundation Model Compatibility |
|---|---|---|---|---|---|
| Deep ensembles | Excellent | Via heteroscedastic variant | M× base | M× base | Expensive for large models |
| Last-layer ensembles | Good | Via heteroscedastic heads | 1× backbone + M× heads | M× heads (cheap) | Practical for any foundation model |
| MC dropout | Moderate | Via heteroscedastic variant | 1× | N× forward passes | Requires dropout in architecture |
| Heteroscedastic | None | Excellent | 1× | 1× | Easy to add to any model |
| Evidential | Moderate | Moderate | 1× | 1× | Requires architectural changes |
24.6 Conformal Prediction: Distribution-Free Guarantees
Most uncertainty quantification methods make assumptions about model behavior or data distributions that may not hold in practice. Temperature scaling assumes miscalibration follows a particular functional form. Ensembles assume that disagreement reflects epistemic uncertainty rather than artifacts of training. Bayesian methods assume specific priors over model parameters. When these assumptions fail, uncertainty estimates may be unreliable precisely when reliability matters most.
Consider how a cautious doctor might communicate diagnostic uncertainty. Rather than saying “I am 73% confident this is condition A,” they might say “Based on your symptoms and test results, I can confidently rule out conditions C, D, and E, but I cannot yet distinguish between A and B; we need more information.” This approach sidesteps the difficulty of assigning precise probabilities by instead specifying which possibilities remain plausible. The size of this “plausible set” communicates uncertainty: a single remaining possibility indicates high confidence, while many remaining possibilities indicate low confidence.
Conformal prediction formalizes this intuition and offers something stronger: finite-sample coverage guarantees that hold under minimal assumptions. Instead of outputting a point prediction, conformal methods produce a prediction set guaranteed to contain the true label with probability at least \(1 - \alpha\), where \(\alpha\) is a user-specified error rate. If we request \(90\%\) coverage (\(\alpha = 0.10\)), the prediction set will contain the true label at least \(90\%\) of the time, regardless of the model’s accuracy or calibration. This guarantee requires only that calibration and test examples are exchangeable (a condition weaker than independent and identically distributed), making conformal prediction robust to model misspecification.
Conformal prediction sidesteps the need for well-calibrated probabilities. Instead of asking “what is the probability this variant is pathogenic?”, it answers “which classifications can we confidently rule out?” A prediction set of {Pathogenic} means high confidence. A set of {Pathogenic, VUS, Benign} means low confidence. The size of the set is the uncertainty estimate, and the coverage guarantee holds regardless of model calibration.
24.6.1 Split Conformal Prediction
This section introduces the mathematical framework for conformal prediction, including quantile computations and non-conformity scores. The key intuition is simpler than the formalism: use a held-out calibration set to learn what scores are “normal,” then flag test predictions as uncertain if they look unusual relative to calibration. Focus on the algorithm steps if the probability theory is unfamiliar.
The most practical conformal method, split conformal prediction, begins by partitioning labeled data into training and calibration subsets. After training the model exclusively on the training portion, non-conformity scores are computed for each calibration example, where higher scores indicate poorer agreement between prediction and true label. The threshold \(q\) is then set at the \((1-\alpha)(1+1/n)\) quantile of these calibration scores. At test time, the prediction set includes all labels whose non-conformity score falls below this threshold.
Non-conformity scores measure how “strange” a candidate label is given the model’s output. For classification, a common choice is \(1 - \hat{p}_y\), where \(\hat{p}_y\) is the predicted probability of the true class. High predicted probability means low non-conformity (the label conforms to the model’s expectations); low predicted probability means high non-conformity. For regression, absolute residuals \(|y - \hat{y}|\) serve as non-conformity scores.
The construction ensures coverage because calibration scores are exchangeable with test scores under the exchangeability assumption. The quantile threshold is set so that a random calibration score exceeds the threshold with probability at most \(\alpha\); by exchangeability, the same holds for test scores. This elegant argument yields exact coverage guarantees without requiring the model to be accurate or well-calibrated.
The coverage guarantee is finite-sample: it holds exactly for any sample size, not just asymptotically. For clinical genomics applications where individual predictions carry significant consequences, this finite-sample property provides assurance that cannot be obtained from asymptotic calibration arguments.
24.6.2 Conformal Prediction for Variant Classification
Variant effect prediction, examined in detail in Chapter 18, concentrates the challenges of uncertainty quantification. Instead of reporting a single pathogenicity score, a conformalized variant classifier outputs a prediction set from the possibilities: {pathogenic}, {benign}, {pathogenic, benign}, or the empty set. The set is guaranteed to contain the true label at the specified coverage rate.
Set size conveys uncertainty without requiring probability interpretation. A singleton prediction set indicates high confidence: the model has enough information to narrow to a single class. A set containing multiple classes indicates uncertainty: the model cannot confidently distinguish between possibilities. The empty set indicates extreme uncertainty where even the most permissive threshold cannot be satisfied.
The trade-off between coverage and informativeness shapes practical deployment. At 99% coverage, prediction sets will frequently include multiple classes, providing reliable but uninformative predictions. At 80% coverage, prediction sets will more often be singletons, providing informative but less reliable predictions. Stakeholders must choose coverage levels that match their tolerance for error versus the cost of uninformative predictions.
24.6.3 Limitations and Practical Considerations
Conformal prediction provides marginal coverage guarantees: averaged over all inputs, 90% of prediction sets will contain the true label. This does not guarantee conditional coverage for any particular subgroup. A model might achieve 90% coverage overall while providing only 70% coverage for rare variant classes or underrepresented populations. Subgroup-stratified coverage assessment reveals these disparities, though achieving conditional coverage guarantees requires stronger assumptions or larger calibration datasets.
The exchangeability assumption can fail in practice. If the calibration set derives from one population and the test set from another, coverage guarantees may not hold. Temporal shifts (calibration on historical data, testing on future data) similarly violate exchangeability. Methods for conformal prediction under distribution shift exist but require additional assumptions about the nature of the shift.
Prediction set size trades off against informativeness. Larger sets provide more reliable coverage but less useful predictions. A model that produces {pathogenic, benign} for every variant achieves perfect coverage but provides no discrimination. Careful model development to improve underlying accuracy reduces average set size while maintaining coverage guarantees.
24.6.4 Integration with Clinical Workflows
Conformal prediction sets integrate naturally with existing variant classification frameworks. The ACMG-AMP guidelines already accommodate uncertainty through categories like “variant of uncertain significance.” Conformal sets provide a principled basis for this categorization: variants receiving singleton sets ({pathogenic} or {benign}) have strong computational evidence, while variants receiving larger sets have uncertain computational evidence. The ACMG-AMP framework and its integration with computational evidence are discussed in Chapter 29.
The coverage guarantee provides a quantitative basis for laboratory policies. A laboratory might decide that computational evidence should achieve 95% coverage before contributing to variant classification, using conformal methods to verify this threshold is met. The guarantee holds regardless of which specific variants are encountered, providing assurance that the policy will perform as intended across the laboratory’s case mix.
Conformal methods also enable selective prediction, where the model abstains rather than producing uncertain predictions. By setting coverage requirements appropriately, laboratories can identify variants where computational methods provide reliable evidence and variants where human review is essential. This selective approach focuses expert attention where it is most needed while allowing automated processing of straightforward cases.
24.7 Out-of-Distribution Detection
24.7.1 Out-of-Distribution Problem
A DNA language model trained on mammalian genomes encounters a novel archaeal sequence. The model’s embedding places this sequence in an unfamiliar region of representation space, far from the clusters formed by training examples. Yet the model still produces a prediction, potentially with high confidence, because standard neural networks are not designed to recognize when inputs lie outside their training distribution. Detecting out-of-distribution (OOD) inputs is essential for safe deployment of foundation models in settings where novel sequences are inevitable.
OOD detection identifies inputs that differ meaningfully from training data, allowing systems to flag uncertain predictions before they cause harm. Novel pathogens may share little sequence similarity with characterized viruses in training data. Synthetic proteins designed for therapeutic purposes may occupy regions of sequence space unsampled by evolution. Variants in poorly characterized genes may lack the contextual information that models rely on for accurate prediction. In each case, recognizing that the input is unusual enables appropriate caution.
The confidence problem compounds OOD challenges. Neural networks often produce high-confidence predictions on OOD inputs because nothing in standard training penalizes confidence on unfamiliar examples. A classifier trained to distinguish pathogenic from benign variants may confidently predict “pathogenic” for a completely random sequence, not because it has evidence for pathogenicity but because it lacks the capacity to say “I do not know.” This failure mode makes OOD detection essential rather than optional.
A protein language model is deployed to score variants in a newly discovered gene with no homologs in UniRef. The model returns high-confidence pathogenicity predictions. What concerns should you have? What additional information would help you assess whether to trust these predictions?
This gene is out-of-distribution relative to the model’s training data; high epistemic uncertainty is expected. The confident predictions are a red flag: the model may be extrapolating unreliably beyond its training experience. You should check embedding distance to training examples, ensemble disagreement if available, and whether the model’s confidence is calibrated for truly novel protein families. Experimental validation through functional assays would be essential before acting on these predictions.
24.7.2 Likelihood-Based Detection and Its Failures
The intuitive approach to OOD detection uses model likelihood: inputs the model finds improbable should be flagged as OOD. Language models assign likelihoods to sequences; surely OOD sequences should receive low likelihood?
This intuition fails for deep generative models. Complex models can assign high likelihood to OOD data for reasons unrelated to semantic similarity to training examples. In high-dimensional spaces, typical sets (regions where most probability mass concentrates) do not coincide with high-density regions. A sequence might land in a high-density region of the model’s distribution while being semantically distant from any training example.
Empirically, language models assign high likelihood to repetitive sequences, sequences with unusual but consistent patterns, and sequences from different domains that happen to share statistical properties with training data . For genomic models, this means likelihood alone cannot reliably distinguish novel biological sequences from sequences within the training distribution.
24.7.3 Embedding-Based Detection
Learned representations provide more reliable OOD detection than raw likelihood. The key insight is that embeddings encode semantic structure: similar sequences cluster together in embedding space, and OOD sequences land in sparse regions distant from training clusters.
Mahalanobis distance measures how far a test embedding lies from training data, accounting for the covariance structure of the embedding space. For each class, compute the mean embedding and covariance matrix from training examples. For a test input, compute its distance to each class centroid in units of standard deviations, accounting for correlations between embedding dimensions. Large Mahalanobis distance indicates OOD inputs.
Why Mahalanobis rather than simple Euclidean distance? Euclidean distance treats all embedding dimensions equally, but learned representations often have highly correlated dimensions and vary substantially in scale. A test point that appears close in Euclidean terms might actually be unusual because it deviates along a low-variance direction where training examples are tightly clustered. Mahalanobis distance normalizes by the covariance structure, detecting such deviations by asking “how many standard deviations away is this point along each principal axis of variation?”
Nearest-neighbor methods provide a non-parametric alternative. For a test embedding, find the \(k\) nearest neighbors among training embeddings and compute the average distance. Large average distance to neighbors indicates the test input lies in a sparse region of embedding space, suggesting it is OOD. This approach makes no distributional assumptions and scales well with modern approximate nearest-neighbor algorithms.
For genomic foundation models, embedding-based OOD detection enables practical deployment safeguards. ESM embeddings place novel protein folds in regions distant from characterized folds, allowing detection of sequences outside the model’s training experience. DNABERT embeddings reveal unusual sequence composition or repeat structures that may confound predictions. Flagging these cases for expert review prevents confident but unreliable predictions from reaching clinical decisions. The properties of DNA and protein language model embeddings are discussed in Chapter 15 and Chapter 16.
24.7.4 Practical OOD Detection for Genomic Applications
Defining what counts as OOD requires domain knowledge. Novel species or clades may share evolutionary history with training examples yet differ enough to warrant caution. Extreme GC content can indicate contamination, unusual biology, or simply under-represented genomic regions. Engineered sequences (designed proteins, synthetic regulatory elements) intentionally explore regions of sequence space not represented in natural sequences.
Combining multiple OOD signals improves reliability. Embedding distance, likelihood, and prediction confidence each capture different aspects of distributional difference. An input flagged by multiple methods is more reliably OOD than one flagged by a single method. Threshold selection involves trade-offs between false positives (flagging in-distribution examples unnecessarily) and false negatives (missing true OOD examples).
The operational response to OOD detection depends on the application. For variant interpretation, OOD inputs might trigger automatic flagging for expert review rather than automated classification. For high-throughput screening, OOD inputs might receive tentative predictions with explicit uncertainty warnings. For safety-critical applications, OOD inputs might trigger rejection with a request for additional information.
For deploying foundation models with OOD safeguards:
- Store training embeddings or a compressed representation (centroids, covariance) during model development
- Compute embedding distance for each test input using Mahalanobis or k-NN distance
- Set threshold based on desired false positive rate on a held-out validation set
- Flag OOD inputs for human review rather than automated processing
- Log and monitor OOD rates over time; increasing rates may signal distribution shift
- Combine signals (embedding distance, ensemble disagreement, confidence) for robust detection
24.8 Selective Prediction and Abstention
24.8.1 When to Abstain
A variant effect predictor achieving 95% accuracy overall provides more clinical value if it can identify which predictions are reliable. Selective prediction allows models to abstain on difficult cases, concentrating predictions on inputs where confidence is warranted. The trade-off between coverage (fraction of inputs receiving predictions) and accuracy (correctness among predictions made) defines the selective prediction problem.
The coverage-accuracy trade-off reflects a fundamental tension. At 100% coverage, the model predicts on all inputs and achieves its baseline accuracy. As coverage decreases (more abstention), accuracy among predictions made typically increases because the model abstains on its most uncertain cases. The shape of this trade-off curve characterizes the model’s ability to identify reliable predictions.
Abstention is appropriate when the cost of errors exceeds the cost of deferral. In clinical variant interpretation, a confident but incorrect pathogenic prediction may trigger unnecessary medical intervention, while abstention merely defers the decision to expert review. If expert review is available and affordable relative to error costs, abstaining on uncertain cases improves overall decision quality. Conversely, in high-throughput screening where expert review is infeasible, abstention may provide little benefit because all predictions eventually require automated handling.
24.8.2 Selective Prediction Methods
Confidence-based selection abstains when the model’s maximum predicted probability falls below a threshold. For a classifier producing probabilities over classes, if max_c *p̂_c* < τ, the model abstains. This simple approach works well when model confidence correlates with correctness, but fails when models are confidently wrong.
Ensemble-based selection abstains when ensemble members disagree beyond a threshold. High disagreement indicates epistemic uncertainty about the correct prediction, warranting abstention even if individual members express confidence. This approach captures uncertainty that confidence-based selection misses when models are overconfident.
Conformal selection abstains when prediction sets exceed a size threshold. If the conformal prediction set contains more than one class, the model lacks confidence to make a unique prediction. This approach connects selective prediction to the coverage guarantees of conformal methods: the model makes predictions with guaranteed coverage on the non-abstained cases.
Learned selection trains a separate model to predict whether the primary model will be correct on each input. This “rejection model” learns to identify failure modes that simple confidence thresholds miss, potentially achieving better coverage-accuracy trade-offs than heuristic methods.
24.8.3 Evaluating Selective Prediction
Risk-coverage curves plot accuracy (or its complement, risk) as a function of coverage, revealing how performance improves as the model becomes more selective. The area under the risk-coverage curve summarizes overall selective prediction quality. Models with better uncertainty estimates produce steeper curves, achieving high accuracy at lower coverage.
Selective accuracy at fixed coverage specifies a coverage level (e.g., 80%) and reports accuracy among predictions made at that coverage. This metric directly answers practical questions: “If we let the model predict on its 80% most confident cases, how accurate will it be?”
Comparison across methods requires matched coverage levels. A method that achieves 99% accuracy at 50% coverage and 95% accuracy at 90% coverage may be preferable to a method achieving 97% accuracy at both levels, depending on operational requirements. Reporting full risk-coverage curves enables stakeholders to select operating points appropriate to their cost structures.
24.9 Uncertainty for Specific Genomic Tasks
The general principles of uncertainty quantification apply differently across genomic prediction tasks. Variant effect prediction, regulatory variant interpretation, and cross-population generalization each present distinct challenges for calibration, coverage, and out-of-distribution detection. The sources of uncertainty vary: coding variants benefit from stronger evolutionary constraint and clearer functional readouts, while regulatory variants operate through context-dependent mechanisms that introduce irreducible noise. Population-specific uncertainty reflects training data composition and has direct implications for equitable clinical deployment.
24.9.1 Variant Effect Prediction Uncertainty
Variant effect prediction concentrates the challenges of uncertainty quantification. Epistemic uncertainty arises from poorly characterized genes, novel protein folds, and under-represented populations in training data. Aleatoric uncertainty stems from incomplete penetrance, variable expressivity, and noise in functional assay labels. Both types of uncertainty must be estimated and communicated for variant predictions to inform clinical decisions appropriately. The technical details of variant effect prediction models are covered in Chapter 18.
Calibration challenges for VEP include the evolving nature of ground truth labels. ClinVar annotations change as new evidence emerges; variants classified as VUS may be reclassified as pathogenic or benign, and even confident classifications occasionally reverse. A model calibrated against a historical version of ClinVar may appear miscalibrated against current annotations, not because the model changed but because the labels did. Periodic recalibration against current databases maintains alignment between model outputs and contemporary clinical understanding.
Population-specific calibration addresses the reality that training data predominantly derive from European-ancestry cohorts. For patients from other ancestral backgrounds, both epistemic uncertainty (fewer training examples) and calibration (different baseline pathogenicity rates, different patterns of variation) may differ from the aggregate. Stratified reliability diagrams by ancestry reveal these differences; ancestry-conditional calibration may be necessary for equitable performance across populations. The governance and policy dimensions of ensuring equitable uncertainty communication are addressed in Chapter 27.
24.9.2 Regulatory Variant Uncertainty
Regulatory variants present distinct uncertainty challenges. Unlike coding variants where effects can be localized to specific amino acid changes, regulatory variants act through complex mechanisms involving transcription factor binding, chromatin accessibility, and three-dimensional genome organization. This mechanistic complexity translates to higher aleatoric uncertainty: even perfectly characterized regulatory variants may have context-dependent effects that vary across cell types, developmental stages, and genetic backgrounds. A variant that disrupts a transcription factor binding site may have dramatic effects in tissues where that factor is active and negligible effects elsewhere, yet the model must predict across all contexts simultaneously. The architecture and capabilities of regulatory prediction models are discussed in Chapter 17.
The context-dependence of regulatory effects creates a calibration challenge distinct from coding variants. A model may be well-calibrated for predicting expression changes in cell types abundant in training data (lymphoblastoid cell lines, common cancer lines) while poorly calibrated for clinically relevant primary tissues rarely profiled at scale. Stratified calibration assessment across tissue types reveals these disparities, but the sparsity of ground truth labels for many tissues limits the precision of tissue-specific calibration estimates.
Expression prediction models like Enformer and Borzoi provide uncertainty estimates for predicted expression changes through several approaches. Ensemble methods quantify disagreement across model variants trained with different random seeds. Heteroscedastic architectures predict tissue-specific confidence alongside tissue-specific expression, learning that predictions for well-characterized tissues deserve higher confidence than those for rarely profiled contexts. These uncertainties propagate to downstream interpretations: a variant predicted to alter expression with high uncertainty warrants different treatment than one with narrow confidence bounds, and the tissue-specificity of uncertainty may itself be informative about which experimental follow-up would most reduce ambiguity.
24.9.3 Uncertainty Across Populations
Differential uncertainty across populations has direct implications for health equity. Models trained predominantly on European-ancestry data exhibit higher epistemic uncertainty for other populations, manifesting in several observable ways: larger prediction sets from conformal methods, higher abstention rates from selective prediction, greater ensemble disagreement, and less reliable confidence estimates from calibration. These differences arise from multiple sources. Linkage disequilibrium patterns differ across populations, meaning that variant correlations learned from European data may not transfer. Population-specific variants absent from training data generate pure epistemic uncertainty. Even shared variants may have different effect sizes across populations due to gene-environment interactions or epistatic backgrounds that vary by ancestry.
Quantifying population-specific uncertainty requires appropriate calibration and evaluation datasets. A model calibrated exclusively on European-ancestry ClinVar submissions may appear well-calibrated on aggregate metrics while being systematically miscalibrated for other populations. The scarcity of diverse calibration data creates a challenging circularity: we cannot assess population-specific calibration without diverse labeled datasets, yet diverse labeled datasets are precisely what current genomic databases lack. Initiatives like the All of Us Research Program and population-specific biobanks (Uganda Genome Resource, Taiwan Biobank, BioBank Japan) are beginning to address this gap, enabling population-stratified uncertainty assessment that was previously impossible. The broader context of biobank resources and their composition is discussed in Section 2.3.
Transparent reporting of population-stratified uncertainty metrics enables informed decisions about model deployment. If a model abstains on 30% of variants in one population but only 10% in another, users can make informed choices about supplementary analyses for the higher-abstention population. Clinical laboratories might establish ancestry-specific thresholds for automated reporting versus expert review. Research applications might weight predictions by ancestry-specific confidence when aggregating across diverse cohorts. Ignoring these differences risks providing lower-quality predictions to already under-served populations while presenting a false appearance of uniform reliability, compounding existing disparities in genomic medicine.
24.10 Communicating Uncertainty to End Users
Statistical uncertainty estimates serve clinical decisions only when they reach end users in interpretable form. The translation from model outputs to actionable information involves choices about categorical versus continuous reporting, numerical versus visual presentation, and whether to frame results as probabilities or as expected outcomes under alternative decisions. Different stakeholders require different presentations: clinicians need actionable categories, researchers need distributional information, and patients need accessible risk communication.
24.10.1 Communication Challenge
A pathogenicity score of 0.73 ± 0.15 may be statistically accurate but nearly useless to a clinician deciding whether to order confirmatory testing. The gap between statistical uncertainty and decision-relevant communication presents a persistent challenge for genomic AI deployment. Different users reason differently about probability and risk; effective communication requires understanding these differences.
Cognitive biases complicate probability interpretation. Humans tend toward overconfidence in point estimates, treating 0.73 as more certain than warranted. Prediction intervals are frequently misunderstood: a 90% confidence interval does not mean the true value has a 90% chance of being in that specific interval (a Bayesian interpretation) but rather that 90% of such intervals would contain the true value (a frequentist interpretation). Base rate neglect leads users to interpret variant-level pathogenicity predictions without accounting for prior probability based on clinical presentation, family history, and phenotypic specificity.
Different stakeholders have different needs. Clinicians require actionable categories that map to clinical decision points, not continuous scores requiring interpretation. Researchers may prefer full probability distributions enabling flexible downstream analysis. Patients and families need understandable risk communication that supports informed decision-making without inducing inappropriate anxiety or false reassurance.
24.10.2 Categorical Reporting
Clinical genetics has established categorical frameworks for variant interpretation. The ACMG-AMP guidelines define five categories: pathogenic, likely pathogenic, variant of uncertain significance, likely benign, and benign. The complete ACMG-AMP framework, including how computational evidence integrates with other evidence types, is examined in Section 29.2. Mapping continuous model outputs to these categories requires threshold selection that balances sensitivity and specificity at clinically meaningful operating points, with guidance on calibrating model outputs to ACMG evidence strength provided in Section 18.5.3.
Uncertainty within categories can be conveyed through confidence qualifiers or numerical confidence scores attached to categorical calls. A “likely pathogenic” call with 95% confidence differs meaningfully from one with 60% confidence, even though both receive the same categorical label. Two-dimensional reporting combining category and confidence enables more nuanced interpretation without abandoning the categorical framework that clinicians expect.
Threshold selection involves value judgments beyond pure statistics. The consequences of false positive and false negative pathogenic calls differ by clinical context. For a severe, treatable condition, false negatives carry higher cost, warranting lower thresholds for pathogenic classification. For untreatable conditions where pathogenic classification affects reproductive decisions, the calculus differs. Uncertainty quantification enables informed threshold selection by revealing the trade-offs at different operating points.
24.10.3 Visual Communication
Probability bars and confidence intervals provide visual representation of uncertainty, though their interpretation depends on user familiarity with statistical graphics. Icon arrays, which represent probabilities as proportions of colored icons in a grid (e.g., 73 red icons and 27 blue icons out of 100), improve comprehension for users without statistical training. The visual representation of proportion is more intuitive than numerical probability for many audiences.
Risk ladders place the prediction in context by showing where it falls relative to other risks of varying magnitude. A variant with 0.73 probability of pathogenicity can be placed alongside risks from other genetic conditions, environmental exposures, or common medical procedures, enabling intuitive comparison.
Interactive visualizations allow users to explore uncertainty in detail, examining how predictions change under different assumptions or how uncertainty varies across related variants. These approaches suit sophisticated users engaged in research or detailed clinical analysis but may overwhelm users seeking simple answers.
24.10.4 Decision-Theoretic Framing
Rather than communicating probability alone, decision-theoretic framing presents expected outcomes under different actions. Instead of “this variant has 73% probability of being pathogenic,” the report might state “if we assume this variant is pathogenic and proceed with surveillance, the expected outcomes are X; if we assume it is benign and decline surveillance, the expected outcomes are Y.”
This framing integrates uncertainty with action, helping users understand how uncertainty affects what they should do rather than treating probability as an end in itself. The approach requires modeling clinical outcomes, which introduces additional assumptions, but makes explicit the decision-relevant implications of uncertainty rather than leaving users to integrate probability with consequences on their own.
24.11 Necessary but Insufficient
Uncertainty quantification transforms foundation model outputs from opaque scores into components of rational decision processes. A well-calibrated pathogenicity prediction that honestly communicates its limitations enables appropriate clinical reasoning: high confidence warrants action, low confidence warrants additional testing or expert review. An overconfident score that claims false precision causes harm through both false positives (unnecessary interventions) and false negatives (missed diagnoses). Temperature scaling, conformal prediction, and out-of-distribution detection together provide the technical foundation for trustworthy genomic AI.
The path from uncertainty quantification to clinical impact requires integrating these methods into operational workflows. Selective prediction enables triage between automated handling and expert review based on model confidence. Conformal prediction sets provide coverage guarantees that support risk-aware decision-making. Out-of-distribution detection prevents confident predictions on inputs that fall outside the training distribution, a particularly important capability given the confounding issues examined in Chapter 13. Calibration ensures that numerical probabilities mean what they claim to mean. Together, these tools enable foundation models to participate in clinical decisions without overstating their reliability. Clinical risk prediction frameworks (Chapter 28) develop these tools further for deployment contexts, while rare disease workflows (Chapter 29) apply them to diagnostic interpretation.
Yet uncertainty quantification alone is insufficient. A perfectly calibrated black box remains a black box. The clinician who receives an uncertain prediction wants to understand why the model is uncertain: Is it because the variant falls in a poorly characterized gene? Because the model has never encountered this protein fold? Because the underlying biology is genuinely ambiguous? Interpretability, examined in Chapter 25, complements uncertainty by revealing the basis for predictions and their associated confidence. Attribution methods (Section 25.1) identify which input features drive predictions; probing classifiers (Section 25.4) reveal what information representations encode. The conjunction of calibrated uncertainty and mechanistic understanding approaches what trustworthy clinical AI requires. Neither alone suffices; together they provide the foundation for models that clinicians can reason with rather than merely defer to.
Before reviewing the summary, test your recall:
- What is the difference between epistemic and aleatoric uncertainty, and why does this distinction matter for clinical decision-making?
- What is model calibration, and why can a highly accurate model still be dangerously miscalibrated?
- Explain how temperature scaling improves calibration. What does it preserve and what does it change?
- How do deep ensembles quantify epistemic uncertainty, and what makes them the “gold standard” among uncertainty quantification methods?
- What coverage guarantee does conformal prediction provide, and why does marginal coverage differ from conditional coverage?
Key Concepts:
- Epistemic vs. aleatoric uncertainty: Epistemic uncertainty arises from limited training data and can be reduced with more examples; aleatoric uncertainty reflects inherent biological noise and is irreducible
- Calibration: A model is calibrated when its predicted probabilities match observed frequencies; most foundation models are miscalibrated and require post-hoc correction
- Post-hoc calibration methods: Temperature scaling (simple, preserves ranking), Platt scaling (handles bias), and isotonic regression (flexible, requires more data)
- Uncertainty quantification methods: Deep ensembles (gold standard), MC dropout (single model), heteroscedastic models (aleatoric), last-layer ensembles (practical for foundation models)
- Conformal prediction: Distribution-free coverage guarantees; prediction set size conveys uncertainty; marginal not conditional coverage
- OOD detection: Embedding-based methods more reliable than likelihood; flag unusual inputs for expert review
- Selective prediction: Abstain on uncertain cases to improve accuracy on retained predictions
Clinical Implications:
- Calibration must be assessed within subgroups (ancestry, gene family) to avoid hidden disparities
- Population-specific uncertainty has direct health equity implications
- Uncertainty communication should match stakeholder needs (categorical for clinicians, distributional for researchers)
- Uncertainty quantification is necessary but insufficient; interpretability complements it
Connections to Other Chapters:
- Builds on evaluation principles (Chapter 12) and pretraining objectives (Chapter 8)
- Precedes interpretability (Chapter 25) which explains why models are uncertain
- Applies to clinical risk prediction (Chapter 28) and rare disease diagnosis (Chapter 29)
- Intersects with fairness concerns (Section 3.7.2, Chapter 27)