18  Variant Effect Prediction

The variants that matter most are the variants we have never seen before.

Chapter Overview

Estimated reading time: 50-60 minutes

Prerequisites: Before reading this chapter, you should be familiar with:

  • Classical variant effect prediction approaches (Chapter 4)
  • Protein language model architectures and training objectives (Chapter 16)
  • DNA language models and their representations (Chapter 15)
  • Regulatory genomics models like Enformer (Chapter 17)
  • Basic understanding of the ACMG variant classification framework

Learning Objectives: After completing this chapter, you will be able to:

  1. Explain how foundation models enable zero-shot variant effect prediction without pathogenicity labels
  2. Compare protein-based approaches (ESM-1v, EVE, AlphaMissense) with DNA-based approaches (SpliceAI, Enformer, GPN-MSA)
  3. Compute variant effect scores from language model likelihoods using masked marginal and pseudo-likelihood methods
  4. Assess model calibration using reliability diagrams and apply appropriate thresholds for ACMG classification
  5. Design multi-model VEP workflows that combine evidence across variant types while avoiding double-counting
  6. Identify when foundation model predictions warrant high confidence versus when additional evidence is essential

Chapter roadmap: We begin with the paradigm shift from feature engineering to representation learning, then explore protein-based and DNA-based approaches in detail. The second half addresses practical deployment: combining evidence, calibration, uncertainty quantification, and the persistent limitations that require clinical judgment.

Classical variant effect prediction required labels: pathogenic variants to define one class, benign variants to define another, and enough examples of each to train a classifier. This requirement created a fundamental bottleneck. The variants most important to classify (those that are rare, never before observed, and located in poorly characterized genes) were precisely those for which labels did not exist. Foundation models offer a different paradigm: score variants using patterns learned from unlabeled sequences, without ever seeing a pathogenic/benign label during pretraining. A protein language model trained only to predict masked amino acids can distinguish damaging substitutions from benign polymorphisms because evolution has already encoded this distinction in the sequences that survived. A DNA language model can identify regulatory disruptions because it learned the grammar of functional elements from billions of nucleotides. The variants that violate learned patterns are the variants that disrupt function.

This zero-shot capability does not eliminate the need for labeled data but changes its role. Rather than training classifiers from scratch, practitioners fine-tune foundation models on modest variant datasets, leveraging pretrained knowledge to achieve performance impossible for models that start from random initialization. The combination of self-supervised pretraining and supervised fine-tuning produces variant effect predictors that outperform classical methods across most benchmarks while requiring far less task-specific data. AlphaMissense, ESM-1v, and similar systems demonstrate that foundation model representations capture variant effects across protein families, including families with no labeled variants in training data.

Yet significant challenges remain. Foundation models predict that variants are damaging without explaining why. Calibration varies across variant types, protein families, and populations, creating uncertainty about when predictions can be trusted. The distinction between “evolutionarily unusual” and “clinically pathogenic” is real: not every rare substitution causes disease, and not every disease-causing variant appears evolutionarily constrained.

18.1 Foundation Model Paradigm for Variant Interpretation

Classical variant effect predictors operate by aggregating hand-crafted features: conservation scores computed from multiple sequence alignments, amino acid property changes, protein domain annotations, and regulatory marks at genomic loci (Chapter 4). Methods like CADD train machine learning models to distinguish pathogenic from benign variants using these features, achieving useful discrimination but ultimately limited by what features the developers chose to include. When a variant falls in a region poorly covered by existing annotations, classical methods have little to offer.

Foundation models invert this relationship. Rather than engineering features, they learn representations from raw sequence data during pretraining, then apply those representations to variant interpretation. A protein language model trained to predict masked amino acids implicitly learns which substitutions violate evolutionary constraints. A DNA language model trained to predict nucleotides in genomic context learns which changes disrupt sequence grammar. The representations encode information about structure, function, and constraint that was never explicitly labeled during training.

Key Insight: Evolution as a Massive Experiment

Foundation models exploit a profound insight: evolution has already conducted billions of “experiments” testing which sequence variants are compatible with life. Protein sequences that survived natural selection represent the “passing” experiments; variants never observed in homologous proteins failed these tests. By learning to predict which amino acids or nucleotides are probable at each position, foundation models implicitly learn what evolution has permitted and, by contrast, what it has rejected. This is why a model trained only on masked token prediction can score variant pathogenicity without ever seeing disease labels.

This paradigm shift has practical consequences. Coverage extends to any variant in any gene, not just those with extensive prior annotation. Representations capture subtle patterns (co-evolution between distant residues, context-dependent motif strength) that resist manual feature engineering. Transfer learning enables rapid adaptation to new tasks and variant classes, with the specific strategies detailed in Chapter 9. The cost is interpretability: understanding why a foundation model assigns a particular score requires specialized analysis techniques rather than simple inspection of feature weights (Chapter 25).

Three architectural families dominate current VEP applications. Protein language models (Chapter 16) encode amino acid sequences and score missense variants by measuring likelihood changes. DNA language models (Chapter 15) operate on nucleotide sequences and can score variants of any type. Regulatory models (Chapter 17) predict molecular phenotypes (chromatin accessibility, gene expression, splicing) and score variants by their predicted impact on these phenotypes. The strongest-performing systems combine elements from multiple families.

Table 18.1: Comparison of foundation model approaches for variant effect prediction. Each approach offers distinct advantages depending on the variant type and desired output.
Approach Input Variant Types Key Methods Strengths Limitations
Protein LM Amino acid sequence Missense ESM-1v, EVE, AlphaMissense Captures evolutionary constraint, structural context Coding only, requires translation
DNA LM Nucleotide sequence All (SNV, indel, noncoding) GPN-MSA, Evo 2 Genome-wide coverage Less protein-specific information
Regulatory DNA + epigenomic context Noncoding, regulatory Enformer, Sei, AlphaGenome Mechanistic predictions Cell-type specificity, calibration
Structure-aware Sequence + 3D structure Missense AlphaMissense Explains why constraint exists Requires structure prediction

Zero-shot scoring from PLM likelihoods

Linear probing with frozen embeddings

Full fine-tuning for maximum performance

Multi-modal integration of sequence and structure
Figure 18.1: Foundation model paradigms for variant effect prediction. (A) Zero-shot scoring uses pretrained likelihoods directly, requiring no task-specific data but limited to what pretraining captured. (B) Linear probing adds a simple classifier on frozen embeddings, requiring minimal labeled data. (C) Full fine-tuning updates all parameters, achieving best performance but requiring substantial labeled data. (D) Multi-modal integration combines sequence (evolutionary) and structure information for comprehensive variant assessment. The appropriate paradigm depends on available labeled data and whether structure contributes to the variant’s effect mechanism.

18.1.1 Zero-Shot and Supervised Approaches

Foundation model VEP methods divide into two paradigms. Zero-shot approaches apply pretrained models directly without task-specific training: ESM-1v scores variants by comparing amino acid likelihoods, requiring no pathogenicity labels. Think of it like a spell-checker that flags words it has never seen in well-edited text. It does not need a dictionary of “misspellings” because any word that looks unusual compared to normal usage stands out as potentially wrong. Similarly, the model’s pretraining objective (masked token prediction) implicitly teaches which substitutions violate evolutionary constraints by learning what “well-edited” (evolutionarily tested) sequences look like. Supervised approaches like AlphaMissense add task-specific training layers and optimize explicitly for pathogenicity prediction using labeled examples.

Stop and Think: Zero-Shot vs. Supervised Trade-offs

Before reading further, consider: If zero-shot methods avoid the biases present in labeled training data, why would anyone use supervised approaches at all? What advantages might supervised fine-tuning provide, and when might those advantages outweigh the risk of inheriting label biases?

The choice involves tradeoffs. Zero-shot methods avoid label bias entirely; they cannot learn to recapitulate existing predictor scores because they never see those scores during training. Supervised methods achieve stronger discrimination when high-quality labels exist but risk inheriting biases from training data. Zero-shot approaches generalize more reliably to novel proteins outside training distributions; supervised methods may overfit to well-studied gene families. In practice, the strongest current systems (AlphaMissense, popEVE) combine foundation model representations with some supervised adaptation, attempting to capture benefits of both paradigms.

18.2 Protein-Based Variant Effect Prediction

Missense variants (single amino acid substitutions) account for approximately half of known pathogenic variants in ClinVar, making protein-level prediction a central challenge (Landrum et al. 2018). Foundation model approaches exploit a simple insight: evolution has already tested billions of amino acid substitutions across millions of years; variants that repeatedly survive natural selection are likely tolerable, while those never observed in homologous proteins likely disrupt function.

18.2.1 Zero-Shot Scoring with Protein Language Models

The simplest foundation model approach to missense VEP requires no task-specific training. A protein language model trained on masked token prediction assigns probabilities to each amino acid at each position given surrounding context. Variant effect scores emerge from comparing the probability of the reference amino acid to the probability of the variant amino acid.

ESM-1v operationalizes this approach using the ESM-2 architecture fine-tuned for single-sequence variant effect prediction (Meier et al. 2021). For a variant substituting amino acid \(a_\text{ref}\) with \(a_\text{var}\) at position i, the score is computed as:

\[ \Delta \text{LLR} = \log P(\text{aa}_{\text{alt}} \mid \mathbf{x}_{-i}) - \log P(\text{aa}_{\text{ref}} \mid \mathbf{x}_{-i}) \tag{18.1}\]

where:

  • \(\text{aa}_{\text{ref}}\) is the reference (wild-type) amino acid at position \(i\)
  • \(\text{aa}_{\text{alt}}\) is the variant (mutant) amino acid
  • \(\mathbf{x}_{-i}\) is the protein sequence with position \(i\) masked
  • Negative \(\Delta\text{LLR}\) indicates the variant is evolutionarily disfavored (predicted deleterious)
Worked Example: Computing LLR Variant Scores

Consider a missense variant in BRCA1 that substitutes leucine (L) with proline (P) at position 1780.

Step 1: The protein language model examines the sequence context around position 1780 and predicts probabilities for each amino acid at that position:

Amino Acid Probability
Leucine (L, reference) 0.35
Proline (P, variant) 0.02
Isoleucine (I) 0.28
Other residues 0.35

Step 2: Compute the log-likelihood ratio: \[\Delta \text{LLR} = \log(0.02) - \log(0.35) = -3.91 - (-1.05) = -2.86\]

Step 3: Interpret the score: A score of -2.86 indicates the variant amino acid (proline) is substantially less probable than the reference (leucine) given the evolutionary context. The model has learned that this position strongly favors aliphatic residues (L, I, V), and the rigid proline is unexpected, consistent with a functionally important position where proline would disrupt structure.

Negative scores indicate that the variant amino acid is less probable than reference in learned evolutionary context, suggesting potential deleteriousness. The model sees only the single query sequence, not multiple sequence alignments, yet achieves discrimination competitive with alignment-based methods on deep mutational scanning benchmarks. The emergence of this capability from masked token prediction, without explicit training on variant effects, exemplifies the emergent biological knowledge discussed in Section 16.1.2.

Computing Variant Likelihoods from Language Models

Language models assign probabilities to sequences, but extracting variant effect scores requires specific computational strategies. Different approaches trade off between computational cost, theoretical justification, and empirical performance.

Masked marginal likelihood (used by ESM-1v): Mask the position of interest and compute the probability of each amino acid given the unmasked context. The variant score is the log probability ratio of variant versus reference amino acid:

\[ \text{Score} = \log P(\text{aa}_{\text{alt}} \mid \mathbf{x}_{-i}) - \log P(\text{aa}_{\text{ref}} \mid \mathbf{x}_{-i}) \tag{18.2}\]

where \(\mathbf{x}_{-i}\) denotes the sequence with position \(i\) masked. This approach requires a single forward pass per variant position. It directly measures how surprising each amino acid is given the local and global context the model learned during pretraining.

Pseudo-likelihood: Sum the masked marginal log-probabilities across all positions in the sequence:

\[ \text{PLL}(\mathbf{x}) = \sum_{i=1}^{L} \log P(x_i \mid \mathbf{x}_{-i}) \tag{18.3}\]

where \(L\) is the sequence length and \(x_i\) is the amino acid at position \(i\). The variant effect is the difference in pseudo-likelihood between mutant and wild-type sequences. This captures how the mutation affects sequence probability globally, not just at the mutated position, and may detect compensatory effects, but requires L forward passes (one per position).

Autoregressive likelihood (for GPT-style models): Compute the probability of generating the sequence left-to-right:

\[ \text{LL}(\mathbf{x}) = \sum_{i=1}^{L} \log P(x_i \mid x_1, \ldots, x_{i-1}) \tag{18.4}\]

This provides a proper generative probability but creates asymmetry: mutations early in the sequence affect more downstream predictions than mutations late in the sequence. Bidirectional models avoid this issue.

Masked language model pseudo-perplexity: Average negative log-probability across positions, measuring how “surprising” the sequence appears to the model. Lower perplexity indicates more natural sequences.

Practical considerations:

  • Masked marginal is computationally cheapest (one forward pass) and works well for local effects
  • Pseudo-likelihood is more thorough but expensive; often approximated by sampling positions
  • Autoregressive likelihood provides proper probabilities but with positional asymmetry
  • Multiple scoring strategies often correlate; choice depends on computational budget and task

Most protein language model variant scoring uses masked marginal likelihood due to its efficiency and strong empirical performance. DNA language models use analogous strategies adapted for nucleotide sequences and longer contexts.

This zero-shot capability reflects what protein language models learn during pretraining: structural constraints (buried positions are hydrophobic), functional constraints (active sites are conserved), and co-evolutionary patterns (compensating mutations at contacting residues). The model has never seen pathogenicity labels, yet its predictions correlate with disease association because evolution and disease share underlying biology.

18.2.2 Alignment-Based Models: EVE and popEVE

An alternative approach explicitly models multiple sequence alignments rather than relying on implicit evolutionary information in single-sequence representations. EVE (Evolutionary Model of Variant Effect) fits a variational autoencoder to the MSA for each protein, learning a generative model that captures position-specific and pairwise constraints (Frazer et al. 2021). Variant scores derive from the change in sequence probability under this model.

The EVE architecture consists of an encoder that maps sequences to a latent space and a decoder that reconstructs sequences from latent representations. Training maximizes a lower bound on sequence likelihood across the MSA. For variant scoring, EVE computes the log-likelihood ratio between mutant and wild-type sequences, capturing how surprising the substitution appears given the evolutionary record for that specific protein.

popEVE extends this framework with improved training procedures and explicit modeling of population allele frequencies (Orenbuch et al. 2025). By incorporating frequency information, popEVE better separates rare deleterious variants from common benign polymorphisms. The model achieves strong performance on ClinVar classification while providing uncertainty estimates through ensemble disagreement.

The tradeoff between single-sequence and MSA-based approaches involves coverage versus depth. ESM-1v scores any protein sequence without requiring alignment construction. EVE provides stronger performance when high-quality MSAs are available but cannot score proteins lacking sufficient homologs. For well-studied protein families with deep evolutionary sampling, MSA-based methods remain competitive; for orphan proteins or rapidly evolving sequences, single-sequence models offer the only foundation model option.

Tranception (Notin et al. 2022) offers a middle ground through retrieval augmentation: rather than training a separate model per protein family, it retrieves evolutionarily related sequences at inference time and uses them to provide context-specific information about which positions tolerate variation. This approach bridges classical MSA-based methods and modern language models, potentially offering benefits of both paradigms for proteins where sufficient homologs can be retrieved. Performance on DMS benchmarks suggests retrieval augmentation can improve variant effect prediction, though benefits vary across protein families.

Knowledge Check: Single-Sequence vs. MSA-Based Models

A patient carries a missense variant in OBSCN, an extremely large gene encoding a protein with few characterized homologs. The gene has only 12 sequences in its MSA, most from closely related mammals.

  1. Would ESM-1v or EVE be more appropriate for scoring this variant? Why?
  2. What fundamental limitation does this example illustrate about alignment-based approaches?
  3. If the variant were in BRCA1 instead (which has hundreds of homologs), would your answer change?

ESM-1v would be more appropriate because it works on single sequences and does not require a high-quality MSA. This illustrates that alignment-based methods fail for orphan proteins or those with sparse evolutionary sampling, where MSA-based approaches lack sufficient data. For BRCA1 with hundreds of homologs, EVE might perform better because the deep evolutionary sampling provides rich coevolution signals that EVE can exploit, though ESM-1v would still provide reliable predictions.

18.2.3 AlphaMissense: Structure-Informed Pathogenicity Prediction

AlphaMissense represents the current state of the art for proteome-wide missense pathogenicity prediction, combining protein language model representations with structural information from AlphaFold2 (Cheng et al. 2023). The system provides precomputed scores for 71 million possible missense variants across the human proteome, enabling instant lookup for any variant in any protein-coding gene.

The architecture integrates multiple information sources. Sequence representations come from a protein language model encoding the wild-type sequence and mutation position. Structural representations derive from AlphaFold2 predictions, capturing local geometry (secondary structure, solvent accessibility, packing density) and longer-range contacts. A neural network combines these representations to produce a pathogenicity probability between 0 and 1.

Why does combining sequence and structure information improve predictions beyond what either provides alone? Sequence-based evolutionary constraint tells you that a position is important, but structure explains why. Consider two equally conserved positions. One is buried in the hydrophobic core where any polar substitution destabilizes the fold, while the other forms a catalytic residue where even conservative substitutions abolish enzyme activity. A charge-preserving substitution (Asp to Glu) might be tolerable at a structural position but devastating at the catalytic one. Structure reveals these mechanistic differences that sequence conservation alone cannot distinguish. Similarly, two surface-exposed positions might show identical evolutionary constraint, but one forms a protein-protein interaction interface while the other faces solvent. Structural context disambiguates cases where sequence statistics are similar but mechanisms (and therefore tolerance for variation) differ substantially.

Training uses a carefully constructed dataset that avoids the circularity plaguing earlier predictors. Rather than training on ClinVar labels (which themselves derive from computational predictions), AlphaMissense uses population frequency as a proxy for pathogenicity: variants common in gnomAD are likely benign, while variants absent from large population samples and observed in disease contexts are likely pathogenic. This approach reduces the risk of learning features that simply recapitulate existing predictor scores.

Key Insight: Why Structure Matters for Variant Interpretation

Protein language models capture that a position is evolutionarily constrained, but structural information explains why. A buried hydrophobic residue is constrained because substituting a charged amino acid would destabilize the fold. An active site residue is constrained because even subtle changes disrupt catalysis. An interface residue is constrained because substitutions abolish protein-protein interactions. By incorporating AlphaFold2 structural features, AlphaMissense can distinguish variants at constrained positions where the constraint arises from different mechanisms, improving prediction for cases where sequence constraint alone is ambiguous.

Calibration receives explicit attention. Raw model outputs undergo isotonic regression calibration against held-out ClinVar variants, ensuring that predicted probabilities correspond to observed pathogenic proportions (Section 24.3). A score of 0.8 should mean that 80% of variants with similar scores are pathogenic, enabling meaningful clinical interpretation. AlphaMissense reports calibrated scores along with discrete classifications (likely pathogenic, likely benign, uncertain) at thresholds chosen to achieve specific precision targets.

Performance on independent benchmarks substantially exceeds classical predictors. On deep mutational scanning datasets (where experimental fitness measurements provide ground truth independent of clinical labels), AlphaMissense achieves correlations of 0.5 to 0.7 depending on the assay, compared to 0.3 to 0.5 for CADD or PolyPhen-2 (Cheng et al. 2023). On ClinVar expert-reviewed variants held out from training, AlphaMissense achieves auROC of 0.94, representing a meaningful improvement over classical methods (Cheng et al. 2023).

The structural component proves essential for this performance. Ablation experiments removing AlphaFold2 features degrade performance substantially, particularly for variants at protein-protein interfaces and buried core positions where local geometry determines functional impact. The protein language model captures evolutionary constraint; structural information explains why that constraint exists.

AlphaMissense integrates evolution and structure

Training strategy avoids circularity

Structural context determines pathogenicity

Proteome-wide classification of 71 million variants
Figure 18.2: AlphaMissense: integrating evolution and structure for variant pathogenicity. (A) Architecture combines ESM evolutionary embeddings with AlphaFold2 structural features. (B) Training avoids ClinVar circularity by using population frequency as a pathogenicity proxy: common gnomAD variants are presumed benign, rare variants in disease-associated contexts are presumed pathogenic. (C) Pathogenicity scores vary appropriately by structural context: core and active site variants score highest, surface variants score lowest. (D) Proteome-wide application classifies 71 million possible human missense variants, creating the largest pathogenicity prediction resource.

18.3 DNA-Based Variant Effect Prediction

Approximately 98% of the human genome lies outside protein-coding regions, yet noncoding variants contribute substantially to disease risk through effects on gene regulation, splicing, and genome stability (Kundaje et al. 2015). Predicting the impact of these variants requires models that operate directly on DNA sequence rather than translated protein.

18.3.1 Splice Variant Prediction with SpliceAI

Splicing variants illustrate both the promise and current limitations of deep learning for noncoding VEP. A substantial fraction of pathogenic variants in ClinVar act through splicing mechanisms, disrupting the precise excision of introns from pre-mRNA (Jaganathan et al. 2019). Classical approaches relied on position weight matrices matching consensus splice site sequences, achieving limited sensitivity for variants outside the core GT-AG dinucleotides.

SpliceAI applies the dilated convolutional architecture introduced in Chapter 6 to predict splice site usage from raw DNA sequence (Jaganathan et al. 2019). The architecture processes 10,000 nucleotides of context through 32 residual blocks with dilated convolutions (dilation rates increasing from 1 to 128), enabling the receptive field to span several kilobases while maintaining nucleotide resolution. Output heads predict splice donor probability, splice acceptor probability, and junction usage at each position.

For variant effect prediction, SpliceAI compares predictions between reference and alternate sequences. The delta score quantifies the change in splice site probability, with positive values indicating gained splice sites and negative values indicating lost sites. Scores exceeding 0.2 correlate with experimentally validated splicing changes; scores above 0.5 have high specificity for pathogenic splicing variants (Jaganathan et al. 2019).

Clinical deployment has validated SpliceAI’s utility. Illumina integrated the model into their clinical interpretation pipeline, and multiple diagnostic laboratories use SpliceAI scores as supporting evidence for ACMG classification. The architectural innovations that enable this performance, including the dilated convolution strategy for expanding receptive fields, are detailed in Section 6.5. The model identifies pathogenic splicing variants missed by classical methods, particularly deep intronic variants that create novel splice sites through cryptic activation.

Limitations reflect the model’s training data. SpliceAI learned from annotated transcripts representing major isoforms in common tissues. Tissue-specific alternative splicing, rare isoforms, and developmental stage-specific patterns fall outside the training distribution. The model also does not capture downstream consequences: whether a predicted splicing change produces a functional protein, triggers nonsense-mediated decay, or has no phenotypic effect requires additional analysis.

18.3.2 Regulatory Variant Prediction with Enformer

The challenge of non-coding variant interpretation is quantified by landmark studies of GWAS variant localization. Farh et al. (2015) established that approximately 90% of disease-associated variants fall in non-coding regions, with roughly 60% in enhancer-like chromatin states (regions showing histone modifications characteristic of active enhancers). Of these, only 10-20% directly disrupt recognizable transcription factor binding motifs.

Stop and Think

If only 10-20% of regulatory variants disrupt recognizable TF motifs, what does this imply about the complexity of regulatory grammar that foundation models must learn?

While SpliceAI addresses one specific noncoding mechanism, regulatory variants that alter enhancer activity, promoter function, or chromatin organization require different approaches. Enformer (Chapter 17) predicts multiple molecular phenotypes (histone modifications, transcription factor binding, chromatin accessibility, gene expression) from 196,608 base pairs of DNA sequence, providing a substrate for regulatory VEP (Ž. Avsec et al. 2021).

Variant effect prediction with Enformer compares predicted tracks between reference and alternate sequences. For a variant in an enhancer, the model might predict reduced H3K27ac signal and decreased CAGE expression at the target promoter. These molecular predictions can be aggregated into variant effect scores, with larger predicted changes indicating greater functional impact.

Several challenges complicate Enformer-based VEP. The model predicts relative effects (fold changes in predicted signal) rather than absolute deleteriousness. Calibrating these predictions against pathogenicity labels requires additional supervised training. Cell-type specificity adds complexity: a variant might strongly affect predictions in cardiac tissue while showing no effect in liver, requiring prior knowledge of relevant tissues for clinical interpretation.

Sei extends this approach by learning a regulatory vocabulary: clusters of predicted effects that correspond to interpretable categories like “active promoter,” “strong enhancer,” or “CTCF binding site” (Chen et al. 2022). Variant scores reflect shifts between these categories, providing more interpretable outputs than raw track changes. A variant that converts an enhancer prediction to a quiescent state has clearer implications than one that reduces H3K27ac by 0.3 log-fold.

Stop and Think: Mechanism vs. Pathogenicity

Enformer can predict that a variant reduces enhancer activity, and SpliceAI can predict that a variant creates a cryptic splice site. But neither model directly predicts whether these molecular changes cause disease.

Consider: A variant in an enhancer might reduce expression of the target gene by 30%. Under what circumstances would this reduction be: - Clearly pathogenic? - Clearly benign? - Uncertain in its clinical significance?

This exercise illustrates a fundamental gap between mechanistic prediction (what the variant does molecularly) and clinical prediction (whether it causes disease).

18.3.3 DNA Language Models: GPN-MSA and Evo 2

DNA language models provide an alternative to phenotype prediction: scoring variants by how unexpected they appear in learned sequence context, analogous to protein language model approaches for missense variants.

GPN-MSA combines DNA language modeling with multi-species sequence alignments (Benegas et al. 2024). Building on the GPN approach introduced in Section 15.4, the model processes aligned sequences from dozens of vertebrate species, learning which positions are conserved and which tolerate variation. Variant scores derive from likelihood ratios: how much less probable is the variant allele compared to reference given the alignment context? This approach captures deep evolutionary constraint missed by simple conservation scores while providing genome-wide coverage including noncoding regions.

Primate-specific constraint data provides complementary evidence. Gao et al. (2023) characterized variation across great ape genomes, providing constraint estimates from species evolutionarily close to humans. Because primates diverged relatively recently, variants tolerated in primate genomes may provide more relevant evidence of benignity than variants tolerated only in distant species. This primate-specific constraint signal can calibrate PLM-based VEP predictions, particularly for variants where deep phylogenetic conservation is ambiguous.

Evo 2 pushes context length to approximately one megabase, enabling single models to capture local motifs and long-range dependencies simultaneously (Brixi et al. 2025). The StripedHyena architecture provides computational efficiency at this scale through state-space-based sequence modeling rather than quadratic attention, as detailed in Section 15.5.3 and Chapter 7. Training on diverse genomes across the tree of life teaches general principles of sequence organization that transfer to human variant interpretation.

Zero-shot variant scoring with Evo 2 follows the standard likelihood ratio approach. Initial benchmarks show performance competitive with conservation-based scores for coding variants and potentially superior performance for noncoding variants where local sequence context matters more than position-specific conservation. The extremely long context enables modeling of effects mediated by distal elements, though whether this theoretical capability translates to improved VEP remains under investigation.

18.3.4 AlphaGenome: Unified Multi-Omic Variant Effect Prediction

AlphaGenome (Section 17.5) represents the most ambitious current attempt at comprehensive VEP, predicting multiple molecular phenotypes from megabase-scale DNA sequence and using those predictions to assess variant effects across modalities (Z. Avsec, Latysheva, and Cheng 2025).

Variant effect prediction with AlphaGenome provides mechanistically interpretable outputs. A promoter variant might show reduced accessibility and decreased expression prediction. An enhancer variant might show weakened contact with its target promoter in addition to reduced local histone acetylation. A splicing variant triggers SpliceAI-like splice site changes while also affecting regulatory track predictions near the affected exon.

The multi-omic approach enables variant prioritization that considers multiple mechanisms simultaneously. A variant in a regulatory element that affects accessibility, expression, and chromatin contacts represents stronger evidence than one affecting only a single predicted phenotype. Conversely, variants with no predicted effect across modalities can be deprioritized despite proximity to disease genes.

Practical deployment involves tradeoffs. Evaluating a single variant requires forward passes through the full model, incurring substantial computational cost compared to lookup-based approaches like AlphaMissense. The model may exhibit overconfidence when extrapolating beyond training cell types. Calibrating multi-dimensional predictions into single pathogenicity scores remains an open problem. These constraints position AlphaGenome as a tool for detailed mechanistic investigation of prioritized variants rather than genome-wide screening.

18.4 Combining Evidence Across Modalities

No single model addresses all variant types and mechanisms. Missense variants in protein-coding regions call for protein-level predictors; splicing variants require splice-specific models; regulatory variants benefit from long-context DNA models. Practical VEP workflows combine multiple predictors to achieve comprehensive coverage.

18.4.1 Integration Strategies

The simplest integration approach applies different models to different variant classes. Missense variants receive AlphaMissense scores; synonymous and intronic variants near splice sites receive SpliceAI scores; promoter and enhancer variants receive Enformer or AlphaGenome predictions. This modular strategy ensures that each variant type receives predictions from an appropriate model.

Table 18.2: Model selection guidance by variant type. The choice depends on variant location and suspected mechanism.
Variant Type Primary Model(s) Secondary/Supporting Key Considerations
Missense AlphaMissense ESM-1v, EVE Check for splicing effects if near exon boundary
Synonymous SpliceAI Enformer (regulatory) Often dismissed but can affect splicing
Splice site (canonical) SpliceAI Very high pathogenicity for loss of function genes
Deep intronic SpliceAI Enformer, GPN-MSA Look for cryptic splice site activation
Promoter/5’UTR Enformer, AlphaGenome GPN-MSA Consider tissue specificity
Enhancer Enformer, Sei AlphaGenome Need prior knowledge of target genes
Structural variants Limited coverage Gap in current foundation model capabilities

More sophisticated integration aggregates scores across models for the same variant. A missense variant might receive both AlphaMissense (protein impact) and Enformer (regulatory impact, relevant if the codon overlaps a regulatory element) predictions. Combining these requires decisions about weighting and potential double-counting of shared information.

Bayesian approaches offer principled integration. Priors encode beliefs about variant mechanism proportions; likelihoods incorporate model predictions given mechanism; posteriors combine evidence across models while respecting uncertainty. REVEL (Rare Exome Variant Ensemble Learner) demonstrated this approach for classical predictors (Ioannidis et al. 2016); extending it to foundation model outputs requires careful calibration of each component score.

18.4.2 Avoiding Double-Counting

Foundation models trained on overlapping data risk capturing correlated rather than independent information. AlphaMissense and ESM-1v both encode evolutionary constraint; combining their scores as independent evidence overweights evolutionary signal. Similarly, conservation-based DNA models like GPN-MSA share information with phyloP scores already incorporated in classical predictors.

Correlation analysis helps quantify redundancy. If two model scores correlate above 0.8 across a benchmark dataset, they likely provide similar information and should not be counted as independent evidence. Residual analysis can identify what unique signal each model contributes beyond shared components.

Practical Guidance: Evidence Independence in ACMG Classification

For ACMG classification, guidelines specifically address computational evidence weighting. The PP3 (computational evidence supporting pathogenicity) and BP4 (computational evidence supporting benignity) criteria apply when multiple tools agree. However, using five correlated predictors that all derive from evolutionary conservation should not count as five independent pieces of evidence.

Recommended practice: 1. Group predictors by primary information source (evolutionary constraint, structural features, regulatory marks) 2. Select one representative tool from each group for ACMG evidence 3. Document tool correlations in your laboratory’s validation records 4. Update tool selection as new methods emerge and correlations are characterized

Clinical laboratories should develop local policies for which tools to consult and how to weight their outputs, ideally based on validation against known variants in their patient population.

18.4.3 Practical Workflow Design

An effective VEP workflow balances comprehensiveness against efficiency. Genome-wide screening might use fast, zero-shot models (DNA language model likelihood scores) to identify variants deviating from expected sequence patterns. Prioritized variants then receive detailed evaluation with computationally expensive models (AlphaGenome multi-omic predictions). Final interpretation combines computational scores with population frequency, gene-level constraint metrics, segregation data, and clinical phenotype.

The ordering matters for efficiency. Filtering the majority of variants with fast models before applying expensive models reduces computational cost by orders of magnitude. The choice of filtering threshold trades sensitivity against specificity: strict thresholds miss true pathogenic variants; lenient thresholds burden downstream analysis with false positives. Threshold selection should match intended use: diagnostic applications prioritize sensitivity while research screening may prioritize specificity.

Why does the tiered approach work biologically as well as computationally? Most variants in a genome are benign; purifying selection has eliminated severely deleterious variants from the population. A fast initial filter that correctly classifies the majority of benign variants reduces the candidate set to a manageable number for detailed analysis. The variants that pass the filter are enriched for those where predictions are uncertain or where multiple mechanisms may be at play, justifying the computational investment in multi-model evaluation. This mirrors clinical reasoning: common polymorphisms need not be scrutinized in depth, while rare variants in disease-relevant genes warrant comprehensive assessment.

Model selection by variant type

Non-coding variant integration pipeline

Ensemble scoring combines predictions

Uncertainty aggregation across models
Figure 18.3: Multi-model integration for comprehensive variant assessment. (A) Coding variants are routed to appropriate models based on consequence type: missense to AlphaMissense/ESM, splice-proximal to SpliceAI, synonymous requiring both splicing and regulatory assessment. (B) Non-coding variants use specialized models for each region: promoters and enhancers use Enformer, splice regions use SpliceAI, deep intronic variants combine conservation with regulatory predictions. (C) Ensemble methods combine individual model scores through learned weights, typically outperforming any single model. (D) Uncertainty aggregation identifies variants where models disagree, flagging them for review. When models agree, predictions are confident; when models disagree, elevated uncertainty triggers manual review.

18.5 Calibration and Clinical Categories

Difficulty Warning: Statistical Foundations Required

This section introduces calibration concepts that require familiarity with probability, Bayesian reasoning, and classification metrics. If terms like “reliability diagram,” “isotonic regression,” or “odds ratio” are unfamiliar, consider first reviewing the technical foundations in Section 24.3. The practical implications are accessible without deep statistical background, but understanding why these methods work requires the full treatment in Chapter 23.

A pathogenicity score of 0.73 means nothing in isolation. If that score reflects a well-calibrated model, approximately 73% of variants receiving similar scores are truly pathogenic, and clinical decisions can proceed accordingly. If the model is miscalibrated, the true pathogenic rate could be 40% or 95%, rendering the score unreliable for clinical interpretation. Model scores become clinically useful only when they map to actionable categories through calibration, the process of ensuring that predicted probabilities match observed frequencies.

18.5.1 Assessing Calibration

Calibration plots (reliability diagrams) visualize the relationship between predicted probabilities and observed frequencies. Variants are binned by predicted score, and the proportion of pathogenic variants in each bin is plotted against the bin’s mean predicted probability. Perfect calibration falls on the diagonal: a predicted 0.8 pathogenicity corresponds to an 80% observed pathogenic rate. Points below the diagonal indicate overconfidence (predictions exceed reality), while points above indicate underconfidence.

Most raw model outputs are poorly calibrated. Neural networks trained with cross-entropy loss tend toward overconfidence, predicting probabilities near 0 or 1 more often than warranted. Protein language model likelihood ratios produce unbounded scores requiring transformation before probability interpretation. The theoretical foundations of why deep networks and foundation models exhibit systematic miscalibration, along with formal definitions of calibration metrics including expected calibration error (ECE), are developed in Section 24.3. The specific challenges posed by foundation model miscalibration in clinical settings are examined in Section 24.3.3.

18.5.2 Calibration Methods for Variant Effect Prediction

Post-hoc calibration transforms raw model outputs into probabilities that match observed pathogenicity frequencies. The technical details of these transformations, including temperature scaling, Platt scaling, and isotonic regression, are developed in Section 24.4. Here we focus on their application to variant effect prediction.

Calibration should use data representative of deployment conditions. Calibrating on ClinVar expert-reviewed variants produces reliable performance on similar variants but may not transfer to novel genes, rare populations, or variant classes underrepresented in ClinVar. A model calibrated on well-studied cancer genes may be systematically overconfident when applied to genes with fewer characterized variants. Stratified calibration by gene function, variant class, or population improves reliability at the cost of increased data requirements.

The systematic biases that arise from distribution shift between calibration and deployment present particular challenges for clinical genomics. Foundation models trained predominantly on European-ancestry data may exhibit differential calibration across populations, producing well-calibrated predictions for some patient groups and miscalibrated predictions for others (Chapter 13). These disparities have direct implications for equitable care, as clinical decisions based on miscalibrated predictions will be systematically worse for patients from underrepresented backgrounds. The sources and consequences of such differential calibration are examined in Section 24.3.

Raw model outputs require calibration

Calibration methods correct systematic bias

Threshold selection for clinical use

Prevalence affects score interpretation
Figure 18.4: Calibration for clinical variant interpretation. (A) Raw model scores are not calibrated probabilities; neural networks exhibit systematic overconfidence, clustering predictions near 0 and 1. (B) Calibration curves reveal and correct probability bias; post-calibration predictions track the diagonal, providing accurate probability estimates. (C) Threshold selection depends on clinical context: screening applications favor high sensitivity, diagnostic applications balance sensitivity and specificity, confirmation applications favor high specificity. (D) Clinical prevalence dramatically affects interpretation: the same score means different things in healthy populations versus rare disease clinics. Proper clinical deployment requires calibration and prevalence-adjusted interpretation.

18.5.3 Mapping to ACMG Categories

The ACMG-AMP variant classification framework defines five categories: pathogenic, likely pathogenic, uncertain significance, likely benign, and benign (Richards et al. 2015). Computational evidence contributes to classification through specific criteria: PP3 (computational evidence supporting pathogenicity) and BP4 (computational evidence supporting benignity).

ACMG-AMP Variant Classification Framework

The American College of Medical Genetics and Genomics (ACMG) and Association for Molecular Pathology (AMP) established a standardized framework for classifying sequence variants in Mendelian disease genes (Richards et al. 2015). Understanding this framework is essential for applying foundation model predictions in clinical settings.

Classification categories:

Category Abbreviation Clinical interpretation
Pathogenic P Disease-causing; actionable
Likely pathogenic LP >90% probability pathogenic; actionable
Uncertain significance VUS Insufficient evidence; not actionable
Likely benign LB >90% probability benign
Benign B Not disease-causing

Evidence types and strengths:

Evidence for pathogenicity includes: - PVS1 (Very Strong): Null variant in gene where loss-of-function causes disease - PS1-PS4 (Strong): Same amino acid change known pathogenic, functional studies, segregation in families, prevalence in affected vs. controls - PM1-PM6 (Moderate): Functional domain, absent from controls, missense in gene with low benign variation, etc. - PP1-PP5 (Supporting): Cosegregation, missense in gene with mostly missense pathogenic, computational evidence (PP3), patient phenotype match, reputable source

Evidence for benignity includes: - BA1 (Stand-alone): Allele frequency >5% in population databases - BS1-BS4 (Strong): Frequency higher than expected, functional studies, segregation against, observed in trans with pathogenic - BP1-BP7 (Supporting): Missense in gene with truncating mechanism, silent with no splice impact, computational evidence (BP4), etc.

Combining evidence:

Classification Evidence required
Pathogenic PVS1 + ≥1 PS; OR 2 PS; OR 1 PS + 3 PM; OR 1 PS + 2 PM + 2 PP; …
Likely pathogenic 1 PVS1 + 1 PM; OR 1 PS + 1-2 PM; OR 1 PS + ≥2 PP; …
Likely benign 1 BS + 1 BP; OR ≥2 BP
Benign BA1; OR ≥2 BS
VUS Criteria not met for other categories

Computational evidence (PP3/BP4):

Foundation model scores contribute through PP3 (supporting pathogenicity) or BP4 (supporting benignity). These criteria provide supporting-level evidence. To achieve stronger evidence levels, tools must be calibrated to demonstrate odds ratios of pathogenicity meeting specific thresholds: >2.08 for supporting, >4.33 for moderate, >18.7 for strong (Tavtigian et al. 2018). Recent calibration studies have established that several foundation model-based predictors can achieve moderate or strong evidence levels at appropriate thresholds (Pejaver et al. 2022; Bergquist et al. 2025).

Mapping continuous foundation model scores to these discrete criteria requires threshold selection. Conservative thresholds ensure high precision at the cost of low recall: only variants with very high (or very low) scores receive computational evidence designation. Lenient thresholds increase recall but admit more false positives, potentially inflating pathogenicity classifications. The choice reflects a fundamental trade-off between missing actionable variants and overclassifying benign variants as potentially harmful.

ClinGen sequence variant interpretation working groups have developed model-specific recommendations for computational predictors, specifying score thresholds that correspond to different evidence strengths. Tavtigian and colleagues proposed a Bayesian framework for calibrating computational evidence strength based on odds ratios of pathogenicity at different score thresholds (Tavtigian et al. 2018). Under this framework, thresholds must achieve specific odds ratios (greater than 2.08 for supporting evidence, greater than 4.33 for moderate evidence) to qualify for particular ACMG evidence levels. Pejaver et al. applied this framework to calibrate 13 classical missense predictors, establishing that four tools (BayesDel, MutPred2, REVEL, VEST4) could provide up to Strong evidence for pathogenicity (Pejaver et al. 2022). In 2025, ClinGen extended these calibrations to foundation model-based predictors, demonstrating that AlphaMissense, ESM1b, and VARITY all reach Strong evidence for pathogenicity and Moderate for benignity at appropriate score thresholds (Bergquist et al. 2025). Laboratories should select tools with established calibrations and document threshold choices in variant reports.

18.5.4 The Challenge of Uncertain Significance

The variant of uncertain significance (VUS) category deserves particular attention. Variants with intermediate foundation model scores genuinely reflect uncertainty: the models cannot confidently distinguish pathogenic from benign. This uncertainty may arise from limited training data for the gene or variant class, conflicting signals in the sequence context, or genuine biological ambiguity where the variant’s effect depends on factors the model cannot observe.

Forcing these variants into discrete categories by applying arbitrary cutoffs misrepresents the actual evidence. A variant scored at 0.55 is not “slightly pathogenic”; it is a variant for which the model has insufficient evidence to discriminate. Reporting calibrated probabilities alongside discrete classifications preserves information for downstream decision-making. Clinicians can then integrate computational evidence with functional studies, segregation data, and clinical presentation, appropriately weighting the computational contribution based on its expressed uncertainty.

The broader framework for understanding and quantifying uncertainty in foundation model predictions, including methods for distinguishing uncertainty arising from limited data (epistemic uncertainty) from uncertainty inherent in the prediction task (aleatoric uncertainty), is developed in Chapter 24. Conformal prediction methods that provide finite-sample coverage guarantees for variant classification are examined in Section 24.6.

18.6 Uncertainty Quantification

Calibration addresses systematic bias in probability estimates; uncertainty quantification addresses the confidence of individual predictions. A well-calibrated model might correctly estimate that 70% of variants in some category are pathogenic, but for any individual variant, we want to know whether the model’s prediction is reliable or whether the variant falls outside the model’s competence.

18.6.1 Sources of Uncertainty

Epistemic uncertainty reflects gaps in the model’s knowledge: regions of input space with sparse training data, variant types rarely observed during training, or proteins from understudied families. This uncertainty is reducible in principle by collecting more data and can be estimated by measuring model disagreement across training variations.

Aleatoric uncertainty reflects inherent noise in the prediction target: variants whose pathogenicity genuinely varies across individuals or contexts, or cases where the same score corresponds to both pathogenic and benign variants for biological rather than modeling reasons. This uncertainty is irreducible by additional training and represents fundamental limits on predictability.

Distinguishing these uncertainty types matters for interpretation. High epistemic uncertainty suggests caution: the model has not seen similar variants and may be extrapolating unreliably. High aleatoric uncertainty suggests that the variant’s effect genuinely depends on factors not captured by sequence alone.

Knowledge Check: Epistemic vs. Aleatoric Uncertainty

Consider two variants with identical AlphaMissense scores of 0.65:

Variant A: A missense change in BRCA1, a gene with thousands of characterized variants in ClinVar and extensive deep mutational scanning data.

Variant B: A missense change in a recently discovered orphan gene with no characterized variants and only 8 homologous sequences in databases.

  1. Which variant likely has higher epistemic uncertainty? Why?
  2. For which variant might the intermediate score more likely reflect aleatoric uncertainty (genuine biological ambiguity)?
  3. How would you communicate the different confidence levels to a clinician?

Variant B has higher epistemic uncertainty because the model has limited training data for orphan genes with sparse evolutionary sampling. For Variant A in BRCA1, the intermediate score more likely reflects aleatoric uncertainty (genuine biological ambiguity about pathogenicity) given extensive characterization data. Communication to clinicians should distinguish these: “For BRCA1, the intermediate score reflects uncertain biological effect; for the orphan gene, the score itself is uncertain due to limited data and should be interpreted cautiously.”

18.6.2 Uncertainty Estimation Methods

Ensemble methods train multiple models on different data subsets or with different random initializations. Like asking multiple doctors for second opinions on a diagnosis, if all the doctors independently reach the same conclusion, you can be more confident in the answer; if they disagree substantially, that disagreement itself is informative, indicating the case is genuinely ambiguous. Similarly, prediction variance across ensemble members estimates epistemic uncertainty. Large disagreement indicates that the prediction depends strongly on training specifics rather than robust learned patterns. Deep ensembles provide well-calibrated uncertainty estimates but multiply computational cost linearly with ensemble size.

Why does ensemble disagreement estimate uncertainty? The intuition is that if multiple models trained on slightly different data or with different random seeds all converge to similar predictions, the prediction is robust to training details and likely reflects genuine patterns in the data. Conversely, if predictions diverge substantially, each model has learned something idiosyncratic that does not generalize. For variant effect prediction, a variant where five ensemble members predict pathogenicity between 0.75 and 0.82 warrants more confidence than one where predictions range from 0.35 to 0.90. The ensemble is effectively asking: “Would I reach the same conclusion if I had seen slightly different training examples?” When the answer is no, that uncertainty should propagate to clinical interpretation.

Monte Carlo dropout approximates Bayesian inference by applying dropout at test time and averaging predictions across multiple stochastic forward passes. Variance across passes estimates uncertainty without training multiple models. This approach adds modest computational overhead and can be applied to any dropout-containing architecture.

Conformal prediction provides distribution-free uncertainty quantification with coverage guarantees (Angelopoulos and Bates 2023). Given a calibration set (held-out labeled examples used to determine score thresholds), conformal methods construct prediction sets guaranteed to contain the true label with specified probability (e.g., 90%). For variant classification, this might produce sets like {pathogenic, uncertain} or {benign} depending on the variant and desired coverage. Larger prediction sets indicate greater uncertainty; single-element sets indicate confident predictions. Section 24.6 examines conformal methods for genomic applications in detail.

18.6.3 Out-of-Distribution Detection

Beyond quantifying uncertainty for in-distribution predictions, responsible deployment requires detecting when inputs fall outside the model’s training distribution. A protein language model trained on natural proteins may produce confident but unreliable predictions for synthetic sequences or fragments. A regulatory model trained on common cell types may fail on rare developmental stages.

Likelihood-based detection uses the model’s own representations to identify unfamiliar inputs. Sequences with low embedding density (few similar sequences nearby in the model’s learned representation space, indicating the input is unusual compared to training data) or anomalous attention patterns may fall outside the training distribution regardless of predicted scores. Flagging these inputs for manual review prevents automated classification of cases the model cannot reliably assess.

Distance-based methods compare new inputs to training examples in representation space. Variants far from any training example in embedding space warrant skepticism even if the model produces confident predictions. Maintaining summary statistics of training representations enables efficient distance computation at deployment.

Aleatoric vs. epistemic uncertainty sources

Ensemble disagreement quantifies uncertainty

Distance-based out-of-distribution detection

Selective prediction improves reliability
Figure 18.5: Uncertainty quantification for variant effect prediction. (A) Aleatoric uncertainty (inherent data ambiguity) reflects context-dependent effects that cannot be reduced. Epistemic uncertainty (model limitations) can be reduced with more training data. (B) Ensemble disagreement provides empirical uncertainty estimates; variance across model runs indicates prediction reliability. (C) Distance from training distribution flags out-of-distribution variants requiring caution. (D) Selective prediction uses uncertainty to abstain on difficult cases, achieving higher accuracy by routing uncertain variants to expert review rather than providing unreliable predictions.

Chapter 24 develops uncertainty quantification methods in detail, including practical implementation guidance and evaluation metrics. For VEP applications, the key insight is that uncertainty estimates complement point predictions: high-confidence predictions can inform clinical decisions; low-confidence predictions should prompt additional evidence gathering rather than blind acceptance of model outputs.

18.7 What Foundation Models Add

Having surveyed current foundation model approaches, we can now directly address what they contribute beyond classical methods (Chapter 4). The answer is nuanced: substantial improvements in some domains, modest gains in others, and persistent blind spots that new architectures have not yet resolved.

18.7.1 Improved Discrimination

On standard benchmarks, foundation model VEP methods consistently outperform classical predictors. AlphaMissense achieves auROC of 0.94 on held-out ClinVar missense variants (Cheng et al. 2023). SpliceAI achieves AUC of 0.97-0.99 for splice site prediction, substantially exceeding classical methods like MaxEntScan (Jaganathan et al. 2019). GPN-MSA scores correlate more strongly with deep mutational scanning measurements than conservation-based methods (Benegas et al. 2024).

These improvements reflect richer representations. Classical methods aggregate independent features (conservation, amino acid properties, domain annotations); foundation models learn nonlinear interactions among positions and capture patterns too subtle for manual feature engineering. The gap is largest for variants where context matters: buried core missense variants where structural environment determines impact, splice variants where cryptic site activation depends on flanking sequence, regulatory variants where motif disruption interacts with chromatin context.

18.7.2 Extended Coverage

Classical methods often fail silently on understudied genes, rare variant classes, or poorly annotated regions. SIFT and PolyPhen require protein alignments; variants in singleton genes without homologs receive no prediction. CADD depends on annotation features; variants in regions lacking regulatory marks receive uninformative scores.

Foundation models degrade more gracefully. Protein language models score any amino acid sequence regardless of available homologs. DNA language models score any genomic position regardless of existing annotation. This extended coverage matters for clinical sequencing of rare diseases, where pathogenic variants often reside in less-studied genes precisely because their severe effects are incompatible with population frequency. Cross-species transfer learning extends this coverage further: Jagota et al. (2023) demonstrate that models trained on deep mutational scanning data from model organisms can generalize to human variants, particularly valuable for rare disease variants in understudied genes where human-only data remains scarce.

18.7.3 Mechanistic Interpretability

AlphaGenome and similar multi-output models provide predictions about mechanism rather than bare pathogenicity scores. A variant flagged as deleterious might also show predicted effects on chromatin accessibility, contact frequency, and downstream gene expression. These mechanistic predictions enable hypothesis generation and targeted experimental validation (Chapter 25).

Classical methods offer limited mechanistic insight. CADD provides a single score without indicating whether it derives from conservation, protein impact, regulatory disruption, or other features. Decomposing the score into component contributions requires separate analysis. Foundation models that predict molecular phenotypes naturally provide this decomposition.

18.7.4 Persistent Limitations

Foundation models have not solved several fundamental challenges. Ancestry bias persists because training data remain skewed toward European populations; performance degrades for variants common in African or Asian populations but rare in training sets. The systematic analysis of ancestry-related confounding appears in Section 13.2.1, with broader confounding detection methods in Section 13.8. Calibration requires substantial labeled data that inherit existing biases. Rare variant classes (structural variants, complex indels, repeat expansions) lack sufficient training examples for reliable prediction.

The comparison to classical methods reveals diminishing returns on certain axes. For well-conserved active site variants in thoroughly studied proteins, PolyPhen-2 already achieves near-optimal performance; AlphaMissense improves marginally. The largest foundation model gains appear for difficult cases where classical features are uninformative or misleading.

Key Insight: Where Foundation Models Help Most

Foundation models provide the largest improvements over classical methods in precisely the situations where clinical need is greatest:

  1. Novel genes: Orphan proteins lacking homologs where alignment-based methods fail
  2. Rare variants: Never-before-seen substitutions where population frequency provides no information
  3. Noncoding regions: Regulatory variants where classical annotation is sparse
  4. Context-dependent effects: Buried core variants, cryptic splice sites, motif-chromatin interactions

For common variants in well-studied genes with extensive clinical annotation, the marginal improvement may be modest. But these are not the variants causing diagnostic uncertainty. The value of foundation models lies in extending reliable prediction to the long tail of rare variants in less-characterized genes, exactly where clinical interpretation struggles most.

Performance varies by protein family

Rare variants remain challenging

Clinical impact assessment

Remaining gaps require new approaches
Figure 18.6: Foundation model VEP: gains and remaining gaps. (A) Performance varies by protein family; well-characterized families achieve high accuracy while poorly-studied families show degraded performance. (B) Rare variants are harder to predict, particularly for recently evolved or population-specific variants. (C) Clinical impact assessment: FM predictions could reclassify significant fractions of VUS, but require appropriate validation. (D) Remaining gaps include complex variants (multi-allelic, structural), modifier effects, and tissue-specific pathogenicity. Foundation models have advanced VEP substantially but do not eliminate the need for functional validation.

18.7.5 Limitations of PLM-Based VEP

Despite impressive benchmark performance, protein language model approaches to variant effect prediction face limitations that practitioners must understand. Karollus, Mauermeier, and Gagneur (2023) systematically evaluated PLM-VEP methods and identified several challenges:

  • Limited sensitivity to rare functional variation: Because PLMs learn from evolutionary data, they may underestimate pathogenicity of de novo variants, recent adaptive mutations, or population-specific functional variants not well-represented in training databases
  • Limited epistatic modeling: While attention mechanisms capture some residue dependencies, PLMs cannot directly score compound heterozygous or digenic interactions (cases where two variants together cause disease) that may be clinically relevant
  • Calibration challenges: Scores reflect evolutionary constraint, not pathogenicity probability. Translating constraint scores to clinical interpretation requires calibration on labeled pathogenic variants

Swanson, Chang, and Zou (2022) provides complementary analysis showing that PLMs show reduced predictive accuracy on proteins with limited evolutionary coverage or non-canonical functions where training data is sparse.

18.8 Clinical Integration Considerations

Foundation model VEP tools require thoughtful integration into clinical workflows. Their benchmark performance does not automatically translate without attention to deployment context, validation requirements, and human factors.

18.8.1 Laboratory Validation

Before clinical use, laboratories should validate foundation model tools against local truth sets representing their patient population. Published benchmark performance on ClinVar may not generalize to a laboratory’s specific case mix. Validation should assess discrimination (can the tool distinguish pathogenic from benign?), calibration (do probability estimates match observed frequencies?), and utility (does incorporating the tool improve variant classification compared to existing workflows?).

Validation requires variants with known pathogenicity independent of the computational predictions being tested. Using ClinVar variants whose classifications already incorporated CADD scores to validate CADD creates circular reasoning, a form of label circularity examined in Section 13.5. Dey et al. (2020) systematically characterize three evaluation pitfalls for VEP methods: (1) label circularity from overlapping annotation sources, (2) class imbalance with pathogenic variants vastly outnumbered by benign, and (3) benchmark contamination through shared training data ancestry. Awareness of these biases is essential for proper method comparison. Gold-standard variants from functional studies, segregation data, or expert review provide cleaner validation targets, with detailed evaluation methodology in Chapter 12.

Practical Guidance: Laboratory Validation Checklist

Before deploying a foundation model VEP tool clinically, ensure you have:

This validation should be repeated when model versions change or when significant shifts in your patient population occur.

18.8.2 Workflow Integration

Foundation model predictions represent one evidence type among many. ACMG guidelines specify how computational evidence combines with population frequency, functional data, segregation, and clinical phenotype. Computational evidence alone rarely suffices for pathogenic or benign classification; it supports or weakens classifications established by other evidence types.

Laboratory information systems require modification to display and store foundation model outputs alongside existing annotations. Analyst training ensures appropriate interpretation: understanding that high scores indicate deleteriousness without establishing causation, recognizing when scores fall outside validated ranges, and knowing when to request additional evidence for uncertain cases.

18.8.3 Communication to Clinicians

Variant reports communicated to ordering clinicians should present foundation model evidence appropriately. Reporting raw scores without context confuses non-specialist clinicians. Reporting discrete classifications without uncertainty may convey false confidence. Effective reporting might state: “Computational tools (AlphaMissense, SpliceAI) concordantly predict this variant is likely to affect protein function, supporting the PP3 criterion for pathogenicity classification.”

When foundation model predictions conflict with other evidence, reports should acknowledge the discrepancy rather than suppressing inconvenient results. A variant segregating with disease in a family but receiving a benign computational prediction warrants explicit discussion, not quiet exclusion of the computational evidence.

18.9 Open Challenges

Current foundation model approaches leave substantial problems unsolved. These open challenges define directions for future research and areas where clinical caution remains warranted.

18.9.1 Complex Variant Types

Most current models address single nucleotide variants and small indels. Structural variants (deletions, duplications, inversions spanning kilobases to megabases) remain largely outside foundation model capabilities. Copy number variation, repeat expansions, and complex rearrangements alter genome architecture in ways current sequence models cannot represent. Extending foundation model paradigms to these variant classes requires architectural innovations beyond current approaches.

18.9.2 Long-Read Sequencing and Variant Effect Prediction

The emergence of long-read sequencing technologies fundamentally changes the landscape of variant detection and interpretation. Pacific Biosciences (PacBio) high-fidelity (HiFi) reads and Oxford Nanopore Technologies (ONT) ultra-long reads routinely span tens of kilobases, far exceeding the 150-300 base pair fragments of short-read platforms (Logsdon, Vollger, and Eichler 2020). This extended read length enables detection of variant classes invisible to short-read analysis while creating both opportunities and challenges for foundation model-based interpretation.

Structural variant detection benefits most dramatically from long-read sequencing. Short reads struggle to resolve breakpoints of large deletions, duplications, and inversions, often missing structural variants entirely or mischaracterizing their boundaries. Long reads spanning structural variant breakpoints enable precise localization and accurate genotyping. Tools like pbsv, Sniffles2, and SVIM detect structural variants with sensitivity and specificity far exceeding short-read methods (Smolka et al. 2024). The resulting catalogs reveal that each human genome harbors thousands of structural variants affecting millions of base pairs, a substantial source of genetic variation previously obscured by technological limitations.

Interpreting these structural variants presents challenges that current foundation models do not address. A deletion removing an entire exon has qualitatively different consequences than a single nucleotide substitution, yet protein language models score point mutations without mechanisms for evaluating larger-scale changes. Regulatory models like Enformer operate on fixed-length sequence windows and cannot naturally represent the genomic rearrangements that structural variants introduce. Extending foundation model approaches to structural variant interpretation requires new architectures that explicitly model genome organization rather than treating sequence as a linear string.

Phasing and haplotype-aware analysis represent a second area where long-read data transforms variant interpretation. Human genomes are diploid, with variants distributed across two homologous chromosomes. The functional consequences of multiple variants depend critically on whether they reside on the same haplotype (cis) or opposite haplotypes (trans). Two loss-of-function variants in cis leave one functional copy, while the same variants in trans may completely abolish gene function (Section 1.4; Section 29.3.2).

Short-read sequencing typically produces unphased genotypes: variants are detected without determining their chromosomal assignment. Statistical phasing using population reference panels infers likely haplotypes but introduces errors, particularly for rare variants where population frequencies provide limited information. Long reads spanning multiple variant sites provide direct physical phasing, resolving haplotype structure from the sequencing data itself. With HiFi reads averaging 15-20 kilobases, most nearby variants can be directly phased, while ultra-long ONT reads exceeding 100 kilobases can phase variants separated by substantial genomic distances.

Foundation models have not yet incorporated haplotype information systematically. Protein language models score individual variants without considering whether they occur together on the same protein copy. DNA language models process single reference sequences rather than diploid genotypes with phased variants. Developing haplotype-aware variant effect prediction remains an open challenge: models must learn how variant combinations interact, distinguishing compensatory mutations that restore function from compound effects that amplify disruption.

Long-read sequencing expands variant detection scope
Figure 18.7: Long-read sequencing expands the scope of variant effect prediction. Short-read sequencing (top) detects primarily SNVs and small indels for which current FM-VEP approaches are well-suited. Long-read sequencing (bottom) reveals structural variants, repeat expansions, and complex haplotypes that current models cannot process. Realizing the clinical potential of long-read sequencing requires developing FM-VEP approaches for these new variant classes, a major open challenge.

Long-read-specific variant calling increasingly incorporates deep learning approaches. DeepVariant and Clair3 use convolutional and recurrent architectures to call variants from pileup images, with versions trained specifically on long-read data achieving accuracy that approaches or exceeds short-read calling for SNVs and small indels (Zheng et al. 2022). PEPPER-Margin-DeepVariant pipelines combine multiple neural network components to handle the higher error rates and distinct error profiles of nanopore sequencing. These tools demonstrate that deep learning can extract accurate variant calls from long-read data despite its different characteristics.

The error profiles of long-read platforms create both challenges and opportunities for foundation model training. ONT sequencing exhibits systematic errors in homopolymer regions and certain sequence contexts that differ from short-read error patterns. PacBio HiFi reads achieve per-read accuracy exceeding 99.9% through circular consensus sequencing, approaching short-read quality while retaining long-read advantages. Foundation models trained on long-read data must learn these error profiles to distinguish true variants from sequencing artifacts.

Training foundation models on long-read data remains largely unexplored. Current DNA language models train on reference genomes and short-read assemblies, rarely incorporating the raw signal or base-called sequences from long-read platforms. Models trained directly on long-read data might learn different patterns: the extended context could enable modeling of longer-range dependencies, while exposure to structural variants during training could improve representation of genome architecture.

Several technical challenges complicate long-read foundation model development. Training data volumes are smaller: long-read datasets remain orders of magnitude smaller than short-read repositories. The computational cost of processing longer sequences scales unfavorably for attention-based architectures, though state-space models like those underlying Evo 2 (Section 7.7.2) partially address this limitation. Representing structural variants requires architectural innovations beyond sequence modeling, potentially incorporating graph representations or hierarchical encodings of genome structure.

The integration of long-read variant detection with foundation model interpretation represents a frontier for the field. As long-read sequencing costs decline and data volumes grow, training foundation models that leverage the unique advantages of long reads (extended context, structural variant representation, direct phasing) becomes increasingly feasible. Models that combine the pattern-recognition capabilities of foundation models with the variant classes revealed by long-read sequencing could substantially expand the scope of computational variant interpretation.

18.9.3 Combinatorial Effects

Genomes contain multiple variants that may interact. Compound heterozygosity (two variants affecting both copies of a gene) creates pathogenic states from individually tolerable variants, a clinical scenario examined in Section 1.4.1 and Section 29.3.2. Modifier variants in other genes modulate penetrance. Haplotype effects mean variants on the same chromosome have different consequences than variants on opposite chromosomes, with phasing methods to distinguish these scenarios detailed in Section 1.4. Current models score variants independently, ignoring these interactions that determine clinical presentation.

18.9.4 Phenotype Specificity

A variant pathogenic for one phenotype may be benign for another. SCN5A variants, for example, cause distinct cardiac arrhythmia syndromes (Long QT type 3, Brugada syndrome, conduction disease) depending on whether mutations produce gain-of-function or loss-of-function effects on sodium channel activity. Foundation models trained on pathogenic/benign labels average across phenotypes, potentially obscuring clinically relevant specificity. Phenotype-specific training requires much larger datasets than currently available.

18.9.5 Temporal and Environmental Context

Variant effects often depend on age, environmental exposures, or physiological state. A variant pathogenic under metabolic stress may be tolerable at baseline. Foundation models capture sequence context but not the dynamic biological context determining phenotypic expression. Integrating longitudinal clinical data with sequence-level predictions remains an unsolved challenge.

18.9.6 Equity and Access

State-of-the-art foundation models require substantial computational resources for training and sometimes for inference. Laboratories in resource-limited settings may lack access to cutting-edge tools, creating a two-tiered system where well-funded institutions deploy sophisticated variant interpretation while others rely on simpler methods. Precomputed scores (like AlphaMissense’s proteome-wide release) partially address computational barriers, but equity concerns extend far beyond compute access.

Training data composition determines which patients foundation models serve well. ClinVar contains many more pathogenic variant classifications for European-ancestry individuals than for other populations (Landrum et al. 2018). Protein language models trained predominantly on sequences from well-studied organisms may capture evolutionary constraints less accurately for proteins divergent from training distributions. The consequence is systematic: variant interpretation performs best for patients who already benefit most from biomedical research, and worst for those historically excluded. A diagnostic laboratory serving a cosmopolitan population will encounter variants where foundation model predictions are less reliable precisely because those variants come from underrepresented ancestries.

Validation cohorts exhibit similar biases. When foundation models are evaluated on ClinVar or gnomAD-derived benchmarks, performance metrics reflect accuracy for the populations overrepresented in those resources. A model achieving 0.95 auROC on standard benchmarks may achieve substantially lower discrimination for African-ancestry variants simply because the benchmark itself undersamples that population. Equitable deployment requires ancestry-stratified evaluation that explicitly reports performance gaps, not aggregate metrics that obscure disparities (Chapter 12). The broader implications of these biases, and governance frameworks for addressing them, receive comprehensive treatment in Chapter 27.

Stop and Think: Equity in Model Development

Consider a clinical genetics laboratory that serves a diverse patient population where 40% of patients have non-European ancestry. The laboratory is evaluating AlphaMissense for clinical use.

  1. What specific validation analyses should the laboratory conduct before deployment?
  2. If the model performs well overall (auROC = 0.92) but substantially worse for African-ancestry variants (auROC = 0.81), what are the ethical implications of deploying it?
  3. How might the laboratory communicate differential performance to ordering clinicians?

There are no simple answers to these questions, but responsible deployment requires explicitly grappling with them.

18.10 Tools for Interpretation, Not Oracles

Foundation models have transformed variant effect prediction from feature engineering to representation learning. Protein language models capture evolutionary constraint at resolution that multiple sequence alignments cannot match. DNA language models and regulatory models extend coverage to noncoding variants across the genome. Multi-omic architectures provide mechanistic predictions enabling hypothesis generation beyond bare deleteriousness scores. The best current methods substantially outperform classical approaches on established benchmarks, particularly for rare variants and novel genes where training data are sparse.

Yet benchmark performance does not automatically translate to clinical utility. Calibration requires careful attention: a model may discriminate pathogenic from benign variants while systematically overestimating or underestimating probabilities. Uncertainty quantification remains immature; models often produce confident predictions for inputs that fall outside their training distribution. Population bias persists despite foundation model advances; improvements over classical methods are smallest for ancestry groups underrepresented in training data. Complex variant types, combinatorial effects, and tissue-specific consequences remain beyond current capabilities.

Clinical deployment demands humility alongside enthusiasm. Foundation model VEP tools are aids to human interpretation, not autonomous classifiers. Their predictions inform rather than determine variant classification, complementing population frequency data, functional assay evidence, segregation analysis, and clinical judgment. Used appropriately, they accelerate diagnosis and reduce missed findings. Used as oracles, they create false confidence and may perpetuate existing inequities in genomic medicine. Clinical workflows (Chapter 29, Chapter 28) integrate these predictions alongside uncertainty quantification (Chapter 24) and interpretability methods that probe what foundation models have learned (Chapter 25). Variant effect prediction sits at the center of genomic medicine; foundation models have raised its ceiling while the work of achieving its potential continues.

Chapter Summary
Test Yourself

Before reviewing the summary, test your recall:

  1. Explain the paradigm shift from classical feature-based VEP to foundation model-based VEP. How do foundation models perform zero-shot variant scoring without pathogenicity labels?

  2. Compare protein-based VEP approaches (ESM-1v, EVE, AlphaMissense) with DNA-based approaches (SpliceAI, Enformer, GPN-MSA). For each approach, identify what variant types it handles best and what biological information it leverages.

  3. What is model calibration and why does it matter for clinical variant interpretation? How do reliability diagrams help assess calibration quality?

  4. Describe three strategies for combining evidence across multiple foundation models. What is the double-counting problem and how can it be avoided when applying ACMG criteria?

  5. Identify three persistent challenges or limitations in foundation model-based VEP that require human judgment. For each, explain why current models struggle and what types of variants are most affected.

  1. Paradigm shift from classical to foundation model VEP: Classical methods manually engineer features (conservation scores, amino acid properties, domain annotations) and train classifiers on labeled pathogenic/benign variants. Foundation models invert this: they learn representations from unlabeled sequences during pretraining (masked token prediction), discovering evolutionary constraints implicitly. Zero-shot scoring compares the likelihood of reference versus variant alleles given learned sequence context; variants that are evolutionarily unexpected (low likelihood) tend to be deleterious. This approach requires no pathogenicity labels because evolution has already conducted billions of experiments testing which variants are compatible with life.

  2. Protein vs. DNA approaches comparison: Protein approaches (ESM-1v, EVE, AlphaMissense) operate on amino acid sequences and excel at missense variants. ESM-1v uses zero-shot likelihood ratios from protein language models. EVE fits variational autoencoders to MSAs, capturing co-evolution. AlphaMissense combines sequence with AlphaFold2 structural features for state-of-the-art missense prediction. DNA approaches (SpliceAI, Enformer, GPN-MSA, Evo 2) work on nucleotide sequences and cover all variant types. SpliceAI specializes in splice variants using dilated convolutions. Enformer predicts regulatory impacts. GPN-MSA and Evo 2 provide DNA language model scores genome-wide, including noncoding regions where protein models cannot operate.

  3. Calibration importance and reliability diagrams: Calibration ensures that predicted probabilities match observed frequencies: a score of 0.8 should mean 80% of variants with that score are truly pathogenic. Without calibration, scores are arbitrary and cannot guide clinical decisions. Reliability diagrams plot predicted probabilities (x-axis) against observed pathogenic proportions (y-axis). Perfect calibration falls on the diagonal. Points below indicate overconfidence (model predicts higher probabilities than reality); points above indicate underconfidence. Neural networks typically overpredict extreme probabilities and require post-hoc calibration (temperature scaling, isotonic regression) before clinical use.

  4. Combining evidence strategies and double-counting: Three strategies:

    1. Modular routing: apply different models to different variant types (AlphaMissense for missense, SpliceAI for splice sites).

    2. Score aggregation: combine predictions from multiple models using ensemble methods or Bayesian integration with careful weighting.

    3. Tiered filtering: use fast models for initial screening, expensive models for prioritized variants.

    The double-counting problem arises when multiple predictors share information sources (e.g., AlphaMissense and ESM-1v both encode evolutionary constraint). Treating correlated predictions as independent evidence overweights shared signals. Solution: assess correlations, group predictors by information source, select one representative from each group for ACMG evidence, and document tool correlations in laboratory validation records.

  5. Three persistent challenges:

    1. Structural variants (deletions, duplications, inversions): current sequence models cannot represent architectural rearrangements; they are trained on SNVs and small indels.

    2. Combinatorial/compound effects: models score variants independently, missing compound heterozygosity where two individually tolerable variants together cause disease, and haplotype effects where phasing determines pathogenicity.

    3. Population bias: training data overrepresent European ancestry; models perform worse for African and Asian populations where variants are undersampled in ClinVar/gnomAD, creating systematic disparities in prediction accuracy for underrepresented groups.

    These limitations require human judgment to integrate additional evidence (segregation, functional assays, ancestry-appropriate interpretation).

Core concepts covered:

  • Paradigm shift: Foundation models learn representations from unlabeled sequences, enabling variant scoring without pathogenicity labels. Evolution has already encoded functional constraints; models learn to detect when variants violate these patterns.

  • Protein-based methods: Zero-shot scoring with protein language models (ESM-1v) compares amino acid likelihoods. Alignment-based models (EVE, popEVE) explicitly capture evolutionary constraints from MSAs. AlphaMissense combines sequence and structural information for state-of-the-art missense prediction.

  • DNA-based methods: SpliceAI predicts splicing changes using dilated convolutions over 10kb context. Enformer and Sei predict regulatory effects. GPN-MSA and Evo 2 provide DNA language model scoring for all variant types.

  • Integration strategies: Different models address different variant types. Combining evidence requires avoiding double-counting from correlated predictors. Practical workflows tier models by computational cost.

  • Calibration: Raw model outputs require calibration before clinical use. Reliability diagrams assess calibration quality. ACMG criteria (PP3/BP4) map continuous scores to evidence levels through calibrated thresholds.

  • Uncertainty: Epistemic uncertainty (model knowledge gaps) differs from aleatoric uncertainty (inherent unpredictability). Ensemble methods, conformal prediction, and out-of-distribution detection provide complementary approaches.

  • Persistent challenges: Structural variants, combinatorial effects, phenotype specificity, and population bias remain unsolved. Foundation models are tools for interpretation, not oracles.

Key connections:

  • Chapter 4 provides the classical VEP baseline that foundation models improve upon
  • Chapter 15 develops the protein language model architectures underlying ESM-1v and AlphaMissense
  • Chapter 16 covers the regulatory models (Enformer, Sei) used for noncoding VEP
  • Chapter 23 presents the full technical treatment of calibration and uncertainty quantification
  • Chapters 27-28 show how VEP tools integrate into clinical workflows

Looking ahead: Having established how foundation models predict variant effects, Chapter 19 extends these concepts to RNA-level predictions, while Chapter 24 provides the complete framework for responsible uncertainty quantification in clinical deployment.