19 Pathogenic Variant Discovery

Warning

TODO:

Clinical genetics ultimately cares about specific variants and genes: which changes in a patient’s genome plausibly explain their phenotype, and which loci are compelling targets for follow-up in the lab. The previous chapters focused on foundation models for variant effect prediction (Chapter 13), multi-omics integration (Chapter 14), and clinical risk prediction (Chapter 18). This chapter shifts the emphasis from prediction to discovery workflows.

The central question is:

Given a huge space of possible variants and genes, how can genomic foundation models (GFMs) help us efficiently home in on those most likely to be causal?

We will treat “pathogenic” broadly—covering both Mendelian variants with large effects and complex trait variants that modulate risk more subtly. GFMs appear at multiple stages of these pipelines:

As variant-level effect predictors (e.g., AlphaMissense, GPN-MSA, Evo 2, AlphaGenome) that score coding and noncoding changes Z. Avsec, Latysheva, and Cheng (2025).
As inputs or priors for fine-mapping and rare variant association tests Clarke et al. (2024).
As node features in gene and network models, including graph neural networks (GNNs) over multi-omics and knowledge graphs Chandak, Huang, and Zitnik (2023).
As guides for CRISPR, MPRA, and other functional assays, closing the loop between in silico prediction and experimental validation Linder et al. (2025).

We will walk through these roles from locus-level variant ranking, to Mendelian disease diagnostics, to graph-based gene prioritization, and finally to closed-loop “hypothesis factory” workflows that blend GFMs with systematic perturbation experiments.

19.1 From Variant Effect Prediction to Prioritization

Chapter 13 surveyed state-of-the-art variant effect prediction (VEP) systems. Models such as AlphaMissense, GPN-MSA, Evo 2, and AlphaGenome assign each variant a score reflecting predicted impact on protein function, regulatory activity, or multi-omic phenotypes Z. Avsec, Latysheva, and Cheng (2025). In isolation, these scores are powerful but not yet a full prioritization pipeline.

In practice, discovery workflows require several additional steps:

Contextualizing the score
A raw VEP score has different implications depending on:
- Variant class (missense, splice, promoter, enhancer, UTR, intronic).
- Gene context (constraint, tissue-specific expression, pathway membership).
- Clinical or experimental question (dominant Mendelian disease, recessive disease, modifier of complex trait).
For example, a moderately damaging missense variant in a highly constrained gene expressed in the relevant tissue may be more compelling than a strongly damaging variant in a gene with no supporting biology.
Aggregation from variants to loci and genes
Discovery problems often operate at locus or gene level, requiring some aggregation of variant scores. Common strategies include:
- Max or top-k pooling – Focus on the worst predicted variant per gene or locus.
- Burden-style aggregation – Sum or average the predicted impact of all rare variants in a gene, possibly weighted by allele frequency and predicted effect.
- Mechanism-aware aggregation – Separate coding vs regulatory, or promoter vs distal enhancer contributions, using tissue-specific scores from models like Enformer or AlphaGenome Z. Avsec, Latysheva, and Cheng (2025).
Combining VEP with orthogonal evidence
VEP is rarely used alone. Modern pipelines combine:
- Population data – Allele frequency and constraint (pLI, LOEUF, missense and LoF intolerance).
- Clinical databases – ClinVar classifications, disease-gene catalogs (OMIM, HGMD).
- Functional annotations – Chromatin state, conservation (PhyloP, PhastCons), known regulatory elements (Siepel et al. 2005).
- Pathway and network context – Membership in pathways enriched for the trait, or centrality in relevant biological networks.
GFMs enter as feature providers in this stack, often replacing or augmenting hand-crafted features.
Calibration and interpretability
For prioritization, ranking may matter more than perfectly calibrated probabilities, but interpretable risk categories are crucial in clinical and experimental settings. This pushes towards:
- Score thresholds with empirical positive predictive value (PPV) estimates.
- Qualitative explanations (e.g., “strong disruption of a conserved splice donor in a haploinsufficient gene”).
- Visualizations of attention maps, saliency, or motif-level contributions (Chapter 17).

In other words, GFMs provide high-resolution local perturbation scores, but the art of discovery is in wiring those scores into larger decision frameworks.

19.2 Integrating VEP with GWAS, Fine-Mapping, and Burden Tests

Genome-wide association studies (GWAS) identify statistical associations between variants and traits. However, GWAS hits are often:

Noncoding – Located in enhancers or other regulatory elements.
In linkage disequilibrium (LD) – Dozens of variants in a region share similar association statistics.
Mechanistically opaque – Even the top GWAS SNP may not be truly causal.

19.2.1 VEP as a prior for fine-mapping

Fine-mapping methods aim to assign each variant in a locus a posterior probability of causality, usually by combining LD patterns, effect-size estimates, and sometimes functional annotations Wu et al. (2024). GFMs naturally provide functional priors:

Regulatory sequence models such as Enformer and AlphaGenome predict how a variant perturbs gene expression or chromatin landscapes Z. Avsec, Latysheva, and Cheng (2025).
Genome-scale LMs like GPN-MSA and Evo 2 estimate the likelihood or impact of nucleotide substitutions in their genomic context Brixi et al. (2025).
Specialized models like TREDNet and MIFM directly target causal variant prediction at GWAS loci Rakowski and Lippert (2025).

From a Bayesian perspective, these models provide a functional prior $ _j $ for each variant $ j $ in the locus. Fine-mapping frameworks can then:

Upweight variants predicted to have large regulatory or coding effects.
Downweight variants with benign or neutral predictions.
Support multi-variant configurations, where multiple causal variants exist at the same locus.

Recent benchmarks like TraitGym systematically evaluate how well various genomic LMs and VEP models serve as fine-mapping priors across traits and tissues (Benegas, Eraslan, and Song 2025).

19.2.2 Rare variant association and DeepRVAT-style models

For rare variants, single-variant tests have limited power. Instead, gene- or region-based burden tests aggregate rare variants across individuals to detect association. Here, VEP plays two key roles:

Variant weighting and filtering
Classical burden tests often restrict to “damaging” variants using simple filters (e.g., predicted LoF, CADD > threshold). GFMs provide richer filters and weights, enabling:
- Fine-grained distinctions among missense variants (e.g., using AlphaMissense scores (Cheng et al. 2023)).
- Inclusion of regulatory variants predicted to modulate gene expression.
- Continuous weights reflecting predicted effect size, rather than binary include/exclude decisions.
End-to-end deep set models
DeepRVAT exemplifies a newer paradigm: instead of hand-engineered burden summaries, a deep set network ingests per-variant features (including GFM-derived VEP scores) and learns to aggregate them into a gene-level risk signal (Clarke et al. 2024). This approach:
- Supports heterogeneous variant classes within a gene.
- Learns flexible aggregation functions (e.g., non-additive interactions) while preserving permutation invariance.
- Accommodates multiple phenotypes and covariates within a single model.

As more cohorts with whole-exome or whole-genome sequencing become available, these GFM-enhanced burden frameworks blur the line between GWAS and rare variant analysis, providing a continuum of variant discovery tools.

19.3 Mendelian Disease Gene and Variant Discovery

In Mendelian disease genetics, the questions tend to be more concrete: Which variant explains this patient’s phenotype? Which gene is implicated? WES/WGS of trios and families produces thousands of variants per individual. The standard pipeline includes:

Quality control and filtering
- Remove low-quality calls and technical artifacts.
- Filter by allele frequency (e.g., <0.1% in population databases), inheritance mode (de novo, recessive, X-linked), and variant type (LoF, missense, splice, structural).
Gene-centric ranking
- Aggregate candidate variants per gene, using constraint metrics and known disease-gene catalogs.
- Integrate phenotype similarity (e.g., HPO-based matching between patient and known gene syndromes).
Manual curation
- Expert review of gene function, expression patterns, animal models, and literature.
- Assessment of segregation in the family, de novo status, and evidence of pathogenic mechanism.

19.3.1 GFMs in Mendelian variant prioritization

GFMs reshape several stages of this process:

Richer coding impact scores
AlphaMissense provides proteome-wide missense pathogenicity estimates with continuous scores that often outperform traditional tools (Cheng et al. 2023). Coding-aware foundation models (cdsFM and related systems) further capture codon-level context and co-evolutionary patterns (Naghipourfar et al. 2024).
Regulatory and splice prediction
Genome-wide models like GPN-MSA, Evo 2, and AlphaGenome estimate the effect of noncoding and splice-proximal variants, filling a gap for Mendelian variants outside exons Z. Avsec, Latysheva, and Cheng (2025).
Combined variant–gene scoring
For each gene, we can aggregate:
- Max or weighted VEP score across all candidate variants.
- Separate tallies for LoF, missense, regulatory, and splice variants.
- Gene-level features (constraint, expression, pathways) and phenotype similarity.
A simple model might compute a composite gene score as a learned function of these features, trained on cohorts with labeled diagnoses.

19.3.2 Rare disease association at scale

Beyond single-family diagnostics, large consortia collect rare disease cohorts where the goal is to discover new gene–disease associations. DeepRVAT-style models provide one blueprint:

Represent each individual as a set of rare variants with multi-dimensional VEP features (from GFMs and traditional tools).
Use deep set networks to map from per-variant features to individual-level phenotype predictions or gene-level association signals (Clarke et al. 2024).
Incorporate multi-omics context (e.g., tissue-specific expression, chromatin accessibility from GLUE-like models) as additional features (Cao and Gao 2022).

This pushes Mendelian discovery closer to the foundation model paradigm: instead of hand-designed burden statistics, we train flexible architectures that learn how to combine variant-level representations into gene- and phenotype-level insights.

19.4 Graph-Based Prioritization of Disease Genes

Many discovery problems are inherently network-structured. Genes interact through pathways, protein–protein interaction (PPI) networks, co-expression modules, regulatory networks, and knowledge graphs. GNNs offer a natural way to fuse:

Node features from GFMs (e.g., aggregated VEP scores, expression profiles).
Graph structure capturing biological relationships.
Labels such as disease associations, essentiality, or cancer driver status.

19.4.1 Multi-omics and cancer gene modules

GLUE (and SCGLUE) frame multi-omics integration as a graph-linked embedding problem, connecting cells and features across modalities (Cao and Gao 2022). Inspired by this, GNN frameworks like MoGCN and CGMega build:

Gene-level graphs combining expression, methylation, copy number, and other omics layers H. Li et al. (2024).
Attention mechanisms to highlight important neighbors and pathways in cancer gene modules.
Predictive models for cancer subtypes, driver genes, and prognostic signatures.

GFMs can enhance these systems by supplying:

Variant-aware gene features (e.g., aggregated predicted impact of observed somatic mutations).
Regulatory context via sequence-based predictions of expression and chromatin (Enformer, Borzoi, AlphaGenome) Z. Avsec, Latysheva, and Cheng (2025).

19.4.2 Knowledge graphs and essential gene prediction

Knowledge graphs like PrimeKG aggregate heterogeneous biomedical entities—genes, diseases, drugs, pathways, and phenotypes—into a unified relational structure (Chandak, Huang, and Zitnik 2023). GNNs on such graphs can be trained to:

Prioritize disease genes based on graph proximity to known genes.
Suggest drug repurposing candidates by connecting genetic evidence to drug targets.
Discover modules linked to therapeutic response or adverse effects.

Bingo provides a related example, combining a large language model (LLM) with GNNs to predict essential genes from protein-level data (Ma et al. 2023). In principle, the node features in such systems could incorporate:

Gene-level embeddings derived from protein LMs (Chapter 9).
Aggregated variant effect embeddings from genomic LMs (Chapter 10 and Chapter 13).
Multi-omic signatures from GLUE-like integrative models (Cao and Gao 2022).

Together, these approaches illustrate a broader trend: GFMs rarely act alone. Instead, they supply dense, information-rich features to graph-based models that reason over the network context where disease mechanisms actually play out.

19.6 Case Studies and Practical Considerations

To ground these ideas, consider two representative application areas.

19.6.1 Rare disease diagnosis pipelines leveraging VEP scores

Modern rare disease centers increasingly adopt GFM-enhanced diagnostic workflows:

Variant filtering and annotation
- Standard QC and frequency filters.
- Annotation with GFM-based VEP scores (coding, regulatory, splice), constraint, and ClinVar evidence.
Gene-ranking model
- Per-gene aggregation of variant scores and features.
- A trained model that predicts the likelihood of each gene being causal, based on retrospective cohorts with known diagnoses.
Phenotype integration
- HPO-based similarity to known gene syndromes.
- Network-based propagation of phenotype associations using knowledge graphs like PrimeKG (Chandak, Huang, and Zitnik 2023).
Expert review
- Geneticists and clinicians inspect the top-ranked genes and variants, cross-checking against patient phenotypes, family segregation, and literature.

Compared to traditional pipelines, the GFM-enhanced version tends to:

Surface non-obvious candidates, such as noncoding or splice variants with strong predicted functional effects.
Provide more nuanced prioritization among multiple missense variants in the same gene.
Offer richer mechanistic hypotheses to guide follow-up experiments.

19.6.2 Cancer driver mutation discovery (coding and noncoding)

In cancer genomics, the goal is to distinguish driver mutations from a large background of passenger mutations. GFMs and graph-based models contribute at multiple levels:

Variant-level scoring
- Use coding VEP (e.g., AlphaMissense, cdsFM-like models) for missense drivers Naghipourfar et al. (2024).
- Use regulatory sequence models (Enformer, AlphaGenome, TREDNet) to evaluate noncoding mutations in promoters and enhancers Hudaiberdiev et al. (2023).
Gene- and module-level aggregation
- Aggregate somatic variants per gene, weighted by predicted functional impact.
- Apply GNNs such as MoGCN and CGMega to identify driver gene modules that are recurrently perturbed across patients H. Li et al. (2024).
- Use set-based models (akin to DeepRVAT) to relate patient-specific variant sets to tumor subtypes or outcomes (Clarke et al. 2024).
Functional follow-up
- Design focused CRISPR tiling screens around candidate regulatory elements, prioritized by GFMs.
- Validate predicted driver genes in cell line or organoid models, integrating transcriptional responses with multi-omic readouts (Chapter 14).

These pipelines exemplify multi-scale integration: GFMs for variant-level effects, GNNs for network-level reasoning, and high-throughput perturbations for experimental validation.

19.7 Outlook: Towards End-to-End Discovery Systems

Biomedical discovery of pathogenic variants is moving from manual, hypothesis-driven workflows toward data- and model-driven pipelines where GFMs act as a central substrate:

They turn raw sequence variation into rich, context-aware variant embeddings.
They provide priors and features for fine-mapping, rare variant association, and gene prioritization.
They guide the design of targeted perturbation experiments, which in turn provide new data to refine the models.

At the same time, several challenges remain:

Robustness and generalization across ancestries, tissues, and disease cohorts.
Calibration and interpretability suitable for clinical and experimental decision-making.
Evaluation frameworks (like TraitGym) that fairly compare models and reveal domain gaps (Benegas, Eraslan, and Song 2025).
Ethical and regulatory considerations around automated variant classification and gene discovery in sensitive contexts.

In the next chapter, we zoom out to the broader drug discovery and biotech landscape (Chapter 20), where many of these discovery building blocks are embedded in industrial-scale pipelines that span from genetic association to target validation, biomarker discovery, and eventually clinical translation.

Avsec, Žiga, Vikram Agarwal, D. Visentin, J. Ledsam, A. Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, J. Jumper, Pushmeet Kohli, and David R. Kelley. 2021. “[Enformer] Effective Gene Expression Prediction from Sequence by Integrating Long-Range Interactions.” Nature Methods 18 (October): 1196–1203. https://doi.org/10.1038/s41592-021-01252-x.

Avsec, Ziga, Natasha Latysheva, and Jun Cheng. 2025. “AlphaGenome: AI for Better Understanding the Genome.” Google DeepMind. https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/.

Benegas, Gonzalo, Carlos Albors, Alan J. Aw, Chengzhong Ye, and Yun S. Song. 2024. “GPN-MSA: An Alignment-Based DNA Language Model for Genome-Wide Variant Effect Prediction.” bioRxiv, April, 2023.10.10.561776. https://doi.org/10.1101/2023.10.10.561776.

Benegas, Gonzalo, Gökcen Eraslan, and Yun S. Song. 2025. “[TraitGym] Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.” bioRxiv. https://doi.org/10.1101/2025.02.11.637758.

Brixi, Garyk, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, et al. 2025. “[Evo 2] Genome Modeling and Design Across All Domains of Life with Evo 2.” bioRxiv. https://doi.org/10.1101/2025.02.18.638918.

Cao, Zhi-Jie, and Ge Gao. 2022. “[GLUE] Multi-Omics Single-Cell Data Integration and Regulatory Inference with Graph-Linked Embedding.” Nature Biotechnology 40 (10): 1458–66. https://doi.org/10.1038/s41587-022-01284-4.

Chandak, Payal, Kexin Huang, and Marinka Zitnik. 2023. “[PrimeKG] Building a Knowledge Graph to Enable Precision Medicine.” Scientific Data 10 (1): 67. https://doi.org/10.1038/s41597-023-01960-3.

Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Applebaum, Alexander Pritzel, et al. 2023. “[AlphaMissense] Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense.” Science 381 (6664): eadg7492. https://doi.org/10.1126/science.adg7492.

Clarke, Brian, Eva Holtkamp, Hakime Öztürk, Marcel Mück, Magnus Wahlberg, Kayla Meyer, Felix Munzlinger, et al. 2024. “[DeepRVAT] Integration of Variant Annotations Using Deep Set Networks Boosts Rare Variant Association Testing.” Nature Genetics 56 (10): 2271–80. https://doi.org/10.1038/s41588-024-01919-z.

Hudaiberdiev, Sanjarbek, D. Leland Taylor, Wei Song, Narisu Narisu, Redwan M. Bhuiyan, Henry J. Taylor, Xuming Tang, et al. 2023. “[TREDNet] Modeling Islet Enhancers Using Deep Learning Identifies Candidate Causal Variants at Loci Associated with T2D and Glycemic Traits.” Proceedings of the National Academy of Sciences 120 (35): e2206612120. https://doi.org/10.1073/pnas.2206612120.

Li, Hao, Zebei Han, Yu Sun, Fu Wang, Pengzhen Hu, Yuang Gao, Xuemei Bai, et al. 2024. “CGMega: Explainable Graph Neural Network Framework with Attention Mechanisms for Cancer Gene Module Dissection.” Nature Communications 15 (1): 5997. https://doi.org/10.1038/s41467-024-50426-6.

Li, Xiao, Jie Ma, Ling Leng, Mingfei Han, Mansheng Li, Fuchu He, and Yunping Zhu. 2022. “MoGCN: A Multi-Omics Integration Method Based on Graph Convolutional Network for Cancer Subtype Analysis.” Frontiers in Genetics 13 (February). https://doi.org/10.3389/fgene.2022.806842.

Linder, Johannes, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and David R. Kelley. 2025. “[Borzoi] Predicting RNA-Seq Coverage from DNA Sequence as a Unifying Model of Gene Regulation.” Nature Genetics 57 (4): 949–61. https://doi.org/10.1038/s41588-024-02053-6.

Ma, Jiani, Jiangning Song, Neil D. Young, Bill C. H. Chang, Pasi K. Korhonen, Tulio L. Campos, Hui Liu, and Robin B. Gasser. 2023. “’Bingo’-a Large Language Model- and Graph Neural Network-Based Workflow for the Prediction of Essential Genes from Protein Data.” Briefings in Bioinformatics 25 (1): bbad472. https://doi.org/10.1093/bib/bbad472.

Naghipourfar, Mohsen, Siyu Chen, Mathew K. Howard, Christian B. Macdonald, Ali Saberi, Timo Hagen, Mohammad R. K. Mofrad, Willow Coyote-Maestas, and Hani Goodarzi. 2024. “[cdsFM - EnCodon/DeCodon] A Suite of Foundation Models Captures the Contextual Interplay Between Codons.” bioRxiv. https://doi.org/10.1101/2024.10.10.617568.

Rakowski, Alexander, and Christoph Lippert. 2025. “[MIFM] Multiple Instance Fine-Mapping: Predicting Causal Regulatory Variants with a Deep Sequence Model.” medRxiv. https://doi.org/10.1101/2025.06.13.25329551.

Siepel, Adam, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs, Minmei Hou, Kate Rosenbloom, Hiram Clawson, et al. 2005. “[PhastCons] Evolutionarily Conserved Elements in Vertebrate, Insect, Worm, and Yeast Genomes.” Genome Research 15 (8): 1034–50. https://doi.org/10.1101/gr.3715005.

Wu, Yang, Zhili Zheng, Loic Thibaut2, Michael E. Goddard, Naomi R. Wray, Peter M. Visscher, and Jian Zeng. 2024. “Genome-Wide Fine-Mapping Improves Identification of Causal Variants.” Research Square, August, rs.3.rs–4759390. https://doi.org/10.21203/rs.3.rs-4759390/v1.

19 Pathogenic Variant Discovery

19.1 From Variant Effect Prediction to Prioritization

19.2 Integrating VEP with GWAS, Fine-Mapping, and Burden Tests

19.2.1 VEP as a prior for fine-mapping

19.2.2 Rare variant association and DeepRVAT-style models

19.3 Mendelian Disease Gene and Variant Discovery

19.3.1 GFMs in Mendelian variant prioritization

19.3.2 Rare disease association at scale

19.4 Graph-Based Prioritization of Disease Genes

19.4.1 Multi-omics and cancer gene modules

19.4.2 Knowledge graphs and essential gene prediction

19.5 Experimental Follow-Up and Closed-Loop Refinement

19.5.1 Designing CRISPR and MPRA experiments with GFMs

19.5.2 Using functional data to retrain and recalibrate models

19.6 Case Studies and Practical Considerations

19.6.1 Rare disease diagnosis pipelines leveraging VEP scores

19.6.2 Cancer driver mutation discovery (coding and noncoding)

19.7 Outlook: Towards End-to-End Discovery Systems