20 Drug Discovery & Biotech

Warning

TODO:

Genomic foundation models (GFMs) are built to turn raw sequence and multi-omic data into reusable biological representations and fine-grained predictions (Chapter 12). In previous chapters you saw how these models improve variant effect prediction (Chapters 10, 11, 13), long-range regulatory modeling (Chapters 8, 11, 12), and disease genetics workflows (Chapters 14–16).

This chapter zooms out to ask a more translational question:

How do genomic foundation models actually plug into drug discovery and biotech workflows?

Rather than walking step-by-step through a single therapeutic program, this chapter offers a compact, high-level map of where GFMs are already useful—or plausibly soon will be. The focus is on three broad roles:

Target discovery and genetic validation:
Using human genetics, variant-level scores, and gene-level evidence to prioritize safer, more effective targets.
Functional genomics and perturbation screens:
Designing, interpreting, and iteratively improving large-scale CRISPR/perturb-seq/MPRA screens with help from GFMs.
Biomarkers, patient stratification, and biotech infrastructure:
Turning model outputs into biomarkers for trial design and integrating GFMs into the industrial MLOps stack.

Throughout, the aim is not to promise “end-to-end AI drug discovery,” but to show pragmatic ways that genomic foundation models can reduce risk, prioritize hypotheses, and make experiments more informative, especially when coupled to high-quality human data.

20.1 Where Genomics Touches the Drug Discovery Pipeline

The canonical small-molecule or biologics pipeline is often summarized as:

Target identification and validation
Hit finding and lead optimization
Preclinical characterization (safety, PK/PD, tox)
Clinical trials (Phase I–III) and post-marketing

Genomics most directly enters at three points:

Early-stage target discovery and validation
- Human genetic associations (GWAS, rare-variant burden, somatic mutation landscapes) point to potential targets.
- Variant-level effect predictions and gene-level constraint metrics help de-prioritize potentially unsafe or non-causal signals.
Biomarker discovery and patient stratification
- Genetic risk scores, regulatory embeddings, and multi-omic signatures define patient subgroups and endpoints for trials.
- Embeddings from GFMs make it easier to find molecularly coherent patient strata beyond traditional clinical labels.
Mechanism-of-action (MoA) and resistance
- Functional genomics screens and perturbation assays help dissect how a compound perturbs cellular networks.
- GFMs can predict which perturbations matter and suggest follow-up experiments.

Other AI-for-drug-discovery efforts focus on molecular design, docking, or protein structure; those are largely beyond the scope of this book. Here we stay close to the DNA- and RNA-centric capabilities you’ve seen earlier: variant effect prediction, regulatory modeling, and multi-omics integration.

20.2 Target Discovery and Genetic Validation

Human genetics provides some of the strongest evidence that modulating a particular target can safely change disease risk. GFMs don’t replace classical statistical genetics, but they provide richer priors and more mechanistic features for identifying and validating targets.

20.2.1 From variant-level scores to gene-level targets

Variant effect prediction (VEP) models provide a natural starting point. Earlier chapters introduced:

Genome-wide deleteriousness scores such as CADD, which integrate diverse annotations and—more recently—deep and foundation-model features (Rentzsch et al. 2019; Schubach et al. 2024).
Protein-centric VEP GFMs, including AlphaMissense, GPN-MSA, and AlphaGenome, which combine protein language models, structure, and long-range context to score coding variants (Cheng et al. 2023; Benegas, Albors, et al. 2024; Z. Avsec, Latysheva, and Cheng 2025; Brandes et al. 2023).
Sequence-to-function models such as Enformer and long-context DNA LMs (e.g., Nucleic Transformer, HyenaDNA), which predict regulatory outputs from large genomic windows (Ž. Avsec et al. 2021; He et al. 2023; Nguyen et al. 2023; Trop et al. 2024).

Drug target teams rarely care about individual variants per se; they care about genes and pathways. The key move is therefore to aggregate variant-level information into gene-level evidence:

Coding variant aggregation
- Summarize missense and predicted loss-of-function (pLoF) variants in each gene using VEP scores.
- Partition variants by predicted functional category (e.g. likely loss-of-function vs. benign missense) and by allele frequency.
- Derive gene-level metrics such as “burden of predicted damaging variants in cases vs controls.”
Noncoding and regulatory evidence
- Aggregate variant effect predictions on enhancers, promoters, and splice sites that link (via chromatin interaction maps or models like Enformer) to a candidate gene (Ž. Avsec et al. 2021; He et al. 2023).
- Use long-range GFMs to connect distal regulatory elements to target loci across 100 kb–1 Mb.
Constraint and intolerance
- Combine VEP-informed burden with gene constraint measures (as used implicitly in CADD and downstream tools) to identify genes that are highly intolerant to damaging variation (Rentzsch et al. 2019; Schubach et al. 2024).
- Extremely constrained genes may be risky targets (essentiality/toxicity), while “dose-sensitive” but not lethal genes may present more attractive opportunities.

From a GFM perspective, the core idea is to treat gene-level evidence as an aggregation problem over high-dimensional variant embeddings. Instead of manually defining a handful of summary statistics, teams can feed variant embeddings or predicted functional profiles into downstream models that learn which patterns matter most for disease.

20.2.2 Linking genetic evidence to target safety and efficacy

Classical human genetics has established several now-standard heuristics for target selection:

“Human knockout” individuals (carrying biallelic LoF variants) provide a natural experiment on what happens when a gene is effectively inactivated.
Protective variants that reduce disease risk suggest directionality of effect (e.g. partial inhibition of a protein is beneficial rather than harmful).
Pleiotropy—associations with many unrelated traits—may signal safety liabilities.

GFMs reinforce and extend these ideas by:

Improving causal variant identification
- Fine-mapping methods and multiple-instance models like MIFM can distinguish truly causal regulatory variants from correlated passengers (Wu et al. 2024; Rakowski and Lippert 2025).
- Combining these with regulatory GFMs tightens the map from GWAS locus → variant → target gene.
Refining effect direction and magnitude
- VEP scores from protein and regulatory GFMs can approximate effect sizes (e.g. how “severe” a missense change is, or how strongly a regulatory variant alters expression) (Cheng et al. 2023; Benegas, Albors, et al. 2024; Z. Avsec, Latysheva, and Cheng 2025).
- This can help differentiate subtle modulators from catastrophic LoF.
Highlighting mechanism-enriched loci
- GFMs provide multi-task predictions (chromatin marks, TF binding, expression, splicing) that make it easier to interpret how a risk locus affects biology (Ž. Avsec et al. 2021; Benegas, Ye, et al. 2024).

In practice, a target discovery workflow might:

Start from GWAS summary statistics or rare variant analyses.
Apply fine-mapping (e.g. MIFM) to identify candidate causal variants (Wu et al. 2024; Rakowski and Lippert 2025).
Score candidate variants with VEP GFMs (both protein and regulatory).
Map variants to genes using long-range regulatory models (Enformer, Nucleic Transformer, HyenaDNA) (Ž. Avsec et al. 2021; He et al. 2023; Nguyen et al. 2023).
Aggregate signals into gene-level “genetic support” scores, incorporating constraint and pleiotropy information.

The result is a ranked list of candidate targets with structured evidence that can be compared across diseases and programs.

20.2.3 Evolving from hand-curated to model-centric target triage

Historically, target triage relied heavily on manual curation:

Experts would review GWAS hits, literature, and pathway diagrams.
Limited quantitative information was available for most genes, especially in non-classical pathways.

GFMs shift this towards a model-centric, continuously updated view:

New data (e.g. biobank sequencing, single-cell atlases) can be fed through trained GFMs to update variant and gene evidence.
The same underlying model suite can support many disease programs, enabling consistent cross-portfolio comparisons.
Benchmark frameworks like TraitGym emphasize standardized evaluation of genotype-phenotype modeling, helping teams choose appropriate model stacks for a given trait (Benegas, Eraslan, and Song 2025).

The limiting factor becomes less “do we have an annotation?” and more “can we interpret the model’s representation and connect it to biological plausibility and druggability?”—a theme echoed in Chapters 13 and 15.

20.3 Functional Genomics Screens in Drug Discovery

While human genetics offers observational evidence, drug discovery also relies heavily on perturbation experiments:

CRISPR knockout/knockdown/activation screens.
Base-editing or saturation mutagenesis around key domains.
MPRA and massively parallel promoter/enhancer assays.
Perturb-seq and other high-throughput transcriptomic readouts.

Genomic foundation models are well positioned to both design and interpret such screens.

20.3.1 Designing smarter perturbation libraries

Traditional pooled screens often rely on simple design rules (e.g. one sgRNA per exon, or tiling a region at fixed spacing). GFMs enable more information-dense designs:

Sequence-to-function priors
- Models like DeepSEA, Enformer, and related CNN/transformer architectures predict which bases are most functionally critical for regulatory outputs (Zhou and Troyanskaya 2015; Ž. Avsec et al. 2021; Benegas, Ye, et al. 2024).
- Library design can focus perturbations on high-sensitivity sites—predicted TF motifs, splice junctions, or enhancer “hotspots.”
Variant prioritization for saturation mutagenesis
- Protein and DNA GFMs can prioritize substitutions expected to span a wide range of predicted fitness, enabling better estimation of quantitative genotype–phenotype maps (Cheng et al. 2023; Marquet et al. 2024).
- This is especially useful for deep mutational scanning near active sites or in regulatory domains.
Off-target and safety considerations
- Sequence models can help filter sgRNA designs with high predicted off-target binding, or prioritize guide positions that minimize unintended regulatory disruption.

The overarching idea is to maximize the information gained per experimental budget by letting GFMs suggest where to perturb in sequence space.

20.3.2 Interpreting screen readouts with GFMs

Once a screen has been run, GFMs can assist in several ways:

Embedding perturbations and outcomes
- Encode each perturbed sequence (e.g. enhancer variant, gene knockout) using a DNA or protein GFM, and represent each experimental condition as the combination of its embedding and observed phenotype (e.g. expression profile).
- This enables manifold learning over perturbations, in which clusters correspond to shared mechanism-of-action.
Mapping hits back to pathways
- Combine GFMs with graph-based models over protein–protein interaction networks and regulatory networks to identify enriched pathways (Gao et al. 2023; Yuan and Duren 2025).
- Learned embeddings help propagate signal to weakly observed genes or variants.
Closing the loop with model retraining
- Use screen outcomes as labeled examples to fine-tune sequence-to-function models in the relevant cell type or context.
- This “lab-in-the-loop” refinement turns generic GFMs into highly tuned models for the cell system of interest.

For example, an MPRA that assays thousands of enhancer variants can yield sequence–activity pairs that dramatically improve expression-prediction GFMs in that locus or tissue. Conversely, model predictions can suggest follow-up experiments (additional variants, cell types, or perturbation strengths) that would be maximally informative given previous data.

20.4 Biomarker Discovery, Patient Stratification, and Trial Design

Even when a target is well validated, many programs fail in late-stage trials because the right patients, endpoints, or biomarkers were not selected. GFMs, combined with large cohorts, offer new tools for defining and validating biomarkers.

20.4.1 From polygenic scores to GFM-informed biomarkers

Classical polygenic scores (PGS) summarize the additive effect of many common variants on disease risk. Deep learning methods such as Delphi extend this idea by learning non-linear genotype–phenotype mappings directly from genome-wide data (Georgantas, Kutalik, and Richiardi 2024).

GFMs can enhance these approaches by:

Providing richer genetic features
- Instead of raw genotypes, models can use VEP-derived scores, variant embeddings, or gene-level features produced by GFMs.
- This can capture non-additive effects, regulatory architecture, and variant-level biology in a more compact representation.
Transferring knowledge across traits and ancestries
- Foundation models trained across diverse genomes (e.g. Nucleotide Transformer, GENA-LM, HyenaDNA) provide features that may generalize more robustly across populations than trait-specific models (Dalla-Torre et al. 2023; Fishman et al. 2025; Nguyen et al. 2023).
- Fine-mapping–aware approaches like MIFM further reduce dependence on linkage disequilibrium patterns (Wu et al. 2024; Rakowski and Lippert 2025).
Distinguishing risk and progression
- By integrating regulatory and expression predictions, risk models can differentiate genetic influences on disease onset vs progression, enabling more targeted enrichment strategies.

In trial design, such models can be used to:

Enrich for high-risk individuals (in prevention trials).
Define genetic subtypes that may respond differently to the same mechanism.
Construct composite biomarkers that mix genetics with conventional clinical features.

20.4.2 Multi-omic and single-cell biomarker discovery

Beyond DNA variation, drug development increasingly leverages multi-omic and single-cell readouts:

Whole-genome/exome tumor sequencing combined with expression, methylation, and copy-number profiling.
Single-cell multiome datasets (RNA + ATAC) that characterize cell-state landscapes in disease (Jurenaite et al. 2024; Yuan and Duren 2025).
Microbiome sequencing for host–microbe interplay and response to therapy (Yan et al. 2025).

GFMs and related architectures can help here in several ways:

Set-based and graph-based encoders
- Models like SetQuence/SetOmic treat heterogeneous genomic features for each tumor as a set, using deep set transformers to extract predictive representations (Jurenaite et al. 2024).
- GRN inference models such as LINGER leverage atlas-scale multiome data to infer regulatory networks that can serve as biomarkers of pathway activity (Yuan and Duren 2025).
Multi-scale integration
- DNA and RNA GFMs can be combined with graph neural networks over gene and protein networks to build end-to-end predictors that map from genotype + cell state to clinical endpoints (Gao et al. 2023; Benegas, Ye, et al. 2024).
- Embeddings from protein LMs (e.g. ESM-2-based variant models) provide additional structure for coding variants (Brandes et al. 2023; Marquet et al. 2024).
Biomarker discovery workflows
- Use GFMs to generate rich embeddings for patients (e.g. from tumor genomes, germline variation, or multi-omic profiles).
- Cluster or perform supervised learning to identify molecular subgroups with differential prognosis or treatment response.
- Validate candidate biomarkers on held-out cohorts or external datasets before deploying them in a trial.

The key shift is that biomarkers are no longer limited to a handful of hand-picked variants or expression markers: they become functions over high-dimensional genomic and multi-omic embeddings, learned in a data-driven way yet grounded in biological priors from GFMs.

20.5 Biotech Workflows and Infrastructure for GFMs

For pharma and biotech organizations, the primary challenge is not “can we train a big model?” so much as “how do we integrate GFMs into existing data platforms, governance, and decision-making?”

20.5.1 GFMs as shared infrastructure

In a mature organization, GFMs should be treated as shared infrastructure, not ad hoc scripts:

Model catalog
- DNA LMs (e.g. Nucleic Transformer, HyenaDNA, GENA-LM) (He et al. 2023; Nguyen et al. 2023; Fishman et al. 2025).
- Sequence-to-function models (e.g. Enformer, Genomic Interpreter) (Ž. Avsec et al. 2021; Li et al. 2023).
- Variant effect predictors (AlphaMissense, GPN-MSA, AlphaGenome, CADD v1.7) (Rentzsch et al. 2019; Schubach et al. 2024; Cheng et al. 2023; Benegas, Albors, et al. 2024; Z. Avsec, Latysheva, and Cheng 2025).
Feature services
- Centralized APIs that take as input variants, genomic intervals, or genes and return embeddings, predicted functional profiles, or risk features.
- Logging and versioning so that analyses can be reproduced even as models and data evolve.
Data governance
- Clear separation between models trained on public data vs. sensitive internal cohorts.
- Guardrails around where internal data can be used for fine-tuning and how resulting models can be shared.

Embedding GFMs in this way allows multiple teams—target ID, biomarker discovery, clinical genetics—to reuse the same core representations rather than each building bespoke models.

20.5.2 Build vs buy vs fine-tune

Organizations face three strategic options:

Use external GFMs “as-is”
- Pros: Low up-front cost; benefits from community benchmarking (e.g. TraitGym for genotype–phenotype modeling (Benegas, Eraslan, and Song 2025)).
- Cons: May not capture organization-specific populations, assays, or traits.
Fine-tune open-source GFMs on internal data
- Pros: Retains powerful general representations while adapting to local distribution.
- Cons: Requires careful privacy controls and computational investment.
Train bespoke internal GFMs
- Pros: Maximum control; can align pretraining exactly with available data and target use cases.
- Cons: Expensive, complex MLOps; risk of overfitting to narrow datasets if not complemented by broader pretraining.

In practice, many groups adopt a hybrid strategy:

Start with public GFMs for early exploration and non-sensitive tasks.
Gradually fine-tune on internal biobank or trial data when added value is clear.
Maintain lightweight model-serving infrastructure for latency-sensitive applications (e.g. clinical decision support) and heavier offline systems for large-scale research workloads.

20.5.3 IP, collaboration, and regulatory considerations

GFMs also raise new questions around:

Intellectual property
- Models trained on proprietary data can be valuable IP assets but are hard to patent directly.
- Downstream discoveries (targets, biomarkers) derived from GFMs must be carefully documented for freedom-to-operate.
Data sharing and federated approaches
- Joint training or evaluation across institutions may require federated learning or model-to-data paradigms, especially for patient-level data.
Regulatory expectations
- For biomarkers used in pivotal trials, regulators will expect transparent documentation of model training, validation, and performance across subgroups.
- Chapters 14 and 15 highlight confounding and interpretability challenges that become even more acute when models inform trial inclusion or primary endpoints.

Overall, leveraging GFMs in biotech is as much an organizational and regulatory engineering problem as a technical one.

20.6 Forward Look: Toward Lab-in-the-Loop GFMs

A recurring theme across this book is moving from static models to closed loops that integrate:

Foundational representation learning on large unlabeled datasets (genomes, multi-omics).
Task-specific supervision (disease status, expression, variant effects).
Experimental feedback from perturbation assays, functional screens, and clinical trials.

In the drug discovery context, this suggests an evolution toward lab-in-the-loop GFMs:

Hypothesis generation
- GFMs identify promising targets, variants, and regulatory regions.
- Graph and set-based models suggest network-level interventions (Jurenaite et al. 2024; Gao et al. 2023; Yuan and Duren 2025).
Experiment design
- Models propose perturbation libraries (CRISPR, MPRA) that maximize expected information gain.
- Safety and off-target predictions help filter risky designs.
Evidence integration and model refinement
- Screen results feed back into GFMs, improving their local accuracy in disease-relevant regions of sequence space.
- Clinical trial outcomes update biomarker models and risk predictors for future trials.
Portfolio-level decision support
- Genetic and functional evidence from GFMs is combined with classical pharmacology to prioritize or deprioritize programs.
- Uncertainty estimates and model critique (Chapter 17) help avoid over-confidence in purely model-driven recommendations.

Realizing this vision will require:

Better calibration and uncertainty quantification in GFMs.
Stronger causal reasoning to distinguish correlation from intervention-worthiness.
Careful ethical and equity considerations, especially when models influence who gets access to trials or targeted therapies (Chapter 16).

Yet even in the near term, GFMs already offer tangible value in de-risking targets, enriching cohorts, and interpreting complex functional data. When combined with rigorous experimental design and domain expertise, they can act not as oracle decision-makers, but as force multipliers for human scientists and clinicians.

In summary, this chapter has sketched how genomic foundation models extend beyond academic benchmarks into practical levers for drug discovery and biotech:

Turning variant and regulatory predictions into target discovery and validation pipelines.
Designing and interpreting functional genomics screens that probe mechanism and vulnerability.
Building richer biomarkers and patient stratification schemes for trials.
Embedding GFMs into industrial data platforms and MLOps.

Subsequent chapters in Part V can zoom into specific application domains—clinical risk prediction (Chapter 18) and pathogenic variant discovery (Chapter 19)—using the conceptual toolkit laid out here.

Avsec, Žiga, Vikram Agarwal, D. Visentin, J. Ledsam, A. Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, J. Jumper, Pushmeet Kohli, and David R. Kelley. 2021. “[Enformer] Effective Gene Expression Prediction from Sequence by Integrating Long-Range Interactions.” Nature Methods 18 (October): 1196–1203. https://doi.org/10.1038/s41592-021-01252-x.

Avsec, Ziga, Natasha Latysheva, and Jun Cheng. 2025. “AlphaGenome: AI for Better Understanding the Genome.” Google DeepMind. https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/.

Benegas, Gonzalo, Carlos Albors, Alan J. Aw, Chengzhong Ye, and Yun S. Song. 2024. “GPN-MSA: An Alignment-Based DNA Language Model for Genome-Wide Variant Effect Prediction.” bioRxiv, April, 2023.10.10.561776. https://doi.org/10.1101/2023.10.10.561776.

Benegas, Gonzalo, Gökcen Eraslan, and Yun S. Song. 2025. “[TraitGym] Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.” bioRxiv. https://doi.org/10.1101/2025.02.11.637758.

Benegas, Gonzalo, Chengzhong Ye, Carlos Albors, Jianan Canal Li, and Yun S. Song. 2024. “Genomic Language Models: Opportunities and Challenges.” arXiv. https://doi.org/10.48550/arXiv.2407.11435.

Brandes, Nadav, Grant Goldman, Charlotte H. Wang, Chun Jimmie Ye, and Vasilis Ntranos. 2023. “Genome-Wide Prediction of Disease Variant Effects with a Deep Protein Language Model.” Nature Genetics 55 (9): 1512–22. https://doi.org/10.1038/s41588-023-01465-0.

Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Applebaum, Alexander Pritzel, et al. 2023. “[AlphaMissense] Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense.” Science 381 (6664): eadg7492. https://doi.org/10.1126/science.adg7492.

Dalla-Torre, Hugo, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, et al. 2023. “Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics.” Nature Methods 22 (2): 287–97. https://doi.org/10.1038/s41592-024-02523-z.

Fishman, Veniamin, Yuri Kuratov, Aleksei Shmelev, Maxim Petrov, Dmitry Penzar, Denis Shepelin, Nikolay Chekanov, Olga Kardymon, and Mikhail Burtsev. 2025. “GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences.” Nucleic Acids Research 53 (2): gkae1310. https://doi.org/10.1093/nar/gkae1310.

Gao, Ziqi, Chenran Jiang, Jiawen Zhang, Xiaosen Jiang, Lanqing Li, Peilin Zhao, Huanming Yang, Yong Huang, and Jia Li. 2023. “[HIGH-PPI] Hierarchical Graph Learning for Protein–Protein Interaction.” Nature Communications 14 (1): 1093. https://doi.org/10.1038/s41467-023-36736-1.

Georgantas, Costa, Zoltán Kutalik, and Jonas Richiardi. 2024. “Delphi: A Deep-Learning Method for Polygenic Risk Prediction.” medRxiv. https://doi.org/10.1101/2024.04.19.24306079.

He, Shujun, Baizhen Gao, Rushant Sabnis, and Qing Sun. 2023. “Nucleic Transformer: Classifying DNA Sequences with Self-Attention and Convolutions.” ACS Synthetic Biology 12 (11): 3205–14. https://doi.org/10.1021/acssynbio.3c00154.

Jurenaite, Neringa, Daniel León-Periñán, Veronika Donath, Sunna Torge, and René Jäkel. 2024. “SetQuence & SetOmic: Deep Set Transformers for Whole Genome and Exome Tumour Analysis.” BioSystems 235 (January): 105095. https://doi.org/10.1016/j.biosystems.2023.105095.

Li, Zehui, Akashaditya Das, William A. V. Beardall, Yiren Zhao, and Guy-Bart Stan. 2023. “Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D Shifted Window Transformer.” arXiv. https://doi.org/10.48550/arXiv.2306.05143.

Marquet, Céline, Julius Schlensok, Marina Abakarova, Burkhard Rost, and Elodie Laine. 2024. “[VespaG] Expert-Guided Protein Language Models Enable Accurate and Blazingly Fast Fitness Prediction.” Bioinformatics 40 (11): btae621. https://doi.org/10.1093/bioinformatics/btae621.

Nguyen, Eric, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, et al. 2023. “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution.” arXiv. https://doi.org/10.48550/arXiv.2306.15794.

Rakowski, Alexander, and Christoph Lippert. 2025. “[MIFM] Multiple Instance Fine-Mapping: Predicting Causal Regulatory Variants with a Deep Sequence Model.” medRxiv. https://doi.org/10.1101/2025.06.13.25329551.

Rentzsch, Philipp, Daniela Witten, Gregory M Cooper, Jay Shendure, and Martin Kircher. 2019. “CADD: Predicting the Deleteriousness of Variants Throughout the Human Genome.” Nucleic Acids Research 47 (D1): D886–94. https://doi.org/10.1093/nar/gky1016.

Schubach, Max, Thorben Maass, Lusiné Nazaretyan, Sebastian Röner, and Martin Kircher. 2024. “CADD V1.7: Using Protein Language Models, Regulatory CNNs and Other Nucleotide-Level Scores to Improve Genome-Wide Variant Predictions.” Nucleic Acids Research 52 (D1): D1143–54. https://doi.org/10.1093/nar/gkad989.

Trop, Evan, Yair Schiff, Edgar Mariano Marroquin, Chia Hsiang Kao, Aaron Gokaslan, McKinley Polen, Mingyi Shao, et al. 2024. “The Genomics Long-Range Benchmark: Advancing DNA Language Models,” October. https://openreview.net/forum?id=8O9HLDrmtq.

Wu, Yang, Zhili Zheng, Loic Thibaut2, Michael E. Goddard, Naomi R. Wray, Peter M. Visscher, and Jian Zeng. 2024. “Genome-Wide Fine-Mapping Improves Identification of Causal Variants.” Research Square, August, rs.3.rs–4759390. https://doi.org/10.21203/rs.3.rs-4759390/v1.

Yan, Binghao, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, and Siyuan Ma. 2025. “Recent Advances in Deep Learning and Language Models for Studying the Microbiome.” Frontiers in Genetics 15 (January). https://doi.org/10.3389/fgene.2024.1494474.

Yuan, Qiuyue, and Zhana Duren. 2025. “[LINGER] Inferring Gene Regulatory Networks from Single-Cell Multiome Data Using Atlas-Scale External Data.” Nature Biotechnology 43 (2): 247–57. https://doi.org/10.1038/s41587-024-02182-7.

Zhou, Jian, and Olga G. Troyanskaya. 2015. “[DeepSEA] Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model.” Nature Methods 12 (10): 931–34. https://doi.org/10.1038/nmeth.3547.