Genomic Foundation Models
Introduction
This book serves three audiences with different entry points:
| Your Background | Start Here | Then Focus On |
|---|---|---|
| Genomics / Statistical Genetics | Skim Part I (familiar territory), then Part II (architectures) | Parts III-IV for learning and foundation models; Part VI for evaluation |
| Machine Learning / Deep Learning | Part I (genomic data and classical methods) | Parts II-IV (architectures to foundation models); Part III for genomics-specific pitfalls |
| Clinical / Translational Research | Part I (2 Data Landscape, 4 Classical Variant Prediction) | Part VI (responsible deployment); Part VII (clinical applications) |
Reading modes:
- Cover-to-cover: The progression is cumulative; later parts build on earlier ones
- Reference: Use the chapter overview callouts to find specific topics
- Deep dive: Each chapter stands alone with explicit prerequisites listed
Supporting materials:
- Appendix A: Deep learning primer for readers needing ML background
- Appendix F: Glossary of terms spanning genomics, ML, and clinical domains
A single fertilized egg divides into trillions of cells sharing essentially the same genome, yet these cells differentiate into over two hundred distinct types, each with characteristic patterns of gene expression, chromatin accessibility, and regulatory state. The instructions for this differentiation are written in the genome itself: in enhancers and silencers distributed across hundreds of megabases, in splice sites that determine which exons join to form mature transcripts, in three-dimensional chromatin contacts that bring distant regulatory elements together. Reading these instructions requires understanding a regulatory grammar that evolution wrote over billions of years but never documented.
Classical computational approaches attacked this problem piecemeal. One model predicted splice sites from local sequence context. Another identified transcription factor binding motifs. A third scored variant pathogenicity using evolutionary conservation. Each required hand-crafted features, curated training sets, and careful validation within a narrow domain. Insights rarely transferred: a model trained to recognize promoters knew nothing about enhancers, and neither could predict how a single nucleotide change might alter splicing. The result was a fragmented landscape where each biological question demanded its own specialized tool.
Foundation models represent a fundamentally different approach. By training on vast corpora of genomic sequence with self-supervised objectives, these models learn representations that capture regulatory logic without explicit supervision on any particular task. The same model that predicts masked nucleotides can, after minimal adaptation, predict chromatin accessibility, identify splice sites, score variant effects, and distinguish pathogenic mutations from benign polymorphisms. This capacity for transfer learning suggests that foundation models have learned something general about how genomes encode function. Understanding what they have learned, how to deploy them effectively, and where they still fail defines the central challenge for practitioners in this field.
Why Foundation Models for Genomics?
Traditional genomic modeling has been overwhelmingly task-specific. A variant caller is tuned to distinguish sequencing errors from true variants in a particular sequencing platform and sample type. A supervised convolutional network predicts a fixed set of chromatin marks for a specific cell line. A polygenic risk score is fit for one trait, in one ancestry group, using data from one biobank. These models can achieve excellent performance in the settings they were designed for, but they often transfer poorly to new assays, tissues, ancestries, or institutions. When the input distribution shifts, whether because of a new sequencing chemistry, a different population, or a novel cell type, performance degrades in ways that are difficult to anticipate.
Foundation models address this fragility through three interrelated strategies. First, they leverage scale: training on massive, heterogeneous datasets spanning multiple assays, tissues, species, and cohorts forces the model to learn representations that capture shared biological structure rather than dataset-specific artifacts. Second, they employ self-supervised objectives that do not require manual labels, allowing them to exploit the vast quantities of unlabeled sequence data, perturbation screens, and population variation that genomics generates. Third, they are designed for reusability: rather than training a new model for each task, practitioners probe, adapt, or fine-tune a shared backbone, amortizing the cost of representation learning across many downstream applications.
The extent to which this paradigm delivers on its promises in genomics remains an active research question. Some tasks benefit dramatically from pretrained representations; others show marginal improvement over strong classical baselines. Transfer across species, cell types, and assays works better in some settings than others. The computational costs of training and deploying large models create practical constraints that vary across research and clinical environments. Foundation models are not the answer to every genomic problem. Effective practice requires frameworks to evaluate when these approaches help, when simpler methods suffice, and how to design analyses that exploit the strengths of modern architectures while remaining alert to their limitations.
Recurring Themes
Throughout the book, we return to these fundamental questions:
- What did the model actually learn? Distinguishing genuine biological insight from spurious correlations and dataset artifacts
- When does complexity help? Identifying where foundation models add value over simpler approaches
- How do we know it works? Designing evaluations that predict real-world performance rather than benchmark success
- What can go wrong? Anticipating failure modes in deployment, especially in clinical settings
- How do we use it responsibly? Navigating the gap between technical capability and appropriate application
Several threads run through the book, and individual chapters can be read as different perspectives on the same underlying questions.
The co-evolution of data and architecture is one such thread. Early variant effect predictors relied on hand-engineered features and shallow models trained on modest curated datasets. Convolutional networks enabled direct learning of regulatory motifs and local grammar from raw sequence, but their fixed receptive fields limited their reach. Transformers and other long-context architectures opened the door to capturing broader regulatory neighborhoods and chromatin structure. Foundation models push toward representations that span multiple assays, tissues, and organisms. At each stage, the question is not simply whether the model is more sophisticated, but how the available data constrain what the model can sensibly learn.
Scaling laws and emergent capabilities represent a related concern. As models grow larger and train on more data, certain capabilities appear discontinuously rather than gradually. The relationship between parameters, training data, and compute follows predictable patterns that inform practical decisions about model development. Understanding these scaling dynamics helps practitioners decide when to train larger models, when existing models suffice, and what capabilities to expect at different scales.
Context length and genomic geometry present persistent challenges. Many genomic phenomena are intrinsically non-local: enhancers regulate genes across hundreds of kilobases, chromatin loops bring distal elements into contact, and polygenic effects distribute risk across thousands of variants genome-wide. How models represent these long-range dependencies, what architectural choices enable or constrain their reach, and what is gained or lost as context windows scale remain central questions for genomic deep learning.
The distinction between prediction and design cuts across multiple chapters. Most current models are used as predictors: given a sequence and context, what molecular or phenotypic outcome is expected? The same models can also be embedded in design workflows, from variant prioritization and library construction to therapeutic sequence optimization. Foundation models change where the boundary lies between analysis and experimental planning, and they introduce new failure modes when generative or optimization objectives are misspecified.
Evaluation connects benchmark performance to real-world decisions. Benchmark scores are seductive and easy to compare, but biological and clinical decisions are messy, multi-objective, and constrained by data drift, confounding, and poorly specified endpoints. A recurring theme is the gap between state-of-the-art metrics on held-out test sets and actual impact in research or clinical deployment. Careful evaluation, confounder analysis, uncertainty quantification, and calibration can narrow that gap, but only when practitioners understand what their metrics actually measure.
Interpretability and mechanism warrant sustained attention. Interpretability is not optional decoration but a design constraint that shapes how models should be built and evaluated. Saliency maps, motif extraction, and mechanistic analyses can deepen understanding of what a model has learned, but they can also provide false comfort when applied to confounded or brittle representations. Distinguishing genuine biological insight from pattern-matching artifacts requires both technical tools and careful experimental design.
Typography and Formatting
Computational biology, machine learning, and clinical genomics each have distinct conventions for technical terminology. ML researchers recognize VCF as a file format; genomicists know BRCA1 as a tumor suppressor gene; clinicians understand gnomAD as a variant database. Typographic conventions distinguish these categories, helping specialists navigate unfamiliar domains while respecting established standards.
The typography system identifies canonical terms that appear in the glossary, distinguishes biological entities from computational infrastructure, and maintains clean prose that does not overwhelm with formatting. Each format choice must earn its place by genuinely aiding comprehension rather than adding visual noise. Databases like gnomAD appear constantly throughout genomic analyses; italicizing every mention would create clutter without improving clarity. In contrast, model names like Enformer and gene names like BRCA1 function as subjects in the narrative: proper nouns that benefit from visual distinction.
The hierarchy is simple: Bold marks glossary terms on first mention only. Italics marks proper nouns that function as subjects or actors in the narrative (models, genes, mathematical variables, Latin terms). Regular text with careful capitalization handles databases, consortia, and resources. Monospace signals computational infrastructure (file formats, code, command-line tools). Most prose remains unformatted, with typography providing navigation aids rather than constant emphasis.
Glossary terms appear in bold on first mention only: “The transformer architecture revolutionized sequence modeling.” Subsequent mentions use regular text. This applies to machine learning concepts (attention mechanism, fine-tuning, embeddings), genomic concepts (single nucleotide polymorphism (SNP), enhancer, phasing), clinical terms (variant of uncertain significance (VUS), penetrance), and statistical concepts (area under ROC curve (auROC), calibration).
Model names use italics throughout: Enformer, DNABERT, AlphaFold, DeepVariant, SpliceAI. Gene and protein names follow biological convention with italics: BRCA1, TP53, CFTR, CYP2D6. Mathematical variables in prose also use italics: “where n represents sequence length” or “attention score between positions i and j”. Latin and foreign terms are italicized: in silico, de novo, ab initio, in trans.
Monospace formatting signals computational elements. File formats use monospace: VCF, BAM, FASTA, FASTQ, BED, GTF. Code elements including function names (batch_size, forward()), packages (transformers, torch), and command-line tools (bedtools, samtools, GATK) also use monospace.
Databases, consortia, and resources use regular text with careful capitalization: gnomAD, ClinVar, ENCODE, GTEx, UniProt. Sequencing technologies (Illumina, PacBio HiFi, Oxford Nanopore) and biochemical assays (ATAC-seq, ChIP-seq, RNA-seq) similarly use regular text. This reduces visual clutter in passages that reference multiple data sources while preserving clarity through distinctive capitalization patterns.
Structure and Organization
Seven parts span thirty-two chapters, with six appendices providing supplementary material. Each part can be read independently, but the progression is cumulative.
Part I: Genomic Foundations lays the genomic and statistical groundwork that later models rest on. 1 From Reads to Variants introduces next-generation sequencing, alignment, and variant calling, highlighting sources of error and the evolution from hand-crafted pipelines to learned variant callers. 2 Data Landscape surveys the core data resources that underlie most modern work: reference genomes, population variation catalogs, clinical variant databases, and functional genomics consortia such as ENCODE and GTEx. 3 GWAS and Polygenic Scores reviews genome-wide association studies, linkage disequilibrium, fine-mapping, and polygenic scores, emphasizing what these variant-to-trait associations do and do not tell us about mechanism. 4 Classical Variant Prediction covers conservation-based and machine-learning-based variant effect predictors such as CADD, including their feature sets, label construction, and issues of circularity and dataset bias. Together, these chapters answer a foundational question: what data and pre-deep-learning tools form the backdrop that any genomic foundation model must respect, integrate with, or improve upon?
Part II: Architectures introduces the conceptual and technical foundations of modern sequence modeling. 5 Tokens and Embeddings examines how genomic and protein sequences are converted into model-compatible representations, covering one-hot encodings, k-mers, byte-pair encodings, learned embeddings, and position encodings, showing how these choices shape downstream model behavior. 6 Convolutional Networks examines convolutional approaches that established the field of genomic deep learning, including DeepSEA, Basset, and SpliceAI, analyzing what they learn about motifs and regulatory grammar and where their fixed receptive fields impose limitations that motivate attention-based architectures. 7 Transformers and Attention provides a detailed treatment of attention mechanisms, position encodings, and transformer architectures, with emphasis on how these ideas translate from language to biological sequence.
Part III: Learning & Evaluation addresses how models learn from data and how we assess what they have learned. 8 Pretraining Strategies covers pretraining objectives, from masked language modeling and next-token prediction to contrastive and generative approaches, examining how self-supervision extracts structure from unlabeled biological data. 9 Transfer Learning Foundations addresses transfer learning, domain adaptation, and few-shot learning, asking when and how pretrained representations generalize to new tasks, species, and data modalities. 10 Adaptation Strategies examines parameter-efficient fine-tuning methods that adapt foundation models to specific tasks. 11 Benchmark Landscape surveys evaluation benchmarks and methodology, 12 Evaluation Methods develops rigorous evaluation practices including splitting strategies, metric selection, and baseline comparisons, and 13 Confounding and Data Leakage addresses the confounding and leakage issues that pervade genomic evaluation.
Part IV: Foundation Model Families surveys the major foundation model families, organized by modality, and establishes variant effect prediction as the integrating application. 14 Foundation Model Paradigm develops a working definition and taxonomy of foundation models in genomics, distinguishing them from earlier supervised approaches and examining scaling laws that characterize how model capabilities change with size and data. 15 DNA Language Models covers DNA language models such as DNABERT, Nucleotide Transformer, HyenaDNA, and Evo, tracing their training corpora, objectives, evaluation suites, and current capabilities. 16 Protein Language Models describes large protein language models trained on evolutionary sequence databases, their emergent structure and function representations, and applications to structure prediction and design. 17 Regulatory Models covers hybrid CNN-transformer and related architectures designed for long genomic contexts, such as Enformer and Borzoi, which predict regulatory readouts over tens to hundreds of kilobases. 18 Variant Effect Prediction serves as a capstone that integrates these model families, examining how protein-based approaches such as AlphaMissense and DNA-based approaches such as splicing and regulatory models combine to address variant effect prediction, the central interpretive challenge that motivates the field.
Part V: Cellular Context examines how foundation model principles extend beyond one-dimensional sequence to embrace cellular and systems-level biology. 19 RNA Structure and Function extends beyond splicing to RNA structure prediction and RNA foundation models, examining how secondary structure and functional context inform representation learning. 20 Single-Cell Models covers foundation models for single-cell transcriptomics and epigenomics, showing how transformer architectures adapt to the unique characteristics of these data types. 21 3D Genome Organization addresses the three-dimensional organization of the genome, from chromatin loops and TAD boundaries to emerging spatial transcriptomics foundation models, examining how 3D structure provides the missing link between sequence and regulatory function. 22 Graph and Network Models turns to graph neural networks and network-based approaches, framing these not as alternatives to sequence models but as higher-level reasoning systems that consume foundation model embeddings as node features. 23 Multi-Omics Integration broadens the view to multi-omics integration, exploring how models can jointly represent genomic, transcriptomic, proteomic, and clinical information to connect sequence variation to phenotype across multiple layers of biological organization.
Part VI: Responsible Deployment develops frameworks for assessing what models actually learn and how reliably they perform. 24 Uncertainty Quantification addresses uncertainty quantification, examining calibration, epistemic versus aleatoric uncertainty, and practical methods such as ensembles and conformal prediction that help models express when they do not know. 25 Interpretability explores interpretability tools from classical motif discovery and attribution methods to emerging mechanistic approaches, asking when these tools reveal genuine biological mechanisms and when they provide false comfort. 26 Causality examines causal inference approaches and their intersection with foundation models. 27 Regulatory and Governance concludes with regulatory and ethical considerations.
Part VII: Applications & Frontiers moves from methods to end-to-end workflows in research and clinical practice. 28 Clinical Risk Prediction discusses clinical risk prediction that combines genomic features with electronic health records and environmental data, focusing on discrimination, calibration, fairness, and deployment in health systems. 29 Rare Disease Diagnosis examines how foundation models fit into rare disease and cancer workflows, including variant prioritization pipelines, integration with family and tumor-normal data, and laboratory validation. 30 Drug Discovery looks at how GFMs intersect with target discovery, functional genomics screens, and biomarker development in pharmaceutical and biotechnology settings. 31 Sequence Design covers generative applications, from protein design and therapeutic sequence optimization to synthetic biology and bioengineering workflows. 32 Frontiers and Synthesis examines emerging directions and open problems.
Six appendices provide supporting material. Appendix A — Deep Learning Primer offers a compact introduction to neural networks, CNNs, transformers, training, and evaluation for readers who want enough machine learning background to engage with the main chapters without consulting external references. Appendix B — Deployment and Compute covers practical considerations for deploying genomic foundation models, including computational requirements, hardware selection, and infrastructure concerns. Appendix C — Data Curation provides guidance on constructing training datasets, covering data sources, quality filtering, deduplication, and contamination detection. Appendix D — Model Reference provides a comprehensive reference table of models discussed in the main chapters, with architecture summaries, training data, and key citations. Appendix E — Resources offers a curated collection of datasets, software tools, courses, and papers for deeper exploration. Appendix F — Glossary defines key terms spanning genomics, machine learning, and clinical applications.
A Framework, Not a Snapshot
Genomic foundation models represent a moving target: architectures evolve, datasets expand, and evaluation standards shift. A book of the state of the art in 2024 would be obsolete before publication. The goal instead is to provide a framework for reasoning about new models as they appear, grounding readers in principles stable enough to outlast any particular architecture or benchmark.
After working through this book, you should be able to:
- Evaluate new models by understanding their data, architecture, objectives, and evaluation methodology
- Design analyses that use foundation models appropriately, knowing when they add value and when simpler methods suffice
- Recognize pitfalls in training, evaluation, and deployment, especially the genomics-specific confounds that invalidate standard ML practices
- Communicate effectively across disciplines, bridging genomics, machine learning, and clinical translation
- Decide where foundation models genuinely advance your work and where they introduce unnecessary complexity or risk
Readers who work through this material should be equipped to place new models in the landscape of data, architecture, objective, and application. They should be able to design analyses that use foundation models as components (whether as feature extractors, priors, or simulators) without overclaiming what the models can do. They should recognize pitfalls in training, evaluation, and deployment, especially in clinical settings where errors have real consequences. And they should be able to decide where foundation models genuinely add value and where simpler methods remain sufficient.
The journey begins with foundations: how raw reads become variants, how variants become the datasets on which all subsequent models depend, and where errors in this upstream process create systematic challenges that propagate through everything built upon them.