14  Foundation Model Paradigm

Four questions about one variant. Four separate models. Months of waiting.

Chapter Overview

Estimated reading time: 40-50 minutes

Prerequisites: This chapter builds on concepts from Chapter 6 (convolutional architectures), Chapter 7 (attention mechanisms), and Chapter 8 (self-supervised learning). Familiarity with basic neural network training and the distinction between supervised and unsupervised learning will help you follow the discussion.

Learning Objectives: After completing this chapter, you should be able to:

  1. Define what distinguishes a genomic foundation model from a task-specific deep learning model
  2. Explain the scaling law framework and its implications for model development decisions
  3. Compare the four major families of genomic foundation models and their respective strengths
  4. Apply the build-versus-use decision framework to choose appropriate models for specific applications
  5. Evaluate foundation models across multiple tasks rather than single benchmarks

Key Insight: Foundation models represent a paradigm shift from training separate models for each task to pretraining a single general-purpose model that can be adapted to many downstream applications. This changes not just how models are built, but how researchers interact with them, shifting from training practitioners to adaptation specialists.

In 2019, a family arrived at a genetics clinic after their infant was diagnosed with a novel cardiac arrhythmia syndrome. Whole-genome sequencing identified a candidate variant, but determining whether it was pathogenic required answering three regulatory questions: Would this variant disrupt a transcription factor binding site? Would it alter chromatin accessibility in cardiomyocytes? Could it create a cryptic splice site? The research team trained three separate models, each requiring its own data curation, hyperparameter search, and validation pipeline. When the fourth question arose (would the variant affect 3D chromatin looping?), they had to start from scratch again. Months passed. The family waited. Every new biological question about the same variant demanded a new model. Knowledge learned for one task provided no benefit for another, even though all four questions probed different consequences of the same few nucleotides.

This scenario captures the fragmentation that defined genomics’ early deep learning era. One convolutional network predicted transcription factor binding; another predicted chromatin accessibility; a third classified splice sites. Each model required its own training data, its own hyperparameter tuning, its own validation strategy. The field accumulated tools without accumulating shared knowledge.

Foundation models promise a different approach: train once on vast genomic data, then adapt to many downstream tasks. A single model pretrained on billions of nucleotides might provide representations useful for regulatory prediction, variant interpretation, sequence design, and cross-species analysis simultaneously. Rather than curating labeled datasets for each new question, researchers could fine-tune existing models on modest task-specific data, leveraging knowledge the model acquired during pretraining. The efficiency gains would be substantial; the conceptual shift would be larger still. Where specialized models treat each genomic question as independent, foundation models assume that shared patterns underlie diverse biological phenomena and that representations capturing those patterns should transfer.

Why should patterns transfer across such different tasks? The key insight is that many genomic prediction tasks share underlying structure. Consider how learning to read opens access to novels, newspapers, scientific papers, and poetry: a single skill unlocks diverse applications because all written text shares common structures (letters, words, grammar). Similarly, diverse genomic tasks share common structures: motifs, regulatory grammar, and evolutionary signatures. Predicting whether a variant disrupts a transcription factor binding site requires understanding sequence motifs; predicting whether that same variant affects chromatin accessibility also requires understanding sequence motifs. A model that learns robust motif representations during pretraining can apply those representations to both tasks. More fundamentally, evolution has shaped genomic sequence through shared mechanisms: natural selection acts on function, and functional sequences share statistical properties regardless of which specific function is measured. A model that learns these shared properties during pretraining has learned something relevant to many downstream questions.

This paradigm, which transformed natural language processing and protein structure prediction, carries both promise and peril for genomics. Pretraining at scale requires computational resources beyond most academic budgets. Adaptation to specific tasks demands expertise in transfer learning techniques that remain poorly understood (Chapter 9). Predictions from general-purpose models may lack the precision of specialized alternatives trained directly on task-specific data. The decision to use, adapt, or build foundation models involves tradeoffs that depend on available resources, target applications, and acceptable uncertainty.

14.1 From Task-Specific Models to Foundation Models

The history of computational genomics reveals a consistent pattern: models become more general while maintaining or improving task-specific performance. Hand-crafted scores such as CADD and SIFT established that integration of diverse genomic annotations could improve variant pathogenicity prediction (Rentzsch et al. 2019; Schubach et al. 2024) (Chapter 4). These approaches faced a ceiling imposed by the features available for engineering, a limitation examined in Section 4.6.4. These approaches relied on expert feature engineering, combining conservation scores, functional annotations, and population frequency data through ensemble methods or logistic regression.

Task-specific deep learning models demonstrated that neural networks could learn relevant features directly from sequence. DeepSEA predicted chromatin accessibility and transcription factor binding from 1 kb sequences using convolutional architectures (J. Zhou and Troyanskaya 2015). ExPecto extended this approach to gene expression prediction by modeling regulatory elements across multiple cell types (J. Zhou et al. 2018). Sei organized regulatory predictions into interpretable sequence classes through unsupervised clustering (Chen et al. 2022). SpliceAI achieved near-perfect splice site prediction through dilated convolutions over 10 kb contexts, though its architecture was purpose-built for this specific task and could not generalize to other regulatory prediction problems (Chapter 6). Enformer scaled sequence-to-function modeling to 200 kb windows and thousands of chromatin tracks through transformer architectures (Avsec et al. 2021).

These models succeeded within their specific domains but remained difficult to repurpose. Training a DeepSEA model required chromatin accessibility data. Using SpliceAI for regulatory prediction would require complete retraining on different labels. Each application domain needed its own model, trained from scratch on task-specific data. The fundamental limitation was not model capacity but training paradigm: supervised learning on narrow tasks produced narrow capabilities.

Predict Before You Look

Before viewing the figure below, make a prediction: In the task-specific paradigm, if a researcher needs to solve five different genomic prediction problems (e.g., splice site prediction, enhancer identification, transcription factor binding, chromatin accessibility, and gene expression), how many separate models would they need to train? In the foundation model paradigm, how many models would be trained from scratch? What is the key difference in how knowledge is reused between these two approaches?

Task-specific paradigm: isolated models for isolated tasks

Foundation model paradigm: shared representations, efficient adaptation
Figure 14.1: The paradigm shift from task-specific to foundation models. (A) The task-specific paradigm trains separate models from scratch for each application. Knowledge about sequence patterns cannot transfer between tasks, requiring substantial labeled data for each new application. (B) The foundation model paradigm pretrains a single large model on diverse unlabeled sequences, capturing general biological knowledge in reusable representations. Small task-specific adapters enable efficient transfer to diverse downstream tasks. This paradigm shift mirrors developments in natural language processing, where pretrained language models revolutionized the efficiency and capability of text-based AI systems.

Sequence language models introduced the self-supervised pretraining paradigm (Chapter 8) to genomics. DNABERT applied masked language modeling to DNA sequences, demonstrating that general representations could be learned without task-specific labels (Ji et al. 2021) (Section 15.2). ESM and ESM-2 showed that protein language models pretrained on sequence alone could transfer effectively to structure prediction, variant effect prediction, and protein design (Rives et al. 2021; Lin et al. 2022) (Section 16.1). The Nucleotide Transformer family scaled DNA language modeling to cross-species training corpora (Dalla-Torre et al. 2023) (Section 15.3). HyenaDNA used implicit convolutions to reach million-token contexts at single-nucleotide resolution (Nguyen et al. 2023) (Section 15.5.1).

The transition from task-specific to foundation models changes the relationship between model developers and users. Task-specific models deliver predictions as their primary product. Foundation models deliver representations that users adapt to their own tasks through the transfer learning techniques examined in Chapter 9. This distinction affects everything from model architecture design to evaluation strategies to deployment infrastructure.

Stop and Think

Before reading about the formal definition of foundation models, consider: what properties would you require for a model to qualify as a “foundation” for multiple downstream tasks? Think about training data, output types, and how users would interact with the model.

14.2 Defining Genomic Foundation Models

The term “foundation model” appears frequently in genomics literature, sometimes applied to any large neural network trained on biological sequences. The Stanford HAI report formally defined foundation models as “models trained on broad data at scale such that they can be adapted to a wide range of downstream tasks” (Bommasani et al. 2022). This definition, while originating from general AI discourse, captures essential properties that distinguish true genomic foundation models from task-specific deep learning approaches. Establishing clear criteria helps separate these categories.

14.2.1 Essential Properties

The defining characteristic of genomic foundation models is their capacity to serve purposes far beyond their original training objectives. This generality emerges from several interconnected properties.

Foundation models train on entire genomes, pan-genomic sequence collections, or large assay compendia with minimal supervision. Their pretraining objectives include masked language modeling, next-token prediction, denoising, or multi-task sequence-to-function prediction. Critically, these objectives do not require dense task-specific labels for every training example. A model that requires annotated enhancers or curated pathogenic variants for every training instance does not qualify as a foundation model under this criterion.

The representations these models produce must prove useful across many downstream tasks. Embeddings can be extracted through forward passes and reused with simple linear probes or lightweight adapter modules rather than requiring full model retraining. The representations should encode biological information at multiple scales, from local sequence motifs to long-range regulatory grammar.

Transfer capability extends across multiple dimensions: different assays (from chromatin accessibility to gene expression), different tissues and cell types, different species, and different variant types (from SNVs to structural variants). Evidence of broad transfer requires evaluation across multiple benchmarks rather than demonstration of performance on a single task (Chapter 11).

Foundation models operate at a scale that would be impractical for task-specific training. Some scale context length, as HyenaDNA scales to million-token windows at single-nucleotide resolution. Others scale parameter count, as the ESM and Nucleotide Transformer families reach billions of parameters. Still others scale data diversity through pan-genomic pretraining across hundreds of species or integration of many assays and cell types. The scaling dimension chosen reflects the model’s intended applications and architectural constraints.

Finally, foundation models typically expose consistent APIs for common operations. These include embedding extraction for sequences or variants, sequence probability scoring, and mask-based in silico mutagenesis for variant effect prediction. Models distributed through repositories such as Hugging Face often include documented recipes for downstream fine-tuning and example notebooks demonstrating common use cases.

The following table summarizes the essential properties that distinguish foundation models from task-specific alternatives.

Table 14.1: Distinguishing properties of foundation models versus task-specific models.
Property Foundation Model Task-Specific Model
Training data Minimal/no labels; self-supervised Dense task-specific labels required
Transfer capability Many downstream tasks Single task
Scale dimension Parameters, context, or data diversity Optimized for specific task
User interaction Embeddings + adaptation End-to-end predictions
Evaluation Multi-task benchmarks Single-task metrics

Why does self-supervised pretraining enable transfer while supervised training does not? The difference lies in what the training objective encourages the model to learn. A model trained to predict splice sites learns features specifically useful for splice site prediction: the GT-AG consensus, branch point motifs, exonic splicing enhancers. These features may be irrelevant or even misleading for other tasks. In contrast, a model trained to predict masked nucleotides across the entire genome must learn features useful for reconstructing any genomic context. To succeed at this broad objective, the model must discover general principles: how motifs combine, how sequence composition varies across regions, what patterns distinguish functional from non-functional sequence. These general features transfer because they capture the underlying organization of the genome rather than task-specific shortcuts.

Field Overview

For a comprehensive survey of genomic foundation models as of 2024, including taxonomy, benchmarks, and applications, see Trop et al. (2024).

Reading Extension: Compare Trop et al.’s taxonomy to the four families introduced in this chapter. Where do their categories align with ours? Where do they diverge, and what might account for different organizational choices?

14.2.2 What does not Count

Many excellent genomic models fail one or more of these criteria and should not be classified as foundation models. Early versions of DeepSEA trained specifically on chromatin accessibility data from a limited set of cell types lack the generality and standardized interface of foundation models, though later iterations that integrate many assays begin to approach foundation model territory (J. Zhou and Troyanskaya 2015). SpliceAI predicts splicing outcomes exceptionally well but was designed for that specific task and provides neither general-purpose embeddings nor easy transfer to other genomic prediction problems (Jaganathan et al. 2019). Even a very large Enformer-like model trained solely on human chromatin tracks remains bound to its specific prediction interface despite its scale and sophistication (Avsec et al. 2021).

The distinction matters for several reasons. It affects evaluation strategy, since foundation models must be assessed across families of tasks rather than single benchmarks (Chapter 12). It affects integration into existing pipelines, since foundation models serve as feature extractors while task-specific models typically provide end-to-end predictions. It affects how we think about model development, since foundation model training requires different infrastructure and data curation than task-specific supervised learning.

14.2.3 Limitations of the Foundation Model Concept

The term “foundation” carries implications worth examining. Architectural foundations are static, load-bearing, and invisible once construction proceeds. Genomic foundation models share only the load-bearing property: they support downstream applications that would otherwise require independent construction. Yet unlike architectural foundations, these models remain visible and modifiable throughout their use. Fine-tuning adjusts the foundation itself rather than building atop an immutable base. The metaphor also implies that foundations precede and enable all subsequent work, but genomic foundation models often coexist with task-specific alternatives that outperform them on narrow benchmarks.

A more accurate metaphor might be “foundation” in the educational sense: a broad base of knowledge that enables specialized learning but continues to develop alongside it. The pretraining phase establishes general competence; adaptation refines that competence for specific purposes without abandoning the original learning. This framing better captures the dynamic relationship between pretrained representations and downstream tasks, though the architectural metaphor has become standard terminology.

Key Insight: The Foundation Model Criterion

A model qualifies as a foundation model not by its size or training cost, but by its demonstrated ability to transfer to diverse downstream tasks without full retraining. The critical test is: can users extract embeddings or apply lightweight adaptation to solve problems the original developers never anticipated? If the answer is yes across multiple task families, you have a foundation model. If the model only produces predictions for its original task, it remains task-specific regardless of scale.

14.3 Scaling Laws and Compute-Optimal Training

The success of foundation models in natural language processing rests partly on empirical scaling laws: predictable relationships between model size, training data, computational budget, and performance. Understanding these relationships guides resource allocation and model development decisions.

14.3.1 Chinchilla Framework and Genomic Constraints

Mathematical Content Ahead

The following subsection presents the mathematical formulation of scaling laws. The key intuition is that performance improves predictably with more parameters and more data, but the rate of improvement follows specific power laws. If you find the equations challenging, focus on the practical implications summarized after the mathematical derivation.

Hoffmann et al. formalized the relationship between model performance and scaling factors through a power law decomposition (Hoffmann et al. 2022):

\[ L(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} \tag{14.1}\]

where:

  • \(L\) is the cross-entropy loss on held-out data (in nats or bits)
  • \(N\) is the number of model parameters
  • \(D\) is the number of training tokens
  • \(E\) is the irreducible loss (entropy remaining even with infinite resources)
  • \(A\), \(B\) are empirically fitted scale constants
  • \(\alpha \approx \beta \approx 0.3\) are the scaling exponents (fitted to data) For language models, the exponents \(\alpha\) and \(\beta\) both approximate 0.3, meaning doubling parameters or data reduces loss by roughly 20%. The constant \(E\) represents irreducible loss (the entropy remaining even with infinite resources), while the parameter term \(A/N^\alpha\) quantifies gains from greater model capacity and the data term \(B/D^\beta\) captures gains from additional training examples.

Why does this decomposition take this particular form? The power law structure reflects the diminishing returns inherent in learning: the first million parameters capture the most common patterns, while each subsequent million captures progressively rarer regularities. The exponents near 0.3 indicate that performance improvements slow considerably as scale increases; you need roughly eight times the resources to halve the gap to optimal performance. The additive structure separates distinct bottlenecks: insufficient model capacity to represent learned patterns (the \(N\) term) versus insufficient data diversity to learn all relevant patterns (the \(D\) term). This decomposition explains why over-parameterized models trained on limited data, or under-parameterized models trained on vast data, both underperform balanced allocations.

Training cost constrains both parameters and data simultaneously through the compute budget \(C\) (measured in FLOPs):

\[ C \approx 6ND \tag{14.2}\]

where \(C\) is measured in FLOPs (floating-point operations). This approximation holds because each training token requires roughly 6 FLOPs per parameter (one forward pass and one backward pass through the network). Optimizing this tradeoff by minimizing loss subject to the budget constraint yields:

\[ N_{\text{opt}} \propto C^{0.49}, \quad D_{\text{opt}} \propto C^{0.51} \tag{14.3}\]

These exponents, both near 0.5, encode the Chinchilla insight: model size and training data should scale approximately equally. Practical implementations target roughly 20 tokens per parameter for compute-optimal training.

Worked Example: Applying the Chinchilla Framework

Scenario: Your lab has a compute budget of \(10^{20}\) FLOPs for training a DNA language model. How should you allocate between model size and training data?

Step 1: Apply the 20-tokens-per-parameter heuristic. Using \(C \approx 6ND\) and \(D \approx 20N\):

\[10^{20} \approx 6 \times N \times 20N = 120N^2\]

\[N \approx \sqrt{10^{20}/120} \approx 2.9 \times 10^8 \text{ parameters}\]

Step 2: Calculate corresponding data requirement:

\[D \approx 20 \times 2.9 \times 10^8 \approx 5.8 \times 10^9 \text{ tokens}\]

Interpretation: With this budget, you should target approximately 300 million parameters trained on roughly 6 billion tokens. Training a 3 billion parameter model on the same budget would leave it severely undertrained (only 600 million tokens), likely underperforming the smaller compute-optimal model.

Caveat for genomics: These ratios were derived for natural language. Genomic sequence may require different ratios due to its simpler alphabet and different statistical structure. When in doubt, empirically validate on held-out data.

The constants fitted for language models should not be assumed to hold for genomic tasks. DNA lacks the hierarchical compositional structure of natural language; regulatory grammar does not build meaning through recursive phrase structure the way sentences do. More critically, the 20-tokens-per-parameter guidance reflects optimization for next-token prediction loss on held-out text. Genomic foundation models often aim to learn representations that transfer to diverse downstream tasks (variant effect prediction, chromatin state inference, evolutionary constraint estimation), none of which were objectives during language model pretraining. The optimal balance between parameters and data may shift when the goal is representation learning rather than task-specific loss minimization.

14.3.2 Empirical Scaling in Genomic Models

Several genomic foundation model families have reported scaling experiments, though systematic scaling laws comparable to NLP remain elusive. The Nucleotide Transformer family provides perhaps the clearest genomic scaling data (Dalla-Torre et al. 2023). Performance on downstream benchmarks improves consistently with parameter count across models from 50 million to 2.5 billion parameters. The largest models (trained on multi-species data) outperform smaller models trained on human sequences alone, suggesting that cross-species data provides effective scaling even when human-specific performance is the target. Training compute scaled from approximately \(10^{19}\) to \(10^{21}\) FLOPs across the model family.

Loss vs. model size

Downstream performance vs. model size

Optimal allocation of compute
Figure 14.2: Scaling laws for genomic foundation models. (A) Pretraining loss decreases predictably with model parameters following a power law, enabling informed decisions about resource allocation. (B) Downstream task performance scales consistently across diverse tasks including contact prediction, secondary structure, and variant effects, demonstrating that larger models capture more transferable biological knowledge. (C) Compute-optimal scaling reveals the trade-off between model size and training data: for fixed compute budget, optimal performance requires balancing parameter count with training tokens. These scaling relationships, first established in natural language processing, extend to biological sequence models and guide foundation model development.

ESM-2 demonstrated similar scaling for protein language models, with performance on structure prediction and variant effect tasks improving smoothly from 8 million to 15 billion parameters (Lin et al. 2022). The largest ESM-2 models approach the structure prediction accuracy of AlphaFold2 using only single-sequence input, a capability entirely absent in smaller models. HyenaDNA focused on context length scaling rather than parameter scaling, demonstrating that million-token contexts at single-nucleotide resolution could be achieved through sub-quadratic architectures (Nguyen et al. 2023).

The scaling law framework has direct implications for model development decisions in genomics, though the constraints differ fundamentally from natural language processing. Unlike natural language, where text data is effectively unlimited, genomic sequence data faces hard constraints. Reference genomes for well-studied species total perhaps \(10^{11}\) to \(10^{12}\) nucleotides. Population-level variant data can expand this somewhat, but the effective diversity may be lower than raw counts suggest. In such data-constrained regimes, smaller models trained to convergence may outperform larger models that are undertrained.

Academic groups typically face stricter compute constraints than industry labs. Given fixed compute budgets, the Chinchilla framework suggests allocating resources toward longer training of smaller models rather than abbreviated training of larger models. A 500 million parameter model trained for 10 epochs on diverse genomic data may outperform a 5 billion parameter model trained for 1 epoch on the same data. Cross-species data offers a potential path around genomic data limitations. The Nucleotide Transformer and Evo families exploit this strategy, learning evolutionary patterns from diverse genomes that improve human-specific predictions.

Stop and Think

Consider the genomic data constraint: reference genomes contain roughly \(10^{11}\) to \(10^{12}\) nucleotides total. How does this compare to the trillions of tokens used to train large language models? What strategies might help overcome this data limitation for genomic foundation models?

14.3.3 Downstream Scaling: A Modified Framework

The Chinchilla scaling laws (Equation 14.1) govern pretraining loss, but foundation models are not deployed to minimize pretraining loss. They are adapted to downstream tasks: variant effect prediction, regulatory element classification, clinical risk stratification. A model that achieves excellent masked language modeling performance may fail when fine-tuned on limited task-specific labels. The critical question becomes: how do performance, required labeled data, and model capacity relate for downstream classification tasks built on foundation model embeddings?

Traditional ML scaling laws do not directly apply. The Chinchilla framework balances model parameters against pretraining tokens, but downstream tasks substitute a different set of constraints: embedding quality (determined by pretraining), labeled examples available for adaptation, and the alignment between pretraining distribution and task distribution. Recent work has begun to characterize these relationships empirically.

Mathematical Detail: Downstream Task Scaling

The downstream setting requires a modified framework because the bottleneck shifts from compute to labeled data quality. Based on synthesis of transfer learning literature (hernandez_transfer_2021?; iclr_downstream_2025?), downstream classifier performance can be decomposed:

\[ L_{downstream}(D_{ft}, E) = L_{irreducible} + \frac{A'}{E^{\alpha'}} + \frac{B'}{D_{ft}^{\beta'}} \]

where:

  • \(L_{downstream}\) = downstream task loss (classification or regression)
  • \(D_{ft}\) = number of labeled fine-tuning examples
  • \(E\) = embedding quality (proxy: pretraining loss, probe performance)
  • \(\alpha'\) = embedding quality exponent (how much better embeddings help)
  • \(\beta'\) = fine-tuning data exponent (how much more data helps)
  • \(L_{irreducible}\) = task-specific noise floor

Key differences from pretraining scaling:

  1. \(E\) replaces model parameters \(N\): embedding quality matters more than raw model size once pretrained
  2. \(\beta'\) can exceed pretraining exponents: better embeddings amplify data efficiency gains
  3. \(L_{irreducible}\) depends on task-embedding alignment: mismatched tasks may never reach good performance

Empirical evidence from protein and DNA language models supports sample efficiency claims. The Nucleotide Transformer study demonstrated that fine-tuning with 1,000 labeled examples matched performance of models trained from scratch on full datasets across 18 genomic prediction tasks (Dalla-Torre et al. 2023). ULMFiT established the canonical result: 100 labeled examples with transfer learning matched 10,000 examples without transfer on text classification (Howard and Ruder 2018). These represent 10-100x reductions in labeled data requirements.

Stop and Think: Sample Efficiency Sources

Before reading on, consider: what properties of pretrained embeddings would enable a classifier to learn from fewer labeled examples? What types of patterns must the embeddings already capture?

Hint: Compare training a linear classifier on one-hot encoded sequence versus training it on embeddings from a pretrained DNA language model.

What determines sample efficiency? The literature identifies several factors. Pretraining-task alignment matters: a model pretrained on human regulatory sequences transfers more efficiently to human enhancer prediction than a model pretrained on bacterial genomes. Embedding dimensionality creates a tradeoff: higher-dimensional embeddings capture richer representations but require more labeled examples for linear probes to avoid overfitting. Layer selection proves task-dependent (Section 10.2): optimal representations for variant effect prediction may reside in different layers than optimal representations for cell type classification. Fine-tuning strategy affects efficiency: gradual unfreezing and discriminative layer-wise learning rates preserve pretrained knowledge while adapting to new tasks.

The zero-shot approach bypasses labeled data requirements entirely. Rather than training a downstream classifier, zero-shot methods use the pretrained model’s likelihood as a fitness proxy. For variant effect prediction, the log-likelihood ratio between mutant and wildtype sequences directly scores deleteriousness without any labeled pathogenicity examples (Section 18.2.1). ESM-1v demonstrated that zero-shot protein variant scoring outperformed supervised baselines on 17 of 41 deep mutational scanning benchmarks, establishing that pretraining alone can capture sufficient functional constraints for some prediction tasks (Meier et al. 2021).

The workflow follows directly: start with zero-shot evaluation if task-appropriate, use linear probes to assess embedding quality with minimal labeled data (100-1,000 examples), and scale to full fine-tuning only when frozen embeddings prove insufficient. This inverts the traditional ML workflow, where supervised training is the default and unsupervised pretraining is an optional enhancement.

Open questions remain. What are the fitted values of \(\alpha'\) and \(\beta'\) for genomic classification tasks? How does class imbalance modify downstream scaling laws? Is there an emergence threshold below which foundation model embeddings provide no benefit over raw sequence features? These questions define the frontier of understanding how foundation models transform genomic prediction.

14.3.4 Emergent Capabilities

Perhaps the most intriguing aspect of foundation model scaling is the emergence of qualitatively new capabilities at sufficient scale. Emergence refers to abilities that are absent or negligible in smaller models but appear discontinuously as models grow. Think of it like learning to ride a bicycle: you cannot ride “a little bit”—you wobble and fall until suddenly, with enough practice, you can balance. The capability emerges all at once rather than improving gradually. Similarly, certain model capabilities remain effectively zero until the model crosses a threshold, then appear seemingly from nowhere.

In large language models, emergent capabilities include multi-step reasoning, code generation, and in-context learning. These capabilities appear at model scales of roughly \(10^{10}\) parameters and above, with no clear precursor in smaller models (Wei et al. 2022).

Genomic foundation models exhibit analogous emergence, though the capability thresholds are less well characterized. The most striking example involves structural understanding from sequence: ESM-2 at sufficient scale produces contact maps and secondary structure predictions from single sequences with accuracy approaching multiple sequence alignment methods like trRosetta (Lin et al. 2022). Smaller ESM models show no meaningful structural understanding. This capability emerges at approximately 650 million parameters and continues improving with scale.

Larger Nucleotide Transformer models transfer more effectively to novel species not seen during training (Dalla-Torre et al. 2023). The ability to generalize beyond training species appears to require sufficient model capacity to learn abstract regulatory principles rather than memorizing species-specific patterns. Similarly, foundation models at sufficient scale can predict variant effects without task-specific fine-tuning, using only the difference in likelihood between reference and alternative sequences. For example, ESM-1v computes log-likelihood ratios to predict protein variant pathogenicity in a zero-shot manner. This zero-shot capability requires models large enough to capture subtle sequence dependencies

Few-shot approaches include task examples in the input context, allowing in-context learning without parameter updates. HyenaDNA demonstrated this capability for genomic tasks, suggesting that sufficiently large models with long context can adapt through prompts rather than training (Nguyen et al. 2023).

The practical implication is that capability thresholds exist: models below certain scales may be fundamentally incapable of certain tasks regardless of fine-tuning. Identifying these thresholds helps guide model selection and prevents wasted effort fine-tuning models that lack necessary capacity.

Key Insight: Emergence Creates Capability Thresholds

Certain capabilities only appear above specific scale thresholds. A 100 million parameter protein language model cannot perform zero-shot structure prediction regardless of how it is fine-tuned; the capability simply does not exist at that scale. Before attempting to adapt a foundation model for a challenging task, verify that models of similar scale have demonstrated the required capability. Attempting to fine-tune a model below the capability threshold is wasted effort.

14.4 Theoretical Foundations: Why Foundation Models Generalize

Foundation models challenge classical statistical learning theory: they have far more parameters than training examples yet generalize remarkably well. Understanding this puzzle requires revisiting fundamental concepts.

14.4.1 Classical Generalization Theory

VC Dimension and Capacity. The Vapnik-Chervonenkis (VC) dimension measures a model class’s capacity, the largest set of points it can perfectly classify in all possible labelings (“shatter”). For a linear classifier in \(d\) dimensions: \(\text{VC}(H) = d + 1\).

Classical bounds state that generalization error scales as:

\[\epsilon_{gen} \leq \epsilon_{train} + O\left(\sqrt{\frac{\text{VC}(H)}{n}}\right)\]

The Puzzle: A transformer with 1B parameters has enormous capacity, yet generalizes from “only” billions of tokens. The VC bound would predict catastrophic overfitting.

14.4.2 Why Classical Theory Fails for Deep Learning

Several factors explain foundation model generalization:

1. Implicit Regularization. Gradient descent on overparameterized models converges to solutions with special properties: minimum norm (among solutions that fit training data), flat minima (robust to parameter perturbations), and simple functions (low-frequency components learned first). These implicit biases are not captured by VC dimension.

2. Benign Overfitting. In high dimensions, models can interpolate training data (zero training error) yet still generalize, if the model’s inductive bias aligns with data structure and noise is absorbed in directions orthogonal to the signal. This “double descent” phenomenon shows test error can improve past the interpolation threshold.

3. Effective Dimension. The effective number of parameters is much smaller than nominal parameter count: many parameters are redundant or nearly zero, the loss landscape is low-rank near minima, and self-supervised objectives constrain the solution space.

4. Data Scaling Laws. Foundation model generalization follows empirical power laws:

\[\epsilon_{test} \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{P_c}{P}\right)^{\alpha_P} + \epsilon_\infty\]

where \(N\) = data size, \(P\) = parameters, and \(\alpha \approx 0.07\) for language models. This implies that increasing data is always beneficial (no classical “variance” regime).

14.4.3 Implications for Genomic Foundation Models

These theoretical insights have practical consequences:

  1. More parameters generally help (until compute-limited, not data-limited)
  2. Pretraining provides implicit regularization for downstream tasks
  3. Transfer works because pretrained representations lie on low-dimensional manifolds aligned with biological structure
  4. Overfitting to genomic benchmarks is a real risk; effective sample size for rare variants may be much smaller than nominal test set size

14.5 A Taxonomy of Genomic Foundation Models

The landscape of genomic foundation models can be organized into four broad families. Each family exhibits distinct characteristics, strengths, limitations, and typical application domains.

Taxonomy of genomic foundation models
Figure 14.3: Taxonomy of genomic foundation models organized by modality and approach. DNA language models (blue) process nucleotide sequences with emphasis on long context and single-nucleotide resolution. Protein language models (green) encode evolutionary knowledge from protein sequences with increasing integration of structural information. Regulatory sequence models (orange) combine sequence processing with multi-task prediction of chromatin and expression tracks. Multi-modal and emerging models (purple) integrate across modalities, combining sequence with structure (AlphaFold2) or leveraging multiple information sources simultaneously. Arrows indicate connections between families where models build on each other’s capabilities.

The following table provides a quick reference for comparing the four foundation model families across key dimensions.

Table 14.2: Comparison of genomic foundation model families.
Family Input Output Pretraining Strength Limitation
DNA LMs Nucleotide sequence Embeddings, probabilities MLM, autoregressive General, scalable No functional grounding
Seq-to-Function Sequence windows Assay predictions Supervised multi-task Mechanistic Tied to training assays
VEP Models Variant + context Effect scores Mixed supervision Clinical relevance Narrow task focus
Multi-Omic Multiple modalities Cross-modal embeddings Contrastive, joint Holistic Data engineering complexity

14.5.1 DNA Language Models

DNA language models treat genomic sequence as a language to be modeled, learning representations from raw nucleotide strings through self-supervised objectives. Without explicit functional labels, these models discover patterns through statistical regularities in genomic sequence.

The pretraining objectives typically involve masked language modeling or autoregressive next-token prediction. Training draws from reference genomes or pan-genomic sequence collections spanning multiple species. The resulting models produce per-position or pooled sequence embeddings that can be extracted and used for downstream tasks. Critically, these embeddings are not tied to specific assays or cell types, making them applicable to any task that benefits from general sequence context.

DNABERT and DNABERT-2 apply BERT-style masked language modeling to DNA sequences, using overlapping k-mers as tokens (Ji et al. 2021; Z. Zhou et al. 2024) (Section 15.2). The Nucleotide Transformer family scales this approach to larger parameter counts and cross-species training (Dalla-Torre et al. 2023) (Section 15.3). HyenaDNA achieves subquadratic complexity through implicit convolutions, enabling context lengths up to one million nucleotides (Nguyen et al. 2023) (Section 15.5.1). Caduceus incorporates bidirectional processing and reverse-complement equivariance as architectural inductive biases (Section 15.5.2). Evo 2 combines long-range attention with biological tokenization strategies (Section 15.5.3). GROVER integrates learned BPE-style tokenization with training on regulatory tracks in addition to raw sequence (Sanabria et al. 2024). These models and their architectural innovations are examined in detail in Chapter 15.

The primary strength of DNA language models lies in their generality: representations not bound to specific assays, cell types, or experimental conditions, capable of processing novel sequences absent from reference genomes. Their self-supervised training requires only genome sequences, making them scalable to massive corpora. The corresponding limitation is that without explicit functional grounding, they may not capture subtle regulatory patterns that manifest only under specific cellular conditions. Performance on tasks requiring fine-grained functional discrimination may lag models trained with functional supervision.

Applications span sequence classification (promoters, enhancers, transposons), motif discovery, variant effect prediction through embedding perturbation, sequence generation for synthetic biology, and transfer learning to new species with limited labeled data.

14.5.2 Sequence-to-Function Foundation Models

Sequence-to-function models predict molecular readouts directly from sequence through supervised or semi-supervised training on assay compendia. These models blur into foundation model territory when their output space is sufficiently broad and their internal representations prove useful for tasks beyond the original assay set.

These models map DNA sequences to high-dimensional vectors of molecular measurements, including chromatin accessibility, histone modifications, transcription factor binding, and gene expression levels. Training uses large collections of functional genomics assays spanning many cell types, enabling the models to learn regulatory grammar through supervised prediction of molecular phenotypes.

Enformer predicts thousands of chromatin and expression tracks from 200 kb sequence windows through transformer attention (Avsec et al. 2021) (Section 17.2). Borzoi extends this with refined architectures and expanded RNA-seq coverage (Section 17.3). Sei organizes predictions into interpretable sequence classes through unsupervised clustering (Chen et al. 2022) (Section 17.4). Earlier models including DeepSEA and Basset established the paradigm at smaller scales (Chapter 6).

The explicit functional supervision in these models provides mechanistic grounding that pure language models lack. Predictions can be interpreted through comparison to experiments. The models naturally support variant effect prediction by computing reference-alternative differences. The tradeoff is that models remain tied to training assays and cell types; extension to new contexts typically requires retraining or new data collection.

Applications center on regulatory variant interpretation in well-studied cell types, eQTL fine-mapping, enhancer identification, transcription factor binding prediction, and regulatory mechanism discovery.

14.5.3 Variant Effect Prediction Models

The clinical need to interpret genetic variants has driven development of models optimized specifically for predicting functional or clinical consequences. These take a variant and predict its effect on molecular phenotypes, organismal fitness, or disease risk.

Variant effect prediction models integrate sequence context with evolutionary information, population genetics signals, and sometimes structural or functional annotations. They output pathogenicity scores, effect size estimates, or functional consequence predictions. Training combines multiple data sources: clinical labels from ClinVar, population frequency from gnomAD, functional assays such as deep mutational scanning, and evolutionary constraint metrics.

AlphaMissense applies protein language models to predict pathogenicity of missense variants (Cheng et al. 2023) (Section 18.2.3). ESM-1v uses evolutionary context for protein variant effect prediction (Section 18.2.1). EVE combines evolutionary and structural information (Section 18.2.2). Genomic foundation models like DNABERT and Enformer provide variant effect predictions through in silico mutagenesis (Section 18.3). The architecture, training, evaluation, and clinical deployment of variant effect predictors are covered comprehensively in Chapter 18, with integration into clinical workflows detailed in Chapter 29.

Knowledge Check

At this point, you should be able to distinguish between the three model families covered so far. Without looking back, try to answer: What is the key difference between DNA language models and sequence-to-function models in terms of their training objectives? Which family would you choose if you needed to predict enhancer activity in a novel cell type not represented in existing training data?

DNA language models use self-supervised objectives (masked language modeling or next-token prediction) on raw sequence without functional labels, while sequence-to-function models train with supervised multi-task prediction on thousands of chromatin and expression assays. For a novel cell type, DNA language models would be preferable because they learn general sequence patterns that transfer across contexts, whereas sequence-to-function models are tied to the specific cell types and assays in their training data.

14.5.4 Multi-Omic Foundation Models

The most ambitious foundation models natively integrate multiple molecular modalities, jointly processing DNA sequence, chromatin state, gene expression, protein abundance, 3D genome structure, or phenotypic descriptions.

Multi-omic models employ architectures designed for heterogeneous input types: transformer variants with cross-attention, graph neural networks, or modality-specific encoders with fusion layers (Chapter 7, Chapter 22). Training objectives encourage cross-modal alignment through contrastive learning, joint prediction, or generative modeling of multiple data types.

Omni-DNA uses transformer-based autoregressive models with vocabulary expansion and multi-task finetuning, unifying diverse genomic tasks under an instruction-response paradigm (Li et al. 2025). Models integrating Hi-C data capture 3D genome organization (Chapter 21). Cross-modal architectures align DNA embeddings with chromatin or expression predictions (Chapter 23).

The unified representations these models produce enable cross-modal queries, and joint training can improve performance through multi-task effects. Data engineering becomes substantially more complex, however, with different modalities requiring different measurement technologies and quality control. The field is early, with few models reaching production maturity.

14.6 Design Dimensions

Within and across families, individual models differ along orthogonal design dimensions that affect suitability for specific tasks.

14.6.1 Data Composition

The choice of training data shapes what patterns a model can learn. Training on human sequences alone focuses on clinically relevant patterns but limits exposure to evolutionary diversity. Cross-species training encourages learning of conserved elements and evolutionary constraints, potentially improving generalization but risking dilution of human-specific signals.

Sequence diversity presents a similar tradeoff. Training on reference genomes alone provides clean sequences but limited exposure to population variation. Incorporating variant data improves robustness but requires careful design to avoid learning spurious associations. Models may also train on raw sequence alone or incorporate functional annotations, trading generality against functional grounding. The implications of training data choices for model bias are examined in Chapter 13.

14.6.2 Architecture Choices

Architectural decisions determine both computational characteristics and inductive biases. Among transformer variants, encoder-only models (DNABERT, Nucleotide Transformer) excel at classification and embedding tasks, while decoder-only models (GROVER) support generative applications (Chapter 7). Full and sparse attention patterns, linear approximations, and Flash attention implementations affect computational efficiency.

Hyena-based models and state space models achieve subquadratic scaling, enabling longer contexts than standard transformers with comparable parameters. Hybrid approaches combine local convolutions with global attention, as in Enformer, processing sequences at multiple resolutions.

14.6.3 Context Length

The context window determines what genomic relationships a model can capture. Short context (under 1 kb) captures local patterns: motifs, splice sites, promoter elements. Medium context (1 to 10 kb) spans complete genes with proximal regulatory regions. Long context (10 to 200 kb) represents enhancer-promoter interactions and TAD-scale organization. Ultra-long context (over 200 kb) enables chromosomal domain modeling and complex structural variant interpretation. The effective use of long context requires appropriate tokenization and positional encoding strategies discussed in Chapter 5, with specific implementations examined in Section 15.2 (k-mer tokenization), Section 15.3 (BPE variants), and Section 7.2 (position embeddings).

The following table summarizes the relationship between context length and the biological phenomena that can be captured.

Table 14.3: Context length determines what genomic relationships a model can capture.
Context Length Range Biological Scope Example Applications
Short < 1 kb Motifs, splice sites TF binding, splice prediction
Medium 1-10 kb Genes, proximal regulation Promoter analysis, UTR effects
Long 10-200 kb Enhancer-promoter, TADs Regulatory variants, eQTL
Ultra-long > 200 kb Chromosomal domains Structural variants, 3D genome

14.6.4 Tokenization

The representation of nucleotides as model inputs affects both computational efficiency and biological resolution. Character-level tokenization maintains single-base resolution but imposes longest sequence lengths. K-mer tokenization reduces length by a factor approaching \(k\), with vocabulary reaching 4,096 for 6-mers. Learned tokenization (BPE-style) discovers schemes from data, potentially allocating vocabulary more efficiently (Medvedev et al. 2025). The choice should align with both computational constraints and biological resolution requirements. Detailed discussion of tokenization strategies appears in Chapter 5.

Design dimensions for genomic foundation models
Figure 14.4: Design dimensions for genomic foundation models. Radar chart positions representative models across six key dimensions: context length (how much sequence the model processes), parameter count (model capacity), training compute (resources required), architecture type (encoder vs. decoder), tokenization strategy (k-mer vs. single-nucleotide), and pretraining objective (masked vs. autoregressive). Different models make different trade-offs: ESM-2 emphasizes parameter scale within protein-length contexts; Enformer balances long context with multi-task supervision; HyenaDNA pushes context length to megabases using sub-quadratic architectures; Evo combines massive scale with autoregressive generation. These trade-offs determine which applications each model best serves.

14.7 Build Versus Use Decisions

The availability of pretrained foundation models creates strategic choices about when to use existing models, when to adapt them, and when to train from scratch.

Stop and Think

Before reading the detailed guidance, consider your own research context: Do you have unique proprietary data? What computational resources are available? How specific is your target task? Based on these factors, would you expect to use, adapt, or build a foundation model?

14.7.1 When to Use Existing Models

Existing foundation models provide immediate utility when the target application aligns with model capabilities, labeled data is limited, and computational resources are constrained.

For tasks where general sequence representations suffice, frozen foundation model embeddings with simple downstream classifiers often perform competitively with fine-tuned alternatives. This approach requires minimal compute (single forward passes), no gradient computation through large models, and modest labeled data (hundreds to thousands of examples). Applications include sequence classification, clustering, and similarity search.

Some foundation models support zero-shot variant effect prediction through likelihood ratio scoring. This requires no task-specific training and produces calibrated scores for novel variants immediately. Zero-shot approaches work well when the pretraining objective aligns with the target task and when fine-tuning data is unavailable or unreliable.

Foundation model APIs also enable rapid prototyping, allowing quick assessment of whether a modeling approach is viable before committing resources to custom development. Testing variant effect prediction with ESM-1v takes hours rather than the weeks required to train a custom model.

14.7.2 When to Adapt Existing Models

Adaptation through fine-tuning or lightweight methods (LoRA, adapters, prefix tuning) makes sense when downstream tasks require specialized behavior beyond what frozen embeddings provide, sufficient labeled data exists (typically thousands to tens of thousands of examples), and the target domain falls within the pretraining distribution.

Parameter-efficient methods like LoRA update a small fraction of model parameters (often under 1%) while keeping the foundation model frozen (Hu et al. 2021). This preserves general knowledge while allowing task-specific adaptation. Compute requirements are modest: a few GPU-hours for most genomic tasks. The approach works well when the foundation model’s representations are largely appropriate but need refinement for specific applications. Details on parameter-efficient adaptation appear in Chapter 9.

Updating all parameters typically achieves the best single-task performance but requires more data (tens of thousands of examples), more compute (GPU-days to weeks), and careful regularization to prevent overfitting. Full fine-tuning makes sense for high-stakes applications where maximum accuracy justifies the investment.

14.7.3 When to Train from Scratch

Building custom foundation models requires substantial justification given the resources involved.

Novel domains present the clearest case for custom pretraining. When target sequences differ fundamentally from existing model pretraining data (novel species, synthetic sequences, non-standard nucleotides), existing models may provide poor transfer. Applications requiring architectural features absent from existing models (specific attention patterns, custom tokenization, multi-modal inputs) similarly demand building from scratch.

Organizations with unique large-scale datasets (clinical biobanks, pharmaceutical screening data) may achieve better performance through custom pretraining than public models allow, though the data advantage must be substantial to justify training costs. Applications requiring larger models or longer contexts than available options face similar calculus.

14.7.4 Cost-Benefit Analysis

The decision framework involves comparing expected performance against resource requirements.

Training a foundation model from scratch requires \(10^{20}\) to \(10^{22}\) FLOPs, translating to thousands of GPU-hours and tens of thousands of dollars at current cloud prices. Fine-tuning requires \(10^{16}\) to \(10^{18}\) FLOPs, often achievable in hours on single GPUs. Inference with frozen embeddings requires only forward passes.

Foundation model pretraining requires billions of tokens. Fine-tuning requires thousands to tens of thousands of labeled examples. Zero-shot and embedding approaches require only evaluation data.

For well-studied tasks with abundant labeled data, fine-tuned models typically outperform frozen embeddings by 5 to 15% on standard metrics. Zero-shot approaches often achieve 70 to 90% of fine-tuned performance. Custom foundation models rarely outperform existing options by large margins unless the application involves genuinely novel domains.

The following table summarizes the resource requirements and expected performance for each approach.

Table 14.4: Resource requirements and expected performance for different foundation model approaches.
Approach Compute Data Required Time Expected Performance
Frozen embeddings \(10^{14}\) FLOPs 100s-1000s labels Hours 70-90% of fine-tuned
LoRA/Adapters \(10^{16}\) FLOPs 1000s labels Hours-Days 95% of full fine-tuning
Full fine-tuning \(10^{18}\) FLOPs 10Ks labels Days-Weeks Best single-task
Train from scratch \(10^{20}\)+ FLOPs Billions tokens Weeks-Months Best if novel domain

Decision framework for using vs. building foundation models
Figure 14.5: Decision framework for using vs. building foundation models. Entry point: a new genomic prediction task. First decision: does a suitable pretrained model exist? If yes, assess task alignment. For high alignment, USE frozen embeddings (hours of work, ~$10 compute, achieving 70-90% of fine-tuned performance); this serves most applications. For moderate alignment, ADAPT using LoRA or light fine-tuning (days of work, $100-1000 compute, ~95% of full fine-tuning). Only when existing models fundamentally lack required capabilities should practitioners BUILD custom foundation models (months of work, $100K+ compute). The vast majority of applications are best served by using or adapting existing models rather than building from scratch.

Time costs often dominate: using existing models takes hours to days, fine-tuning takes days to weeks, training from scratch takes weeks to months. For time-sensitive applications, using existing models often dominates even if custom training would eventually yield better results.

Practical Guidance: The Build-vs-Use Decision

For most genomic applications, follow this decision sequence:

  1. Start with frozen embeddings from the most appropriate existing foundation model. Evaluate on held-out data before investing more resources.

  2. Try parameter-efficient fine-tuning (LoRA or adapters) if frozen embeddings underperform by more than 10% versus published baselines.

  3. Consider full fine-tuning only for high-stakes applications where the 5% improvement over LoRA justifies GPU-days of compute.

  4. Train from scratch only when all of the following hold: (a) target domain differs fundamentally from existing pretraining data, (b) you have access to unique large-scale data, (c) timeline permits months of development, and (d) budget permits $100K+ in compute.

Most researchers will never need to train a foundation model from scratch. The efficiency of using pretrained models is precisely their value proposition.

14.8 Evaluation Principles

Foundation models resist evaluation on single tasks. Their value lies in transfer across many applications, making comprehensive evaluation substantially more complex than benchmarking task-specific models.

14.8.1 Multi-Task Assessment

A genomic foundation model should be evaluated across families of related tasks rather than isolated benchmarks. For DNA language models, this includes sequence classification tasks, variant effect prediction across multiple variant types, motif discovery, and cross-species transfer. For sequence-to-function models, evaluation should span prediction of held-out assays, transfer to novel cell types, and consistency with experimental measurements.

The diversity of evaluation tasks complicates comparison across models. A model excelling at promoter classification may underperform on eQTL fine-mapping. Direct comparisons require controlling for differences in training data, model scale, and evaluation protocols. Standardized benchmark suites are examined in Chapter 11.

14.8.2 Transfer Versus Pretraining Performance

Foundation models are intended for transfer, making pretraining loss only moderately predictive of downstream utility. A model with slightly worse masked language modeling loss may produce better embeddings if its training objective better aligns with useful representations. Evaluation should explicitly test transfer through zero-shot performance, few-shot learning, cross-domain transfer, and robustness to distribution shift.

Detailed discussion of benchmark suites, evaluation protocols, and methodological best practices appears in Chapter 11 and Chapter 12.

14.9 Foundation Model Ecosystem

Genomic foundation models exist within a broader ecosystem of infrastructure, community resources, and shared practices.

14.9.1 Model Distribution

Most models are distributed through centralized repositories. Hugging Face hosts many DNA and protein language models with documented APIs. GitHub repositories accompany publications with weights, code, and examples. Standardized formats reduce friction in adoption, enabling rapid benchmarking and experimentation.

14.9.2 Documentation Requirements

Responsible distribution requires comprehensive documentation: training data provenance, preprocessing procedures, architecture details, hyperparameters, evaluation protocols, and known limitations. Data provenance is particularly important given population-specific biases and use restrictions in genomic datasets (Chapter 13).

14.9.3 Industry and Academic Contributions

Both academic and industry groups develop genomic foundation models. Academic models emphasize reproducibility and open access. Industry models may offer superior performance through proprietary data or compute but with limited transparency. Notable industry contributions include NVIDIA’s BioNeMo platform and Microsoft’s Azure genomics integration. Users should review license terms before clinical or commercial deployment.

14.10 Open Questions

Despite rapid progress, fundamental challenges remain unsolved, and the field’s trajectory remains uncertain.

Whether genomic foundation models converge toward unified architectures or maintain specialized families is unclear. The diversity of genomic scales, resolution requirements, and functional contexts may preclude the convergence seen in NLP, where transformers now dominate across most tasks.

Existing models learn correlations without distinguishing causal from spurious relationships. Integrating causal structure could improve robustness and enable counterfactual reasoning, but current architectures provide no principled mechanism for causal inference (Chapter 13).

Models trained on reference genomes and common variants may not calibrate well for ultra-rare or de novo variants, precisely the variants most likely to be clinically actionable (Chapter 29). Improved integration of structural and evolutionary constraints could strengthen rare variant interpretation.

Translation to clinical use requires robust cross-population performance, calibrated uncertainty (Chapter 24), interpretability for clinicians (Chapter 25), prospective validation, and regulatory approval. These requirements extend well beyond benchmark performance, and the path from research model to clinical deployment remains poorly charted.

14.11 Convergence Without Consolidation

Foundation models for genomics divide into families serving different needs. DNA language models learn general sequence representations from self-supervised pretraining, capturing evolutionary constraints and regulatory patterns without explicit functional labels (Chapter 15). Sequence-to-function models predict molecular phenotypes from sequence, providing quantitative outputs (expression levels, chromatin states, splice probabilities) that DNA language models alone cannot produce (Chapter 17). Variant effect models integrate sequence representations with evolutionary information to score the functional impact of genetic variants (Chapter 18). Multi-omic models combine sequence with additional data modalities to capture regulatory relationships that sequence alone cannot resolve (Chapter 23). No single family dominates; effective genomic AI requires matching model capabilities to application requirements.

Scale introduces both opportunities and constraints. Scaling laws describe predictable relationships between parameters, data, compute, and performance, enabling principled resource allocation. Some capabilities appear only at sufficient scale, creating thresholds that cannot be crossed through fine-tuning alone. The practical implication is that certain applications require institutional-scale investment, while others can leverage existing pretrained models with modest adaptation. The build-versus-use framework guides this decision: use existing models when they suffice, adapt through fine-tuning or feature extraction when needed, train from scratch only when unique data or requirements justify the investment.

This framework instantiates across specific domains. DNA language models (Chapter 15) and protein language models (Chapter 16) exemplify self-supervised pretraining on biological sequence. Regulatory models (Chapter 17) demonstrate sequence-to-function prediction at long-range scales. Variant effect prediction (Chapter 18) integrates multiple model families for clinical interpretation. Throughout, these principles guide model selection: what does this application require, which model family provides it, and what scale is necessary to achieve it?

Chapter Summary
Test Yourself

Before reviewing the summary, test your recall:

  1. What distinguishes a foundation model from a task-specific deep learning model? Why does self-supervised pretraining enable transfer to multiple downstream tasks while supervised training does not?

  2. According to the Chinchilla scaling laws, what is the relationship between model parameters, training data, and compute budget? If you have a fixed compute budget, should you train a larger model on less data or a smaller model on more data?

  3. What are the four major families of genomic foundation models and what is the key strength and limitation of each family?

  4. Explain the concept of emergent capabilities in foundation models. Why does this matter when selecting a model for adaptation to a new task?

  5. When should you build a foundation model from scratch versus adapting an existing one? What are the key decision factors in the build-versus-use hierarchy?

  1. Foundation vs. task-specific models: Foundation models are distinguished by their ability to transfer to diverse downstream tasks through embeddings or lightweight adaptation, not by their size alone. Self-supervised pretraining enables transfer because the model must learn general features useful for reconstructing any genomic context (e.g., how motifs combine, sequence composition patterns, functional vs. non-functional sequence). In contrast, supervised training on narrow tasks produces features specifically optimized for that task (e.g., splice site prediction learns GT-AG consensus and branch points), which may be irrelevant or misleading for other applications.

  2. Chinchilla scaling laws: The framework shows that loss \(L(N, D) = E + A/N^\alpha + B/D^\beta\) where \(N\) is parameters, \(D\) is training tokens, and compute \(C \approx 6ND\). The key insight is that model parameters and training data should scale approximately equally; the optimal ratio is roughly 20 tokens per parameter. For a fixed compute budget, you should train a smaller model on more data rather than a larger model on less data, as undertrained large models typically underperform smaller compute-optimal models.

  3. Four foundation model families:

    1. DNA language models use self-supervised pretraining for general sequence representations but lack functional grounding for subtle regulatory patterns.

    2. Sequence-to-function models predict molecular phenotypes with mechanistic grounding but are tied to training assays and cell types.

    3. Variant effect prediction models integrate multiple information sources for clinical relevance but focus narrowly on variant interpretation.

    4. Multi-omic models integrate across modalities for holistic understanding but face data engineering complexity and are still early in development.

  4. Emergent capabilities: These are abilities that appear discontinuously at certain scale thresholds and are absent in smaller models (e.g., ESM-2’s structural understanding emerges at ~650M parameters, zero-shot variant effect prediction requires sufficient scale). This matters because attempting to fine-tune a model below the capability threshold is wasted effort: if a 100M parameter model fundamentally cannot perform a task, no amount of fine-tuning will enable it. You must verify that models of similar scale have demonstrated the required capability before attempting adaptation.

  5. Build vs. use hierarchy: Start with frozen embeddings from existing models (hours of work, ~$10 compute, 70-90% of fine-tuned performance). Escalate to parameter-efficient adaptation like LoRA if embeddings underperform (days of work, $100-1000 compute, ~95% of full fine-tuning). Consider full fine-tuning only for high-stakes applications (weeks, $1000+). Train from scratch only when: (a) target domain differs fundamentally from existing pretraining data, (b) you have unique large-scale data, (c) timeline permits months, and (d) budget permits $100K+ compute. Most researchers never need to train from scratch.

Core Concepts:

  • Foundation models are distinguished from task-specific models by their ability to transfer to diverse downstream tasks through embeddings or lightweight adaptation, not by size alone.

  • Scaling laws (Chinchilla framework) describe predictable relationships between parameters, data, compute, and performance. For genomics, data constraints often matter more than compute constraints.

  • Emergent capabilities appear at scale thresholds; models below these thresholds cannot achieve certain capabilities regardless of fine-tuning.

  • Four model families serve different needs: DNA language models (general embeddings), sequence-to-function models (assay predictions), variant effect models (clinical interpretation), and multi-omic models (cross-modal integration).

  • Build-vs-use decisions follow a clear hierarchy: start with frozen embeddings, escalate to adaptation if needed, train from scratch only for genuinely novel domains.

Key Takeaways:

  1. The paradigm shift from task-specific to foundation models changes how researchers interact with models, shifting from training practitioners to adaptation specialists.

  2. For most applications, using or adapting existing foundation models is more efficient than training from scratch.

  3. Evaluation must span multiple tasks; single-benchmark performance does not capture foundation model value.

  4. The path from research to clinical deployment requires addressing uncertainty, interpretability, and regulatory requirements beyond benchmark performance.

Looking Ahead: The next chapters examine each foundation model family in depth: DNA language models (Chapter 15), protein language models (Chapter 16), regulatory sequence-to-function models (Chapter 17), and variant effect prediction (Chapter 18).