10  DNA Foundation Models

Warning

TODO:

Genomic language models extend the ideas of protein language models (Chapter 9) to the DNA level: they treat genomes themselves as a “corpus,” learn statistical regularities through self-supervision, and reuse those representations for many downstream tasks.

Where Chapters 5–7 focused on supervised sequence-to-function CNNs and specialized architectures, and Chapter 8 focused on representation and tokenization, this chapter turns to DNA foundation models—large, often transformer-based or hybrid architectures trained on unlabeled genomic sequence at scale.

These models aim to provide a single, reusable backbone for tasks ranging from regulatory annotation and variant effect prediction to cross-species transfer and clinical prioritization. They mark the transition from “a model per dataset” to general-purpose genomic backbones analogous to BERT, GPT, and ESM in natural and protein language modeling.


10.1 From Supervised CNNs to Self-Supervised Genomic LMs

The CNN era (DeepSEA, ExPecto, SpliceAI; Chapters 5–7) shared a common pattern:

  • Inputs: One-hot encoded DNA sequence around a locus
  • Targets: Task-specific labels (chromatin marks, expression, splice junctions)
  • Objective: Predict those labels using supervised loss functions

This approach achieved remarkable performance but suffers from three fundamental constraints:

  1. Label dependence – Every new assay, cell type, or phenotype requires new labeled data.
  2. Task coupling – Model design is tightly coupled to the task (e.g., splice-aware architectures, expression-distance kernels).
  3. Limited reuse – Features learned for one problem do not automatically transfer to others.

Protein language models (Chapter 9) showed a different route: self-supervised learning on unlabeled sequences, with downstream tasks solved by probing or fine-tuning. Genomic language models import this recipe to DNA:

  • Data: Large collections of genomic sequences across species, individuals, or functional regions.
  • Objectives:
    • Masked language modeling (MLM): predict masked bases or tokens.
    • Next-token or sequence modeling: predict the next token in a sequence.
    • Hybrid tasks: combine MLM with auxiliary objectives (e.g., predicting annotations).
  • Usage modes:
    • Freeze the model and train light-weight probes for specific tasks.
    • Fine-tune the entire model (or adapters) for specialized downstream tasks.
    • Use zero-shot or few-shot scoring by comparing log-likelihoods of alternative sequences or alleles.

The promise is that once a sufficiently powerful backbone is trained, it becomes the default starting point for nearly any DNA-level prediction problem.


10.2 Early Genomic Transformers: DNABERT and DNABERT-2

10.2.1 DNABERT — BERT for k-merized DNA

DNABERT applied the BERT masked language modeling framework to genomic sequences, using overlapping k-mers (e.g., 6-mers) as tokens and training on human reference sequences (Ji et al. 2021). As discussed in Chapter 8, this design had several defining characteristics:

  • Tokenization: Overlapping k-mers created a discrete vocabulary of size \(4^k\).
  • Objective: Masked token prediction, exactly as in BERT.
  • Input scale: Context windows of a few hundred base pairs (e.g., 512 tokens).
  • Downstream evaluation: Fine-tuned on tasks such as promoter classification, splice site prediction, and transcription factor binding.

DNABERT provided proof of concept that:

  • Self-supervised pretraining on raw DNA can improve performance over training from scratch.
  • Learned embeddings capture biologically meaningful regularities, even when trained only on the reference genome.
  • BERT-style architectures can be re-used across multiple downstream tasks.

However, the k-mer design also introduced the limitations detailed in Chapter 8:

  • No true sequence compression—overlapping k-mers do not reduce sequence length.
  • Ambiguity in positional interpretation—each nucleotide participates in multiple tokens.
  • Limited context and scaling, due to quadratic attention and redundant overlapping tokens.

10.2.2 DNABERT-2 — Toward Better Tokenization and Efficiency

DNABERT-2 revisited both tokenization and architecture, highlighting how much representation matters for genomic LMs (Zhou et al. 2024).

Key differences relative to DNABERT:

  • Tokenization: Improved schemes (e.g., BPE-style merges) that better compress redundancies and reduce sequence length.
  • Efficiency: Models that scale to larger contexts and corpora without prohibitive memory costs.
  • Performance: Consistent gains on a range of seq2label genomics benchmarks over both DNABERT and non-pretrained baselines.

DNABERT and DNABERT-2 collectively established that:

  1. Self-supervision on DNA works and is competitive with hand-engineered pipelines.
  2. Tokenization choices (Chapter 8) have large practical consequences.
  3. Masked LM training can produce reusable representations for diverse sequence tasks.

10.3 Scaling Context and Diversity: Nucleotide Transformer

DNABERT showed feasibility, but its context windows and training data were modest relative to the scale of genomes. Nucleotide Transformer pushed much further, emphasizing scale and diversity (Dalla-Torre et al. 2023):

  • Corpus: Genomic data spanning multiple species and populations.
  • Models: Transformer encoders of various sizes, from moderate to very large parameter counts.
  • Context length: Up to ~6 kb per input sequence—an order-of-magnitude jump over DNABERT while still using dense attention (Chapter 8).
  • Training objective: Masked language modeling on subsequences sampled from genomes.

The Nucleotide Transformer work contributed several important ideas:

  1. Cross-species pretraining
    Training on many genomes (rather than a single reference) exposes the model to:

    • Diverse sequence patterns and GC content.
    • Different regulatory architectures.
    • Evolutionary constraints that recur across lineages.

    This mirrors the use of large multi-species multiple sequence alignments in protein LMs (Chapter 9) but operates at the raw DNA level.

  2. Benchmark suite
    To quantify representation quality, Nucleotide Transformer introduced a benchmark panel of genomic tasks, commonly referred to in later work as the Nucleotide Transformer benchmarks (Dalla-Torre et al. 2023). Typical tasks include:

    • Promoter and enhancer classification.
    • Histone mark and chromatin accessibility prediction.
    • Variant/pathogenicity proxies.
    • Regulatory element type classification.

    Models are evaluated via linear probes, shallow classifiers, or light fine-tuning, providing a standard yardstick for later DNA LMs.

  3. Scaling trends
    As with protein and natural-language models, performance improves predictably with:

    • Larger models.
    • More pretraining data.
    • Longer context windows.

    These scaling curves help forecast the returns from investing in even larger genomic LMs.


10.4 Beyond Dense Attention: HyenaDNA and 1 Mb Context

Quadratic attention limits transformer context length to tens of kilobases at best, even with aggressive engineering. HyenaDNA replaces attention with a Hyena long-convolution / state-space architecture that scales sub-quadratically and can process sequences up to 1 Mb (Nguyen et al. 2023).

As summarized in Chapter 8:

Model Architecture Max context Complexity
DNABERT Transformer 512 bp \(O(L^2)\)
Nucleotide Transformer Transformer 6 kb \(O(L^2)\)
HyenaDNA Hyena 1 Mb \(O(L \log L)\)

HyenaDNA introduced several qualitative advances:

  1. Megabase-scale context
    Processing 1 Mb windows allows the model to “see”:

    • Entire gene bodies plus flanking regulatory regions.
    • Long-range enhancer–promoter interactions.
    • Topologically associating domain (TAD)-scale structure.

    This aligns better with biological reality, where regulatory interactions often span tens to hundreds of kilobases.

  2. Single-nucleotide resolution
    Despite its long context, HyenaDNA maintains base-level resolution, so single-nucleotide variants can be evaluated in the context of megabases of surrounding sequence.

  3. In-context learning signals
    On Nucleotide Transformer benchmarks and additional tasks, HyenaDNA shows in-context learning behaviors—performance improves when examples are included in the input context without updating model weights (Nguyen et al. 2023). This suggests that at sufficient scale, DNA models can adapt to new tasks or distributions via prompts rather than fine-tuning, mirroring phenomena seen in large language models.

  4. State-of-the-art performance
    On GenomicBenchmarks and related evaluations, HyenaDNA achieves state-of-the-art results on the majority of tasks, often by large margins, illustrating that architectural innovations can yield both longer context and better predictive accuracy (Nguyen et al. 2023).


10.5 Generative Regulatory Foundation Models: GROVER

Most models above focus on sequence-level objectives (masked or next-token). GROVER shifts attention from sequence to regulatory tracks (Sanabria et al. 2024):

  • Inputs/outputs: GROVER is trained on multi-track functional genomics signals (e.g., ATAC-seq, histone marks) across many cell types and tissues instead of raw sequence alone.
  • Objective: Predict masked or held-out regulatory profiles conditioned on neighboring tracks, cell-type embeddings, or limited sequence context.
  • Architecture: A transformer-style backbone tailored to spatiotemporal grids of genomic positions × assays × cell types.

GROVER plays a role analogous to self-supervised vision models for images:

  1. It treats regulatory profiles as a high-dimensional “image” over the genome.
  2. It learns rich representations of regulatory states at each position.
  3. It supports tasks like imputation of missing assays, denoising, and cell-type-specific activity prediction.

While not a pure DNA language model, GROVER-style systems complement sequence-based LMs:

  • DNA LMs capture what the genome can do (the encoded potential).
  • Regulatory LMs like GROVER capture what the genome is actually doing in specific contexts (cell types, conditions).

Later chapters (Part IV) explore how sequence-based and regulatory foundation models can be combined—e.g., using DNA LMs to parameterize sequence priors and regulatory LMs for context-specific readouts (Sanabria et al. 2024).


10.6 Central-Dogma-Aware and Annotation-Enriched Models

Chapter 8 discussed how tokenization can encode biological structure. Some recent models push this further by integrating central dogma and genomic annotations directly into the modeling framework.

10.6.1 Life-Code: Central Dogma as an Inductive Bias

Life-Code proposes codon-aware, central-dogma-informed tokenization to bridge DNA, RNA, and protein within a single language-modeling framework (Liu et al. 2025):

  • Coding regions: Tokenized as codons (3-mers in frame), reflecting the genetic code.
  • Noncoding regions: Tokenized via learned subword units.
  • Integration: Unified representations span DNA → RNA → protein, enabling knowledge sharing across modalities.

Life-Code uses distillation from protein LMs (Chapter 9) to:

  • Import protein-level structural knowledge into DNA/RNA sequence representations.
  • Improve performance on tasks involving coding sequence, such as predicting missense effects or expression changes.
  • Achieve competitive or state-of-the-art results on tasks across the full central dogma (DNA, RNA, protein) (Liu et al. 2025).

10.6.2 BioToken: Encoding Variants and Structure as Tokens

BioToken extends tokenization beyond nucleotide content to include explicit genomic annotations (Medvedev et al. 2025):

  • Variant-aware tokens: Encode SNPs, insertions, and deletions as distinct tokens rather than implicit changes in sequence.
  • Structural annotations: Incorporate information about exons, introns, UTRs, promoters, enhancers, and other regulatory elements.
  • Functional context: Include signals such as chromatin state, conservation scores, or known regulatory motifs.

This design moves toward fully structured genomic LMs, where:

  • The input “sentence” is not only DNA bases but also position-specific metadata.
  • Representations can directly integrate sequence, structure, and functional annotations.

Life-Code and BioToken foreshadow the multi-modal, multi-omic foundation models discussed in Part IV, where sequence is only one of many integrated information streams.


10.7 Using Genomic LMs in Practice

Just as protein LMs can be used in different modes (frozen embeddings, fine-tuning, zero-shot scoring; Chapter 9), genomic LMs have multiple usage patterns.

10.7.1 Embeddings as Universal Features

The simplest way to use a genomic LM is to:

  1. Extract embeddings for windows around loci of interest (e.g., 1–6 kb segments).
  2. Pool or select positions relevant to the task (e.g., promoters, candidate enhancers, variant sites).
  3. Train a light-weight downstream model (linear layer, small MLP, or logistic regression).

Applications include:

  • Regulatory element classification: Distinguishing promoters, enhancers, silencers, and insulators.
  • Chromatin state prediction: Predicting ATAC-seq or histone mark presence from sequence alone, as an alternative to models like DeepSEA (Chapter 5).
  • Variant effect scoring: Replacing or augmenting hand-crafted features in frameworks like CADD with LM-derived features (analogous to CADD v1.7’s use of PLM features; Chapter 9; Schubach et al. (2024)).
  • Splicing and transcript modeling: Combining LM embeddings with splice-aware architectures like SpliceAI (Chapter 7).

Because the LM is frozen, this approach is computationally efficient and avoids catastrophic forgetting when new tasks are added.

10.7.2 Fine-Tuning and Task-Specific Heads

When more labels are available, fine-tuning can significantly improve performance:

  • Full fine-tuning: Update all LM parameters for a specific task.
  • Adapter-based tuning: Insert small bottleneck modules into each layer and update only those, keeping the backbone mostly frozen.
  • Prompt-based tuning: Learn task-specific prompts or prefix embeddings that steer the model’s behavior without changing its core weights.

Fine-tuning is especially valuable for:

  • High-stakes clinical tasks where every percentage point matters.
  • Tasks that probe very specific sequence-function relationships (e.g., particular TF binding specificities).
  • Scenarios where domain shift is large (e.g., applying a cross-species LM to a previously unseen clade).

10.7.3 Zero-Shot and Few-Shot Variant Scoring

Analogous to protein models like ESM-1v and AlphaMissense (Chapter 9; Brandes et al. (2023); Cheng et al. (2023)), genomic LMs can be used to compute zero-shot variant scores by:

  • Comparing the log-likelihood (or pseudo-likelihood) of sequences with reference vs alternate alleles.
  • Measuring changes in masked-token prediction probabilities at variant positions.
  • Evaluating the impact of a variant on internal representations (e.g., vector distances between reference and variant embeddings).

These approaches can:

  • Provide rapid prioritization of novel variants without task-specific training.
  • Complement supervised classifiers trained on clinical or functional labels (e.g., ClinVar, curated datasets).
  • Offer a starting point for more specialized models (e.g., exon-specific splice models, enhancer-specific expression predictors).

10.8 Evaluation, Benchmarks, and Pitfalls

As genomic LMs proliferate, evaluation practices become crucial.

10.8.1 Benchmark Suites

Nucleotide Transformer introduced a widely used benchmark panel (Dalla-Torre et al. 2023), and later work, including HyenaDNA and Life-Code, also reports results on GenomicBenchmarks and related collections (Nguyen et al. 2023). Common traits of these suites include:

  • Multiple task families:
    • Promoter/enhancer classification.
    • TF binding prediction.
    • Chromatin accessibility and histone marks.
    • Splicing, TSS/TES prediction, or other sequence-label tasks.
  • Standardized splits:
    • Train/validation/test partitions.
    • Consistent evaluation metrics (AUROC, AUPRC, accuracy).
  • Baseline comparisons:
    • Non-pretrained CNNs and transformers.
    • Earlier models like DeepSEA, ExPecto, and SpliceAI.
    • Previously published genomic LMs.

These benchmarks help separate true representational gains from gains due to dataset choice or training tricks.

10.8.2 Distribution Shift, Data Leakage, and Overfitting

Genomic LMs are especially vulnerable to distribution shift: they may be pretrained on one mix of species, assays, or cohorts and then applied to a very different context (new genome builds, experimental protocols, or patient populations). The general principles and evaluation strategies for robustness to these shifts are covered in detail in Chapter 15; here we mainly note that GLM benchmarks should explicitly include “out-of-domain” settings (e.g., new tissues or cohorts) rather than only i.i.d. held-out sequences.

Because many genomic resources are reused across pretraining, fine-tuning, and evaluation, data leakage and overfitting can easily inflate retrospective performance, especially for GLMs trained on massive unlabeled corpora and then evaluated on derived benchmarks. Systematic treatment of leakage paths, circularity, and confounding—along with practical mitigation strategies—is given in Chapter 16.

Finally, when GLM-derived features feed into clinical prediction pipelines (e.g., risk scores or variant prioritization tools), the relevant notion of “performance” becomes clinical: discrimination, calibration, and net benefit for specific decisions. These clinical evaluation criteria, and how to connect model scores to real-world decisions, are discussed in Chapter 18.


10.9 Lessons and Outlook

DNA language models bring the “foundation model” paradigm to the genome. Several themes emerge:

  1. Representation is central
    Tokenization and context length (Chapter 8) are not superficial implementation details—they determine what patterns a model can see and how efficiently it can learn. Life-Code and BioToken show that biologically informed tokenization and annotations can serve as powerful inductive biases (Liu et al. 2025; Medvedev et al. 2025).

  2. Scale and diversity matter
    Nucleotide Transformer and HyenaDNA demonstrate that performance improves with both model size and training data diversity (Dalla-Torre et al. 2023; Nguyen et al. 2023). Including multiple species, populations, and genomic contexts yields more robust representations.

  3. Long-range context is biologically necessary
    Many regulatory phenomena operate at tens to hundreds of kilobases. Megabase-scale models like HyenaDNA show that we can finally begin to model these interactions at single-base resolution in a single forward pass.

  4. Self-supervision and supervision are complementary
    Self-supervised LMs excel at learning broad, reusable features, but they do not automatically solve every downstream problem. Specialized architectures and supervised objectives (e.g., Enformer and related models in Chapter 11) remain crucial for accurate quantitative prediction of complex genomic readouts.

  5. Integration with other modalities is the next frontier
    Models like GROVER, Life-Code, and BioToken hint at a future where DNA LMs are one part of larger multi-modal genomic foundation models that integrate:

    • Sequence (DNA, RNA, protein).
    • Regulatory profiles (chromatin, expression).
    • 3D genome organization.
    • Population genetics, phenotypes, and clinical data.

This chapter has focused on sequence-centric DNA LMs and their immediate extensions. In Chapter 11, we turn to Enformer and related long-range sequence-to-function models that explicitly predict molecular readouts from sequence, closing the loop between self-supervised sequence understanding and supervised functional prediction.