Appendix D — Model Reference

This appendix provides a reference catalog of genomic foundation models and related computational tools discussed throughout the book. Models are organized by category with key specifications to help practitioners select appropriate tools for their applications.

D.1 DNA Language Models

Model Parameters Context Tokenization Key Capability Citation
DNABERT 110M 512 bp 6-mer Chromatin accessibility, TF binding Ji et al. (2021)
DNABERT-2 117M 512 bp BPE Improved efficiency, multi-species Z. Zhou et al. (2024)
Nucleotide Transformer 50M to 2.5B 6 kb 6-mer Embeddings, regulatory prediction Dalla-Torre et al. (2023)
HyenaDNA 1.4M to 6.6M 1 Mb Single nucleotide Long-range dependencies Nguyen et al. (2023)
Caduceus 1.8M to 7.4M 131 kb Single nucleotide Bidirectional, reverse complement Schiff et al. (2024)
GROVER 80M to 520M 2 kb Single nucleotide DNA + RNA understanding Sanabria et al. (2024)
Evo 7B 131 kb Single nucleotide Generation, whole-genome Nguyen et al. (2024)
Evo 2 7B to 40B 1 Mb Single nucleotide Multi-scale prediction Brixi et al. (2025)

D.1.1 Model Access

Model Repository Weights License
DNABERT github.com/jerryji1993/DNABERT HuggingFace MIT
DNABERT-2 github.com/MAGICS-LAB/DNABERT_2 HuggingFace MIT
Nucleotide Transformer github.com/instadeepai/nucleotide-transformer HuggingFace CC BY-NC-SA 4.0
HyenaDNA github.com/HazyResearch/hyena-dna HuggingFace Apache 2.0
Caduceus github.com/kuleshov-group/caduceus HuggingFace Apache 2.0
Evo github.com/evo-design/evo HuggingFace Apache 2.0

D.2 Protein Language Models

Model Parameters Context Architecture Key Capability Citation
ESM-2 8M to 15B 1,024 AA Transformer encoder Structure, function, variants Lin et al. (2022)
ESM-1v 650M 1,024 AA Transformer encoder Zero-shot variant effects Meier et al. (2021)
ESMFold 15B 1,024 AA Encoder + structure Single-sequence folding Lin et al. (2022)
ProtTrans 420M to 3B 1,024 AA Transformer Multilingual protein embeddings Elnaggar et al. (2021)
ProGen2 151M to 6.4B 1,024 AA Autoregressive Protein generation Nijkamp et al. (2023)

D.2.1 Model Access

Model Repository Weights License
ESM-2 github.com/facebookresearch/esm HuggingFace MIT
ESMFold github.com/facebookresearch/esm HuggingFace MIT
ProtTrans github.com/agemagician/ProtTrans HuggingFace Academic

D.3 Sequence-to-Function Models

Model Input Output Architecture Key Capability Citation
DeepSEA 1 kb 919 chromatin features CNN Regulatory variant effects J. Zhou and Troyanskaya (2015)
Beluga 2 kb 2,002 features CNN Extended DeepSEA J. Zhou et al. (2018)
Sei 4 kb 21,907 targets CNN Sequence classes Chen et al. (2022)
Basenji 131 kb 4,229 tracks Dilated CNN Expression prediction Kelley et al. (2018)
Basenji2 131 kb 5,313 tracks Dilated CNN Cross-species, human + mouse Kelley (2020)
Enformer 196 kb 5,313 tracks Transformer Long-range regulation Avsec et al. (2021)
Borzoi 524 kb RNA-seq Transformer RNA expression Linder et al. (2025)

D.3.1 Model Access

Model Repository Weights License
DeepSEA/Beluga kipoi.org Kipoi Academic
Sei github.com/FunctionLab/sei-framework Zenodo MIT
Basenji/Basenji2 github.com/calico/basenji Direct Apache 2.0
Enformer github.com/deepmind/deepmind-research/tree/master/enformer TF Hub Apache 2.0
Borzoi github.com/calico/borzoi Direct Apache 2.0

D.4 Splice Prediction Models

Model Input Output Architecture Key Capability Citation
SpliceAI 10 kb context Splice probability ResNet Cryptic splice sites Jaganathan et al. (2019)
MaxEntScan 9+23 nt Splice score Position weight matrix Consensus scoring Yeo and Burge (2004)
Pangolin 5 kb Tissue-specific splicing Transformer Tissue context Zeng and Li (2022)

D.4.1 Model Access

Model Repository Web Interface License
SpliceAI github.com/Illumina/SpliceAI spliceailookup.broadinstitute.org GPLv3
Pangolin github.com/tkzeng/Pangolin N/A MIT

D.5 Variant Effect Predictors

D.5.1 Integrative Scores

Model Input Method Key Features Citation
CADD Any variant Ensemble ML 100+ annotations, universal Rentzsch et al. (2019)
REVEL Missense Ensemble 13 tool integration Ioannidis et al. (2016)
PrimateAI-3D Missense Deep learning + structure Primate conservation Sundaram et al. (2018)

D.5.2 Protein Language Model-Based

Model Input Method Key Features Citation
AlphaMissense Missense ESM + AlphaFold Structure-aware PLM Cheng et al. (2023)
ESM-1v Missense Zero-shot PLM No training required Meier et al. (2021)
EVE Missense VAE on MSA Evolutionary model Frazer et al. (2021)
GPN-MSA Any variant Alignment LM Conservation + context Benegas et al. (2024)

D.5.3 Conservation-Based

Model Input Method Key Features Citation
SIFT Missense Sequence conservation Fast, interpretable Ng and Henikoff (2003)
PolyPhen-2 Missense Conservation + structure HumDiv/HumVar models Adzhubei et al. (2010)
GERP++ Any position Rejected substitutions Base-level conservation Davydov et al. (2010)
phyloP Any position Phylogenetic model Acceleration/conservation Pollard et al. (2009)

D.5.4 Model Access

Model Access Web Interface
CADD cadd.gs.washington.edu Score lookup + download
AlphaMissense github.com/google-deepmind/alphamissense Precomputed scores
REVEL sites.google.com/site/revelgenomics Precomputed scores
gnomAD gnomad.broadinstitute.org Integrated VEP scores

D.6 Structure Prediction

Model Input Output Key Capability Citation
AlphaFold2 Protein sequence + MSA 3D structure High-accuracy folding Jumper et al. (2021)
AlphaFold3 Protein/DNA/RNA/ligand Complex structure Multi-molecule complexes Abramson et al. (2024)
ESMFold Protein sequence 3D structure Single-sequence, fast Lin et al. (2022)
RoseTTAFold Protein sequence + MSA 3D structure Three-track architecture Baek et al. (2021)

D.6.1 Model Access

Model Repository Server License
AlphaFold2 github.com/google-deepmind/alphafold alphafold.ebi.ac.uk Apache 2.0
AlphaFold3 github.com/google-deepmind/alphafold3 alphafoldserver.com Research only
ESMFold github.com/facebookresearch/esm esmatlas.com MIT

D.7 Single-Cell and Multi-Omics Models

Model Input Output Key Capability Citation
scGPT scRNA-seq Cell embeddings Cell type, perturbation Cui et al. (2024)
Geneformer scRNA-seq Gene embeddings Transfer learning Theodoris et al. (2023)
scBERT scRNA-seq Cell embeddings Cell annotation Yang et al. (2022)
GLUE Multi-omics Integrated embeddings Cross-modality integration Cao and Gao (2022)

D.8 Polygenic and Clinical Models

Model Input Output Key Capability Citation
Delphi Genotypes Disease risk Deep PGS Georgantas, Kutalik, and Richiardi (2024)
DeepRVAT Rare variants Gene burden Rare variant aggregation Clarke et al. (2024)
G2PT Genotypes + phenotypes Risk prediction Genotype-to-phenotype Lee et al. (2025)

D.9 Category Definitions

DNA LM
DNA language models using self-supervised pretraining (masked language modeling or autoregressive) on genomic sequences. Produce embeddings useful for diverse downstream tasks.
PLM
Protein language models trained on protein sequences using similar self-supervised objectives. Capture evolutionary and structural information.
Seq→Func
Supervised sequence-to-function models predicting molecular phenotypes (chromatin accessibility, histone modifications, gene expression) directly from DNA sequence.
Splice
Specialized models for splice site recognition and splicing outcome prediction.
VEP
Variant effect predictors spanning multiple paradigms: conservation-based, integrative ensemble, and foundation model-based approaches.
Structure
Protein (and nucleic acid) structure prediction models.
GFM
Genomic foundation model: a broad term for models with reusable representations applicable across multiple downstream tasks.

D.10 Practical Considerations

D.10.1 Selecting a Model

When choosing a model for a specific application:

  1. Task alignment: Does the model’s pretraining objective match your task? MLM-pretrained models excel at classification; autoregressive models enable generation.

  2. Context requirements: Long-range regulatory effects require models with large context windows (Enformer, HyenaDNA, Evo). Local motif tasks work with shorter contexts.

  3. Computational resources: Parameter counts range from millions to billions. Smaller models (DNABERT, 110M) run on consumer GPUs; larger models (Evo 2, 40B) require substantial infrastructure.

  4. License restrictions: Some models restrict commercial use (CC BY-NC) or require academic affiliation. Verify license compatibility before deployment.

  5. Benchmark performance: Consult Chapter 11 for standardized comparisons on tasks relevant to your application.

D.10.2 Model Versioning

Foundation models are actively developed, with new versions often substantially outperforming predecessors. When citing or deploying models:

  • Specify exact version and checkpoint (e.g., “ESM-2 650M, checkpoint esm2_t33_650M_UR50D”)
  • Record model weights hash for reproducibility
  • Note training data version (UniRef versions change over time)
  • Document inference parameters (temperature, sampling strategy for generative models)