Appendix D — Model Reference

This appendix provides a reference catalog of genomic foundation models and related computational tools discussed throughout the book. Models are organized by category with key specifications to help practitioners select appropriate tools for their applications.

D.1 DNA Language Models

Model	Parameters	Context	Tokenization	Key Capability	Citation
DNABERT	110M	512 bp	6-mer	Chromatin accessibility, TF binding	Ji et al. (2021)
DNABERT-2	117M	512 bp	BPE	Improved efficiency, multi-species	Z. Zhou et al. (2024)
Nucleotide Transformer	50M to 2.5B	6 kb	6-mer	Embeddings, regulatory prediction	Dalla-Torre et al. (2023)
HyenaDNA	1.4M to 6.6M	1 Mb	Single nucleotide	Long-range dependencies	Nguyen et al. (2023)
Caduceus	1.8M to 7.4M	131 kb	Single nucleotide	Bidirectional, reverse complement	Schiff et al. (2024)
GROVER	80M to 520M	2 kb	Single nucleotide	DNA + RNA understanding	Sanabria et al. (2024)
Evo	7B	131 kb	Single nucleotide	Generation, whole-genome	Nguyen et al. (2024)
Evo 2	7B to 40B	1 Mb	Single nucleotide	Multi-scale prediction	Brixi et al. (2025)

D.1.1 Model Access

Model	Repository	Weights	License
DNABERT	github.com/jerryji1993/DNABERT	HuggingFace	MIT
DNABERT-2	github.com/MAGICS-LAB/DNABERT_2	HuggingFace	MIT
Nucleotide Transformer	github.com/instadeepai/nucleotide-transformer	HuggingFace	CC BY-NC-SA 4.0
HyenaDNA	github.com/HazyResearch/hyena-dna	HuggingFace	Apache 2.0
Caduceus	github.com/kuleshov-group/caduceus	HuggingFace	Apache 2.0
Evo	github.com/evo-design/evo	HuggingFace	Apache 2.0

D.2 Protein Language Models

Model	Parameters	Context	Architecture	Key Capability	Citation
ESM-2	8M to 15B	1,024 AA	Transformer encoder	Structure, function, variants	Lin et al. (2022)
ESM-1v	650M	1,024 AA	Transformer encoder	Zero-shot variant effects	Meier et al. (2021)
ESMFold	15B	1,024 AA	Encoder + structure	Single-sequence folding	Lin et al. (2022)
ProtTrans	420M to 3B	1,024 AA	Transformer	Multilingual protein embeddings	Elnaggar et al. (2021)
ProGen2	151M to 6.4B	1,024 AA	Autoregressive	Protein generation	Nijkamp et al. (2023)

D.2.1 Model Access

Model	Repository	Weights	License
ESM-2	github.com/facebookresearch/esm	HuggingFace	MIT
ESMFold	github.com/facebookresearch/esm	HuggingFace	MIT
ProtTrans	github.com/agemagician/ProtTrans	HuggingFace	Academic

D.3 Sequence-to-Function Models

Model	Input	Output	Architecture	Key Capability	Citation
DeepSEA	1 kb	919 chromatin features	CNN	Regulatory variant effects	J. Zhou and Troyanskaya (2015)
Beluga	2 kb	2,002 features	CNN	Extended DeepSEA	J. Zhou et al. (2018)
Sei	4 kb	21,907 targets	CNN	Sequence classes	Chen et al. (2022)
Basenji	131 kb	4,229 tracks	Dilated CNN	Expression prediction	Kelley et al. (2018)
Basenji2	131 kb	5,313 tracks	Dilated CNN	Cross-species, human + mouse	Kelley (2020)
Enformer	196 kb	5,313 tracks	Transformer	Long-range regulation	Avsec et al. (2021)
Borzoi	524 kb	RNA-seq	Transformer	RNA expression	Linder et al. (2025)

D.3.1 Model Access

Model	Repository	Weights	License
DeepSEA/Beluga	kipoi.org	Kipoi	Academic
Sei	github.com/FunctionLab/sei-framework	Zenodo	MIT
Basenji/Basenji2	github.com/calico/basenji	Direct	Apache 2.0
Enformer	github.com/deepmind/deepmind-research/tree/master/enformer	TF Hub	Apache 2.0
Borzoi	github.com/calico/borzoi	Direct	Apache 2.0

D.4 Splice Prediction Models

Model	Input	Output	Architecture	Key Capability	Citation
SpliceAI	10 kb context	Splice probability	ResNet	Cryptic splice sites	Jaganathan et al. (2019)
MaxEntScan	9+23 nt	Splice score	Position weight matrix	Consensus scoring	Yeo and Burge (2004)
Pangolin	5 kb	Tissue-specific splicing	Transformer	Tissue context	Zeng and Li (2022)

D.4.1 Model Access

Model	Repository	Web Interface	License
SpliceAI	github.com/Illumina/SpliceAI	spliceailookup.broadinstitute.org	GPLv3
Pangolin	github.com/tkzeng/Pangolin	N/A	MIT

D.5 Variant Effect Predictors

D.5.1 Integrative Scores

Model	Input	Method	Key Features	Citation
CADD	Any variant	Ensemble ML	100+ annotations, universal	Rentzsch et al. (2019)
REVEL	Missense	Ensemble	13 tool integration	Ioannidis et al. (2016)
PrimateAI-3D	Missense	Deep learning + structure	Primate conservation	Sundaram et al. (2018)

D.5.2 Protein Language Model-Based

Model	Input	Method	Key Features	Citation
AlphaMissense	Missense	ESM + AlphaFold	Structure-aware PLM	Cheng et al. (2023)
ESM-1v	Missense	Zero-shot PLM	No training required	Meier et al. (2021)
EVE	Missense	VAE on MSA	Evolutionary model	Frazer et al. (2021)
GPN-MSA	Any variant	Alignment LM	Conservation + context	Benegas et al. (2024)

D.5.3 Conservation-Based

Model	Input	Method	Key Features	Citation
SIFT	Missense	Sequence conservation	Fast, interpretable	Ng and Henikoff (2003)
PolyPhen-2	Missense	Conservation + structure	HumDiv/HumVar models	Adzhubei et al. (2010)
GERP++	Any position	Rejected substitutions	Base-level conservation	Davydov et al. (2010)
phyloP	Any position	Phylogenetic model	Acceleration/conservation	Pollard et al. (2009)

D.5.4 Model Access

Model	Access	Web Interface
CADD	cadd.gs.washington.edu	Score lookup + download
AlphaMissense	github.com/google-deepmind/alphamissense	Precomputed scores
REVEL	sites.google.com/site/revelgenomics	Precomputed scores
gnomAD	gnomad.broadinstitute.org	Integrated VEP scores

D.6 Structure Prediction

Model	Input	Output	Key Capability	Citation
AlphaFold2	Protein sequence + MSA	3D structure	High-accuracy folding	Jumper et al. (2021)
AlphaFold3	Protein/DNA/RNA/ligand	Complex structure	Multi-molecule complexes	Abramson et al. (2024)
ESMFold	Protein sequence	3D structure	Single-sequence, fast	Lin et al. (2022)
RoseTTAFold	Protein sequence + MSA	3D structure	Three-track architecture	Baek et al. (2021)

D.6.1 Model Access

Model	Repository	Server	License
AlphaFold2	github.com/google-deepmind/alphafold	alphafold.ebi.ac.uk	Apache 2.0
AlphaFold3	github.com/google-deepmind/alphafold3	alphafoldserver.com	Research only
ESMFold	github.com/facebookresearch/esm	esmatlas.com	MIT

D.7 Single-Cell and Multi-Omics Models

Model	Input	Output	Key Capability	Citation
scGPT	scRNA-seq	Cell embeddings	Cell type, perturbation	Cui et al. (2024)
Geneformer	scRNA-seq	Gene embeddings	Transfer learning	Theodoris et al. (2023)
scBERT	scRNA-seq	Cell embeddings	Cell annotation	Yang et al. (2022)
GLUE	Multi-omics	Integrated embeddings	Cross-modality integration	Cao and Gao (2022)

D.8 Polygenic and Clinical Models

Model	Input	Output	Key Capability	Citation
Delphi	Genotypes	Disease risk	Deep PGS	Georgantas, Kutalik, and Richiardi (2024)
DeepRVAT	Rare variants	Gene burden	Rare variant aggregation	Clarke et al. (2024)
G2PT	Genotypes + phenotypes	Risk prediction	Genotype-to-phenotype	Lee et al. (2025)

D.9 Category Definitions

DNA LM: DNA language models using self-supervised pretraining (masked language modeling or autoregressive) on genomic sequences. Produce embeddings useful for diverse downstream tasks.
PLM: Protein language models trained on protein sequences using similar self-supervised objectives. Capture evolutionary and structural information.
Seq→Func: Supervised sequence-to-function models predicting molecular phenotypes (chromatin accessibility, histone modifications, gene expression) directly from DNA sequence.
Splice: Specialized models for splice site recognition and splicing outcome prediction.
VEP: Variant effect predictors spanning multiple paradigms: conservation-based, integrative ensemble, and foundation model-based approaches.
Structure: Protein (and nucleic acid) structure prediction models.
GFM: Genomic foundation model: a broad term for models with reusable representations applicable across multiple downstream tasks.

D.10 Practical Considerations

D.10.1 Selecting a Model

When choosing a model for a specific application:

Task alignment: Does the model’s pretraining objective match your task? MLM-pretrained models excel at classification; autoregressive models enable generation.
Context requirements: Long-range regulatory effects require models with large context windows (Enformer, HyenaDNA, Evo). Local motif tasks work with shorter contexts.
Computational resources: Parameter counts range from millions to billions. Smaller models (DNABERT, 110M) run on consumer GPUs; larger models (Evo 2, 40B) require substantial infrastructure.
License restrictions: Some models restrict commercial use (CC BY-NC) or require academic affiliation. Verify license compatibility before deployment.
Benchmark performance: Consult Chapter 11 for standardized comparisons on tasks relevant to your application.

D.10.2 Model Versioning

Foundation models are actively developed, with new versions often substantially outperforming predecessors. When citing or deploying models:

Specify exact version and checkpoint (e.g., “ESM-2 650M, checkpoint esm2_t33_650M_UR50D”)
Record model weights hash for reproducibility
Note training data version (UniRef versions change over time)
Document inference parameters (temperature, sampling strategy for generative models)

Abramson, Josh, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, et al. 2024. “[AlphaFold3] Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” Nature 630 (8016): 493–500. https://doi.org/10.1038/s41586-024-07487-w.

Adzhubei, Ivan A., Steffen Schmidt, Leonid Peshkin, Vasily E. Ramensky, Anna Gerasimova, Peer Bork, Alexey S. Kondrashov, and Shamil R. Sunyaev. 2010. “A Method and Server for Predicting Damaging Missense Mutations.” Nature Methods 7 (4): 248–49. https://doi.org/10.1038/nmeth0410-248.

Avsec, Žiga, Vikram Agarwal, D. Visentin, J. Ledsam, A. Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, J. Jumper, Pushmeet Kohli, and David R. Kelley. 2021. “[Enformer] Effective Gene Expression Prediction from Sequence by Integrating Long-Range Interactions.” Nature Methods 18 (October): 1196–1203. https://doi.org/10.1038/s41592-021-01252-x.

Baek, Minkyung, Frank DiMaio, Ivan Anishchenko, Justas Dauparas, Sergey Ovchinnikov, Gyu Rie Lee, Jue Wang, et al. 2021. “Accurate Prediction of Protein Structures and Interactions Using a Three-Track Neural Network.” Science 373 (6557): 871–76. https://doi.org/10.1126/science.abj8754.

Benegas, Gonzalo, Carlos Albors, Alan J. Aw, Chengzhong Ye, and Yun S. Song. 2024. “GPN-MSA: An Alignment-Based DNA Language Model for Genome-Wide Variant Effect Prediction.” bioRxiv, April, 2023.10.10.561776. https://doi.org/10.1101/2023.10.10.561776.

Brixi, Garyk, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg Brockman, Daniel Chang, Gabriel A. Gonzalez, et al. 2025. “[Evo 2] Genome Modeling and Design Across All Domains of Life with Evo 2.” bioRxiv. https://doi.org/10.1101/2025.02.18.638918.

Cao, Zhi-Jie, and Ge Gao. 2022. “[GLUE] Multi-Omics Single-Cell Data Integration and Regulatory Inference with Graph-Linked Embedding.” Nature Biotechnology 40 (10): 1458–66. https://doi.org/10.1038/s41587-022-01284-4.

Chen, Kathleen M., Aaron K. Wong, Olga G. Troyanskaya, and Jian Zhou. 2022. “[DeepSEA Sei] A Sequence-Based Global Map of Regulatory Activity for Deciphering Human Genetics.” Nature Genetics 54 (7): 940–49. https://doi.org/10.1038/s41588-022-01102-2.

Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė, Taylor Applebaum, Alexander Pritzel, et al. 2023. “[AlphaMissense] Accurate Proteome-Wide Missense Variant Effect Prediction with AlphaMissense.” Science 381 (6664): eadg7492. https://doi.org/10.1126/science.adg7492.

Clarke, Brian, Eva Holtkamp, Hakime Öztürk, Marcel Mück, Magnus Wahlberg, Kayla Meyer, Felix Munzlinger, et al. 2024. “[DeepRVAT] Integration of Variant Annotations Using Deep Set Networks Boosts Rare Variant Association Testing.” Nature Genetics 56 (10): 2271–80. https://doi.org/10.1038/s41588-024-01919-z.

Cui, Haotian, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. 2024. “scGPT: Toward Building a Foundation Model for Single-Cell Multi-Omics Using Generative AI.” Nature Methods 21 (8): 1470–80. https://doi.org/10.1038/s41592-024-02201-0.

Dalla-Torre, Hugo, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago, et al. 2023. “Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics.” Nature Methods 22 (2): 287–97. https://doi.org/10.1038/s41592-024-02523-z.

Davydov, Eugene V., David L. Goode, Marina Sirota, Gregory M. Cooper, Arend Sidow, and Serafim Batzoglou. 2010. “Identifying a High Fraction of the Human Genome to Be Under Selective Constraint Using GERP++.” PLOS Computational Biology 6 (12): e1001025. https://doi.org/10.1371/journal.pcbi.1001025.

Elnaggar, Ahmed, Michael Heinzinger, Christian Dallago, Ghalia Rihawi, Yu Wang, Llion Jones, Tom Gibbs, et al. 2021. “ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Deep Learning and High Performance Computing.” arXiv. https://doi.org/10.48550/arXiv.2007.06225.

Frazer, Jonathan, Pascal Notin, Mafalda Dias, Aidan Gomez, Joseph K. Min, Kelly Brock, Yarin Gal, and Debora S. Marks. 2021. “[EVE] Disease Variant Prediction with Deep Generative Models of Evolutionary Data.” Nature 599 (7883): 91–95. https://doi.org/10.1038/s41586-021-04043-8.

Georgantas, Costa, Zoltán Kutalik, and Jonas Richiardi. 2024. “Delphi: A Deep-Learning Method for Polygenic Risk Prediction.” medRxiv. https://doi.org/10.1101/2024.04.19.24306079.

Ioannidis, Nilah M., Joseph H. Rothstein, Vikas Pejaver, Sumit Middha, Shannon K. McDonnell, Saurabh Baheti, Anthony Musolf, et al. 2016. “REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants.” The American Journal of Human Genetics 99 (4): 877–85. https://doi.org/10.1016/j.ajhg.2016.08.016.

Jaganathan, Kishore, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F. McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A. Kosmicki, et al. 2019. “[SpliceAI] Predicting Splicing from Primary Sequence with Deep Learning.” Cell 176 (3): 535–548.e24. https://doi.org/10.1016/j.cell.2018.12.015.

Ji, Yanrong, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. “DNABERT: Pre-Trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome.” Bioinformatics 37 (15): 2112–20. https://doi.org/10.1093/bioinformatics/btab083.

Jumper, John, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, et al. 2021. “[AlphaFold2] Highly Accurate Protein Structure Prediction with AlphaFold.” Nature 596 (7873): 583–89. https://doi.org/10.1038/s41586-021-03819-2.

Kelley, David R. 2020. “[Basenji2] Cross-Species Regulatory Sequence Activity Prediction.” PLOS Computational Biology 16 (7): e1008050. https://doi.org/10.1371/journal.pcbi.1008050.

Kelley, David R., Yakir A. Reshef, Maxwell Bileschi, David Belanger, Cory Y. McLean, and Jasper Snoek. 2018. “[Basenji] Sequential Regulatory Activity Prediction Across Chromosomes with Convolutional Neural Networks.” Genome Research 28 (5): 739–50. https://doi.org/10.1101/gr.227819.117.

Lee, Ingoo, Zachary S. Wallace, Yuqi Wang, Sungjoon Park, Hojung Nam, Amit R. Majithia, and Trey Ideker. 2025. “[G2PT] A Genotype-Phenotype Transformer to Assess and Explain Polygenic Risk.” bioRxiv. https://doi.org/10.1101/2024.10.23.619940.

Lin, Zeming, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, et al. 2022. “[ESM-2] Language Models of Protein Sequences at the Scale of Evolution Enable Accurate Structure Prediction.” bioRxiv. https://doi.org/10.1101/2022.07.20.500902.

Linder, Johannes, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and David R. Kelley. 2025. “[Borzoi] Predicting RNA-Seq Coverage from DNA Sequence as a Unifying Model of Gene Regulation.” Nature Genetics 57 (4): 949–61. https://doi.org/10.1038/s41588-024-02053-6.

Meier, Joshua, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and Alexander Rives. 2021. “[ESM-1v] Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein Function.” bioRxiv. https://doi.org/10.1101/2021.07.09.450648.

Ng, Pauline C., and Steven Henikoff. 2003. “SIFT: Predicting Amino Acid Changes That Affect Protein Function.” Nucleic Acids Research 31 (13): 3812–14. https://doi.org/10.1093/nar/gkg509.

Nguyen, Eric, Michael Poli, Matthew G. Durrant, Brian Kang, Dhruva Katrekar, David B. Li, Liam J. Bartie, et al. 2024. “Sequence Modeling and Design from Molecular to Genome Scale with Evo.” Science 386 (6723): eado9336. https://doi.org/10.1126/science.ado9336.

Nguyen, Eric, Michael Poli, Marjan Faizi, Armin Thomas, Callum Birch-Sykes, Michael Wornow, Aman Patel, et al. 2023. “HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution.” arXiv. https://doi.org/10.48550/arXiv.2306.15794.

Nijkamp, Erik, Jeffrey A. Ruffolo, Eli N. Weinstein, Nikhil Naik, and Ali Madani. 2023. “ProGen2: Exploring the Boundaries of Protein Language Models.” Cell Systems 14 (11): 968–978.e3. https://doi.org/10.1016/j.cels.2023.10.002.

Pollard, Katherine S., Melissa J. Hubisz, Kate R. Rosenbloom, and Adam Siepel. 2009. “Detection of Nonneutral Substitution Rates on Mammalian Phylogenies.” Genome Research 20 (1): 110–21. https://doi.org/10.1101/gr.097857.109.

Rentzsch, Philipp, Daniela Witten, Gregory M Cooper, Jay Shendure, and Martin Kircher. 2019. “CADD: Predicting the Deleteriousness of Variants Throughout the Human Genome.” Nucleic Acids Research 47 (D1): D886–94. https://doi.org/10.1093/nar/gky1016.

Sanabria, Melissa, Jonas Hirsch, Pierre M. Joubert, and Anna R. Poetsch. 2024. “[GROVER] DNA Language Model GROVER Learns Sequence Context in the Human Genome.” Nature Machine Intelligence 6 (8): 911–23. https://doi.org/10.1038/s42256-024-00872-0.

Schiff, Yair, Chia-Hsiang Kao, Aaron Gokaslan, Tri Dao, Albert Gu, and Volodymyr Kuleshov. 2024. “Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling.” arXiv. https://doi.org/10.48550/arXiv.2403.03234.

Sundaram, Laksshman, Hong Gao, Samskruthi Reddy Padigepati, Jeremy F. McRae, Yanjun Li, Jack A. Kosmicki, Nondas Fritzilas, et al. 2018. “Predicting the Clinical Impact of Human Mutation with Deep Neural Networks.” Nature Genetics 50 (8): 1161–70. https://doi.org/10.1038/s41588-018-0167-z.

Theodoris, Christina V., Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, et al. 2023. “[Geneformer] Transfer Learning Enables Predictions in Network Biology.” Nature 618 (7965): 616–24. https://doi.org/10.1038/s41586-023-06139-9.

Yang, Fan, Wenchuan Wang, Fang Wang, Yuan Fang, Duyu Tang, Junzhou Huang, Hui Lu, and Jianhua Yao. 2022. “scBERT as a Large-Scale Pretrained Deep Language Model for Cell Type Annotation of Single-Cell RNA-Seq Data.” Nature Machine Intelligence 4 (10): 852–66. https://doi.org/10.1038/s42256-022-00534-z.

Yeo, Gene, and Christopher B. Burge. 2004. “Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals.” Journal of Computational Biology 11 (2-3): 377–94. https://doi.org/10.1089/1066527041410418.

Zeng, Tony, and Yang I. Li. 2022. “Predicting RNA Splicing from DNA Sequence Using Pangolin.” Genome Biology 23 (1): 103. https://doi.org/10.1186/s13059-022-02664-4.

Zhou, Jian, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K. Wong, and Olga G. Troyanskaya. 2018. “[Expecto] Deep Learning Sequence-Based Ab Initio Prediction of Variant Effects on Expression and Disease Risk.” Nature Genetics 50 (8): 1171–79. https://doi.org/10.1038/s41588-018-0160-6.

Zhou, Jian, and Olga G. Troyanskaya. 2015. “[DeepSEA] Predicting Effects of Noncoding Variants with Deep Learning–Based Sequence Model.” Nature Methods 12 (10): 931–34. https://doi.org/10.1038/nmeth.3547.

Zhou, Zhihan, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and Han Liu. 2024. “DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome.” arXiv. https://doi.org/10.48550/arXiv.2306.15006.