Appendix E — Resources

This appendix collects educational resources, databases, and software tools for readers seeking to deepen their understanding of genomics, machine learning, and their intersection. Resources are organized by topic and include both foundational references and practical tools.

E.1 Textbooks

E.1.1 Genomics and Human Genetics

Thompson & Thompson Genetics and Genomics in Medicine (9th ed.)
Ronald Cohn, Stephen Scherer, Ada Hamosh. Clinical-focused overview of human genetics and genomics for medicine. Excellent grounding in clinical genomics, variant interpretation, and genetic disease mechanisms.
Human Molecular Genetics (5th ed.)
Tom Strachan, Andrew Read. Higher-level molecular genetics text with strong coverage of mechanisms, technologies, and disease applications. More technical depth than Thompson & Thompson.
Molecular Biology of the Cell (7th ed.)
Bruce Alberts et al. Comprehensive cell biology text covering the molecular machinery underlying genomic processes. Essential background for understanding what genomic models are predicting.
Genomes 4
T.A. Brown. Focused specifically on genome organization, evolution, and analysis. Strong coverage of comparative genomics relevant to conservation-based methods.

E.1.2 Immunology

Janeway’s Immunobiology (10th ed.)
Kenneth M. Murphy, Casey Weaver, Leslie J. Berg. Standard comprehensive immunology textbook. Relevant for understanding immune-related genomic variation and applications like HLA typing.

E.1.3 Machine Learning and Deep Learning

Deep Learning
Ian Goodfellow, Yoshua Bengio, Aaron Courville. The comprehensive deep learning reference. Free online: https://www.deeplearningbook.org/
Dive into Deep Learning (D2L)
Aston Zhang et al. Interactive deep learning book with executable Jupyter notebooks and multi-framework code (PyTorch, TensorFlow, JAX). Free online: https://d2l.ai/
An Introduction to Statistical Learning (ISLR, 2nd ed.)
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. Gentle introduction to statistical learning methods. R and Python editions available free online: https://www.statlearning.com/
The Elements of Statistical Learning (ESL)
Trevor Hastie, Robert Tibshirani, Jerome Friedman. More advanced, theory-heavy companion to ISLR. Free PDF: https://hastie.su.domains/ElemStatLearn/
Pattern Recognition and Machine Learning
Christopher Bishop. Classic ML text with strong probabilistic foundations. Relevant for understanding uncertainty quantification approaches.

E.1.4 Bioinformatics and Computational Biology

Bioinformatics: Sequence and Genome Analysis (2nd ed.)
David Mount. Foundational algorithms for sequence analysis including alignment, HMMs, and phylogenetics.
Biological Sequence Analysis
Richard Durbin, Sean Eddy, Anders Krogh, Graeme Mitchison. Essential reading for probabilistic approaches to biological sequences. HMM chapter particularly relevant.
Computational Genomics with R
Altuna Akalin. Practical computational genomics using R/Bioconductor. Free online: https://compgenomr.github.io/book/

E.1.5 Foundation Model Reference Library

The following Springer textbooks provide systematic coverage of topics essential for genomic foundation model research. Organized by domain, these represent authoritative references for deep learning architectures, clinical prediction methodology, statistical genetics, and interpretability.

E.1.5.1 Deep Learning and Foundation Models

Foundation Models for Natural Language Processing: Pre-trained Language Models Integrating Media
Gerhard Paass, Sven Giesselbach. Springer, 2023. Open Access (CC-BY 4.0). Comprehensive coverage of transformer architectures, pretraining objectives, and transfer learning. Essential for understanding the architectural foundations underlying genomic language models.
Multivariate Statistical Machine Learning Methods for Genomic Prediction
Osval Antonio Montesinos Lopez, Abelardo Montesinos Lopez, Jose Crossa. Springer, 2022. Open Access (CC-BY 4.0). Covers statistical and deep learning methods for genomic prediction including G-BLUP, Bayesian methods, kernel methods, and neural network implementations with code examples.
Machine Learning and Systems Biology in Genomics and Health
Shailza Singh (ed.). Springer, 2022. Applied machine learning for disease prediction, gene regulatory networks, and cardiovascular genomics.

E.1.5.2 Clinical Prediction and Validation

Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating (2nd ed.)
Ewout W. Steyerberg. Springer, 2019. The gold standard reference for clinical prediction model development. Covers discrimination, calibration, validation strategies, net benefit analysis, and TRIPOD reporting guidelines. Essential reading for any clinical deployment work.

E.1.5.3 Interpretability and Explainable AI

Interpretability in Deep Learning
Ayush Somani, Alexander Horsch, Dilip K. Prasad. Springer, 2023. Comprehensive taxonomy of interpretability methods including the 5W1H framework, saliency methods, attention visualization, and domain-specific applications to CNNs, autoencoders, and graph neural networks.
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (LNAI 11700)
Wojciech Samek, Gregoire Montavon, Andrea Vedaldi, Lars Kai Hansen, Klaus-Robert Muller (eds.). Springer, 2019. Multi-author volume covering feature visualization, layer-wise relevance propagation, and methods for evaluating explanation quality.
xxAI - Beyond Explainable AI (LNAI 13200)
Andreas Holzinger, Randy Goebel, Ruth Fong, Taesup Moon, Klaus-Robert Muller, Wojciech Samek (eds.). Springer, 2022. Advances beyond basic XAI including concept-based explanations, counterfactual analysis, and causal approaches to interpretation.

E.1.5.4 Statistical Genetics

The Fundamentals of Modern Statistical Genetics
Nan M. Laird, Christoph Lange. Springer, 2011. Foundational text covering Mendelian genetics, linkage and association analysis, population structure, and gene-environment interactions. Essential background for understanding confounding in genomic prediction.
Heterogeneity in Statistical Genetics: How to Assess, Address, and Account for Mixtures in Association Studies
Derek Gordon, Stephen J. Finch, Wonkuk Kim. Springer, 2020. Critical reference for understanding population stratification, locus heterogeneity, and statistical methods to address ancestry-related confounding.
Applied Statistical Genetics with R: For Population-based Association Studies
Andrea S. Foulkes. Springer, 2009. Practical R implementations for GWAS analysis, multiple testing correction, haplotype analysis, and tree-based methods for genetic data.
Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL
Rongling Wu, Chang-Xing Ma, George Casella. Springer, 2007. Classical foundations of QTL mapping and statistical models for quantitative traits.

E.1.5.5 Systems Biology and Networks

Networks in Systems Biology: Applications for Disease Modeling (Computational Biology 32)
Fabricio Alves Barbosa da Silva, Nicolas Carels, Marcelo Trindade dos Santos, Francisco Jose Pereira Lopes (eds.). Springer, 2020. Covers protein-protein interaction networks, gene regulatory networks, network propagation algorithms, and disease module identification methods.
Handbook of Statistical Bioinformatics (2nd ed.)
Henry Horng-Shing Lu, Bernhard Scholkopf, Martin T. Wells, Hongyu Zhao (eds.). Springer, 2022. Comprehensive handbook covering single-cell analysis methods, network inference, causal discovery, and deep learning for omics data.
Methodologies of Multi-Omics Data Integration and Data Mining (Translational Bioinformatics 19)
Kang Ning (ed.). Springer, 2023. Methods for integrating multiple data modalities including feature-level and decision-level fusion approaches.

E.1.5.6 Causal Inference

Statistical Causal Discovery: LiNGAM Approach (SpringerBriefs in Statistics)
Shohei Shimizu. Springer, 2022. Specialized treatment of non-Gaussian causal discovery methods with identifiability conditions and applications to observational data.

E.2 Online Courses

E.2.1 Machine Learning and Deep Learning

Stanford CS229: Machine Learning
Andrew Ng’s foundational ML course. Lecture videos and materials freely available. https://cs229.stanford.edu/
Stanford CS231n: CNNs for Visual Recognition
Deep dive into convolutional networks with strong foundations applicable to sequence models. http://cs231n.stanford.edu/
Stanford CS224n: NLP with Deep Learning
Essential for understanding transformer architectures, attention mechanisms, and language model pretraining. http://web.stanford.edu/class/cs224n/
fast.ai Practical Deep Learning
Top-down practical approach to deep learning. Free course with notebooks: https://course.fast.ai/
DeepMind x UCL Deep Learning Lecture Series
Excellent coverage of modern deep learning topics including transformers and self-supervised learning. YouTube playlist freely available.

E.2.2 Genomics and Bioinformatics

MIT 7.91J: Foundations of Computational and Systems Biology
Comprehensive computational biology course covering sequence analysis, structure, networks. MIT OpenCourseWare: https://ocw.mit.edu/courses/7-91j-foundations-of-computational-and-systems-biology-spring-2014/
Coursera: Genomic Data Science Specialization
Johns Hopkins series covering genomic technologies, Python/R for genomics, and statistical analysis. https://www.coursera.org/specializations/genomic-data-science
EMBL-EBI Training
Free online courses on genomics databases, tools, and analysis methods. https://www.ebi.ac.uk/training/
Rosalind
Problem-based bioinformatics learning platform. Excellent for building algorithmic intuition. https://rosalind.info/

E.2.3 Applied Genomic ML

Coursera: AI for Medicine Specialization
DeepLearning.AI course covering ML applications in medical imaging and clinical data. https://www.coursera.org/specializations/ai-for-medicine
ML4Bio Summer School
Annual workshop on machine learning for biology. Materials often available online.

E.3 Genomic Databases

E.3.1 Variant and Population Databases

Database Description URL
ClinVar Clinical variant interpretations https://www.ncbi.nlm.nih.gov/clinvar/
gnomAD Population allele frequencies (730K+ exomes/genomes) https://gnomad.broadinstitute.org/
dbSNP Catalog of genetic variation https://www.ncbi.nlm.nih.gov/snp/
ClinGen Clinical genome resource, gene-disease validity https://clinicalgenome.org/
OMIM Mendelian inheritance in man https://omim.org/
HGMD Human gene mutation database (subscription) http://www.hgmd.cf.ac.uk/
LOVD Locus-specific variant databases https://www.lovd.nl/

E.3.2 Functional Annotation Databases

Database Description URL
ENCODE Encyclopedia of DNA elements https://www.encodeproject.org/
GTEx Tissue-specific gene expression https://gtexportal.org/
Roadmap Epigenomics Epigenome maps across cell types https://egg2.wustl.edu/roadmap/web_portal/
FANTOM5 Functional annotation of mammalian genomes https://fantom.gsc.riken.jp/5/
4D Nucleome 3D genome organization https://www.4dnucleome.org/

E.3.3 Protein Databases

Database Description URL
UniProt Protein sequences and annotations https://www.uniprot.org/
AlphaFold DB Predicted protein structures https://alphafold.ebi.ac.uk/
PDB Experimental protein structures https://www.rcsb.org/
InterPro Protein families and domains https://www.ebi.ac.uk/interpro/
Pfam Protein family database https://www.ebi.ac.uk/interpro/entry/pfam/

E.3.4 Gene and Pathway Databases

Database Description URL
Ensembl Genome browser and annotation https://www.ensembl.org/
UCSC Genome Browser Genome visualization and tracks https://genome.ucsc.edu/
KEGG Pathway and molecular interaction maps https://www.kegg.jp/
Reactome Curated pathway database https://reactome.org/
Gene Ontology Functional annotation ontology http://geneontology.org/
STRING Protein-protein interactions https://string-db.org/

E.3.5 Single-Cell Databases

Database Description URL
Human Cell Atlas Reference maps of human cells https://www.humancellatlas.org/
CellxGene Single-cell data exploration https://cellxgene.cziscience.com/
Single Cell Portal Broad Institute scRNA-seq repository https://singlecell.broadinstitute.org/
Tabula Sapiens Multi-organ human cell atlas https://tabula-sapiens-portal.ds.czbiohub.org/

E.4 Software Tools

E.4.1 Sequence Analysis

Tool Description URL
BWA Burrows-Wheeler aligner for short reads https://github.com/lh3/bwa
Minimap2 Long-read alignment https://github.com/lh3/minimap2
STAR RNA-seq aligner https://github.com/alexdobin/STAR
SAMtools SAM/BAM manipulation http://www.htslib.org/
BCFtools Variant calling and manipulation http://www.htslib.org/
GATK Genome analysis toolkit https://gatk.broadinstitute.org/
DeepVariant Deep learning variant caller https://github.com/google/deepvariant

E.4.2 Variant Annotation

Tool Description URL
VEP Ensembl variant effect predictor https://www.ensembl.org/vep
SnpEff Variant annotation and effect prediction https://pcingola.github.io/SnpEff/
ANNOVAR Functional annotation https://annovar.openbioinformatics.org/
InterVar ACMG/AMP interpretation https://wintervar.wglab.org/

E.4.3 Deep Learning Frameworks

Framework Description URL
PyTorch Primary framework for genomic DL https://pytorch.org/
HuggingFace Transformers Pretrained model hub and tools https://huggingface.co/
Jax/Flax High-performance ML (used by DeepMind) https://github.com/google/jax
PyTorch Lightning Training boilerplate reduction https://lightning.ai/

E.4.4 Genomic ML Libraries

Library Description URL
Kipoi Model zoo for genomics https://kipoi.org/
Selene Deep learning for sequences https://github.com/FunctionLab/selene
Enformer (TensorFlow) Official Enformer implementation https://github.com/deepmind/deepmind-research/tree/master/enformer
Pysam Python interface for SAM/BAM https://github.com/pysam-developers/pysam
Biopython Biological computation in Python https://biopython.org/
Scanpy Single-cell analysis https://scanpy.readthedocs.io/

E.4.5 Workflow Management

Tool Description URL
Snakemake Python-based workflow manager https://snakemake.readthedocs.io/
Nextflow Data-driven pipelines https://www.nextflow.io/
WDL/Cromwell Workflow description language https://cromwell.readthedocs.io/

E.5 Benchmarks and Datasets

E.5.1 Genomic Benchmarks

Benchmark Domain URL
Nucleotide Transformer Benchmarks DNA LM evaluation https://github.com/instadeepai/nucleotide-transformer
TAPE Protein tasks https://github.com/songlab-cal/tape
ProteinGym Protein fitness prediction https://proteingym.org/
GenomeBenchmarks DNA classification tasks https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks

E.5.2 Variant Datasets

Dataset Description URL
ClinVar Clinical variant annotations https://www.ncbi.nlm.nih.gov/clinvar/
DMS datasets Deep mutational scanning Various; see ProteinGym
CADD training data Simulated and observed variants https://cadd.gs.washington.edu/

E.6 Community and Forums

E.6.1 Discussion Forums

E.6.2 Preprint Servers

E.6.3 Conferences

Conference Focus Typical Timing
ISMB Computational biology July
RECOMB Computational molecular biology April-May
ASHG Human genetics October-November
NeurIPS Machine learning December
ICML Machine learning July
MLCB ML in computational biology December (NeurIPS workshop)

E.6.4 Key Research Groups

Selected groups active at the intersection of genomics and machine learning:

  • Kundaje Lab (Stanford): Regulatory genomics, interpretability
  • Kelley Lab (Calico): Sequence-to-function models
  • Marks Lab (Harvard): Evolutionary models, protein fitness
  • Troyanskaya Lab (Princeton): Functional genomics, Sei
  • Regev Lab (Genentech/Broad): Single-cell genomics
  • ESM Team (Meta AI): Protein language models
  • DeepMind: AlphaFold, Enformer, AlphaMissense

E.7 Keeping Current

The field moves rapidly. Strategies for staying current:

  1. Preprint alerts: Set bioRxiv/arXiv alerts for keywords like “genomic foundation model,” “variant effect prediction,” “DNA language model”

  2. Twitter/X: Follow active researchers and labs; the ML4Bio community is particularly active

  3. Conference proceedings: ISMB, RECOMB, and NeurIPS MLCB workshops publish cutting-edge work

  4. Model hubs: Monitor HuggingFace for new genomic model releases

  5. Database updates: ClinVar and gnomAD release notes track data growth and methodology changes

  6. Review articles: Annual reviews in Nature Reviews Genetics, Genome Biology, and Nature Methods provide consolidated perspectives