Preface

Working on genomic foundation models means context-switching constantly: debugging data artifacts one week, reproducing a transformer-based variant effect predictor the next, and arguing about clinical patient cohorts the week after. The knowledge required is scattered across textbooks, methods papers, and tribal folklore - genomics on one shelf, deep learning on another, clinical deployment in someone else’s head entirely.

This book is my attempt to put those pieces in one place: to connect the mature, statistically grounded tradition of human genetics with the rapidly changing ecosystem of deep learning and foundation models, and to make that transition legible for people who live in one corner of the triangle and are trying to get oriented to the others.

I wrote it first for myself and my collaborators: as a way to organize wiki pages, markdown files, and half-finished slide decks into something coherent. Over time it became clear that turning those notes into a book might be useful to others navigating the same landscape.

Why I Wrote This Book

What I wanted, but could not find, was a conceptual throughline:

How do we get from reads to variants in a way that a deep model can trust?
How should we think about polygenic scores, fine-mapping, and functional assays in the era of foundation models?
When we say a model “understands” regulatory grammar or protein function, what does that actually mean?
And what does it take to move from a promising preprint to a tool that can support decisions about real patients?

This book is my best attempt at answering those questions in a way that is historically grounded, technically honest, and practically oriented.

How This Book Came Together

The structure of the book reflects the way these ideas evolved in my own work.

Early sections grew out of teaching and mentoring conversations: explaining next-generation sequencing, variant calling, and pre-deep-learning interpretation methods to new team members who were strong in statistics or ML but new to genomics (and vice versa).

The middle sections emerged from a series of “journal club + experiments” cycles, where we:

read papers on sequence-to-function CNNs, protein language models, and genomic transformers,
tried to reproduce key results or adapt them to key datasets,
and documented the pain points: data formats, training instabilities, evaluation pitfalls, which never quite fit into a methods section.

The later parts were shaped by collaborations around clinical prediction, variant interpretation pipelines, and larger multi-omic models. Many of the examples and caveats come directly from these projects: places where a model that looked excellent on paper behaved in surprising ways when exposed to real-world data, or where simple baselines outperformed much fancier architectures once confounding and distribution shift were handled correctly.

Because of that origin, the book has a particular bias: it is written from the perspective of someone who spends much of their time trying to get models to work in messy, high-stakes settings. You will see this in the emphasis on data quality, evaluation, and clinical translation.

How to Read This Book

Reader Pathways

This is not a genomics textbook, a complete review of every DNA or protein model, or a deep-learning-from-scratch course. Instead, it is meant to be:

a roadmap to the main kinds of data, models, and objectives that matter for genomic foundation models today
a bridge between classical statistical genetics and modern representation learning
a practical guide to the kinds of failure modes and design choices that matter in real applications

If you are…	Suggested path
New to genomics	Start with Part I, use Appendix A for deep learning background, then proceed sequentially
New to deep learning	Use Appendix A first, then Part II for sequence architectures, Part I as needed for genomic context
Experienced in both	Use the book as reference; Chapter Overviews help you find specific topics
Focused on clinical applications	Part I → Part V → Part VI, with model chapters as needed

You do not need to read the book cover-to-cover in order.

If your background is in genomics or statistical genetics, you may want to skim the early deep-learning motivations and focus more on the sections that introduce convolutional models, transformers, and self-supervision, then move on to evaluation and applications.
If you come from machine learning, it may be more helpful to start with the genomic data and pre-deep-learning methods, then dive into the sequence-to-function and transformer-based chapters with an eye toward how the data and objectives differ from text or images.
If you are a clinician or translational researcher, you might care most about the reliability, confounding, and clinical deployment discussions, dipping back into the modeling parts as needed to interpret results or communicate with technical collaborators.

The book is organized into seven parts:

Part I introduces genomic data and pre-deep-learning interpretation methods, from sequencing and variant calling to early pathogenicity scores and polygenic models.
Part II focuses on sequence architectures, with emphasis on representations, convolutional networks, and attention mechanisms.
Part III covers learning and evaluation: pretraining objectives, transfer learning, adaptation strategies, benchmarks, and the confounding issues that pervade genomic evaluation.
Part IV turns to foundation model families, covering DNA language models, protein language models, regulatory models, and variant effect prediction.
Part V extends to systems-level modeling: RNA, single-cell, 3D genome, networks, and multi-omics integration.
Part VI examines responsible deployment: uncertainty, interpretability, causality, and ethical considerations.
Part VII looks at applications and frontiers: clinical risk prediction, rare disease, drug discovery, biological design, and emerging directions.

Within each part, the goal is not to catalogue every paper, but to highlight representative examples and the design principles they illustrate. References are there to give you starting points, not to serve as a comprehensive literature review.

What This Book Assumes (and What It Does Not)

The book assumes:

basic familiarity with probability and statistics (regression, hypothesis testing, effect sizes),
core genomics concepts (genes, variants, linkage disequilibrium, GWAS at a high level),
and some exposure to machine learning ideas (training versus test data, overfitting, loss functions).

It does not assume that you have implemented deep learning models yourself, or that you are fluent in every area. When a chapter leans heavily on a particular background (for example, causal inference or modern self-supervised learning), it will either provide a brief refresher or point you to an appendix or external resource.

If you are missing some of this background, that is fine. The intent is for you to be able to read actively: to pause, look up side topics, and then return to the main arc without feeling lost.

A Note on Scope and Opinions

Genomic foundation models are evolving quickly. Any snapshot is, by definition, incomplete and slightly out of date.

Rather than chasing every new architecture or benchmark, the book focuses on durable ideas:

how different data types fit together,
what kinds of objectives encourage useful representations,
how evaluation can fail in genomics-specific ways,
and where deep models complement (rather than replace) classical approaches.

Inevitably, there are judgment calls about which papers, methods, and perspectives to emphasize. Those choices reflect my own experiences and biases. They are not an official position of any institution I work with, and they will certainly differ from other reasonable views in the field.

You should treat the book as one opinionated map of the landscape, not the landscape itself.

Acknowledgements

This book exists because of many generous people who shared their time, ideas, and encouragement.

First, I owe a deep debt of gratitude to my colleagues in the Mayo Clinic GenAI and broader data science community. The day-to-day conversations, whiteboard sessions, and “what went wrong here?” post-mortems with this group shaped much of the perspective and many of the examples in the chapters.

I am especially grateful to the principal investigators and clinicians whose questions kept the focus on real patients and real decisions: Dr. Shant Ayanian, Dr. Elena Myasoedova, and Dr. Alexander Ryu.

To leadership at Mayo Clinic who supported the time, computing resources, and institutional patience needed for both the models and this book: Dr. Matthew Callstrom, Dr. Panos Korfiatis, and Matt Redlon.

To my data science and machine learning engineering colleagues, whose work and feedback directly shaped many of the workflows and case studies: Bridget Toomey, Carl Molnar, Zach Jensen, and Marc Blasi.

I am also grateful for the architectural creativity, hardware insight, and willingness to experiment from our collaborators at Cerebras: Natalia Vassilieva, Jason Wolfe, Omid Shams Solari, Vinay Pondenkandath, Bhargav Kanakiya, and Faisal Al-khateeb.

And to our collaborators at GoodFire, whose partnership helped push these ideas toward interpretable and deployable systems: Daniel Balsam, Nicholas Wang, Michael Pearce, and Mark Bissell.

I would also like to thank my former colleagues at LGC for foundational work and conversations around protein language models and large-scale representation learning: Prasad Siddavatam and Robin Butler.

Beyond these named groups, I owe a broader debt to the geneticists, molecular biologists, statisticians, clinicians, and engineers whose work this book draws on. The field moves forward because people share code, publish honest benchmarks, and insist that models be connected back to biologically meaningful questions. Thank you for setting that standard.

Finally, I am grateful to my wife, Alyssa, and our two kids for their patience with the evenings and weekends this book consumed. You gave me the space to finish it and the reasons to step away from it.

If this book helps you connect a new model to a real biological question, design a more robust evaluation, or communicate more clearly across disciplinary boundaries then it will have done its job.

– Josh Meehl