2 The Genomic Data Landscape
TODO:
- Citations:
- …
- Somewhere in project, discuss correlated but distinct rare variants vs gene legthal variants vs late-onset disease variants
- OMIM; the pgx one
2.1 Why Genomic Data Resources Matter
Once we can sequence genomes and call variants, we immediately face a new problem: interpretation. No single dataset is sufficient to decide whether a variant is benign, pathogenic, or relevant to a trait. Instead, we rely on a mosaic of complementary resources: reference genomes and gene annotations that define coordinates and consequences, population variation catalogs that reveal what survives in healthy individuals, cohort and biobank datasets that link variation to phenotypes, functional genomics atlases that map biochemical activity, and clinical databases that aggregate expert interpretations.
This chapter surveys these foundational resources. Later chapters draw from them repeatedly—either directly as model inputs or indirectly as labels, benchmarks, and priors. We begin with general genomic infrastructure (references, variation catalogs, cohorts) and then turn to functional and expression resources (ENCODE, GTEx-like datasets) that provide the training labels for sequence-to-function models.
2.2 Reference Genomes and Gene Annotations
Every genomic analysis begins with a coordinate system. Reference genomes define the scaffold onto which sequencing reads are mapped, while gene annotations overlay that scaffold with biological meaning, specifying where transcripts begin and end, which regions encode protein, and how exons are spliced together. These resources are so foundational that their assumptions often become invisible: a variant’s consequence, a gene’s constraint score, and a model’s training labels all depend on choices embedded in the reference assembly and annotation release. Understanding these dependencies is essential for interpreting results, recognizing systematic biases, and anticipating how analyses will generalize across datasets built on different genomic foundations.
2.2.1 Reference Assemblies
Most modern pipelines align reads to a small number of reference assemblies, predominantly GRCh38 or the newer T2T-CHM13 (Nurk et al. 2022). A reference genome is not simply a consensus sequence; it encodes a series of consequential decisions about how to represent duplications, alternate haplotypes, and unresolved gaps, all annotated with coordinates that downstream tools assume are stable.
The choice of reference shapes everything that follows. It determines which regions are “mappable” by short reads, how structural variants are represented, and how comparable results will be across cohorts and over time. Graph-based and pangenome references relax the assumption of a single linear reference, but the majority of datasets used in this book, and the models trained on them, are still built on GRCh37 or GRCh38 (Liao et al. 2023).
2.2.2 Gene Models
Gene annotation databases such as GENCODE and RefSeq define the exon–intron structures, canonical and alternative transcripts, start and stop codons, and untranslated regions that allow us to interpret variants in biological context (Frankish et al. 2019; O’Leary et al. 2016). These annotations are critical for distinguishing coding from non-coding variants, identifying splice-disrupting mutations, and mapping functional genomics signals to genes.
The MANE Select project provides a single matched transcript per protein-coding gene that is identical between GENCODE and RefSeq, simplifying clinical interpretation but further privileging a single isoform over biological complexity (Morales et al. 2022).
Many downstream resources, from variant effect predictors to polygenic score pipelines, implicitly assume that gene models are correct and complete. In practice, new isoforms continue to be discovered, alternative splicing remains incompletely cataloged, and cell-type-specific transcripts may be missing from bulk-derived annotations. These gaps propagate through every tool built on them.
2.3 Population Variant Catalogs and Allele Frequencies
Population variant catalogs provide the empirical foundation for distinguishing pathogenic mutations from benign polymorphisms. Allele frequency, the proportion of chromosomes in a reference population carrying a given variant, serves as a powerful prior: variants observed at appreciable frequency in healthy individuals are unlikely to cause severe early-onset disease, while ultra-rare variants demand closer scrutiny. Beyond simple filtering, allele frequencies inform statistical frameworks for case-control association, provide training signal for deleteriousness predictors, and enable imputation of ungenotyped variants through . The catalogs described below have progressively expanded in sample size, ancestral diversity, and annotation depth, transforming variant interpretation from an ad hoc exercise into a quantitative discipline.
2.3.1 dbSNP and the Variant Universe
Historically, dbSNP aggregated known single nucleotide polymorphisms and short indels into a single catalog, providing stable identifiers (rsIDs) that serve as common currency across tools and publications, basic frequency information where available, and a convenient handle for linking to other resources (Sherry et al. 2001). Modern whole-exome and whole-genome sequencing cohorts routinely discover millions of previously unseen variants, but dbSNP identifiers remain the standard way to refer to known polymorphisms.
2.3.2 1000 Genomes and Early Reference Panels
The 1000 Genomes Project provided one of the first widely used multi-population reference panels, enabling imputation and linkage-disequilibrium-based analyses on genotyping arrays (Auton et al. 2015). Its samples continue to serve as benchmarks for variant calling performance, and its haplotype structure underlies many imputation servers and downstream analyses (Yun et al. 2021).
2.3.3 The Genome Aggregation Database (gnomAD)
The Genome Aggregation Database aggregates exome and genome data from a wide array of cohorts into harmonized allele frequency resources (Karczewski et al. 2020). gnomAD provides high-resolution allele frequencies for SNVs and indels across diverse ancestries, constraint metrics such as pLI and LOEUF that summarize a gene’s intolerance to loss-of-function variation, and per-variant annotations flagging poor quality regions, low complexity, and other caveats.
These resources are indispensable for filtering common variants in Mendelian disease diagnostics, distinguishing extremely rare variants from recurrent ones, and providing population genetics priors used by variant effect predictors and deleteriousness scores like CADD (Rentzsch et al. 2019; Schubach et al. 2024). The constraint metrics, in particular, have become standard features in machine learning models that prioritize disease-relevant genes and variants.
2.4 Cohorts, Biobanks, and GWAS Summary Data
Large-scale biobanks and population cohorts have transformed human genetics from a discipline reliant on family studies and candidate gene approaches into one powered by population-level statistical inference. These resources link genomic data to electronic health records, lifestyle questionnaires, imaging, and longitudinal outcomes, enabling discovery of genetic associations across thousands of traits simultaneously. However, the composition of these cohorts carries consequences: the overrepresentation of European-ancestry individuals in most major biobanks creates systematic gaps in variant discovery, effect size estimation, and polygenic score portability that propagate through downstream analyses. These ancestry biases, and strategies for addressing them, are discussed in detail in Chapter 16.
2.4.1 Large Population Cohorts
Modern human genetics relies on large cohorts with genome-wide variation and rich phenotyping. UK Biobank, with approximately 500,000 participants and deep phenotyping, has become the dominant resource for methods development and benchmarking (Bycroft et al. 2018). FinnGen leverages Finland’s population history and unified healthcare records (Kurki et al. 2023). The All of Us Research Program prioritizes diversity, aiming to enroll one million participants with deliberate oversampling of historically underrepresented groups (All of Us 2019). Additional resources include the Million Veteran Program, Mexican Biobank, BioBank Japan, China Kadoorie Biobank, and emerging African genomics initiatives such as H3Africa (Sirugo, Williams, and Tishkoff 2019). Together, these efforts enable genome-wide association studies for thousands of traits, development and evaluation of polygenic scores, and fine-mapping of causal variants and genes (Marees et al. 2018; Mountjoy et al. 2021).
While this book focuses on models rather than specific cohorts, it is important to recognize that most GWAS and polygenic score methods in Chapter 3 assume data from either array genotyping with imputation or whole-exome/whole-genome sequencing with joint calling, as in DeepVariant/GLnexus-style pipelines (Yun et al. 2021). The ascertainment, quality control, and population composition of these cohorts shape what signals can be detected and how well models generalize.
2.4.2 GWAS Summary Statistics
Beyond individual-level data, many resources distribute GWAS summary statistics: per-variant effect sizes and p-values aggregated across cohorts. The GWAS Catalog compiles published results across traits (Sollis et al. 2023), while the PGS Catalog provides curated polygenic score weights and metadata for reproducibility (Lambert et al. 2021). Frameworks like Open Targets Genetics integrate fine-mapped signals and candidate causal genes across loci (Mountjoy et al. 2021).
These summary data are the raw material for many polygenic score methods (Chapter 3) and statistical fine-mapping algorithms. They enable meta-analysis across cohorts, transfer of genetic findings to new populations, and integration with functional annotations to prioritize causal variants.
2.5 Functional Genomics and Regulatory Landscapes
The vast majority of the human genome lies outside protein-coding exons, yet this non-coding space harbors the regulatory logic that governs when, where, and how much each gene is expressed. Functional genomics assays provide the experimental means to map this regulatory landscape: identifying transcription factor binding sites, nucleosome positioning, chromatin accessibility, histone modifications, and three-dimensional genome organization across cell types and conditions. For the purposes of this book, these datasets serve a dual role. First, they supply the biological vocabulary for interpreting non-coding variants, linking sequence changes to potential regulatory consequences. Second, and more directly, they provide the training labels for sequence-to-function deep learning models. When a model learns to predict chromatin accessibility or histone marks from DNA sequence alone, it is learning a compressed representation of the regulatory code implicit in thousands of functional genomics experiments.
2.5.2 The Cistrome Data Browser
While ENCODE and Roadmap produced authoritative datasets for their chosen cell types and factors, they represent only a fraction of publicly available functional genomics experiments. The Cistrome Data Browser addresses this gap by aggregating thousands of human and mouse ChIP-seq and chromatin accessibility datasets from ENCODE, Roadmap, GEO, and individual publications into a reprocessed, searchable repository (Zheng et al. 2019). All datasets pass through a uniform quality control and processing pipeline, enabling comparisons across experiments that were originally generated by different labs with different protocols.
Cistrome provides uniform peak calls and signal tracks, metadata for cell type, factor, and experimental conditions, and tools for motif analysis and regulatory element annotation. The tradeoff is heterogeneity: while the reprocessing harmonizes computational steps, the underlying experiments vary in sample preparation, sequencing depth, and experimental design. Cistrome thus expands coverage at the cost of the tight experimental control found in the primary consortia.
2.5.3 From Assays to Training Labels
Sequence-to-function models transform these functional genomics resources into supervised learning problems. Models like DeepSEA (see Chapter 5) draw training labels from ENCODE, Roadmap, and Cistrome-style datasets collectively: each genomic window is associated with binary or quantitative signals indicating transcription factor binding, histone modifications, or chromatin accessibility across many assays and cell types (Zhou and Troyanskaya 2015; Zhou et al. 2018).
The quality, coverage, and biases of these labels directly constrain what models can learn. Cell types absent from the training compendium cannot be predicted reliably. Factors with few high-quality ChIP-seq experiments will have noisier labels. And systematic differences between assay types (peak-based binary labels versus quantitative signal tracks) shape whether models learn to predict occupancy, accessibility, or something in between. These considerations become central when we examine model architectures and training strategies in Chapter 5.
2.6 Expression and eQTL Resources
Expression datasets link sequence variation to transcriptional consequences, providing a bridge between regulatory elements and gene-level effects. While functional genomics assays reveal where transcription factors bind and which chromatin regions are accessible, expression data answer the downstream question: does this regulatory activity actually change how much RNA a gene produces? Expression quantitative trait loci (eQTLs) formalize this relationship statistically, identifying genetic variants associated with changes in transcript abundance. For variant interpretation and genomic prediction, eQTLs offer mechanistic hypotheses connecting non-coding variants to specific genes and tissues. For model training, expression data provide quantitative labels that integrate across the many regulatory inputs converging on a single promoter. The resources below range from population-scale bulk tissue atlases to emerging single-cell datasets that resolve expression variation at cellular resolution.
2.6.1 Bulk Expression Atlases
Projects like the Genotype-Tissue Expression (GTEx) consortium provide RNA-seq expression profiles across dozens of tissues, eQTL maps linking variants to gene expression changes in cis, and splicing QTLs and other molecular QTLs (gtex_consortium_gtex_2020?). With matched genotypes and expression data from nearly 1,000 post-mortem donors across 54 tissues, GTEx established foundational insights: most genes harbor tissue-specific eQTLs, regulatory variants typically act in cis over distances of hundreds of kilobases, and expression variation explains a meaningful fraction of complex trait heritability.
Even when not explicitly cited, GTEx-like resources underpin expression prediction models such as PrediXcan and TWAS frameworks, colocalization analyses that ask whether a GWAS signal and an eQTL share a causal variant, and expression-based prioritization of candidate genes at trait-associated loci (gamazon_gene-based_2015?). The GTEx design has limitations: post-mortem collection introduces agonal stress artifacts, sample sizes per tissue vary considerably, and some disease-relevant tissues (such as pancreatic islets or specific brain regions) remain undersampled. Complementary resources like the eQTLGen Consortium aggregate eQTL results from blood across larger sample sizes, trading tissue diversity for statistical power (vosa_large-scale_2021?).
2.6.2 Single-Cell and Context-Specific Expression
Bulk RNA-seq averages expression across all cells in a tissue sample, obscuring the cell-type-specific programs that often mediate disease biology. Single-cell RNA-seq resolves this heterogeneity, identifying expression signatures for individual cell types, rare populations, and transitional states. Large-scale efforts like the Human Cell Atlas, Tabula Sapiens, and disease-focused single-cell consortia are building reference atlases that catalog cell types across organs and developmental stages (regev_human_2017?; tabula_sapiens_consortium_tabula_2022?).
For variant interpretation, single-cell data enable cell-type-specific eQTL mapping, revealing that a variant may influence expression in one cell type but not others within the same tissue. Spatial transcriptomics adds anatomical context, preserving tissue architecture while measuring gene expression. These technologies introduce computational challenges: sparsity from dropout, batch effects across samples and technologies, and the sheer scale of datasets with millions of cells. In this book, single-cell and spatial resources appear primarily in later chapters on multi-omics integration and systems-level models, but they represent the direction toward which expression genetics is moving, promising to connect genetic variation to cellular phenotypes with unprecedented resolution.g.
2.7 Variant Interpretation Databases and Clinical Labels
2.7.2 CADD and Annotation-Centric Resources
The CADD framework (Chapter 4) integrates many of the resources surveyed in this chapter—gnomAD frequencies, conservation scores, regulatory tracks, and gene annotations—into a genome-wide deleteriousness score (Rentzsch et al. 2019; Schubach et al. 2024). It illustrates how population frequencies, functional genomics signals, and clinical variant databases can be combined into a single per-variant summary that serves as both a practical tool and a methodological template for later deep learning approaches.
2.8 How Later Chapters Use These Resources
The genomic deep learning models that follow are only as good as the data they are trained and evaluated on. Chapter 3 uses GWAS and biobank-scale cohorts to define polygenic scores. Chapter 4 explores how annotation-based scores like CADD compress many of these resources into a single number. Chapters 5-7 use ENCODE/Roadmap/Cistrome-style functional data as training labels for sequence-to-function models. Chapters 12-14 revisit these resources as inputs, labels, and priors for genomic foundation models.
By surveying the data landscape here, we establish a common reference that later chapters can build on without re-introducing each resource from scratch. The recurring theme is that data quality, completeness, and biases flow through every model trained on them. Understanding these foundations is essential for interpreting what models learn and where they fail.