2  Data Landscape

Your model is only as good as your data. And your data has gaps that are often difficult to detect.

Chapter Overview

Estimated reading time: 45-55 minutes

Prerequisites: Basic understanding of DNA/RNA/protein structure; familiarity with sequencing concepts from Chapter 1; awareness of variant calling pipelines.

Learning Objectives: After completing this chapter, you should be able to:

  1. Identify the major genomic data resources and explain what type of evidence each provides
  2. Recognize systematic biases in training data and predict how they affect downstream models
  3. Distinguish between population-level variant catalogs, functional genomics datasets, and clinical interpretation databases
  4. Evaluate phenotype quality and anticipate its impact on genetic association studies
  5. Apply critical literacy when selecting training data for genomic machine learning models

Key Insight: Every machine learning model inherits both the signal and the systematic gaps of its training data. Understanding what datasets contain, and what they miss, is prerequisite to interpreting what models learn.

A clinical laboratory sequences a patient of West African ancestry presenting with symptoms suggestive of hypertrophic cardiomyopathy. The automated variant classifier flags a missense variant in MYH7 as pathogenic, the same classification it received in ClinVar, where the variant was deemed pathogenic based on its absence from a large European-ancestry cohort. The problem: this variant is ten times more common in African populations, where it segregates as benign. The classifier, trained on databases dominated by European samples, cannot distinguish a population-enriched benign variant from a genuinely pathogenic one. The patient receives a diagnosis of genetic cardiomyopathy; their family members are offered cascade testing and surveillance.

This is not a hypothetical scenario. Such misclassifications have been documented repeatedly, with consequences ranging from unnecessary cardiac interventions to psychological harm that persists even after reclassification.

Genomic models learn from labels, and those labels come from somewhere. A variant effect predictor trained on ClinVar classifications learns whatever biases clinical laboratories embedded in those classifications. A chromatin accessibility model trained on Encyclopedia of DNA Elements (ENCODE) cell lines may fail on primary tissues absent from the training compendium. A constraint metric derived from European-ancestry cohorts will be poorly calibrated for variants private to other populations. Every machine learning model in genomics inherits both the signal and the systematic gaps of its training data. Understanding what genomic resources contain, and what they systematically miss, is prerequisite to interpreting what models learn.

No single dataset captures the complexity of genomic function. The field depends on a mosaic of complementary resources: reference genomes and gene annotations that define the coordinate system, population variant catalogs that reveal what survives in healthy individuals, biobank datasets that link genetic variation to phenotypes at scale, functional genomics atlases that map biochemical activity across cell types and conditions, and clinical databases that aggregate expert variant interpretations. Each resource contributes a different type of evidence. Reference genomes provide the scaffold against which all variants are defined. Population databases like the Genome Aggregation Database (gnomAD) establish baseline expectations for variant frequency. Functional assays from ENCODE and Roadmap Epigenomics indicate where the genome shows evidence of regulatory activity. Clinical databases like ClinVar provide ground-truth labels for pathogenic and benign variants, at least for the subset of variants that have been expertly reviewed.

The genomic data ecosystem organized into four functional layers
Figure 2.1: The genomic data ecosystem. Major resources are organized into four functional layers: reference assemblies and gene annotations provide the coordinate foundation; population catalogs and biobanks supply variant frequencies and phenotype associations; functional genomics consortia map biochemical activity across cell types; clinical databases aggregate pathogenicity interpretations. Arrows indicate data dependencies, with each layer building on those below. Every machine learning model in genomics inherits both the signal and the systematic gaps of resources in this ecosystem.

The goal of this chapter is not encyclopedic completeness but critical literacy: the ability to recognize when training data may not represent the population, condition, or variant class at hand. This literacy becomes essential when deploying models in clinical contexts where failures have consequences. The biases introduced by population structure, label ascertainment, and technical artifacts create systematic confounding that propagates through every downstream model. These challenges receive detailed treatment, where Section 13.2.1 examines ancestry-related confounding, Section 13.2.4 addresses ascertainment patterns in clinical labels, and Section 13.2.2 covers technical batch effects. The present chapter establishes the data landscape from which those confounders arise.

Predict Before Viewing

Before examining the table below, test your understanding: For each major resource type (reference genomes, population catalogs, biobanks, functional genomics, clinical databases), what would be the primary limitation or blind spot that a model trained on that data might inherit?

Table 2.1: Major genomic data resource types and their roles in machine learning workflows. Each layer provides different evidence types, and models trained on any single layer inherit that layer’s biases and blind spots.
Resource Type Example Databases What It Provides Primary Use in ML
Reference Genomes GRCh38, T2T-CHM13 Coordinate system, sequence scaffold Input sequences
Gene Annotations GENCODE, RefSeq, MANE Exon/intron structure, transcript definitions Variant annotation
Population Catalogs gnomAD, 1000 Genomes Allele frequencies, constraint metrics Filtering, features
Biobanks UK Biobank, All of Us Genotype-phenotype associations GWAS labels
Functional Genomics ENCODE, Roadmap, GTEx Chromatin states, TF binding, expression Training labels
Clinical Databases ClinVar, ClinGen Pathogenicity classifications Benchmark labels

2.1 Reference Genomes and Gene Annotations

A family arrives at a genetics clinic after their newborn’s screening reveals a potential metabolic disorder. The clinical team orders whole-genome sequencing and receives a report identifying a novel variant in a gene associated with the condition. The variant’s coordinates place it at the boundary between an exon and an intron, potentially disrupting splicing. Yet whether this interpretation is correct depends on decisions made years before the child was born: which positions constitute exon boundaries, which transcript model defines the canonical gene structure, and which sequence serves as the reference against which “variant” is defined. Reference genomes and gene annotations are so foundational that their assumptions often become invisible, yet every downstream analysis inherits the choices embedded in these resources. A model cannot learn about a regulatory element for a transcript that does not exist in the annotation.

2.1.1 Reference Assemblies

A patient’s clinical sequencing reveals a potentially pathogenic variant in a duplicated region of chromosome 17. The variant calling pipeline reports a confident genotype, the annotation tool predicts a frameshift, and the clinical team prepares to discuss the finding with the family. Yet the “variant” may be an artifact of misalignment: reads from a paralogous sequence elsewhere in the genome mapped incorrectly because the reference assembly collapsed two distinct loci into one. Whether this error occurs, whether it can be detected, and whether the clinical interpretation has any foundation in biological reality all depend on the choice of reference genome.

Most modern pipelines align reads to a small number of reference assemblies, predominantly Genome Reference Consortium Human Build 38 (GRCh38) or the newer telomere-to-telomere CHM13 assembly (T2T-CHM13) (Nurk et al. 2022). A reference genome is not simply a consensus sequence; it encodes a series of consequential decisions about how to represent duplications, alternate haplotypes, and unresolved gaps. These decisions determine which regions are mappable by short reads, how structural variants are represented, and how comparable results will be across cohorts built on different assemblies. The variant calling pipelines described in Chapter 1 depend fundamentally on these reference choices. Variant callers that rely on these reference assemblies face characteristic failures in difficult regions, as detailed in Section 1.6.

Key Insight: Reference Choice Shapes What Is Detectable

The reference genome is not neutral infrastructure; it is an active filter. Regions absent from the reference cannot be mapped to. Collapsed duplications create spurious variants. Unresolved gaps hide entire genes. When a model fails to detect a variant or makes a false call, ask: could this be a reference assembly limitation rather than a model limitation?

Graph-based and pangenome references relax the assumption of a single linear reference, representing multiple haplotypes and ancestries within a unified coordinate system (Liao et al. 2023). Comparative multi-species references, such as those used in mammalian constraint maps from the Zoonomia consortium (Sullivan et al. 2023), extend this idea across species, providing evolutionary conservation scores that feed directly into deleteriousness predictors and gene-level constraint metrics discussed in Section 4.1.1 for evolutionary approaches and Section 4.3 for integrated scoring.

For most genomic deep learning applications, the practical reality is still GRCh37 or GRCh38 coordinates, often with incremental patches. Models trained on these resources therefore inherit their blind spots: incomplete or collapsed segmental duplications, underrepresented ancestries in pangenome construction, and uneven quality across chromosomes and regions. These limitations concentrate in precisely the regions where variant interpretation matters most (such as the HLA locus, pharmacogenes with structural variation, and segmental duplications harboring disease genes), creating a systematic mismatch between clinical importance and reference quality.

2.1.2 Gene Models

A child presents with developmental delay and muscle weakness. Whole-genome sequencing identifies a novel variant near the DMD gene, which encodes dystrophin and causes Duchenne muscular dystrophy when disrupted. The annotation pipeline reports the variant as intronic and unlikely to affect protein function. Yet DMD spans 2.2 megabases and includes 79 exons with complex alternative splicing; whether this variant disrupts a tissue-specific isoform depends entirely on which transcript model the annotation tool uses. The clinical implications are entirely different, yet the underlying sequence is identical: only the annotation changes.

Gene annotation databases such as GENCODE and RefSeq define the biological vocabulary overlaid on reference coordinates: exon-intron structures, canonical and alternative transcripts, start and stop codons, and untranslated regions (Frankish et al. 2019; O’Leary et al. 2016). These annotations distinguish coding from non-coding variants, identify splice-disrupting mutations, and map functional genomics signals to genes. They also establish the units (genes, transcripts, exons) that downstream models implicitly operate on. Splicing prediction models in Section 6.5 learn splice site grammar from annotated exon-intron boundaries, then apply those patterns to detect both canonical and cryptic sites.

The MANE Select project provides a single matched transcript per protein-coding gene that is identical between GENCODE and RefSeq, simplifying clinical interpretation and variant reporting (Morales et al. 2022). This standardization makes variant descriptions consistent across laboratories, yet it privileges a single isoform over biological complexity. In contexts where tissue-specific or developmentally regulated isoforms drive disease (alternative splicing in muscular dystrophies, isoform-specific expression in neuropsychiatric conditions), the canonical transcript may miss the relevant biology.

Deep Dive: Alternative Splicing

For ML readers: A single gene can produce multiple different proteins through alternative splicing:

The basics: After DNA is transcribed into pre-mRNA, introns (non-coding segments) are removed and exons (coding segments) are joined together. Alternative splicing means different combinations of exons can be included in the final mRNA, producing different protein variants (isoforms) from the same gene.

Splicing patterns:

  • Exon skipping: An exon is excluded from the final transcript
  • Alternative 5’ or 3’ splice sites: Different endpoints for exon boundaries
  • Intron retention: An intron is kept in the mature mRNA
  • Mutually exclusive exons: One of two exons is included, never both

Scale: Over 95% of human multi-exon genes undergo alternative splicing. The ~20,000 human genes produce an estimated 100,000+ distinct protein isoforms.

Why it matters for genomic ML: A variant might be benign in the canonical isoform but pathogenic in a tissue-specific isoform. Models trained only on canonical transcripts miss this complexity. Splicing prediction models like SpliceAI (Section 6.5) predict how variants affect splice site usage.

New isoforms continue to be discovered, alternative splicing remains incompletely cataloged, and cell-type-specific transcripts may be absent from bulk-derived annotations. Non-coding RNA genes and pseudogenes are even more unevenly annotated. These gaps propagate through every tool built on them: variant effect predictors cannot score consequences for transcripts that do not exist in their reference annotation, and expression models cannot predict isoforms they were never trained on.

Stop and Think: Annotation Dependencies

Consider a variant effect predictor trained exclusively on MANE Select transcripts. What types of pathogenic variants might it systematically miss? Think about tissue-specific isoforms, retained introns, and alternative first exons before reading on.

The predictor would miss variants affecting: (1) tissue-specific isoforms not represented in MANE, (2) alternative exons used in specific developmental stages or cell types, (3) non-coding transcripts and regulatory RNAs, and (4) novel splice isoforms not yet annotated. This illustrates how annotation choices propagate into model blind spots.

2.2 Population Variant Catalogs and Allele Frequencies

A clinical geneticist evaluates a child with an undiagnosed syndrome and identifies a novel missense variant in a candidate gene. The question that determines what happens next is deceptively simple: has anyone else carried this variant? If the variant appears in thousands of healthy adults, it is almost certainly benign. If it has never been observed across hundreds of thousands of sequenced genomes, that absence becomes evidence of selective pressure against the variant, strongly suggesting functional consequence. Without population-scale variant catalogs, this inference is impossible, and every rare variant would demand the same level of scrutiny regardless of its actual likelihood of causing disease.

Ancestry representation disparity across major genomic resources
Figure 2.2: Ancestry representation across major genomic resources. Stacked bars show the proportion of samples from different ancestral backgrounds in key databases. European-ancestry individuals (blue) comprise approximately 78% of GWAS participants despite representing roughly 16% of the global population. This overrepresentation propagates through variant interpretation databases, functional genomics atlases, and polygenic score development, creating systematic gaps in genomic medicine for underrepresented populations. Inset map highlights continental regions with the largest representation deficits relative to population size.

Allele frequency, the proportion of chromosomes in a reference population carrying a given variant, serves as one of the most powerful priors in variant interpretation. Beyond simple filtering, allele frequencies inform statistical frameworks for case-control association, provide training signal for deleteriousness predictors, and enable imputation of ungenotyped variants through linkage disequilibrium (see Chapter 3). The catalogs described below have progressively expanded in sample size, ancestral diversity, and annotation depth, transforming variant interpretation from an ad hoc exercise into a quantitative discipline.

Deep Dive: Allele Frequency

For ML readers: Allele frequency is the proportion of chromosomes carrying a particular variant in a population:

Minor Allele Frequency (MAF): The frequency of the less common allele at a biallelic site. If 99% of chromosomes carry A and 1% carry G at a position, MAF = 0.01 (1%).

Frequency categories in practice:

Category MAF Typical interpretation
Common >5% Unlikely to cause severe disease (would be selected against)
Low-frequency 1-5% May contribute to complex traits
Rare 0.1-1% Candidates for Mendelian disease
Ultra-rare <0.1% Strongest candidates for pathogenicity

Why frequency matters:

  1. Filtering: Variants common in healthy populations are unlikely to cause severe disease. A variant at 1% frequency in gnomAD cannot plausibly cause a dominant condition affecting 1 in 100,000 people.

  2. Evolutionary signal: Rare variants have been exposed to less natural selection. Absence from large population databases suggests the variant may be deleterious.

  3. Population specificity: A variant at 5% in one population but absent from another is not “rare”; it is population-specific. Frequency filtering must be ancestry-aware.

An important nuance shapes model interpretation: these catalogs record variants that are compatible with being sampled in the first place. Gene-lethal variants that cause embryonic death or severe childhood disease rarely appear, even when they are biologically informative. Variants causing late-onset conditions (Alzheimer’s risk alleles, adult-onset cancer predisposition) can persist at appreciable frequencies because selection has not had time to remove them. Models trained on population data can only learn from variants present in these catalogs, which means they systematically underrepresent the most severe loss-of-function mutations.

2.2.1 dbSNP and Variant Identifiers

Two laboratories sequence the same patient and report their findings to a tumor board. Laboratory A describes a variant using genomic coordinates on GRCh38; Laboratory B uses HGVS nomenclature relative to a specific transcript. Are they discussing the same variant? Without standardized identifiers, this simple question can consume hours of manual reconciliation. The database of Single Nucleotide Polymorphisms (dbSNP) provides the common currency that cuts through this ambiguity: stable identifiers (rsIDs) that enable integration across tools and publications (Sherry et al. 2001).

When a laboratory reports a variant, when a researcher publishes a GWAS finding, and when a clinician queries a pathogenicity database, they need a common language to ensure they are discussing the same genomic position. Modern whole-exome and whole-genome sequencing routinely discovers millions of previously unseen variants per large cohort, but dbSNP identifiers remain the standard way to reference known single nucleotide polymorphisms (SNPs) and link disparate resources. When a GWAS publication reports an association at rs12345, that identifier traces back to dbSNP and enables integration with functional annotations, clinical databases, and population variant catalogs.

2.2.2 1000 Genomes and Early Reference Panels

Genotyping arrays measure only a sparse subset of genomic positions, yet disease-associated variants may lie anywhere in the genome. How can researchers infer variants at unmeasured positions? The answer lies in patterns of co-inheritance: variants that travel together on ancestral chromosome segments can be inferred from neighboring measured positions. This process of imputation depends entirely on having reference panels that capture the haplotype structure of the population being studied.

The 1000 Genomes Project provided one of the first widely used multi-population panels for imputation, sampling individuals from African, European, East Asian, South Asian, and admixed American populations (Auton et al. 2015). The resulting haplotype structure underlies many imputation servers and downstream analyses, enabling genotyping arrays with millions of markers to impute tens of millions of untyped variants through linkage disequilibrium (Yun et al. 2021). Although its sample size (approximately 2,500 individuals) is modest by current standards, 1000 Genomes established the template for how to build and distribute multi-population reference panels, and its samples continue to serve as benchmarks for variant calling performance. The role of imputation in GWAS is discussed further in Chapter 3.

More recent initiatives have substantially increased both sample size and ancestral diversity. TOPMed (Trans-Omics for Precision Medicine) provides deep whole-genome sequencing of over 180,000 individuals with deliberate oversampling of underrepresented populations (Taliun et al. 2021). The resulting reference panel enables more accurate imputation particularly for African-ancestry individuals and captures rare variants missed by earlier, smaller panels.

2.2.3 Genome Aggregation Database (gnomAD)

Distinguishing genuinely rare variants from sampling gaps requires population-scale catalogs with two properties: sufficient sample size to detect low-frequency variants reliably, and sufficient ancestral diversity to avoid misclassifying variants common in underrepresented populations. A variant at 1% frequency in African populations but absent from European cohorts would be incorrectly flagged as novel by any database sampling only European individuals. The Genome Aggregation Database (gnomAD) addresses both requirements by aggregating exome and genome sequencing data from research and clinical cohorts worldwide into harmonized allele frequency resources spanning hundreds of thousands of individuals (Karczewski et al. 2020).

gnomAD provides high-resolution allele frequencies stratified by genetic ancestry, enabling population-matched filtering that accounts for variants common in one ancestry but rare in others. This stratification matters because a variant observed at 1% frequency in African populations but absent from European cohorts would be incorrectly flagged as ultra-rare by a model trained predominantly on European data.

gnomAD also introduced constraint metrics that have become standard features in variant prioritization. The probability of loss-of-function intolerance (pLI) and loss-of-function observed/expected upper bound fraction (LOEUF) summarize how depleted a gene is for protein-truncating variants relative to expectation. These metrics work by computing how many loss-of-function variants we would expect to see in a gene if such variants were neutral (based on the gene’s length, sequence composition, and trinucleotide mutation rates), then comparing this expectation to the number actually observed across hundreds of thousands of individuals. When observed counts are far below expectation, the missing variants must have been removed by natural selection before carriers could reproduce, indicating that the gene cannot tolerate loss of function. Genes essential for viability show far fewer loss-of-function variants than neutral mutation rates would predict; this depletion provides evidence of selective constraint that transfers to variant interpretation. A novel truncating variant in a highly constrained gene warrants more concern than the same variant class in an unconstrained gene. These constraint metrics serve as important features in many variant effect predictors discussed in Section 4.3 and Chapter 18.

Population frequencies from gnomAD provide critical filtering steps in clinical variant interpretation pipelines, as detailed in Section 29.1.2. The constraint metrics derived from gnomAD form a foundation for variant effect prediction discussed in Section 4.1.1. Allele frequency distributions also inform fine-mapping approaches that assign causal probability to GWAS-associated variants (Section 3.3).

Interpreting Constraint Metrics

pLI (probability of loss-of-function intolerance) estimates the probability that a gene falls into the class of haploinsufficient genes where loss of one copy causes disease. Scores range from 0 to 1; genes with pLI > 0.9 are considered highly constrained. pLI’s categorical nature (genes are classified as tolerant, intermediate, or intolerant) limits its resolution for genes with intermediate constraint.

LOEUF (loss-of-function observed/expected upper bound fraction) provides a continuous measure by computing the ratio of observed to expected loss-of-function variants, with a 90% confidence upper bound. Lower LOEUF values indicate stronger constraint. A gene with LOEUF of 0.2 has observed only 20% as many truncating variants as expected under neutral evolution. LOEUF has largely superseded pLI in contemporary analyses due to its continuous scale and more intuitive interpretation.

Predict Before Viewing

Before examining the constraint metrics table, consider: If a gene has very few loss-of-function variants observed relative to expectation, what does that tell you about the gene’s importance? Would you expect disease-causing genes to be more or less constrained than average?

Table 2.2: Comparison of gnomAD constraint metrics and their appropriate applications. LOEUF has largely superseded pLI for contemporary analyses due to its continuous scale.
Metric Range Interpretation When to Use
pLI 0-1 Probability gene is haploinsufficient Categorical assessment; pLI > 0.9 indicates high constraint
LOEUF 0-2+ Ratio of observed/expected LoF variants Continuous ranking; lower = more constrained
Missense Z Any Standard deviations from expected missense count Missense constraint; higher = more constrained
pRec 0-1 Probability gene causes recessive disease Recessive disease candidate genes
Knowledge Check: Which Database?

For each scenario, which database or resource would be most appropriate to query, and why?

  1. A clinical geneticist needs to determine whether a novel missense variant in BRCA2 has been previously classified as pathogenic or benign by expert review.

  2. A researcher building a variant effect predictor needs allele frequencies stratified by genetic ancestry to filter training data for rare variants.

  3. A bioinformatician is annotating variants and needs stable identifiers that can be referenced across publications and tools.

  4. A computational biologist training a chromatin accessibility model needs experimental data on DNase-seq and transcription factor binding across multiple cell types.

  1. ClinVar: Aggregates expert variant classifications from clinical laboratories and expert panels, providing pathogenicity assertions with supporting evidence summaries.

  2. gnomAD: Provides population-stratified allele frequencies from >140,000 exomes and >76,000 genomes, enabling ancestry-aware filtering that avoids misclassifying population-specific variants as ultra-rare.

  3. dbSNP: Assigns stable rsID identifiers to known variants, serving as the common vocabulary for referencing variants across databases, publications, and bioinformatics tools.

  4. ENCODE/Roadmap Epigenomics: Provides experimental functional genomics data including DNase-seq, ChIP-seq for transcription factors and histone marks, and chromatin state annotations across hundreds of cell types and tissues.

These resources are indispensable for filtering common variants in Mendelian disease diagnostics, distinguishing ultra-rare variants from recurrent ones, and providing population genetics priors for deleteriousness scores like CADD (Rentzsch et al. 2019; Schubach et al. 2024). At the same time, they reflect the composition of the cohorts they aggregate: ancestry representation remains uneven despite ongoing efforts, structural variants and repeat expansions are less completely cataloged than SNVs and short indels, and individuals with severe early-onset disease are underrepresented by design. These biases propagate into every model that uses gnomAD frequencies or constraint scores as features.

2.3 Biobanks and GWAS Data

A pharmaceutical company developing a new cardiac drug needs to understand which genetic variants influence drug response. A health system implementing pharmacogenomic testing needs to know which patients are at risk for adverse reactions. A researcher studying the genetics of depression needs cases and controls with standardized phenotyping. None of these questions can be answered by sequencing alone; they require linking genetic variation to phenotypes at scale, across thousands or hundreds of thousands of individuals. Yet assembling such cohorts introduces its own biases: participants must consent, provide samples, and have phenotypes recorded in standardized ways. The populations enrolled in major biobanks reflect patterns of healthcare access, research infrastructure, and historical priorities that do not represent global genetic diversity.

The overrepresentation of European-ancestry individuals in most major biobanks creates systematic gaps in variant discovery, effect-size estimation, and polygenic score portability that propagate through downstream analyses (Sirugo, Williams, and Tishkoff 2019). A variant common in West African populations may be absent or rare in European-dominated catalogs, rendering it invisible to association studies and underrepresented in predictive models. This tension between scientific utility and representational equity shapes every biobank-derived resource and is discussed in detail in Chapter 13.

2.3.1 Large Population Cohorts

Statistical Note: Sample Size Requirements

Detecting genetic associations with small effect sizes requires very large sample sizes. A variant increasing disease risk by OR = 1.05 requires approximately 50,000 cases and 50,000 controls to detect at genome-wide significance (\(\alpha = 5 \times 10^{-8}\)). Required sample size scales with the inverse square of effect size: halving the effect size quadruples the required sample. This relationship explains why modern GWAS require hundreds of thousands of participants.

A variant that increases heart disease risk by 5% (OR = 1.05) requires approximately 50,000 cases and 50,000 controls to detect reliably at genome-wide significance (\(\alpha = 5 \times 10^{-8}\)). A variant shifting a continuous trait like blood pressure by 0.05 standard deviations demands even larger samples, often exceeding 100,000 individuals. The fundamental constraint is statistical: detecting small effect sizes against a backdrop of millions of tested variants requires both stringent significance thresholds and massive sample sizes to achieve adequate power. Required sample size scales with the inverse square of effect size, meaning a variant with half the effect requires four times the sample. This relationship explains why genetic discovery accelerated dramatically when biobanks reached the scale of hundreds of thousands of participants. Chapter 28 provides a detailed treatment of these statistical foundations and their implications for clinical risk prediction.

UK Biobank, with approximately 500,000 participants and deep phenotyping across thousands of traits, has become a dominant resource for methods development and benchmarking (Bycroft et al. 2018). FinnGen uses Finland’s population history and unified healthcare records for large-scale disease association discovery (Kurki et al. 2023). The All of Us Research Program prioritizes diversity, aiming to enroll one million participants with deliberate oversampling of historically underrepresented groups (All of Us Research Program Investigators, The 2019). deCODE genetics has genotyped a substantial fraction of Iceland’s population, enabling unique studies of rare variants and founder effects in a population with detailed genealogical records (Gudbjartsson et al. 2015). Additional resources include the Million Veteran Program, Mexican Biobank, BioBank Japan, China Kadoorie Biobank, and emerging African genomics initiatives such as H3Africa (Sirugo, Williams, and Tishkoff 2019).

Predict Before Viewing

Before looking at the biobank comparison table, predict: Which characteristic likely determines the statistical power to detect rare variant associations versus common variant associations? Which characteristic most affects whether findings will transfer across global populations?

Table 2.3: Major biobanks and their characteristics. Note that most resources remain dominated by European-ancestry participants, creating systematic gaps in variant discovery for other populations.
Biobank Sample Size Key Strength Primary Ancestry
UK Biobank ~500,000 Deep phenotyping, imaging European (94%)
FinnGen ~500,000 Founder population, linked EHR Finnish
All of Us 1,000,000 (target) Diversity prioritized Multi-ancestry
deCODE ~200,000 Genealogical records Icelandic
BioBank Japan ~200,000 East Asian representation Japanese
Million Veteran Program ~900,000 Diverse U.S. veterans Multi-ancestry

Together, these efforts enable genome-wide association studies (GWAS) for thousands of traits, development and evaluation of polygenic scores, and fine-mapping of causal variants and genes (Marees et al. 2018; Mountjoy et al. 2021). From a modeling perspective, they provide the large-scale genotype-phenotype matrices that power architectures ranging from classical linear mixed models to foundation models trained on biobank-scale data. The practical reality for most GWAS and polygenic score methods in Chapter 3 is data from either array genotyping with imputation or whole-exome/whole-genome sequencing with joint calling, as in DeepVariant/GLnexus-style pipelines (Yun et al. 2021).

PLINK (Purcell et al. 2007) remains the workhorse tool for GWAS data manipulation, supporting quality control, population stratification analysis, association testing, and file format conversions that underlie most biobank analyses. Despite its age, PLINK’s efficiency with large-scale genotype data makes it indispensable for preprocessing pipelines that feed into downstream deep learning workflows.

2.3.2 GWAS Summary Statistics

Individual-level genotype and phenotype data are powerful but sensitive. Sharing such data across institutions requires complex data use agreements, institutional review board approvals, and secure computing infrastructure. These barriers would slow scientific progress if every analysis required access to raw data. Summary statistics offer an alternative: per-variant effect sizes, standard errors, and p-values that capture the essential association signal without revealing individual genotypes.

The GWAS Catalog compiles published results across thousands of traits, while the PGS Catalog provides curated polygenic score weights and metadata for reproducibility (Sollis et al. 2023; Lambert et al. 2021). Frameworks like Open Targets Genetics integrate fine-mapped signals with functional annotations to prioritize candidate causal genes at associated loci (Mountjoy et al. 2021). The Open Targets Platform (platform.opentargets.org) extends this by aggregating target-disease associations from multiple evidence sources—genetic associations, expression data, literature mining, pathway annotations, and known drugs—into systematic scores for drug target prioritization (Ochoa et al. 2023). Unlike resources focused on individual variants (ClinVar) or individual genes (OMIM), Open Targets provides gene-level disease association scores that integrate across evidence types, bridging genetic discovery and therapeutic translation (Section 30.1.3).

Summary statistics enable meta-analysis across cohorts without sharing individual-level data, transfer of genetic findings to new populations through methods like PRS-CSx, and integration with functional annotations to distinguish causal variants from linked bystanders (Ruan et al. 2022). For deep learning, summary statistics provide a sparse, trait-level view of the genome. This sparsity creates different challenges than the dense labels available in functional genomics, but also different opportunities: the genetic architecture revealed through GWAS informs polygenic score construction (Section 3.5) and indicates which variants and pathways merit follow-up with regulatory models (Chapter 17) and variant effect predictors (Chapter 18).

2.4 Functional Genomics and Regulatory Landscapes

Protein-coding exons constitute roughly 1.5% of the human genome, yet most disease-associated variants from GWAS fall outside coding regions. A massive study identifies 100 loci associated with schizophrenia, but 90 of them lie in non-coding regions with no obvious connection to any gene. This mismatch creates a fundamental interpretability problem: we can identify non-coding loci that harbor disease risk, but we cannot easily determine which base pairs matter, which genes they regulate, or in which cell types they act. Understanding these non-coding variants requires mapping the regulatory logic that governs when, where, and how much each gene is expressed. Functional genomics assays provide this map, identifying transcription factor binding sites, nucleosome positioning, chromatin accessibility, histone modifications, and three-dimensional genome organization across cell types and conditions.

Key Insight: The Label Source Determines the Model

Functional genomics datasets serve a dual role in genomic deep learning. First, they supply the biological vocabulary for interpreting non-coding variants. Second, they provide the training labels for sequence-to-function models. When DeepSEA learns to predict chromatin accessibility from DNA sequence, it compresses into its parameters the regulatory patterns implicit in thousands of ENCODE experiments. The model learns whatever biology ENCODE measured, and inherits whatever ENCODE missed.

2.4.1 ENCODE, Roadmap, and Related Consortia

Predict Before You Look

Before reading about ENCODE and Roadmap, consider: If you were designing a reference dataset to train models predicting regulatory activity from DNA sequence, what are the key design decisions you’d face? Think about:

  • Which cell types or tissues to profile?
  • Which assay types to prioritize?
  • How to balance depth (many experiments in few cell types) versus breadth (few experiments across many cell types)?

Keep these tradeoffs in mind as you read.

A single ChIP-seq experiment for one transcription factor in one cell line provides useful signal, but models that learn general regulatory grammar require thousands of such experiments spanning many factors, marks, and cell types. A researcher training a regulatory model on her own laboratory’s data will produce a model that works well in her specific experimental context but fails to generalize. The key insight behind ENCODE and Roadmap was that coordinated experimental campaigns, with standardized methods and quality control, could create reference datasets serving the entire field.

The Encyclopedia of DNA Elements (ENCODE) and Roadmap Epigenomics consortia designed coordinated experimental campaigns that profiled transcription factor binding (ChIP-seq), histone modifications, chromatin accessibility (DNase-seq, ATAC-seq), and chromatin conformation (Hi-C) across cell lines and primary tissues (Kagda et al. 2025; Kundaje et al. 2015). Gene Expression Omnibus (GEO) archives these and many other functional genomics datasets with standardized metadata (Edgar, Domrachev, and Lash 2002).

The significance of these consortia lies less in any individual experiment than in the scale and standardization they provide. By generating hundreds of assays across dozens of cell types with consistent protocols, ENCODE and Roadmap created canonical reference datasets that define the regulatory landscape for the cell types they profiled. These resources enabled multiple generations of regulatory models. DeepSEA (Section 6.2) pioneered multi-task learning on ENCODE chromatin accessibility and transcription factor binding, where each prediction task corresponds to a ChIP-seq or accessibility experiment. Enformer (Section 17.2) extended this paradigm with transformer attention mechanisms and longer context windows. The progression from convolutional to attention-based architectures reflects both the richness of ENCODE data and its limitations: models trained on these resources inherit ENCODE’s choices about which cell types, factors, and experimental conditions merit inclusion.

ENCODE and Roadmap Epigenomics data coverage matrix
Figure 2.3: Coverage of the ENCODE and Roadmap Epigenomics data compendium. Each row represents a cell type or tissue; each column represents an assay type (chromatin accessibility, histone modifications, transcription factor binding, gene expression). Color intensity indicates data availability, from absent (white) to comprehensively profiled (dark blue). Tier 1 cell lines (K562, GM12878, HepG2) show near-complete coverage across assay types, while many disease-relevant primary tissues remain sparsely profiled. Models trained on this compendium inherit its coverage biases, performing best for well-characterized cell types and potentially failing for undersampled contexts.
Knowledge Check: Functional Genomics Coverage

A model trained to predict chromatin accessibility performs excellently on K562 cells (a leukemia cell line) but poorly on primary pancreatic beta cells. Without looking back, can you explain why this might occur?

K562 is one of the most heavily profiled cell types in ENCODE with extensive training data. Pancreatic beta cells, despite their importance in diabetes, have sparse representation in training compendia. The model learned patterns specific to well-represented cell types and cannot generalize to cell types absent from training.

2.4.2 Cistrome Data Browser

You want to train a model predicting binding sites for a transcription factor implicated in your disease of interest. ENCODE has extensive data, but not for your factor. A literature search reveals that fifteen laboratories have published ChIP-seq experiments for this factor over the past decade, but every dataset uses different peak callers, different quality thresholds, and different normalization schemes. Comparing them directly would be comparing apples to oranges. How do you build a unified training set from this scattered evidence?

The Cistrome Data Browser solves exactly this problem by aggregating thousands of human and mouse ChIP-seq and chromatin accessibility datasets from ENCODE, Roadmap, GEO, and individual publications into a uniformly reprocessed repository (Zheng et al. 2019). All datasets pass through standardized quality control and peak calling, enabling comparisons across experiments originally generated with different protocols.

Cistrome provides uniform peak calls, signal tracks, and metadata for cell type, factor, and experimental conditions. The uniform reprocessing is critical because raw ChIP-seq data from different laboratories cannot be directly compared: different peak callers use different statistical thresholds, different normalization schemes produce different signal intensities, and different quality filters exclude different artifacts. By applying identical computational pipelines to all datasets, Cistrome makes experiments comparable even when they were not designed for cross-study analysis. The tradeoff is heterogeneity: while reprocessing harmonizes computational steps, the underlying experiments vary in sample preparation, antibody quality, sequencing depth, and experimental design. Cistrome expands coverage at the cost of the tight experimental control found in the primary consortia, a tradeoff that matters when models learn from noisy or inconsistent labels.

Predict Before Viewing

Consider the tradeoff between data quality and data coverage. If you needed transcription factor binding data for a rare cell type not in ENCODE, where would you look? What quality concerns might you need to address?

Table 2.4: Comparison of functional genomics data sources. Resources vary in the tradeoff between standardization and coverage.
Resource Scope Quality Control Key Tradeoff
ENCODE Consortium-generated Stringent, standardized Limited cell type coverage
Roadmap Epigenomics Primary tissues Standardized protocols Fewer factors profiled
Cistrome Aggregated public data Uniform reprocessing Variable original quality
GEO All submitted experiments Minimal curation Heterogeneous methods

2.4.3 From Assays to Training Labels

Here is the central challenge for regulatory genomics: a ChIP-seq experiment tells you where a transcription factor binds in one cell type under one condition, but a machine learning model needs millions of labeled examples to learn DNA sequence patterns that predict binding in general. How do you transform scattered experimental measurements into the systematic training labels that deep learning requires? This conversion from assays to labels is where the biology meets the algorithm, and where subtle choices about data processing determine what models can and cannot learn.

Sequence-to-function models like DeepSEA (see Chapter 17) draw training labels from ENCODE, Roadmap, and Cistrome-style datasets: each genomic window is associated with binary or quantitative signals indicating transcription factor binding, histone modifications, or chromatin accessibility across hundreds of assays and cell types (Zhou and Troyanskaya 2015; Zhou et al. 2018).

The quality, coverage, and biases of these labels directly constrain what models can learn. Cell types absent from the training compendium cannot be predicted reliably. Factors with few high-quality ChIP-seq experiments will have noisier labels. Systematic differences between assay types (binary peak calls versus quantitative signal tracks) shape whether models learn to predict occupancy, accessibility, or something in between. These considerations become central when examining model architectures and training strategies in Chapter 17.

2.4.4 Deep Mutational Scanning and Multiplexed Variant Assays

Population variant catalogs tell us which variants survive in healthy individuals, but they cannot tell us what happens when a specific amino acid is changed to every possible alternative. Functional genomics experiments reveal where the genome is active, but they do not directly measure the consequence of each possible mutation. Deep mutational scanning (DMS) fills this gap by measuring the fitness or functional impact of thousands of protein or regulatory variants in a single experiment.

These assays systematically introduce mutations (often approaching saturation mutagenesis for a protein domain or regulatory element), subject the resulting library to selection or screening, and use sequencing to quantify the representation of each variant before and after selection. The result is dense, quantitative measurements of variant effects under controlled conditions. Benchmarks such as ProteinGym compile large DMS datasets across proteins to evaluate variant effect predictors. TraitGym curates multiplexed reporter assays and other high-throughput readouts of regulatory variant effects (Notin et al. 2023; Benegas, Eraslan, and Song 2025).

These resources sit at the interface between genomic and protein-level modeling. Where gnomAD and biobanks catalog sparse, naturally occurring variation, DMS datasets offer dense, quantitative functional measurements across systematic variant libraries that test most or all possible substitutions. DMS data differ fundamentally from population catalogs: they measure functional impact directly under controlled conditions rather than inferring it from population survival. Protein sequence models (Chapter 16) and regulatory variant predictors (Chapter 18) use these DMS-style datasets as key benchmarks and training sources.

Predict Before Viewing

Three different data sources can tell you about variant effects, but each measures something different. Before viewing the table, predict: Which source would best answer “Has this variant been seen in healthy people?” versus “Does this variant disrupt protein function in a lab assay?” versus “Has a clinical geneticist classified this variant as disease-causing?”

Table 2.5: Comparison of variant-level data sources. Each provides different evidence types with complementary strengths and limitations.
Data Type Measurement Coverage Key Limitation
Population catalogs (gnomAD) Presence in healthy individuals Natural variants only Severe variants underrepresented
Deep mutational scanning Direct functional impact Saturation mutagenesis One protein/element at a time
Clinical databases (ClinVar) Expert pathogenicity assessment Clinically observed variants Ascertainment toward disease genes

2.5 Expression and eQTL Resources

Functional genomics assays reveal where transcription factors bind and which chromatin regions are accessible, but they do not directly answer the downstream question: does regulatory activity actually change how much RNA a gene produces? A transcription factor may bind a genomic region without altering expression of nearby genes; an accessible chromatin region may not contain active regulatory elements. Regulatory binding and gene expression exist in a many-to-many relationship that cannot be resolved by either measurement alone. Expression datasets complete this link, measuring transcript abundance across tissues, cell types, and genetic backgrounds.

Connecting non-coding GWAS variants to their effector genes requires mechanistic hypotheses: some indication of which gene a regulatory variant actually regulates. Expression quantitative trait loci (eQTLs) provide exactly this connection, identifying genetic variants statistically associated with transcript-level changes. When a GWAS signal colocalizes with an eQTL for a nearby gene in disease-relevant tissue, that gene becomes a candidate effector. For model training, expression data provide quantitative labels that integrate across many regulatory inputs converging on a single promoter.

2.5.1 Bulk Expression Atlases

A GWAS identifies a locus associated with coronary artery disease in a non-coding region. Dozens of genes lie within the associated interval. Which one mediates the disease risk? If the lead variant also associates with expression of a nearby gene specifically in arterial endothelial cells, that gene becomes the prime candidate. Without tissue-specific expression data linked to genotypes, this inference is impossible.

The Genotype-Tissue Expression (GTEx) consortium provides the most comprehensive resource linking genetic variation to gene expression across human tissues, with RNA-seq profiles from 948 post-mortem donors across 54 tissues (GTEx Consortium, The 2020). GTEx established foundational insights that inform regulatory genomics models: most genes harbor tissue-specific eQTLs, regulatory variants typically act in cis over distances of hundreds of kilobases, and expression variation explains a meaningful fraction of complex trait heritability.

GTEx underlies expression prediction models such as PrediXcan, which trains tissue-specific models to impute gene expression from genotypes alone (Gamazon et al. 2015). Transcriptome-wide association studies (TWAS) extend this idea to associate imputed expression with phenotypes (Gusev et al. 2016). Colocalization methods ask whether a GWAS signal and an eQTL share the same causal variant, providing evidence that the associated gene mediates the trait effect.

The GTEx design has limitations worth acknowledging. Post-mortem collection introduces agonal stress artifacts that may not reflect living tissue biology. Sample sizes vary considerably across tissues (hundreds for some, dozens for others), affecting statistical power. Some disease-relevant tissues, such as pancreatic islets or specific brain subregions, remain undersampled. Complementary resources like the eQTLGen Consortium aggregate eQTL results from blood across much larger sample sizes, trading tissue diversity for statistical power (Võsa et al. 2021).

2.5.2 Single-Cell and Context-Specific Expression

Bulk RNA-seq averages expression across all cells in a tissue sample, obscuring the cell-type-specific programs that often mediate disease biology. A bulk eQTL in brain tissue might reflect astrocytes, neurons, microglia, or oligodendrocytes; the causal cell type matters for understanding mechanism. This averaging creates a fundamental resolution problem: variants may have strong effects in rare cell populations that are diluted to undetectability when mixed with other cell types.

Single-cell RNA-seq resolves this heterogeneity, identifying expression signatures for individual cell types, rare populations, and transitional states. Large-scale efforts including the Human Cell Atlas and Tabula Sapiens are building reference atlases that catalog cell types across organs and developmental stages (Regev et al. 2017; Tabula Sapiens Consortium, The 2022). For variant interpretation, single-cell data enable cell-type-specific eQTL mapping, revealing that a variant may influence expression in one cell type but not others within the same tissue. Spatial transcriptomics adds anatomical context, preserving tissue architecture while measuring gene expression.

These technologies introduce computational challenges: sparsity from dropout effects, batch variation across samples and technologies, and massive scale with millions of cells per study. They also offer an increasingly fine-grained view of the link between genotype, regulatory state, and cellular phenotype. Multi-omics integration (Chapter 23) and systems-level modeling draw heavily on single-cell and spatial resources.

Predict Before Viewing

If you wanted to know which cell type in the brain expresses a candidate gene for a neurological disorder, which technology would be most appropriate? What would you gain and lose compared to bulk RNA-seq?

Table 2.6: Expression profiling technologies and their tradeoffs. Higher resolution technologies introduce new computational challenges.
Technology Resolution Scale Key Challenge
Bulk RNA-seq Tissue average 100s of samples Cell type confounding
Single-cell RNA-seq Individual cells Millions of cells Sparsity, dropout
Spatial transcriptomics Cells in tissue context 1000s of spots Lower gene coverage

2.6 Protein Databases

A researcher developing a variant effect predictor needs to understand whether a missense mutation disrupts protein function. Sequence conservation across species provides one signal, but structural context adds another dimension: a mutation at an active site or protein-protein interface may be more disruptive than one in a flexible loop. Training models that use this structural intuition requires databases cataloging both protein sequences and their three-dimensional structures. These resources have grown from modest beginnings (a few hundred structures in the early PDB) to comprehensive atlases covering essentially all known protein sequences with either experimental or predicted structures.

Protein databases serve multiple roles in genomic deep learning. Sequence databases provide the training corpora for protein language models (Chapter 16). Structure databases supply the labels for structure prediction and the geometric constraints that inform structure-aware variant effect predictors. The intersection of sequence and structure enables models that learn evolutionary patterns from millions of sequences while grounding predictions in physical reality.

2.6.1 Sequence Databases

Protein language models learn the grammar of evolution by reading billions of protein sequences, but where do those sequences come from? The answer matters because the training corpus defines what the model can learn. A protein family absent from the database is invisible to the model. A protein family represented by thousands of diverse homologs will be understood deeply. The difference between a model that predicts variant effects accurately and one that fails silently often traces back to whether the relevant protein families were well-represented in training.

A protein language model learns patterns from the evolutionary record encoded in sequence databases. The depth and diversity of these databases directly constrain what patterns can be learned: a model trained on bacterial sequences alone will miss eukaryotic-specific motifs, while one trained only on well-characterized model organisms will underrepresent the functional diversity of environmental samples.

UniProt provides the foundational sequence resource, integrating manually curated entries (Swiss-Prot) with computationally annotated sequences (TrEMBL) into a comprehensive protein knowledgebase (The UniProt Consortium 2023). The UniRef clusters organize these sequences at different identity thresholds (UniRef100, UniRef90, UniRef50), enabling efficient sampling strategies for model training that balance coverage against redundancy (Suzek et al. 2007). UniRef50, which clusters sequences at 50% identity, reduces the database size substantially while preserving sequence diversity, making it practical for training large models.

The Big Fantastic Database (BFD) extends beyond curated sequences to include metagenomic data, capturing protein diversity from environmental samples that have never been cultured in laboratories (Steinegger, Mirdita, and Söding 2019). BFD contains over 2.5 billion protein sequences, an order of magnitude larger than UniProt, representing the vast majority of known protein sequence space. This scale proved critical for training models like ESM-2, where exposure to diverse evolutionary patterns during pretraining improved downstream performance on variant effect prediction and other tasks. However, metagenomic sequences carry higher annotation uncertainty and may include fragments, chimeras, and other artifacts that curated databases exclude.

Stop and Think: Database Tradeoffs

Consider training a protein language model. You have access to UniProt (curated, ~250 million sequences) and BFD (metagenomic, ~2.5 billion sequences). What are the arguments for using each? What might you lose by using only one?

UniProt advantages: higher quality annotations, fewer artifacts, established functional knowledge. BFD advantages: 10x more sequences, broader evolutionary coverage, rare protein families. Using only UniProt might miss patterns in understudied protein families; using only BFD might introduce noise from low-quality sequences. Modern approaches often combine both, using BFD for pretraining diversity and UniProt for downstream task fine-tuning.

2.6.2 Structure Databases

When you need to understand why a missense variant disrupts protein function, sequence alone often cannot answer the question. Is the affected residue buried in the hydrophobic core where any substitution would destabilize folding? Does it sit at a protein-protein interface where the mutation would disrupt binding? Is it part of the catalytic site where even conservative changes eliminate activity? Answering these questions requires knowing the three-dimensional structure, and structure databases provide this critical context for variant interpretation.

Experimental protein structures anchor computational predictions in physical reality. The Protein Data Bank (PDB) archives structures determined by X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy, providing the definitive reference for protein three-dimensional organization (Berman et al. 2000). As of 2024, the PDB contains over 220,000 structures, though coverage remains uneven: well-studied proteins in model organisms are heavily represented, while the structures of most human proteins remain experimentally undetermined.

The AlphaFold Protein Structure Database transformed this landscape by providing predicted structures for essentially all proteins in UniProt (Varadi et al. 2022). These predictions, generated by AlphaFold2 (Jumper et al. 2021), achieve accuracy competitive with experimental determination for well-folded domains. The database democratized structural biology, enabling researchers to access structural hypotheses for any protein of interest without experimental effort. For variant interpretation, AlphaFold structures provide context that was previously available only for the small fraction of proteins with experimental structures.

However, predicted structures carry important caveats. AlphaFold provides confidence scores (pLDDT) that indicate prediction reliability, with disordered regions and novel folds receiving lower scores. Models trained on AlphaFold structures inherit both the power of comprehensive coverage and the uncertainty of computational prediction. The distinction between experimental and predicted structures matters when using structural features for clinical variant interpretation, where the evidentiary standards are higher than for research applications. Structure-aware variant effect prediction is discussed in Section 18.2, while the role of protein language models in genomic foundation models appears in Chapter 16.

2.7 Phenotype Definition and Data Quality

Every model in genomics learns from labels, but phenotype labels carry their own biases distinct from variant annotations or functional genomics measurements. A GWAS for type 2 diabetes depends entirely on how diabetes is defined: by self-report, ICD-10 codes, hemoglobin A1c thresholds, medication records, or clinical adjudication. Each definition captures a different slice of the underlying biology. Self-report misses undiagnosed cases. ICD codes reflect billing practices as much as clinical reality. Laboratory thresholds impose sharp boundaries on continuous metabolic dysregulation. The “same” phenotype defined differently yields different genetic architectures, different effect sizes, and different polygenic score performance.

This sensitivity to phenotype definition compounds as biobanks scale. UK Biobank’s 500,000 participants enable discovery at unprecedented statistical power, but that power is limited by the precision of the phenotypes being tested. A GWAS with millions of participants but noisy case definitions may have less effective power than a smaller study with carefully adjudicated outcomes. The trade-off between sample size and phenotype quality pervades modern statistical genetics, and understanding its contours is essential for interpreting what models trained on biobank data actually learn.

Phenotype quality issues create systematic confounding in GWAS (Section 3.8.3) and clinical risk prediction models (Section 28.4). Deep phenotyping approaches that extract richer representations from EHR data are examined in Section 3.8.4 for GWAS contexts and Section 28.3 for clinical deployment.”

2.7.1 Problem of Binary Disease Definitions

Most GWAS treat disease as binary: case or control, affected or unaffected. This simplification enables standard statistical machinery but discards information about disease severity, age of onset, trajectory, and subtype. Two patients both labeled “coronary artery disease” may differ in clinically meaningful ways: one experienced an acute myocardial infarction at age 45, the other underwent elective stenting for stable angina at 72. Collapsing this heterogeneity into a single binary label forces genetic analyses to identify variants associated with an artificial composite rather than biologically coherent disease entities.

The consequences extend beyond reduced statistical power. Phenotype heterogeneity can induce genetic heterogeneity, where different genetic variants predispose to different subtypes that have been artificially combined. A GWAS for “depression” that includes melancholic depression, atypical depression, and adjustment disorders will identify variants associated with the mixture rather than any specific syndrome. The resulting polygenic scores predict the mixture, potentially missing stronger associations with homogeneous subtypes and providing weaker stratification than would be achievable with cleaner phenotype definitions.

Clinical endpoints also differ in their proximity to genetic effects. Biomarkers such as LDL cholesterol or blood pressure lie closer to gene function than clinical outcomes such as myocardial infarction or stroke, which require the biomarker dysregulation to persist, interact with environmental factors, and culminate in tissue damage. Genetic effects are typically larger and more readily detected for intermediate phenotypes than for distal clinical outcomes. This motivates strategies that analyze biomarkers as outcomes in their own right, then connect genetic effects on biomarkers to disease risk through Mendelian randomization or mediation analysis.

2.7.2 Electronic Health Record Quality and Completeness

Electronic health records promise comprehensive phenotyping at scale: every diagnosis, procedure, medication, and laboratory result captured in structured or semi-structured form. In practice, EHR data are messy, incomplete, and shaped by processes far removed from biology. A diagnosis code reflects not just what the patient has but what the clinician chose to document, what the billing system required, and what the coding specialist interpreted. The same clinical presentation may receive different codes depending on the setting, the clinician’s documentation habits, and institutional coding policies.

Missing data pervades EHR phenotyping. Laboratory values are measured when clinically indicated, not at random, creating informative missingness where the absence of a measurement conveys information about the patient’s health status. Patients who transfer between health systems appear to have incomplete histories. Conditions managed by specialists outside the health system may be entirely absent from the record. These gaps are not random but systematically related to patient characteristics, healthcare access, and disease severity in ways that can bias genetic analyses.

Temporal dynamics add further complexity. Disease onset rarely corresponds to diagnosis date; patients carry pathology for years before clinical recognition. Medication records indicate prescriptions but not adherence. Procedure dates capture interventions but not the progression of disease that motivated them. Time-to-event analyses must grapple with left truncation (patients entering observation after disease onset), interval censoring (disease status observed only at discrete timepoints), and the distinction between incident and prevalent cases that confounds cross-sectional analyses.

Practical Guidance: Working with EHR Phenotypes

When using EHR-derived phenotypes for genetic analysis:

  1. Specify inclusion/exclusion criteria precisely. Which ICD codes define cases? What time windows apply?
  2. Document missingness patterns. Is absence of a code evidence of absence, or missing data?
  3. Consider ascertainment bias. Who gets tested, screened, or referred? This shapes who appears as cases.
  4. Check for temporal artifacts. Did coding practices change during your study period?
  5. Validate against chart review. Even small validation studies (n=100) reveal systematic misclassification.

2.7.3 Coding Inconsistencies and Label Noise

The International Classification of Diseases provides a standardized vocabulary, but standardized vocabulary does not guarantee standardized application. ICD-10 contains over 70,000 codes, and clinical coders must choose among them based on physician documentation that may be ambiguous, incomplete, or inconsistent. Studies comparing chart review to coded diagnoses find substantial discordance: some patients with clear clinical disease lack corresponding codes, while others have codes without supporting clinical evidence (Birman-Deych et al. 2005; Quan et al. 2005).

Code usage also evolves over time. The transition from ICD-9 to ICD-10 in the United States (October 2015) created discontinuities in phenotype definitions built on specific codes. Clinical practice changes alter what conditions are tested for, diagnosed, and coded. COVID-19’s emergence created entirely new codes and altered coding patterns for respiratory illness more broadly. These temporal discontinuities matter for genetic studies because they create apparent phenotype changes that have nothing to do with biology: a patient may appear to develop a new condition simply because the coding system changed, or disease prevalence may appear to increase because a screening program was introduced. Longitudinal analyses spanning coding transitions or practice changes must account for these artifacts or risk confusing temporal trends in coding with temporal trends in disease.

Label noise from coding errors propagates into every downstream analysis. A phenotype definition with 10% misclassification (5% false positives, 5% false negatives) substantially attenuates genetic effect sizes and reduces GWAS power. For rare diseases where cases are precious, false positives among controls matter less than false negatives among cases, which dilute the genetic signal. For common diseases where controls are presumed healthy, false negatives among controls (undiagnosed cases) similarly attenuate associations. The magnitude of this attenuation depends on disease prevalence, misclassification rates, and their correlation with genetic risk.

2.7.4 Deep Phenotyping Approaches

Recognition of these limitations has motivated deep phenotyping strategies that move beyond binary disease definitions. Quantitative phenotypes, when available, preserve information that binary thresholds discard. Rather than dichotomizing blood pressure into hypertensive versus normotensive, analyzing systolic and diastolic pressure as continuous traits captures the full distribution of genetic effects. Similarly, imaging-derived phenotypes (cardiac MRI measurements, brain volume, bone density) provide precise quantitative endpoints with higher heritability than clinical disease outcomes.

Phenotype refinement uses clinical features to identify more homogeneous subgroups. Clustering patients by age of onset, comorbidity patterns, or biomarker profiles can reveal subtypes with distinct genetic architectures. Type 2 diabetes, for instance, has been decomposed into clusters defined by age, BMI, insulin resistance, and beta-cell function, with different clusters showing different genetic associations and different disease trajectories (Ahlqvist et al. 2018). Such stratification requires sufficient clinical data to define subgroups, limiting its application to well-phenotyped cohorts.

A more radical approach abandons expert-specified phenotype criteria entirely. Instead of encoding clinical knowledge through hierarchical ontologies, embedding methods learn vector representations of clinical concepts from co-occurrence patterns in EHR data. Word2Vec models trained on ICD-10 code sequences position clinically related codes near each other in this learned space; codes that co-occur in patient records cluster together regardless of their position in the ICD ontology. Large language models can generate similar phenotype embeddings from textual descriptions, capturing semantic relationships encoded in clinical language.

These embeddings can serve as phenotypes themselves. Xu et al. demonstrated that GWAS conducted on EHR-embedding dimensions identified heritable components of clinical phenotype structure, with genetic correlations revealing coherent trait clusters such as cardiovascular disease risk factors (Ruan et al. 2022). The embeddings capture phenotypic relationships that binary disease definitions obscure, potentially improving the power to detect genetic associations and the transferability of polygenic scores across related traits.

2.7.5 Impact on Downstream Modeling

Phenotype quality constraints propagate through every analysis built on biobank data, creating systematic confounding that affects both GWAS (Section 3.8.3) and clinical risk prediction models (Section 28.4). Polygenic scores trained on noisy phenotypes learn to predict the noise alongside the signal, potentially inheriting coding artifacts, temporal discontinuities, and population-specific documentation practices. Transfer learning from one biobank to another may fail not because the underlying genetic architecture differs but because the phenotype definitions differ in ways that alter what the model learned.

Foundation models face analogous challenges. A model that learns associations between genetic variants and EHR-derived phenotypes absorbs whatever systematic distortions those phenotypes contain. If a diagnosis is more likely to be coded in patients who receive specialist care, the model learns a genetic signature for healthcare access as much as for disease biology. If a biomarker is measured only in symptomatic patients, the model learns from a biased sample that may not represent the population distribution. Deep phenotyping approaches that extract richer representations from EHR data offer partial solutions, examined in Section 3.8.4 for GWAS contexts and Section 28.3 for clinical deployment.

These considerations motivate careful phenotype documentation in model development. Specifying exactly how a phenotype was defined, which codes or criteria were applied, what exclusions were made, and how temporal boundaries were established enables assessment of whether findings will generalize to settings with different definitions. The goal is not perfect phenotyping, which remains unattainable, but transparent phenotyping that allows downstream users to understand what the model actually learned and where its assumptions may break down.

2.8 Variant Interpretation Databases and Clinical Labels

A family receives whole-exome sequencing results for their child with developmental delay. The laboratory report lists 50 rare variants in genes associated with neurodevelopmental disorders. For each variant, the clinical team must answer: is this the cause? Allele frequencies tell us what variants survive in healthy populations, and functional genomics data reveal where the genome is biochemically active, but neither directly answers this question. That determination requires integrating multiple lines of evidence (family segregation, functional assays, computational predictions, phenotypic observations) into a structured framework that can be applied consistently.

Clinical variant interpretation databases aggregate these assessments from laboratories, expert panels, and research groups. These databases have become critical infrastructure for both clinical genomics and computational method development, providing labels that inform diagnostic decisions and serve as training data for machine learning models. Their labels carry biases and circularity that propagate through any analysis built on them, yet no viable alternative exists for large-scale model training and evaluation.

2.8.1 ClinVar and Clinical Assertions

A clinical laboratory sequences a patient with suspected hereditary cancer syndrome and identifies a missense variant in BRCA2. Before returning results, the laboratory searches ClinVar and finds that three other laboratories have evaluated this variant: two classified it as likely pathogenic, one as a variant of uncertain significance. How should this conflicting evidence inform the final report? ClinVar aggregates assertions of variant pathogenicity from clinical laboratories and researchers worldwide, making it the central clearinghouse for clinical variant interpretations (Landrum et al. 2018).

ClinVar provides standardized classifications following ACMG/AMP guidelines (pathogenic, likely pathogenic, benign, likely benign, variant of uncertain significance) that are central to diagnostic pipelines and to benchmarking variant effect predictors. It has become the de facto reference for variant pathogenicity labels, but its contents reflect systematic biases that affect any downstream use. These biases operate at multiple levels and warrant careful consideration.

Key Insight: The Circularity Problem

ClinVar serves simultaneously as (1) a source of training labels for variant effect predictors and (2) a benchmark for evaluating those same predictors. This dual role creates circularity: if ClinVar classifications increasingly incorporate computational scores, and those scores were trained on earlier ClinVar classifications, the benchmark becomes contaminated. A model may appear to perform well simply by reproducing the computational component of its training labels rather than learning independent biological signal.

Submission heterogeneity poses a fundamental challenge. Annotations come from diverse submitters, including diagnostic laboratories, research groups, expert panels, and database exports. Submitters apply varying evidentiary standards; some provide detailed supporting evidence while others offer only assertions. Conflicting interpretations are common, particularly for variants of uncertain significance.

Classifications evolve as evidence accumulates. A variant classified as VUS in 2018 may be reclassified as likely pathogenic by 2023 based on new functional studies or additional patient observations. ClinVar releases monthly snapshots rather than maintaining formal version control, so models trained on older releases may learn outdated classifications that have since been revised. Specifying the exact ClinVar release date is essential for reproducibility.

Ancestry and gene coverage biases create uneven representation. Variants in well-studied populations (particularly European ancestry) and well-characterized disease genes are heavily overrepresented. Variants from underrepresented populations are more likely to remain classified as VUS due to insufficient evidence. This creates feedback loops: predictive models perform better on European-ancestry variants because training data is richer, reinforcing the disparity (Landrum et al. 2018).

Clinical assertions in ClinVar become training labels for variant effect predictors like CADD (Section 4.3) and evaluation benchmarks for foundation model approaches (Chapter 18). The role of ClinVar in ACMG-AMP variant classification workflows is detailed in Section 29.2. Calibration of computational scores to ClinVar pathogenicity assertions is examined in Section 18.5.3, while systematic evaluation of ClinVar as a benchmark resource appears in Section 11.3.1.

Circularity with computational predictors represents a subtle but important concern. Clinical submissions increasingly incorporate computational scores like CADD, REVEL, and AlphaMissense as supporting evidence for pathogenicity classification. When these same ClinVar labels are then used to train or evaluate computational predictors, circularity emerges (Schubach et al. 2024). If a laboratory used a high CADD score as supporting evidence for classifying a variant as likely pathogenic, and that variant later appears as a positive label in ClinVar, models trained on ClinVar may partly learn to reproduce CADD itself rather than discovering independent signal. This circularity operates at two levels: evaluation circularity (when models are assessed on benchmarks influenced by the model’s own predictions) and training circularity (when features used in training derive from the same underlying information as the labels). Both forms inflate apparent performance without demonstrating genuine predictive power.

Variants of uncertain significance constitute the majority of rare variant classifications, reflecting genuinely limited evidence. These variants are both targets for predictive modeling (can computational methods resolve uncertainty?) and potential pitfalls (models trained only on confidently classified variants may not generalize to VUS with different characteristics).

Despite these limitations, ClinVar remains invaluable. The key is using it appropriately: recognizing biases when training models, accounting for version differences when comparing studies, stratifying performance by ancestry and gene coverage, and treating computational predictions as one line of evidence rather than definitive classifications.

Classification distribution showing VUS dominance

Gene coverage showing uneven classification density

Classification evolution over time
Figure 2.4: The ClinVar variant interpretation landscape. (A) Distribution of clinical significance classifications, highlighting that variants of uncertain significance dominate the database. (B) Classification density across genes, showing that well-studied genes have orders of magnitude more classified variants than the long tail of rarely-studied genes. (C) Temporal evolution of classifications, illustrating how variant interpretations change as evidence accumulates. Together, these panels reveal why computational variant interpretation remains challenging: most variants lack confident classifications, coverage is highly uneven across genes, and ground truth labels are themselves unstable over time.

2.8.2 Complementary Clinical Databases

You search ClinVar for a variant in a rare disease gene and find nothing. Does this mean the variant is novel? Not necessarily. A paper published last month may have reported this exact variant in a patient with your phenotype, but the finding has not yet been submitted to ClinVar. Alternatively, a locus-specific database maintained by experts in this disease may have curated the variant years ago with detailed functional evidence that never made it to ClinVar. Knowing where else to look, and what each resource captures, can be the difference between a missed diagnosis and a solved case.

ClinVar’s open-access model and broad submission base make it the most widely used resource, but it is not the only source of clinical variant interpretations. The Human Gene Mutation Database (HGMD) maintains a curated collection of disease-causing mutations compiled from the published literature, with particular depth in rare Mendelian disorders (Stenson et al. 2017). HGMD’s professional version includes variants not yet publicly released, and its curation emphasizes literature-reported pathogenic variants rather than the full spectrum of classifications in ClinVar. The Leiden Open Variation Database (LOVD) takes a gene-centric approach, with individual databases maintained by gene experts who curate variants according to locus-specific knowledge (Fokkema et al. 2011). LOVD instances often capture variants and functional evidence specific to particular disease communities that may not appear in broader databases.

These resources complement ClinVar in important ways: HGMD provides literature-derived pathogenic variants that may precede ClinVar submissions, while LOVD captures expert knowledge from disease-specific research communities. For model development and benchmarking, awareness of these alternative sources matters because training exclusively on ClinVar may miss variants documented elsewhere, and apparent novel predictions may simply reflect incomplete training data rather than genuine generalization.

Predict Before Viewing

Different clinical databases serve different purposes. Before viewing the table, consider: If you needed the most up-to-date classification from expert panels for gene-disease validity, which resource would you consult? If you needed literature-reported pathogenic variants for a rare disease, where would you look?

Table 2.7: Clinical variant databases and their complementary roles. Training on any single database may miss variants documented elsewhere.
Database Content Focus Access Model Strengths
ClinVar All classifications Open access Broad coverage, standardized format
HGMD Literature-derived pathogenic Subscription Early capture of published variants
LOVD Gene-specific curation Open access Deep expert knowledge per gene
ClinGen Expert panel consensus Open access High-confidence curations

2.8.3 ClinGen and Expert Curation

Two laboratories classify the same variant in SCN5A, a cardiac arrhythmia gene. One calls it pathogenic based on computational predictions and population frequency; the other calls it a variant of uncertain significance because those same features, applied by a general pipeline, miss nuances specific to ion channel variants. Who is right? When classifications conflict and clinical decisions hang in the balance, the field needs authoritative adjudication from experts who deeply understand both the gene biology and the accumulated evidence. This is the role that ClinGen fills.

Clinical laboratories submitting to ClinVar vary enormously in expertise and evidentiary standards. A submission from a general diagnostic laboratory applying ACMG guidelines to an unfamiliar gene may differ substantially from an assessment by researchers who have studied that gene for decades. The Clinical Genome Resource (ClinGen) addresses this heterogeneity by providing expert-curated assessments at multiple levels (Rehm et al. 2015).

ClinGen expert panels evaluate gene-disease validity (whether variation in a gene can cause a specific disease) and dosage sensitivity (whether haploinsufficiency or triplosensitivity leads to clinical phenotypes). These evaluations build on the catalog of Mendelian phenotypes maintained by OMIM, which provides curated gene-disease associations and clinical synopses (Amberger et al. 2015).

ClinGen also develops calibrated thresholds for computational predictors, specifying score intervals that justify different strengths of evidence (supporting, moderate, strong) for pathogenicity or benignity (Pejaver et al. 2022). The FDA has recognized these curations as valid scientific evidence for clinical validity. These calibrations directly inform how computational scores should be incorporated into variant classification workflows and are discussed further in Section 18.5.3 for score calibration to ACMG evidence levels and Section 18.4 for integration of multiple computational predictors.

2.8.4 Pharmacogenomics Resources

Most variant interpretation focuses on rare mutations that cause or predispose to disease. Pharmacogenomics presents a different paradigm: common polymorphisms that individually may have no disease consequences but profoundly influence how individuals respond to medications. These variants matter not because they cause disease but because they determine whether a drug will work, fail, or cause harm.

Implementing pharmacogenomics in clinical practice requires three capabilities: curating variant-drug associations from published literature, translating that evidence into actionable dosing guidelines, and automating the path from a patient’s VCF file to a clinical report. PharmGKB addresses the first need, cataloging over 800 genes, 700 drugs, and thousands of variant-drug-phenotype relationships with evidence levels (Whirl-Carrillo et al. 2012). CPIC translates this knowledge into standardized guidelines specifying how to adjust drug selection or dosing based on metabolizer phenotype (Relling et al. 2019). PharmCAT automates annotation, taking VCF files as input and producing CPIC-compliant reports (Sangkuhl et al. 2019). ClinPGx integrates all three into a unified framework spanning variant detection through clinical recommendation (Gong et al. 2025).

Star-Allele Nomenclature

Pharmacogenes use a specialized nomenclature where haplotypes (combinations of variants on the same chromosome) are designated by star alleles. The reference haplotype is *1, with variant haplotypes numbered sequentially (*2, *3, etc.) as they were discovered. Each star allele represents a specific combination of SNVs, indels, or structural variants that travel together.

For CYP2D6, over 150 star alleles have been defined. Some reduce enzyme function (*4, *5), others increase it through gene duplication (*1xN), and many have unknown functional consequences. A patient’s diplotype (the combination of maternal and paternal star alleles) determines their metabolizer phenotype: poor, intermediate, normal, or ultrarapid.

Star-allele calling requires phasing to determine which variants co-occur on the same chromosome, plus structural variant detection to identify gene deletions and duplications. Standard SNV-focused pipelines miss critical information, which is why specialized tools like PharmCAT exist.

The CYP2D6 gene exemplifies the complexity. This cytochrome P450 enzyme metabolizes approximately 25% of clinically used drugs, including codeine, tamoxifen, and many antidepressants (Nofziger et al. 2019). Patients with loss-of-function CYP2D6 variants cannot activate codeine to morphine, rendering the drug ineffective for pain relief; patients with gene duplications may convert codeine too efficiently, experiencing dangerous opioid toxicity from standard doses. The difference between these scenarios depends entirely on accurate star-allele diplotyping.

From a modeling perspective, pharmacogenomic resources offer a complementary type of label linking variants to molecular and clinical outcomes through different mechanisms than Mendelian disease pathogenicity. Where ClinVar labels indicate whether a variant causes disease, PharmGKB labels indicate how a variant affects drug response in individuals who may be otherwise healthy.

2.9 Inherited Constraints

Every genomic model inherits both the power and the biases of its training data. A variant effect predictor trained on ClinVar labels absorbs the ascertainment patterns of clinical sequencing: European ancestry overrepresented, rare diseases enriched, incidental findings undersampled. A chromatin model trained on ENCODE immortalized cell lines learns regulatory patterns that may not generalize to primary tissues with different epigenetic landscapes. Models that estimate genetic constraint quantify how strongly purifying selection acts against damaging variants in each gene, comparing observed variant counts to expectations. But when trained on human population databases, these models systematically miss the most severe cases: gene-lethal variants never appear because carriers do not survive to be sequenced.

These biases compound as data flows through analysis pipelines. GWAS summary statistics carry ancestry composition forward into polygenic scores. Conservation scores calculated from biased multiple sequence alignments propagate into variant effect predictions. Foundation model pretraining on reference genomes from limited populations shapes the representations available for all downstream applications. Each transformation amplifies some biases while masking others, making the provenance of model behavior increasingly difficult to trace.

The critical question is not whether models trained on these data contain biases; they do. The question is whether those biases can be characterized, bounded, and ultimately corrected. These foundational datasets appear throughout genomic AI as training labels, evaluation benchmarks, and sometimes both simultaneously. Recognizing when the same data sources serve multiple roles is essential for interpreting model performance honestly and anticipating where generalization will fail. Part III examines these challenges in depth: data partitioning strategies that account for shared ancestry and homology (Section 12.2), population structure effects that confound genetic associations (Section 13.2.1), and ascertainment patterns that create circularity in clinical labels (Section 13.2.4).

Chapter Summary
Test Yourself

Before reviewing the summary, test your recall:

  1. What are the four main layers of the genomic data ecosystem, and what type of evidence does each provide?
  2. Why does alternative splicing create challenges for variant annotation, and what percentage of human multi-exon genes undergo alternative splicing?
  3. How do systematic ancestry biases in population databases like gnomAD affect downstream models trained on these data?
  4. What is the circularity problem between computational variant predictors and clinical databases like ClinVar?
  5. Why do GWAS summary statistics represent a compromise between scientific value and data sharing restrictions?
  1. Four layers of the genomic data ecosystem:

    • Reference assemblies and gene annotations: coordinate foundation and biological vocabulary
    • Population catalogs and biobanks: variant frequencies and phenotype associations
    • Functional genomics consortia: biochemical activity across cell types
    • Clinical databases: pathogenicity interpretations
  2. Alternative splicing challenges: A variant may be benign in the canonical transcript but pathogenic in a tissue-specific isoform. Models trained only on canonical transcripts (like MANE Select) miss this complexity. Over 95% of human multi-exon genes undergo alternative splicing, producing an estimated 100,000+ distinct protein isoforms from ~20,000 genes.

  3. Ancestry bias effects: Models inherit whatever populations were represented in training data. A variant common in West African populations but absent from European-dominated catalogs would be incorrectly flagged as ultra-rare. Constraint metrics (pLI, LOEUF) are poorly calibrated for variants private to underrepresented populations. This propagates through variant prioritization, deleteriousness prediction, and clinical interpretation.

  4. Circularity problem: ClinVar serves simultaneously as training data for computational predictors and as the benchmark for evaluating those same predictors. Clinical laboratories increasingly incorporate computational scores (CADD, REVEL, AlphaMissense) as supporting evidence for pathogenicity classification. When these ClinVar labels then train or evaluate those same predictors, models may learn to reproduce their own predictions rather than discovering independent biological signal.

  5. GWAS summary statistics compromise: Individual-level genotype and phenotype data are powerful but sensitive, requiring complex data use agreements, IRB approvals, and secure computing infrastructure. Summary statistics (per-variant effect sizes, standard errors, p-values) capture the essential association signal without revealing individual genotypes, enabling meta-analysis and data sharing while protecting participant privacy.

Key Concepts Covered:

  • Reference genomes and gene annotations provide the coordinate system for all genomic analysis; choices embedded in these resources propagate through downstream models
  • Population variant catalogs (dbSNP, 1000 Genomes, gnomAD) establish frequency baselines and constraint metrics; ancestry representation biases affect all models using these data
  • Biobanks link genotypes to phenotypes at scale, enabling GWAS and polygenic score development; European ancestry overrepresentation limits global applicability
  • Functional genomics datasets (ENCODE, Roadmap, GTEx) provide training labels for regulatory models; cell type coverage determines what models can learn
  • Clinical databases (ClinVar, ClinGen) aggregate pathogenicity classifications; circularity between computational predictors and clinical labels complicates benchmarking
  • Phenotype quality varies dramatically based on definition, EHR coding practices, and ascertainment; noisy labels attenuate genetic signals

Critical Literacy Framework:

When evaluating any genomic ML model, ask:

  1. What data was it trained on, and what populations/cell types are represented?
  2. What are the known biases and blind spots of those training resources?
  3. Does the evaluation benchmark share data sources with training (circularity risk)?
  4. Will the model’s training population match the deployment population?

Looking Ahead:

The data landscape described here flows directly into Chapter 3 (Chapter 3), where we examine how GWAS transforms biobank data into genetic associations. The biases introduced by ancestry representation, phenotype quality, and population structure become concrete statistical challenges addressed throughout Part 2 (Learning & Evaluation) and Part 5 (Responsible Deployment).