13 Confounding and Data Leakage
Models learn shortcuts. Shortcuts work until they do not.
Estimated reading time: 40-50 minutes
Prerequisites: This chapter assumes familiarity with basic machine learning evaluation concepts (training/test splits, performance metrics) from Chapter 12, population structure concepts from Section 3.1.4, and variant annotation databases from Section 2.8. Readers should also understand the distinction between correlation and causation.
Learning Objectives: After completing this chapter, you should be able to:
- Distinguish between confounding, bias, and data leakage using precise definitions
- Identify the major sources of confounding in genomic datasets (ancestry, batch effects, label bias, temporal drift)
- Explain why population structure creates shortcuts that foundation models readily exploit
- Design appropriate data splitting strategies for different evaluation goals
- Apply diagnostic methods to detect confounding in model predictions
- Select and implement mitigation strategies appropriate to specific confounding sources
Key Insight: A model can achieve excellent benchmark performance while learning nothing about biology. The same expressiveness that allows foundation models to capture complex biological patterns also allows them to discover subtle confounders invisible to simpler diagnostics. Rigorous evaluation design is not optional; it is the only way to distinguish genuine learning from sophisticated shortcut exploitation.
A variant effect predictor trained on ClinVar achieves 0.92 auROC on held-out variants from the same database, yet performance drops to 0.71 when evaluated on a prospectively collected clinical cohort. A polygenic risk score for coronary artery disease stratifies European-ancestry individuals with impressive discrimination, then fails almost completely when applied to individuals of African ancestry. A gene expression model trained on GTEx data predicts tissue-specific patterns with apparent precision, until deployment reveals it learned to distinguish sequencing centers rather than biological states. Each model worked brilliantly in evaluation and failed quietly in practice.
These failures share a common cause: the models learned shortcuts rather than biology. Genomic datasets encode hidden structure from ancestry and family relatedness to sequencing center, capture kit, and label curation protocol. These factors correlate with both features and labels. When such confounders remain uncontrolled, models exploit them. The central challenge is that confounded models can appear to work, sometimes spectacularly well, until they encounter data where the shortcuts no longer apply.
This problem is not unique to deep learning. Linear regression and logistic models suffer from the same biases when fit on confounded data. What makes confounding particularly dangerous in the foundation model era is scale: larger datasets and more expressive architectures make it easier to discover subtle shortcuts that remain invisible in standard diagnostics but cause dramatic failures when distributions shift at deployment. A shallow model might miss the correlation between sequencing center and disease status; a transformer with hundreds of millions of parameters will find it if that correlation helps optimize the training objective.
13.1 Confounding, Bias, and Leakage
The terminology of confounding, bias, and leakage describes distinct phenomena that often co-occur and reinforce each other. Precision in language helps clarify what has gone wrong when a model fails.
A confounder is a variable that influences both the input features and the label. Ancestry provides a canonical example: it affects allele frequencies across the genome (the features) and disease risk through environmental, socioeconomic, and healthcare pathways (the labels). If ancestry is not explicitly modeled or controlled, a model trained to predict disease may learn to identify ancestry rather than disease biology. The prediction appears accurate because ancestry correlates with outcome, but the model has captured correlation rather than mechanism.
Bias refers to systematic deviation from the quantity we intend to estimate or predict. Bias can result from confounding, but also arises from measurement error, label definitions, sampling procedures, or deployment differences. A case-control study with 50% disease prevalence will train models that systematically over-predict risk when deployed in populations where true prevalence is 5%. The model may be perfectly calibrated for the training distribution yet dangerously miscalibrated for clinical use.
Data leakage occurs when information about the test set inadvertently influences model training or selection. Leakage pathways include overlapping individuals or variants between training and evaluation, shared family members across splits, duplicated samples under different identifiers, and indirect channels such as pretraining on resources that later serve as benchmarks. The circularity between computational predictors and ClinVar annotations discussed in Section 4.5 exemplifies this last category: CADD-like scores influence which variants receive pathogenic annotations, and those annotations then become training labels for the next generation of predictors.
Distribution shift describes mismatch between training and deployment data distributions. Three primary types arise in genomic applications: covariate shift occurs when the input distribution \(P(X)\) changes but the relationship \(P(Y|X)\) remains stable (e.g., deploying a model trained on European ancestry genotypes to African ancestry individuals); prior shift (or label shift) occurs when the outcome prevalence \(P(Y)\) changes (e.g., training on case-control studies with 50% cases but deploying in populations with 5% prevalence); concept drift occurs when the relationship \(P(Y|X)\) itself changes over time (e.g., diagnostic criteria evolving so the same clinical features map to different labels). A model that learns hospital-specific coding patterns will fail when deployed at a different institution, not because the biology differs but because the label generation process does.
Before examining the table below, test your understanding: For each scenario, identify whether it represents confounding, bias, data leakage, or distribution shift:
- A model trained on variants from 2018-2020 ClinVar is tested on variants added to ClinVar in 2023
- A polygenic score calibrated on 50% case prevalence is deployed in a population with 5% prevalence
- Ancestry affects both allele frequencies and disease risk through healthcare access pathways
- The same individual’s genome appears in both training and test sets under different identifiers
Check your predictions against the table definitions below.
The following table clarifies the distinctions between these related but distinct concepts:
| Term | Definition | Example | Detection | Primary Solution |
|---|---|---|---|---|
| Confounding | Variable affects both features and labels | Ancestry affects genotype frequencies and disease risk | Confounder-only baselines match model performance | Matching, adjustment, or invariance learning |
| Bias | Systematic deviation from target | Training at 50% prevalence, deploying at 5% | Calibration analysis across settings | Design matching, recalibration |
| Data leakage | Test information influences training | Same variant in train and test sets | Performance collapse under strict splits | Rigorous deduplication, temporal splits |
| Distribution shift | Train/deploy distributions differ | Model trained on one hospital, deployed at another | Performance degradation on new cohorts | Domain adaptation, multi-site training |
The terminology and conceptual framework for understanding heterogeneity in genetic associations derives from classical statistical genetics, including Laird and Lange’s Fundamentals of Modern Statistical Genetics (Laird and Lange 2011) and Gordon’s Heterogeneity in Statistical Genetics (Gordon, Finch, and Kim 2020). Three forms of heterogeneity documented in genetics can complicate foundation model evaluation:
- Locus heterogeneity: Different genes cause the same phenotype in different families
- Allelic heterogeneity: Different variants in the same gene cause the same phenotype
- Population heterogeneity: Effect sizes vary across ancestry groups
Each form creates opportunities for shortcut learning. A model that achieves high population-level performance while learning which gene families associate with disease (rather than extracting variant-level regulatory signals) may be exploiting locus heterogeneity as a shortcut. Whether foundation models actually exploit these heterogeneity patterns requires empirical investigation specific to genomic architectures.
The statistical methods for detecting and addressing heterogeneity in GWAS offer foundations that extend naturally to foundation model evaluation, especially for ancestry-stratified performance analysis.
The critical question for any genomic model is: Does the association between features and labels flow through the biological mechanism I care about, or through a confounding pathway? A model predicting variant pathogenicity from sequence might learn that certain haplotype backgrounds correlate with pathogenic labels, but if that correlation exists because of ancestry-biased ascertainment rather than biological causation, the model has learned a shortcut that will fail when applied to differently ascertained populations.
Causal diagrams (DAGs) provide rigorous notation for reasoning about confounding. Key structures:
Fork (Common Cause):
Z
/ \
X Y
\(Z\) confounds the \(X \to Y\) relationship. Example: Ancestry (\(Z\)) affects both variant frequency (\(X\)) and disease prevalence (\(Y\)), creating spurious variant-disease association.
Chain (Mediation):
X → M → Y
\(M\) mediates the effect of \(X\) on \(Y\). Controlling for \(M\) blocks the causal path. Example: Variant (\(X\)) affects protein function (\(M\)) which affects phenotype (\(Y\)).
Collider (Selection Bias):
X → Z ← Y
Conditioning on collider \(Z\) induces association between \(X\) and \(Y\) even if none exists. Example: Selecting variants that are either pathogenic (\(X\)) or frequently studied (\(Y\)) into ClinVar (\(Z\)) creates spurious correlation.
d-Separation. Variables \(X\) and \(Y\) are d-separated by set \(Z\) (and thus conditionally independent given \(Z\)) if every path between them is blocked. A path is blocked if it:
- Contains a chain or fork whose middle node is in \(Z\), OR
- Contains a collider whose middle node (and descendants) is NOT in \(Z\)
These graphical criteria formalize when adjustment for a variable eliminates confounding versus when it introduces new bias.
For foundation models, these risks are magnified. Genomes encode ancestry, relatedness, and assay conditions in thousands of subtle features, even when those labels are never explicitly provided. Large transformers find shortcuts that smaller models would miss if those shortcuts improve the training objective. Complex training regimes involving pretraining on biobank-scale data, fine-tuning on curated labels, and evaluation on community benchmarks create many opportunities for direct and indirect leakage.
13.2 Sources of Confounding in Genomic Data
Confounders in genomic modeling cluster into several categories, though the same underlying variable (such as recruitment site) may simultaneously induce ancestry differences, batch effects, and label bias. These categories are not mutually exclusive; batch effects in single-cell data (Section 20.6.1) and multi-omic integration (Section 23.7.1) represent domain-specific manifestations of the same underlying challenge.
Before reading about specific confounding sources, consider: if you were designing a study to train a variant pathogenicity predictor, what variables might affect both the variants you observe (your features) and the pathogenicity labels you collect (your outcomes)? List three potential confounders and how they might create spurious associations.
13.2.1 Population Structure and Relatedness
Ancestry creates perhaps the most pervasive confounders. Continental and sub-continental population structure affects both genomic features and many phenotypes of interest, creating classic confounding. The portability failures of polygenic scores across ancestry groups (Section 3.7) represent one clinically consequential manifestation of this confounding. Family relationships (siblings, parent-offspring pairs, cryptic relatedness detectable only through genotype similarity) and founder effects that create local haplotype structure compound these issues. Relatedness creates a more subtle problem than population stratification: when close relatives appear in both training and test sets, models can memorize shared haplotype segments rather than learning generalizable patterns, producing inflated performance estimates that collapse for unrelated individuals.
13.2.2 Technical Batch Effects
Sequencing and analysis pipelines introduce their own systematic differences. Different instruments produce distinct error profiles. Library preparation protocols vary in GC bias, coverage uniformity, and adapter content. Capture kits determine which genomic regions receive adequate coverage. Alignment algorithms and variant callers make different decisions at ambiguous positions. When samples from a particular batch disproportionately represent a specific label class (cases sequenced at one center, controls at another), models learn to distinguish batches rather than biology.
13.2.3 Institutional and Recruitment Confounding
The institutions where patients receive care introduce additional confounding layers. Hospital systems use distinct coding practices, diagnostic thresholds, and follow-up schedules. The phenotype quality issues that result are examined in Section 2.7, with implications for how models learn from systematically biased labels. Population-based biobanks differ from referral-center cohorts in disease severity, comorbidity patterns, and demographic composition. Individuals who receive genomic testing may be more severely affected, more affluent, or preferentially drawn from particular ancestry groups, introducing selection bias that distorts apparent variant-phenotype relationships.
These sources of confounding trace back to data collection and curation processes. Training data inherit the biases present in the databases from which they derive: ClinVar’s overrepresentation of European ancestry variants (Section 2.8.1), gnomAD’s population composition (Section 2.2.3), and the tissue coverage decisions of consortia like ENCODE and GTEx (Section 2.4.1). Understanding data provenance is prerequisite to anticipating which confounders a model may have learned.
13.2.4 Label Generation Bias
The process of generating ground truth annotations itself creates biases. Clinical labels derived from billing codes or problem lists reflect documentation practices as much as underlying disease. Variant pathogenicity databases exhibit the systematic biases detailed in Section 2.8: ClinVar annotations over-represent European ancestry, well-studied genes, and variants submitted by high-volume clinical laboratories (Landrum et al. 2018). Expression, regulatory, or splicing labels derived from specific tissues or cell lines may not generalize to other biological contexts. The circularity problem identified in Section 4.5 persists into the foundation model era: when model predictions influence which variants receive expert review, and expert classifications become training labels, feedback loops amplify historical biases.
13.2.5 Temporal Drift
Clinical practice, diagnostic criteria, and coding conventions evolve over time. Sequencing technologies and quality control pipelines also change. A model trained on 2015 data may fail on 2024 data not because biology changed but because documentation practices, coding standards, and available treatments all evolved. This temporal drift affects both the features models learn and the labels they predict.
13.2.6 Resource Overlap and Indirect Leakage
Even the resources used for training and evaluation create leakage pathways. When databases like gnomAD or UK Biobank appear in both model training and evaluation, indirect information flows compromise apparent generalization. A foundation model pretrained on gnomAD allele frequencies, then evaluated on a benchmark that uses gnomAD for population filtering, faces indirect leakage even if specific variants do not overlap. Community benchmarks that reuse widely available variant sets across multiple publications create additional leakage pathways that accumulate over time as the field iterates.
The following table summarizes the major confounding sources, their mechanisms, and detection approaches:
| Source | Affects Features Via | Affects Labels Via | Detection Signal | Mitigation Approach |
|---|---|---|---|---|
| Ancestry | Allele frequencies, haplotypes, LD patterns | Healthcare access, environmental exposure | Performance stratified by ancestry; PCA-only baseline | Matching, PCs as covariates, invariance training |
| Relatedness | Shared haplotype segments | Shared environmental factors, ascertainment | Kinship matrix analysis; family-aware split sensitivity | Family-aware splitting |
| Batch effects | Coverage, error profiles, variant calling | Case/control imbalance across batches | Batch predicts phenotype; embedding clusters by batch | Batch covariates, harmonization, domain adaptation |
| Institution | Sequencing protocols, capture kits | Coding practices, diagnostic criteria | Performance varies by site | Multi-site training, cohort holdouts |
| Label generation | Features used in curation decisions | Circular dependency with prior predictions | Ablating predictive features degrades performance | Temporal splits, independent validation |
| Temporal drift | Technology evolution | Practice guideline changes | Performance degrades on newer data | Time-based splits, continuous monitoring |
13.3 Population Structure as a Shortcut
Population structure represents one of the most pervasive confounders in genomic modeling. The core issue is that ancestry simultaneously affects genomic features and many phenotypes through pathways that have nothing to do with direct genetic causation.
Human genetic variation is structured by ancestry: allele frequencies, haplotype blocks, and linkage disequilibrium patterns differ across populations in ways that reflect demographic history. Principal components computed from genome-wide genotypes provide a low-dimensional summary of this structure and have become standard in genome-wide association studies (GWAS) to correct for stratification (Patterson, Price, and Reich 2006; Price et al. 2006). Yet ancestry is not merely a statistical nuisance. It is intertwined with geography, environment, socioeconomic status, and access to healthcare, factors that directly impact disease risk, likelihood of receiving genetic testing, and the quality of phenotyping when testing occurs.
The statistical genetics community developed these corrections precisely because early genome-wide association studies produced spurious signals driven by ancestry differences between cases and controls rather than causal variant effects (see Section 3.1.4 for detailed treatment of population stratification in association testing). Foundation models face the same fundamental problem in a different guise: ancestry structure that confounded linear regression in GWAS now confounds neural network predictions, and the solutions require similar conceptual foundations even when the technical implementations differ.
Gordon’s heterogeneity framework (Gordon, Finch, and Kim 2020) offers additional perspective: when effect sizes differ across populations (G×E or G×ancestry interactions), no single model can simultaneously achieve optimal prediction in all groups. This fundamental limitation affects foundation models as much as linear PRS, though the mechanisms of failure differ. Foundation models may achieve apparent high performance by weighting populations differently, masking poor calibration in underrepresented groups behind strong discrimination in majority populations.
Consider a rare disease clinic serving primarily individuals of European ancestry. This clinic contributes most pathogenic variant submissions to ClinVar, while variants observed predominantly in other ancestries remain classified as variants of uncertain significance (Landrum et al. 2018). A model trained on ClinVar may learn that European-enriched variants tend to have pathogenic labels and non-European-enriched variants tend to have uncertain or benign labels, not because of any biological difference in pathogenicity but because of differential clinical characterization. The model appears to predict pathogenicity while actually predicting ancestry-correlated ascertainment.
Foundation models trained on nucleotide sequences see ancestry information directly: the distribution of k-mers and haplotypes differs by population. When such models are fine-tuned to predict disease risk or variant effects, they may leverage ancestry as a shortcut. Increasing model capacity does not solve this problem; it often makes it worse by enabling detection of increasingly subtle ancestry-linked features. The polygenic score portability literature provides stark evidence: risk scores derived from European ancestry cohorts show 40-75% reductions in prediction accuracy when applied to African ancestry individuals (Duncan et al. 2019). Similar patterns emerge for variant effect predictors and regulatory models, though they are often less thoroughly documented due to limited cross-ancestry evaluation.
A common misconception is that larger, more powerful models will “see through” confounding to the underlying biology. In reality, the opposite often occurs. A linear model might capture only the strongest ancestry-outcome correlations; a transformer with billions of parameters will find every ancestry-linked feature that improves the training objective, no matter how subtle. Model expressiveness is not a defense against confounding; it is an amplifier.
This mismatch between the populations used for model development and the populations that would benefit from genomic medicine creates a fundamental tension between current practice and equitable healthcare. Models that work primarily for European ancestry individuals perpetuate existing health disparities, regardless of their benchmark performance. The fairness implications are examined further in Section 13.10.
13.3.1 Addressing Ancestry Bias in Genomic Models
Several approaches have emerged to address ancestry confounding in genomic prediction. Amariuta et al. (2020) demonstrated that incorporating functional genomic annotations can improve PRS transferability across populations, with larger gains for traits with well-characterized regulatory mechanisms.
The importance of diverse representation extends beyond European-focused cohorts. Sohail et al. (2023) provides critical analysis of PRS performance in Latin American populations, revealing systematic biases that cannot be addressed by simple recalibration. The work highlights the need for foundation models to be evaluated explicitly on diverse populations rather than assuming that performance on European-ancestry cohorts will transfer.
A 2025 study provided the first systematic documentation of ancestry-stratified performance disparities across variant effect predictors. When evaluated separately within European, African, East Asian, and South Asian ancestry groups, all major VEP tools (CADD, REVEL, AlphaMissense) showed significantly degraded discrimination in non-European populations (martin_pervasive_2025?). The performance gap was not subtle: auROC differences ranged from 0.05 to 0.12 between European and African ancestry groups, with the gap widening further for rare variants that constitute the clinically actionable tier.
The mechanism is straightforward confounding: training data for VEP models derives predominantly from European-ancestry individuals. Biobanks oversample European populations (Section 2.3), and clinical genetic testing has historically served European-ancestry patients disproportionately. Models learn patterns specific to European haplotype backgrounds, which fail to generalize when LD structure, allele frequencies, and functional variant distributions differ in other populations. A variant that appears pathogenic because it occurs on a rare European haplotype may be benign when the same allele appears on a common African haplotype.
The clinical consequence follows directly: computational pathogenicity scores systematically overpredict pathogenicity for variants common in non-European populations, generating false positive flags that burden clinical interpretation pipelines and potentially delay diagnosis. The study established ancestry-stratified evaluation as a methodological requirement for any variant effect predictor claiming clinical utility, paralleling the fairness requirements for polygenic scores discussed in Section 3.7.2.
Before moving to technical artifacts, test your retention from the earlier sections:
- What is the difference between a confounder and data leakage? Give a concrete example of each.
- Why does ancestry act as a confounding variable in genomic prediction?
- What diagnostic would reveal that your model’s 0.85 auROC primarily reflects ancestry rather than biological mechanism?
A confounder is a variable that affects both features and labels (e.g., ancestry affects allele frequencies and disease risk through healthcare access), while data leakage occurs when test information influences training (e.g., the same variant appearing in both train and test sets). Both inflate performance, but leakage involves information that should not exist at prediction time, while confounding involves real but non-causal associations.
Ancestry affects genomic features through population-specific allele frequencies and haplotype structure, while simultaneously affecting disease labels through environmental factors, healthcare access, socioeconomic status, and clinical ascertainment practices, creating a spurious association pathway the model can exploit.
Train a confounder-only baseline using just ancestry principal components with no genomic features. If this baseline achieves performance close to your full model (e.g., 0.80 vs 0.85 auROC), ancestry confounding drives most of the signal.
13.4 Technical Artifacts as Biological Signal
Technical pipelines are complex, and each step from sample collection through final variant calls can introduce systematic differences that models may learn.
Sequencing centers differ in instruments, reagents, and quality control thresholds. Library preparation protocols produce distinct coverage profiles and GC bias patterns. Capture kits determine which genomic regions are well-covered and which have systematic dropout. Read length affects the ability to span repetitive regions and call structural variants. Alignment and variant calling algorithms make different decisions at ambiguous genomic positions.
When samples from a particular batch or platform are disproportionately drawn from a specific phenotype class, models learn to distinguish batches. Why does batch-phenotype correlation arise in real studies? Practical constraints drive the pattern: case samples are often collected at specialized disease centers with particular sequencing infrastructure, while controls come from population biobanks using different platforms. Temporal factors compound this: if cases were sequenced earlier when certain technologies dominated, and controls were added later with newer platforms, technology-phenotype correlation becomes embedded in the data. Studies rarely randomize case-control status across batches because retrospective collection is cheaper than prospective design, creating systematic confounding that standard quality control cannot detect.
In high-dimensional feature spaces, even subtle batch-specific artifacts (coverage dips at particular loci, variant density patterns reflecting caller behavior, residual adapter sequences) can become predictive. Foundation models that process raw reads, coverage tracks, or variant streams are particularly vulnerable because batch signatures may be encoded in features that preprocessing would typically remove.
A model achieves 0.88 auROC predicting disease status from whole-genome sequences. You discover that case samples were sequenced at Center A using Illumina NovaSeq, while control samples were sequenced at Center B using HiSeq. What would you predict about performance on a new cohort where cases and controls are equally distributed across both centers? What diagnostic would you run to test for batch confounding?
Performance would likely collapse to near-chance levels (close to 0.50 auROC) because the model learned to distinguish sequencing centers rather than disease biology. When cases and controls are equally distributed across both centers, the batch-disease correlation disappears and the learned shortcut becomes useless.
Diagnostics to run:
Train a classifier to predict sequencing center from the model’s learned embeddings; high accuracy confirms batch encoding.
Visualize embeddings colored by center and by disease status; clustering by center rather than disease reveals the problem.
Evaluate performance stratified by center; if within-center performance is poor, the model relies on between-center differences.
Common patterns suggesting batch confounding include embedding spaces where samples cluster by sequencing center rather than phenotype, strong predictive performance that collapses when evaluated on data from a new platform, and models that can accurately predict batch identity (sequencing center, capture kit, processing date) from inputs that should be batch-independent. When a model designed to predict disease can also predict which laboratory processed the sample, something has gone wrong.
13.5 Label Bias and Circularity
Labels in genomic applications rarely represent ground truth in any absolute sense. They represent the outputs of complex processes involving clinical documentation, expert review, computational prediction, and database curation. These processes introduce biases that models absorb and may amplify.
Clinical phenotypes derived from electronic health records inherit the limitations of medical documentation. Billing codes capture what was reimbursable, not necessarily what was present. Problem lists reflect what clinicians chose to document, which varies by specialty, institution, and individual practice patterns. Diagnostic criteria change over time, creating apparent temporal trends in disease prevalence that reflect evolving definitions rather than changing biology.
Variant pathogenicity labels illustrate the problem of circularity. ClinVar aggregates submissions from clinical laboratories, research groups, and expert panels (Landrum et al. 2018). The evidence underlying these submissions often includes computational predictions: a laboratory may cite CADD, REVEL, or other predictors as supporting evidence for a pathogenic classification. When the next generation of predictors trains on ClinVar, it learns to replicate the computational predictions that contributed to those labels. Performance on ClinVar-derived benchmarks thus reflects, in part, agreement with previous predictors rather than independent biological insight.
Why does circularity inflate validation metrics specifically? The mechanism is statistical: the new model’s task becomes predicting what previous models predicted, not predicting true pathogenicity. If CADD influenced 30% of pathogenic labels, and the new model learns to approximate CADD, it automatically achieves high agreement on those labels, regardless of whether the underlying biology was correctly captured. The inflation is proportional to the previous model’s influence on labeling: more circularity means more inflated benchmarks. Critically, this inflation is invisible within the circular ecosystem; only prospective validation on genuinely novel variants or independent functional assays reveals the gap between apparent and true performance.
This circularity extends across the ecosystem of genomic resources. gnomAD allele frequencies inform variant filtering in clinical pipelines. UK Biobank genotype-phenotype associations shape which variants receive functional follow-up. Structural annotations from ENCODE and Roadmap Epigenomics influence which regulatory regions are considered biologically important. Foundation models pretrained on these resources, then evaluated against benchmarks derived from the same resources, may achieve impressive scores while learning to reproduce the assumptions and biases of existing annotations rather than discovering new biology.
13.6 Data Splitting
Data splitting is among the primary tools for assessing generalization, yet naive splits can silently permit leakage that inflates apparent performance.
The following section introduces formal concepts about data splitting. The core intuition is simple: different splitting strategies test different types of generalization. Random splits test interpolation; structured splits test extrapolation to genuinely new contexts.
13.6.1 Random Individual-Level Splits
Random individual-level splits assign samples randomly to training, validation, and test sets. This approach fails when samples are not independent: family members may appear on both sides of a split, allowing models to memorize shared haplotypes. Rare variant analysis is particularly vulnerable because disease-causing variants may be private to specific families, and memorizing which families have which variants is far easier than learning generalizable sequence-function relationships.
13.6.2 Family-Aware Splits
Family-aware splits address relatedness by ensuring that all members of a family appear in the same split. This prevents direct memorization of family-specific variants but does not address population structure (ancestry groups may remain imbalanced across splits) or other confounders.
13.6.3 Locus-Level Splits
Locus-level splits hold out entire genomic positions, ensuring that no variant at a test position appears during training. This stringent approach prevents models from memorizing site-specific patterns and is essential for variant effect prediction where the goal is to score novel variants at positions the model has never seen.
Why do models memorize positions when batch effects or ascertainment biases exist? Gradient descent discovers whatever pathway most efficiently reduces loss. When certain genomic positions systematically appear in training with particular labels (because well-studied genes are overrepresented, or because sequencing centers focused on specific regions), the model can reduce loss by learning “position X tends to be pathogenic” rather than “variants disrupting this motif tend to be pathogenic.” In high-dimensional feature spaces, both pathways are viable; position memorization is often easier. Locus-level splits force the model to succeed on positions it has never seen, eliminating the memorization shortcut entirely.
Many published benchmarks fail to implement locus-level splitting, allowing models to achieve high scores by recognizing familiar positions rather than learning generalizable effects. The evaluation considerations in Section 12.4 address these issues in detail.
13.6.4 Region and Chromosome Splits
Region or chromosome splits hold out entire genomic regions, testing whether models learn biology that transfers across the genome rather than region-specific patterns. This is particularly relevant for regulatory prediction, where local chromatin context may differ between regions.
13.6.5 Cohort and Site Splits
Cohort or site splits hold out entire institutions, sequencing centers, or biobanks, directly testing robustness to the batch and cohort effects discussed above. Models that perform well only within their training cohort but fail on held-out cohorts have learned institution-specific patterns.
13.6.6 Temporal Splits
Time-based splits use temporal ordering, training on earlier data and evaluating on later data. This approach simulates prospective deployment and tests robustness to temporal drift. A model trained on 2018 data and evaluated on 2023 data faces realistic distribution shift that random splits would obscure.
13.6.7 Indirect Leakage Across Resources
Beyond explicit split design, indirect leakage remains a concern. A variant that appears in ClinVar may also appear in gnomAD (with population frequency information), in functional assay datasets (with splicing or expression effects), and in literature-derived databases (with disease associations). Pretraining on any of these resources while evaluating on another creates indirect information flow that standard deduplication would miss.
Before examining the table, consider these scenarios and predict which splitting strategy is most appropriate:
- You are building a variant effect predictor that must score novel variants at genomic positions never seen before
- Your model showed excellent performance in development but you need to test if it will work at other hospitals
- You are training on genotypes from families with rare diseases
- You want to simulate how your model would perform if deployed next year
Match each scenario to the splitting strategy it requires. Then check the table to see if your predictions align with the “When to Use” column.
The following table compares splitting strategies and their properties:
| Strategy | What It Holds Out | Leakage Addressed | Generalization Tested | When to Use |
|---|---|---|---|---|
| Random | Random samples | None | Interpolation only | Never for final evaluation |
| Family-aware | Family groups | Relatedness memorization | Across unrelated individuals | When pedigree structure exists |
| Locus-level | Genomic positions | Position memorization | Novel genomic positions | Variant effect prediction |
| Chromosome | Entire chromosomes | Regional patterns | Cross-genome transfer | Regulatory prediction |
| Cohort/Site | Institutions | Batch effects, coding practices | Cross-institution deployment | Clinical deployment validation |
| Temporal | Time periods | Future information | Prospective performance | Simulating real deployment |
13.7 Data Leakage as Confounding
Data leakage can be understood as a special case of confounding where the confounder is information that should not exist at prediction time. This framing clarifies why leakage inflates performance estimates and why leaked models fail in deployment: they have learned associations with variables that are unavailable when predictions must actually be made.
The detailed taxonomy of leakage types (label, feature, temporal, and benchmark leakage) along with detection strategies is provided in Section 12.4. Here we examine how each leakage type creates confounding structures that distort model evaluation.
13.7.1 Causal Structure of Leakage
In causal terms, leakage introduces a backdoor path between features and labels that does not represent the relationship we intend to model (see Chapter 26 for formal treatment of backdoor paths and causal identification). Consider a pathogenicity predictor trained with conservation scores that were computed using alignments incorporating known pathogenic variants. The causal structure includes: (1) the intended path from sequence features through biological mechanism to pathogenicity, and (2) a leaked path from pathogenicity labels through their influence on conservation databases back to conservation features. The model cannot distinguish signal flowing through these two paths, and performance estimates reflect both.
Label leakage creates confounding when the process that generated labels also influenced feature construction. The confounder is the shared information source: ClinVar curators who used computational predictions created a dependency between those predictions and subsequent labels. Feature leakage creates confounding when features correlate with labels through non-causal pathways, such as batch effects that happen to align with case-control status. Temporal leakage creates confounding through time-dependent information flow: future knowledge that influenced either features or labels introduces associations that would not exist in prospective application.
13.7.2 Compounding Effects
These leakage types interact and compound. A model suffering from multiple forms may achieve extraordinary benchmark performance while learning nothing transferable to prospective clinical use. The apparent signal is real within the leaked evaluation framework but spurious for the intended application.
Consider a variant effect predictor that: (1) uses conservation scores computed from databases that include known pathogenic variants, (2) was trained on ClinVar labels that were influenced by earlier predictors, and (3) is evaluated on a benchmark constructed using similar computational filtering methods.
How many distinct leakage pathways can you identify? For each pathway, what would the model learn that would inflate its apparent performance but fail in prospective deployment?
Consider a variant effect predictor that uses conservation scores (feature leakage), was trained on ClinVar labels influenced by earlier predictors (label leakage), and is evaluated on a benchmark constructed using similar computational methods (benchmark leakage). Each leakage type independently inflates performance; together, they create an evaluation that measures something entirely different from prospective predictive ability.
13.7.3 Implications for Confounding Analysis
The confounding framework suggests that leakage detection methods (described in Section 12.4) can be understood as strategies for identifying and blocking backdoor paths. Feature ablation removes variables that may carry leaked signal. Temporal validation eliminates paths that depend on future information. Baseline analysis reveals when simple confounders explain most of the apparent performance.
This perspective also clarifies why some apparent leakage may be acceptable. If conservation scores will always be available at prediction time, the path through conservation represents legitimate signal rather than confounding. The distinction depends on the deployment context: what information will actually be available when the model must make predictions? Leakage is confounding by information that exists in evaluation but not in application.
13.8 Detecting Confounding
Confounding is often subtle, requiring systematic diagnostics rather than reliance on aggregate performance metrics.
13.8.1 Confounder-Only Baselines
The most direct diagnostic trains simple models using only potential confounders: ancestry principal components, batch indicators, sequencing center identifiers, recruitment site. If these confounder-only baselines approach the performance of complex genomic models, confounding likely drives a substantial portion of the signal. Reporting confounder-only baselines alongside genomic model results makes hidden shortcuts visible.
If a model using only ancestry principal components (no genomic features) achieves 0.75 auROC, and your full genomic model achieves 0.82 auROC, how much of that 0.82 reflects biology versus ancestry confounding? The baseline provides a floor: any performance attributable to your genomic features is the delta above this floor. Always compute and report confounder-only baselines.
The choice of baseline fundamentally shapes conclusions about model performance. A particularly insidious form of baseline weakness occurs when polygenic prediction studies use only clumping-and-thresholding (C+T) methods as comparators rather than LD-aware Bayesian approaches. C+T aggressively discards genetic signal by pruning correlated variants, creating an artificially weak baseline that inflates apparent deep learning gains by 16-60% compared to properly tuned alternatives (Ge et al. 2019; Vilhjálmsson et al. 2015).
Studies reporting neural network “improvements” over C+T baselines may be demonstrating only that neural networks implicitly model linkage disequilibrium, which LD-aware Bayesian methods like LDpred2 and PRS-CS already capture more efficiently. When compared against these stronger baselines, apparent neural network advantages often disappear or reverse. A 2025 Nature Communications analysis found that neural networks performed only 93-95% as well as properly tuned linear regression for polygenic prediction when appropriate baselines were used, with apparent “nonlinear advantages” reflecting joint-tagging effects rather than genuine epistasis detection.
Weak baselines function as hidden confounders in model evaluation: they create spurious associations between model complexity and performance improvement that do not reflect genuine capability gains. Always verify that published comparisons include LD-aware methods (LDpred2-auto, PRS-CS, SBayesR) rather than only C+T. Claims of substantial improvement over “state-of-the-art” warrant skepticism until baseline strength is confirmed.
13.8.2 Stratified Performance Analysis
Performance stratified by ancestry group, sequencing platform, institution, and time period reveals whether aggregate metrics mask heterogeneity. Both discrimination (auROC, area under the precision-recall curve (auPRC)) and calibration diagnostics should be computed for each subgroup. Models may achieve high overall auROC while being poorly calibrated or nearly useless for specific subpopulations. Performance that varies dramatically across subgroups suggests confounding or distribution shift even when overall metrics appear strong.
13.8.3 Residual Confounder Associations
Associations between model outputs and potential confounders can reveal encoding of ancestry or batch information beyond what the label requires. Plotting predictions against ancestry principal components, adjusting for true label status, shows residual confounding. Comparing mean predicted risk across batches or time periods within the same true label class identifies systematic biases. Formal association tests (regression, mutual information) between predictions and confounders that show strong residual associations indicate the model has learned confounder-related features that go beyond predicting the label itself.
13.8.4 Split Sensitivity Analysis
Varying the splitting strategy probes for leakage. Re-evaluating performance under locus-level splits, cohort holdouts, or temporal splits reveals whether initial results depended on memorization. A model that achieves 0.90 auROC with random splits but only 0.75 auROC with locus-level splits has likely memorized site-specific patterns. Large drops in performance under stricter splitting indicate inflated initial results.
13.8.5 Negative Control Outcomes
Using outcomes known to be unrelated to genomics as negative controls provides powerful confirmation of confounding. If a model trained to predict disease from genotypes can also predict administrative outcomes (insurance type, documentation completeness) with similar accuracy, it has learned confounders. Shuffling labels within batch or ancestry strata should eliminate predictive signal; if it does not, the model exploits structure that transcends any specific outcome.
You have trained a disease prediction model that achieves 0.88 auROC. Apply what you have learned about detection:
- Name three diagnostic tests you would run to detect confounding
- For each test, describe what result would indicate a problem
- If your confounder-only baseline achieves 0.82 auROC, what does this tell you about your model?
- Three key diagnostics:
- Confounder-only baseline using ancestry PCs and batch indicators
- Stratified performance analysis across ancestry groups and sequencing centers
- Split sensitivity comparing random vs. locus-level vs. temporal splits
- Problem indicators:
- Baseline approaches full model performance (e.g., 0.82 vs 0.88), indicating confounding drives most signal
- Performance varies dramatically across subgroups (e.g., 0.90 in European ancestry but 0.60 in African ancestry), suggesting shortcuts
- Performance drops substantially under stricter splits (e.g., from 0.88 to 0.70 with locus-level), indicating memorization
- If the confounder-only baseline achieves 0.82 auROC while your full model achieves 0.88, then 0.82 of your performance comes from ancestry/batch shortcuts, and only 0.06 (the delta) comes from actual genomic features; your model has learned primarily confounders, not biology.
13.9 Mitigation Strategies
No mitigation strategy eliminates confounding entirely, and each involves trade-offs between bias, variance, and coverage. The approaches described here are complementary: design-based methods constrain confounding before modeling begins, statistical adjustments handle residual confounding, invariance learning provides protection when confounders are incompletely measured, and rigorous benchmark construction ensures that evaluation reflects generalization rather than shortcut learning.
The following table provides a decision framework for selecting mitigation strategies based on your specific situation:
| Strategy | When Applied | Confounder Requirement | Main Trade-off | Best For |
|---|---|---|---|---|
| Matching | Study design | Known, measurable | Reduced sample size | Known major confounders |
| Covariate adjustment | Training | Known, measurable | May remove real signal | Ancestry, batch correction |
| Residualization | Preprocessing | Known, measurable | Information loss | Strong linear confounding |
| Adversarial invariance | Training | Known groups | Reduced accuracy | Unknown within-group variation |
| Group DRO | Training | Known groups | Worse average performance | Fairness-critical applications |
| Multi-site training | Data collection | None | Logistical complexity | Institution effects |
| Temporal splits | Evaluation | Time stamps | Smaller test set | Prospective deployment |
13.9.1 Study Design and Cohort Construction
Design-based approaches provide the most robust protection against confounding because they prevent the problem rather than attempting to correct it statistically. When cases and controls are matched on potential confounders before data collection, those variables cannot drive spurious associations regardless of model complexity.
Matching strategies balance cases and controls on age, sex, ancestry, recruitment site, and sequencing platform. For ancestry, matching can use self-reported categories, genetic principal components, or fine-scale population assignments depending on the granularity required. Why does matching work when statistical adjustment could handle the same variables? Matching eliminates confounding by design rather than by assumption. Statistical adjustment assumes the functional form relating confounders to outcomes is correctly specified; if the true relationship is nonlinear or involves interactions the model does not include, residual confounding persists. Matching makes no such assumptions: when cases and controls have identical confounder distributions, no functional form is needed because there is no confounder-outcome variation to model. Exact matching (requiring identical values) provides the strongest protection but may be infeasible when confounders are continuous or when the pool of potential controls is limited. PCA-based genetic matching, genetic similarity score matching, or coarsened exact matching offer practical alternatives that achieve approximate balance across multiple confounders simultaneously.
Genomic case-control studies have developed specialized matching approaches distinct from classical epidemiological methods:
- PCA-based matching (e.g., PCAmatchR): Matches cases to controls using weighted Mahalanobis distance on ancestry principal components, with weights proportional to variance explained by each PC.
- Genetic similarity score (GSM) matching: Calculates pairwise genetic similarity directly from genome-wide genotypes, then matches based on similarity scores.
- Coarsened exact matching (CEM): Coarsens continuous confounders into strata and requires exact matching within strata. Works well for fewer than ~10 strong confounders but can produce extreme sample size losses in high-dimensional settings.
Why not propensity score matching? Propensity score matching estimates \(P(T=1 \mid X)\), the probability of receiving a treatment given confounder values. In case-control genomic studies, disease status is the outcome, not a treatment assignment. There is no “treatment” whose assignment probability can be modeled. When propensity score methods do appear in genomic literature (e.g., PCAPS), they typically address association testing rather than case-control matching at the study design stage.
Prognostic score matching is the conceptual mirror: it estimates \(\mathbb{E}[Y \mid X, T=0]\) and matches on predicted disease risk. This approach is appropriate when you want to balance expected outcomes rather than treatment assignment probabilities. See the sidebar below for the distinction.
These two matching approaches are often confused but address different causal structures:
| Propensity Score | Prognostic Score | |
|---|---|---|
| Definition | \(P(T=1 \mid X)\) | \(\mathbb{E}[Y \mid X, T=0]\) |
| What it models | Probability of receiving treatment | Expected outcome under control conditions |
| Estimated in | Full sample (all subjects) | Untreated/control group only |
| Balances | Treatment assignment mechanism | Outcome risk factors |
| Use case | Observational studies estimating treatment effects | Studies where outcome prediction matters more than treatment assignment |
When to use which:
- Propensity scores: When you have a clear treatment/exposure and want to estimate its causal effect. Classic examples include drug efficacy studies or policy interventions.
- Prognostic scores: When disease status is the outcome and you want to match on predicted risk. Appropriate for case-control studies where there is no “treatment assignment” to model.
Theoretical insight (Hansen 2008): Prognostic scores directly balance what matters for outcome estimation (the expected outcome under control conditions) rather than balancing treatment assignment. A strong confounder of treatment assignment may be a weak predictor of outcome, and vice versa. Prognostic matching ensures matched pairs have similar expected outcomes, making any observed difference more attributable to the exposure of interest.
Practical consideration: Propensity score matching is “outcome-blind,” allowing you to finalize matched cohorts without examining outcomes and reducing researcher degrees of freedom. Prognostic score matching requires modeling the outcome, which some view as violating the design/analysis separation. Doubly robust approaches combine both propensity and prognostic scores, providing valid causal estimates if either model (but not necessarily both) is correctly specified (Leacy and Stuart 2013).
Balanced sampling during training prevents models from optimizing primarily for majority patterns. When one ancestry group comprises 80% of training data, gradient updates predominantly reflect that group’s patterns, and minority group performance suffers. Down-sampling the majority group or up-sampling minority groups within mini-batches ensures that all groups contribute meaningfully to parameter updates. The trade-off is reduced effective sample size: discarding majority group samples wastes information, while up-sampling minority groups risks overfitting to limited examples.
Prospective collection with diversity targets ensures that training data represent the populations where models will be deployed. Retrospective matching can balance existing cohorts but cannot address variants or patterns that are absent from the original collection. The All of Us Research Program, Million Veteran Program, and similar initiatives that prioritize ancestral diversity from inception provide data that enable genuinely generalizable models, though the genomic AI field has yet to fully leverage these resources.
The limitation of design-based approaches is that they must anticipate which variables will confound. Unknown or unmeasured confounders cannot be matched, and over-matching (matching on variables that are consequences of the exposure) can introduce bias rather than remove it. Design and analysis approaches work best in combination: match on known confounders, then adjust for residual imbalances that matching did not eliminate.
13.9.2 Covariate Adjustment
Covariate adjustment explicitly models confounders rather than ignoring them, allowing estimation of outcome effects that account for confounding variables. The approach is familiar from genome-wide association studies, where including ancestry principal components as covariates in regression models reduces spurious associations driven by population structure.
For foundation models, covariate adjustment takes several forms. The simplest approach includes confounder variables (ancestry PCs, batch indicators, sequencing platform) as additional input features alongside genomic data. The model learns to use confounder information when predicting outcomes, and the genomic feature coefficients or attention weights reflect associations that remain after accounting for confounders. This approach assumes the model can learn the appropriate adjustment; for complex confounding patterns, explicit modeling may be preferable to implicit learning.
Residualization removes confounder-associated variance before training genomic models. Regressing features or phenotypes on confounders and retaining only the residuals ensures that subsequent models cannot exploit confounder-outcome associations. The risk is removing genuine biological signal when confounders correlate with causal variants. Ancestry principal components, for instance, capture population structure that includes both confounding (differential ascertainment) and biology (population-specific genetic architecture). Aggressive residualization may discard the latter along with the former.
Mixed models and hierarchical structures treat institution, batch, or ancestry group as random effects, estimating genomic associations while accounting for clustering within groups. This approach is standard in genetic epidemiology and translates naturally to deep learning through hierarchical Bayesian frameworks or explicit modeling of group-level parameters. The key advantage is borrowing strength across groups while allowing group-specific intercepts or slopes, though computational costs increase substantially for large datasets with many groups.
The fundamental limitation of covariate adjustment is that it requires measuring and correctly specifying confounders. Unmeasured confounders remain uncontrolled. Conditioning on colliders (variables caused by both exposure and outcome) introduces bias rather than removing it. Careful causal reasoning, often formalized through directed acyclic graphs, is essential for determining which variables should be adjusted and which should not.
13.9.3 Domain Adaptation and Invariance Learning
Domain adaptation methods aim to learn representations that do not encode confounders, achieving predictions that generalize across batches, institutions, or populations without explicitly modeling each source of variation (see Chapter 9 for broader treatment of transfer learning and domain adaptation techniques). These approaches are particularly valuable when confounders are numerous, incompletely measured, or difficult to specify.
Adversarial training adds a discriminator network that attempts to predict batch identity, ancestry, or other confounders from learned representations. The feature extractor is trained with two competing objectives: maximize prediction accuracy for the primary task while minimizing the discriminator’s ability to recover confounder labels. Why does this adversarial setup produce invariant representations? The gradient reversal trick provides the key insight: during backpropagation, gradients from the discriminator are negated before flowing to the feature extractor. Instead of learning features that help predict the confounder (which normal backprop would produce), the feature extractor learns features that actively hurt confounder prediction. The feature extractor faces a two-player game where the discriminator tries to extract confounder information and the feature extractor tries to hide it. At equilibrium, representations contain information useful for the primary task but encode confounders no better than random chance. When successful, the learned representations retain information useful for prediction while discarding information that distinguishes confounded groups. Domain adversarial neural networks and gradient reversal layers implement this approach efficiently within standard deep learning frameworks.
The theoretical limitation is that perfect invariance and maximum accuracy cannot be achieved simultaneously when confounders correlate with the outcome through both causal and non-causal pathways. Enforcing strict invariance to ancestry, for instance, may remove genuine population-specific genetic effects along with confounding. Practitioners must balance the degree of invariance against task performance, typically through hyperparameters controlling the adversarial loss weight.
Group distributionally robust optimization (group DRO) targets worst-group performance rather than average performance, encouraging models that work for all subgroups rather than optimizing for the majority. The training objective minimizes the maximum loss across predefined groups (ancestry categories, sequencing platforms, institutions), ensuring that no group is systematically disadvantaged. This approach requires group labels during training and may sacrifice some average performance to improve worst-case outcomes.
Importance weighting and distribution matching align feature distributions across domains without explicit adversarial training. Samples from underrepresented domains receive higher weights during training, or feature distributions are explicitly matched through maximum mean discrepancy or optimal transport objectives. These methods can be combined with other approaches and are particularly useful when the target deployment distribution is known but differs from training data.
13.9.4 Data Curation and Benchmark Design
The signals available for learning depend entirely on how data are curated and how benchmarks are constructed. Careful attention to data provenance and evaluation design prevents many confounding problems that would otherwise require complex modeling solutions.
Deduplication across training and evaluation sets prevents direct memorization. For genomic data, deduplication must operate at multiple levels: individual samples (the same person appearing under different identifiers), family groups (relatives sharing haplotype segments), and genomic loci (the same variant position appearing in both training and test sets). Variant effect prediction requires particularly stringent locus-level deduplication; a model that has seen any variant at position chr1:12345 during training cannot be fairly evaluated on novel variants at that position.
Splitting strategies determine what generalization is actually tested. Random splits assess interpolation within the training distribution. Locus-level splits test generalization to novel genomic positions. Chromosome holdouts test transfer across genomic regions. Cohort splits test robustness to institutional and demographic differences. Temporal splits simulate prospective deployment. Each strategy answers a different question, and benchmark performance under one splitting regime does not guarantee performance under others. Reporting results across multiple splitting strategies reveals which aspects of generalization a model has achieved. The comprehensive treatment of benchmark design in Chapter 11 addresses these considerations in detail.
Benchmark diversity ensures that evaluation reflects the full range of deployment contexts. Benchmarks constructed from a single ancestry group, institution, or sequencing platform test only narrow generalization. Explicitly including diverse ancestries, multiple institutions, and varied technical platforms in evaluation sets reveals performance heterogeneity that homogeneous benchmarks would hide. The ProteinGym and CASP benchmarks in protein modeling demonstrate how thoughtfully constructed evaluation resources can drive genuine progress; genomic variant interpretation would benefit from similar community efforts.
Stating that a dataset is “diverse” is insufficient without quantification. The Vendi Score (Friedman and Dieng 2023) provides a principled metric for this purpose. Given a set of samples and a pairwise similarity function (e.g., kernel on embeddings or genotype distances), the Vendi Score computes the matrix exponential of the Shannon entropy of the normalized similarity matrix’s eigenvalues — equivalent to the effective rank of the similarity matrix. The score ranges from 1 (all samples identical) to \(n\) (all samples maximally distinct), providing an intuitive interpretation as the effective number of unique elements in the dataset.
For genomic applications, the Vendi Score can quantify diversity along multiple axes: ancestry representation in a cohort (using genetic distance kernels), embedding coverage of a foundation model’s representation space, or phenotypic breadth of an evaluation benchmark. Tracking diversity scores across train/test splits and subsampling schemes helps ensure that data reduction preserves representativeness rather than collapsing onto a homogeneous subset. When benchmark diversity scores are low despite inclusion of nominally different populations, the benchmark may not test the generalization it claims to assess.
Documentation of overlaps between training resources and benchmarks enables readers to assess potential leakage. When a foundation model is pretrained on gnomAD, fine-tuned on ClinVar, and evaluated on a benchmark that filters variants using gnomAD frequencies, the information flow is complex and potentially circular. Explicit documentation of which resources contributed to which stages of model development allows appropriate skepticism about performance claims. Benchmark papers should catalog known overlaps with major training resources; model papers should acknowledge which benchmarks may be compromised by their pretraining choices.
13.9.5 Causal Inference Approaches
When observational confounding cannot be eliminated through design or statistical adjustment, causal inference frameworks offer principled alternatives that leverage the structure of genetic inheritance itself (see Chapter 26 for comprehensive treatment of causal inference in genomics).
The random assortment of alleles at meiosis creates natural experiments that Mendelian randomization exploits (Davey Smith and Ebrahim 2003). Because genotypes are assigned before birth and cannot be influenced by most environmental confounders, genetic variants that affect an exposure (such as a biomarker level or gene expression) can serve as instrumental variables for estimating causal effects on downstream outcomes. An instrumental variable is a variable that (1) affects the exposure of interest, (2) affects the outcome only through the exposure (no direct effect), and (3) is independent of unmeasured confounders. Why does this random assortment matter for causal inference? Consider the confounding that plagues observational studies: people with high LDL cholesterol may also smoke, exercise less, and have poorer diets, confounding any association between LDL and heart disease. But the genetic variants affecting LDL levels were randomly assigned at conception, before any lifestyle choices occurred. These variants cannot be confounded by lifestyle because they were fixed before lifestyle existed. By using genetic variants as instruments, Mendelian randomization asks: “Do people who were randomly assigned higher LDL (through genetic lottery) have higher heart disease risk?” This isolates the causal effect of LDL itself. A foundation model trained to predict expression levels can be evaluated for causal relevance by testing whether its predictions, instrumented through genetic variants, associate with disease outcomes in ways that survive Mendelian randomization assumptions. This approach has revealed that many observational biomarker-disease associations reflect confounding rather than causation, and similar logic applies to model-derived predictions.
Directed acyclic graphs (DAGs) formalize assumptions about causal structure and clarify which variables should be adjusted, which should be left unadjusted, and which adjustments would introduce bias rather than remove it (Pearl 2009). Conditioning on a collider (a variable caused by both exposure and outcome) induces spurious associations; conditioning on a mediator blocks causal pathways of interest. Explicit DAG construction forces researchers to articulate their causal assumptions, making hidden confounding visible and enabling principled variable selection. For genomic models, DAGs clarify the relationships among ancestry, technical factors, biological mechanisms, and phenotypic outcomes, revealing which adjustment strategies address confounding versus which inadvertently condition on consequences of the outcome.
Outcomes and exposures known to be unrelated to the prediction target provide empirical tests of residual confounding without requiring complete causal knowledge (Lipsitch, Tchetgen Tchetgen, and Cohen 2010). A negative control outcome is one that should not be causally affected by the exposure of interest; if the model predicts it anyway, confounding is present. A negative control exposure is one that should not causally affect the outcome; association with the outcome again indicates confounding. For a variant effect predictor, administrative outcomes (insurance status, documentation completeness) serve as negative control outcomes that genotypes should not predict. Synonymous variants in non-conserved regions can serve as negative control exposures that should not affect protein function. Strong predictions for negative controls reveal that the model has learned confounders rather than biology.
These causal approaches do not replace careful study design and rigorous splitting, but they provide additional tools for distinguishing genuine biological signal from confounded associations, particularly when the same observational data must serve both training and evaluation purposes.
13.10 Fairness and External Validity
Confounding connects directly to fairness and health equity. Models that achieve high average performance while failing for specific populations may appear successful while exacerbating existing disparities.
Polygenic risk scores illustrate this tension. European ancestry-derived scores predict cardiovascular disease, diabetes, and breast cancer risk reasonably well within European ancestry populations. Applied to African or Asian ancestry individuals, the same scores show substantially worse discrimination and calibration (Duncan et al. 2019). Healthcare systems that deploy these scores without ancestry-specific validation risk providing inferior risk stratification to already underserved populations. The portability analysis framework in Section 3.7 quantifies these degradations, while clinical deployment frameworks (Section 28.8) address operational responses.
Variant interpretation exhibits similar patterns. ClinVar contains many more pathogenic variant classifications for European ancestry individuals than for other populations (Landrum et al. 2018). The data composition issues underlying this imbalance are examined in Section 2.8.1. Predictors trained on ClinVar inherit this imbalance, performing better for variants common in European populations and worse for variants enriched in other ancestries. Clinical deployment of such predictors may reduce diagnostic yield for non-European patients.
Pharmacogenomics provides a concrete example of ancestry-dependent clinical utility. Chung et al. (2004) demonstrated that CYP2D6 metabolizer phenotypes vary substantially across populations, with allele frequencies differing by orders of magnitude between African, Asian, and European ancestry groups. Foundation models predicting drug response should account for these population-specific patterns or risk systematic errors in dosing recommendations for underrepresented groups.
A hospital system proposes deploying a polygenic risk score for breast cancer screening prioritization. The score was developed and validated in a European ancestry cohort. The hospital serves a population that is 40% African ancestry and 25% Hispanic/Latino.
- What fairness concerns should be raised before deployment?
- What validation studies would you require?
- What monitoring should be implemented post-deployment?
Major fairness concerns: The score may show 40-75% reduced accuracy in non-European populations, potentially providing inferior risk stratification to already underserved groups. Differential performance could lead to missed diagnoses in minority populations or inappropriate screening recommendations.
Required validation: Stratified performance analysis by ancestry group in the target population; calibration assessment for each ancestry group (not just discrimination); comparison to ancestry-matched baselines; sensitivity analysis to understand performance degradation mechanisms.
Post-deployment monitoring: Track screening recommendations and cancer detection rates stratified by ancestry; monitor for disparities in false positive/negative rates; assess whether the tool improves or worsens existing outcome disparities; implement thresholds for stopping use if equity metrics deteriorate.
The uncertainty quantification approaches discussed in Chapter 24 provide partial mitigation: models that report high uncertainty for under-represented populations at least flag predictions that should not be trusted. Out-of-distribution detection methods (Section 24.7) specifically address when inputs fall outside the training distribution. The interpretability methods in Chapter 25 can reveal when models rely on ancestry-correlated features, with attribution analysis (Section 25.1) identifying which input features drive ancestry-dependent predictions. Yet technical solutions alone are insufficient. Addressing fairness requires intentional data collection that prioritizes under-represented populations, evaluation protocols that mandate subgroup analysis, and deployment decisions that consider equity alongside aggregate accuracy.
External validity asks whether a model’s performance in one setting predicts its performance in another. Confounding and distribution shift often cause dramatic external validity failures. A model that achieves excellent metrics in the development cohort may fail when deployed at a different institution, in a different healthcare system, or in a different country. The clinical risk prediction frameworks in Section 28.9 emphasize multi-site validation precisely because single-site performance frequently fails to generalize.
The fairness implications of confounding extend beyond technical model performance into questions of justice in healthcare resource allocation, diagnostic equity, and the distribution of benefits from genomic medicine. Governance frameworks for addressing these structural challenges are examined in Section 27.2.
13.11 A Practical Checklist
The following checklist synthesizes the diagnostics and mitigations discussed above. Systematic application during model development and evaluation surfaces confounding that would otherwise remain hidden.
This checklist should be applied at three stages: (1) during study design, to prevent confounding through matching and balanced sampling; (2) during model development, to detect and mitigate confounding through diagnostics and training modifications; and (3) during evaluation, to ensure that performance estimates reflect genuine generalization rather than shortcut learning. Document your responses to each item in your methods section.
Population structure and relatedness: Quantify ancestry via principal components and relatedness via kinship coefficients. Decide explicitly whether to match, stratify, or adjust for these factors, and document the justification. Report performance stratified by ancestry group. When family structure exists in the data, verify that relatives do not appear across train-test boundaries.
Data splits and leakage: Ensure individuals, families, and genomic loci do not cross the train-validation-test boundaries for target tasks. Implement stricter splits (locus-level, chromosome-level, cohort-based, time-based) and report the performance differences. Check for overlap with external databases or benchmarks used in evaluation and document any shared resources.
Batch, platform, and cohort effects: Catalog technical variables (sequencing center, instrument, protocol, assay) and cohort identifiers. Test whether these variables predict labels or align with subgroups of interest. Use embedding visualizations, principal components, or simple classifiers to detect batch signatures. Apply mitigation (design matching, covariate adjustment, domain adaptation) when batch effects are detected.
Label quality and curation bias: Document how labels were defined and what processes (billing codes, expert review, computational prediction, registry inclusion) produced them. Quantify label noise where possible. Consider robust training strategies when labels are noisy. Assess how curated resources like ClinVar reflect historical biases and whether those biases affect evaluation validity.
Cross-group performance and fairness: Report metrics for each major subgroup (ancestry, sex, age, cohort, platform) rather than only aggregate performance. Examine calibration across groups, not just discrimination. Discuss clinical implications of residual performance gaps and whether deployment might worsen existing disparities.
Reproducibility and transparency: Document dataset construction, inclusion criteria, and splitting strategies completely. Release preprocessing, training, and evaluation code when feasible. Describe which confounders were measured, how they were handled, and what limitations remain.
Models that pass this checklist provide more reliable evidence of genuine biological learning. Models that fail at multiple points may achieve benchmark success while learning shortcuts that will not transfer to new settings.
13.12 Rigor as Response
These confounding and bias problems are not reasons for despair. They are reasons for rigor. The same expressive capacity that enables foundation models to discover subtle shortcuts also enables them to learn complex biological patterns when training data and evaluation protocols are designed appropriately. The goal is not to abandon powerful models but to create conditions under which their power serves biological discovery rather than benchmark gaming.
Several trends support progress. Multi-ancestry biobanks and international collaborations expand the diversity of available training data. Benchmark developers implement stricter splitting protocols and require subgroup analyses. Pretraining strategies that explicitly promote invariance to technical factors are emerging. Uncertainty quantification methods (Chapter 24) provide mechanisms for models to express appropriate caution when inputs fall outside their training distribution. The problem of confounding is tractable with sustained attention to data provenance, evaluation design, and deployment monitoring. The benchmark catalog in Chapter 11 identifies which evaluation resources are most susceptible to particular confounders, while the evaluation methodology in Chapter 12 provides protocols for detecting leakage before it inflates reported performance.
Yet vigilance remains essential. New datasets bring new confounders. Novel architectures create new opportunities for shortcut learning. Community benchmarks accumulate indirect leakage as resources are reused across studies. Treating confounding as a first-order concern throughout model development, rather than an afterthought addressed only when reviewers complain, distinguishes models that actually work from models that merely perform well on convenient benchmarks. The interpretability methods in Chapter 25 provide tools for distinguishing genuine regulatory insight from sophisticated pattern matching, with mechanistic interpretability (Section 25.7) offering the strongest evidence about what models have actually learned. The uncertainty quantification approaches in Chapter 24 enable models to communicate when their predictions should not be trusted, with selective prediction (Section 24.8) providing operational frameworks for routing uncertain cases to human review. Together with rigorous evaluation, these capabilities move the field toward models that reveal genuine biology and behave reliably across the diverse clinical and scientific settings where they will be deployed.
Before reviewing the summary, test your recall:
- Distinguish between confounding, bias, data leakage, and distribution shift. Give a concrete genomic example of each.
- How does population structure create shortcuts that foundation models exploit, and why does increasing model capacity amplify rather than mitigate this problem?
- A confounder-only baseline using ancestry PCs achieves 0.75 auROC, while your full genomic model achieves 0.82 auROC. What does this reveal about your model’s learned features?
- Why does label circularity in ClinVar (where computational predictions influence pathogenicity annotations) inflate validation metrics for new predictors trained on ClinVar?
- What splitting strategies address different types of leakage (individual overlap, family structure, position memorization, temporal drift)?
Confounding, bias, leakage, and distribution shift: Confounding occurs when a variable affects both features and labels (example: ancestry influences allele frequencies through population history and disease risk through healthcare access pathways). Bias is systematic deviation from the target quantity (example: training on 50% disease prevalence but deploying at 5% prevalence causes systematic over-prediction). Data leakage occurs when test information influences training (example: the same variant appearing in both training and test sets, or ClinVar labels being influenced by CADD scores that later become training features). Distribution shift is mismatch between training and deployment distributions (example: a model trained on one hospital’s coding practices failing at a different institution with different documentation standards).
Population structure as exploitable shortcut: Population structure creates dual pathways where ancestry affects genetic features (through population-specific allele frequencies, haplotypes, and linkage disequilibrium patterns) and simultaneously affects disease labels through non-biological pathways (healthcare access, environmental exposures, clinical ascertainment practices). Foundation models can detect ancestry from local k-mer frequencies and haplotype patterns even in raw sequences, then exploit ancestry-label correlations as shortcuts. Increasing model capacity amplifies this problem because larger transformers with billions of parameters can discover increasingly subtle ancestry-linked features that smaller models would miss; model expressiveness is an amplifier of confounding, not a defense against it.
Interpreting confounder-only baseline performance: The confounder-only baseline achieving 0.75 auROC while the full model achieves 0.82 auROC reveals that the vast majority of predictive performance (0.75 out of 0.82) comes from ancestry/batch shortcuts rather than genomic biology. Only the 0.07 delta represents signal genuinely attributable to genomic features beyond what ancestry alone provides. This indicates the model has primarily learned to exploit confounders rather than biological mechanisms, and would likely fail when deployed in settings where ancestry-outcome relationships differ from the training distribution.
Label circularity inflates validation metrics: When clinical laboratories cite computational predictions like CADD or REVEL as supporting evidence for pathogenic classifications, and those classifications become ClinVar labels, new predictors trained on ClinVar face a circular task: they are learning to replicate what previous predictors said, not learning true pathogenicity. The inflation occurs because the new model achieves high agreement on labels that were influenced by computational predictions, essentially predicting what old models predicted rather than independent biological truth. This circularity is invisible within the circular ecosystem but becomes apparent during prospective validation on genuinely novel variants or independent functional assays where the feedback loop does not exist.
Splitting strategies for different leakage types:
- Individual overlap: Random individual-level splits at minimum, but this only works when samples are truly independent
- Family structure: Family-aware splits that keep all relatives together in the same partition to prevent haplotype memorization
- Position memorization: Locus-level splits that hold out entire genomic positions, ensuring the model has never seen any variant at test positions during training
- Temporal drift: Time-based splits that train on earlier data and test on later data, simulating prospective deployment and capturing evolution in sequencing technology, coding practices, and diagnostic criteria
Each splitting strategy tests a different aspect of generalization, and models should be evaluated under all relevant strategies to demonstrate genuine robustness.
Core Concepts:
- Confounding occurs when a variable affects both features and labels, creating spurious associations that models exploit as shortcuts
- Population structure is the most pervasive confounder in genomics, affecting both genetic features and phenotypes through non-biological pathways
- Batch effects from sequencing centers, capture kits, and analysis pipelines can become predictive signals that fail at deployment
- Label circularity occurs when computational predictions influence training labels, creating apparent validation that reflects agreement rather than insight
- Data leakage can be understood as confounding by information unavailable at prediction time
Key Diagnostics:
- Confounder-only baselines reveal how much signal comes from shortcuts versus biology
- Stratified performance analysis exposes hidden heterogeneity across subgroups
- Split sensitivity analysis (random vs. locus-level vs. temporal) tests for memorization
- Negative control outcomes confirm whether models learn confounders
Mitigation Hierarchy:
- Prevention through design: Matching, balanced sampling, prospective diverse collection
- Adjustment during training: Covariate inclusion, residualization, mixed models
- Invariance learning: Adversarial training, group DRO
- Rigorous evaluation: Locus-level splits, cohort holdouts, temporal validation
Connection to Other Chapters:
- Evaluation methodology (Chapter 12) provides detailed leakage detection protocols
- Uncertainty quantification (Chapter 24) flags predictions on out-of-distribution inputs
- Interpretability (Chapter 25) reveals what features models actually use
- Clinical deployment (Section 28.9) addresses operational fairness monitoring
Take-Home Message: High benchmark performance is not evidence of biological learning. Only rigorous evaluation design, systematic confounding diagnostics, and stratified subgroup analysis can distinguish models that have learned biology from models that have learned shortcuts. Treat confounding as a first-order concern throughout model development.