12 Evaluation Methods

The difference between valid and misleading evaluation often lies in methodological details that standard reporting obscures.

Chapter Overview

Estimated reading time: 35-40 minutes

Prerequisites: This chapter assumes familiarity with the benchmark landscape surveyed in Chapter 11. Understanding of basic machine learning evaluation (train/test splits, auROC) and genomic data characteristics (Chapter 2) is essential.

Learning Objectives: After completing this chapter, you should be able to:

Identify why random train-test splits fail for genomic data and select appropriate splitting strategies
Detect and prevent the four major types of data leakage (label, feature, temporal, benchmark)
Choose metrics aligned with deployment objectives, distinguishing discrimination, calibration, and clinical utility
Design ablation studies and baseline comparisons that isolate genuine model contributions
Apply statistical rigor including significance testing, effect sizes, and confidence intervals

Chapter Structure: This chapter examines how to evaluate models properly, covering splitting strategies, leakage detection, metric selection, baseline comparison, and statistical rigor. The companion chapter on the Benchmark Landscape (Chapter 11) surveys what benchmarks exist.

Genomic data makes it exceptionally easy to fool yourself. Sequences share evolutionary history, so a model that memorizes training sequences may appear to generalize when tested on homologs. Variants cluster in families and populations, so ancestry-stratified performance can masquerade as genuine prediction. Experimental measurements carry batch effects invisible to the untrained eye, so a model can learn to distinguish sequencing centers rather than biological states. Training labels often derive from the very databases used for evaluation, creating circular validations that inflate performance without testing genuine predictive power. Every shortcut that simplifies evaluation in other machine learning domains becomes an opportunity for false confidence in genomics.

Random data splits that work perfectly well for natural images become actively misleading when applied to biological sequences. A protein held out for testing may share 90% sequence identity with a training protein, allowing the model to succeed through memorization rather than generalization. A variant classified as pathogenic in the test set may come from the same gene family as training variants, letting the model exploit gene-level signals rather than learning variant-specific effects. A cell line in the test set may have been processed at the same sequencing center as training samples, enabling the model to recognize batch signatures rather than biological patterns. These leakages are not hypothetical; they have inflated reported performance across the genomic machine learning literature.

The difference between valid and misleading evaluation often lies not in benchmark choice but in methodological details: data splitting strategies, metric selection, baseline comparisons, ablation designs, and statistical testing. Chapter 11 catalogs what benchmark tasks exist, how they are constructed, and what capabilities they probe. This chapter addresses the complementary question: given a benchmark, how do we apply it to produce trustworthy results? These principles apply across all benchmark categories, from chromatin state prediction to clinical variant classification. By mastering evaluation methodology, practitioners can distinguish genuine advances from artifacts that will not survive deployment.

Difficulty Warning: Methodological Rigor

The concepts of leakage, confounding, and proper experimental design are subtle but essential. A model developer who masters these principles will avoid the common pitfalls that produce misleading benchmark results. Take time with each section; the investment will pay dividends in every evaluation you conduct.

12.1 Why Random Splits Fail

The standard machine learning recipe calls for randomly partitioning data into training, validation, and test sets. For image classification or sentiment analysis, this approach works well because individual examples are approximately independent. A photograph of a cat shares no special relationship with another photograph of a different cat beyond their common label. Random assignment ensures that training and test distributions match, and performance on the test set provides an unbiased estimate of performance on new examples from the same distribution.

Genomic data violates these assumptions at every level. Consider a protein dataset where the goal is to predict stability from sequence. Proteins in the same family share evolutionary history and often similar structures. If a training set includes beta-lactamase variants from E. coli and the test set includes beta-lactamase variants from Klebsiella, the model may appear to generalize to “new” proteins while actually recognizing sequence patterns it saw during training. The test performance reflects memorization of family-specific features rather than general principles of protein stability.

The problem compounds when sequence identity is high. Two proteins sharing 80% sequence identity will typically have similar structures and functions. A model trained on one and tested on the other is not really being tested on a novel example; it is being asked to interpolate within a region of sequence space it has already explored. Even at 30% sequence identity, the so-called “twilight zone” of homology detection (Rost 1999), proteins often share structural and functional similarities that can be exploited by sufficiently powerful models.

Variant-level data presents analogous challenges. Variants within the same gene share genomic context, and variants affecting the same protein domain share structural environment. Variants from the same individual share haplotype background. Variants from the same population share allele frequency distributions shaped by demographic history. Each of these relationships creates opportunities for models to learn shortcuts that generalize within the training distribution but fail on genuinely novel examples.

Key Insight: The Independence Assumption

The fundamental issue is that genomic data points are not independent. Random splits assume independence; when this assumption is violated, the test set no longer provides an unbiased estimate of generalization. The consequence is systematic overestimation of performance. A model that achieves 0.90 auROC with random splitting might achieve only 0.75 auROC when evaluated on truly held-out examples, with the gap reflecting how much the model learned about biology versus how much it learned about the structure of the training data.

12.2 Homology-Aware Splitting

The solution to homology-driven leakage is to explicitly account for sequence similarity when constructing data splits. Rather than random assignment, examples are clustered by sequence identity, and entire clusters are assigned to training, validation, or test sets. This ensures that no test example is “too similar” to any training example, forcing the model to demonstrate genuine generalization.

12.2.1 Clustering Tools and Workflows

Two tools dominate homology-aware splitting in practice. CD-HIT clusters sequences by greedy incremental clustering, assigning each sequence to an existing cluster if it exceeds a similarity threshold to the cluster representative, or creating a new cluster otherwise (Li and Godzik 2006). Why greedy incremental rather than optimal clustering? Computing all-versus-all pairwise similarities for millions of sequences would require billions of comparisons, making optimal clustering computationally infeasible. The greedy approach processes sequences one at a time, comparing each new sequence only to existing cluster representatives rather than to all sequences. This reduces complexity from quadratic to roughly linear in the number of sequences, enabling practical application to genomic-scale datasets. The tradeoff is that cluster assignments depend on input order and may not be globally optimal, but for splitting purposes this approximation suffices. The algorithm is fast and scales to millions of sequences. For proteins, a typical workflow clusters at 40% sequence identity for stringent splitting or 70% for moderate splitting. For nucleotide sequences, thresholds are typically higher (80-95%) because DNA evolves more slowly than protein sequences: synonymous mutations and purifying selection allow DNA to retain high similarity even across distant species, so 80% nucleotide identity may represent functional divergence comparable to 30-40% protein identity.

MMseqs2 offers faster clustering with similar sensitivity, becoming essential for large-scale analyses (Steinegger and Söding 2017). The tool supports multiple clustering modes and can handle databases with hundreds of millions of sequences. For foundation model pretraining where deduplication affects billions of sequences, MMseqs2 is often the only practical option.

Practical Guidance: Choosing Identity Thresholds

The choice of identity threshold involves trade-offs:

Threshold	Proteins	Nucleotides	Trade-off
Stringent	30-40%	80-85%	Hardest test; may lack training data
Moderate	50-70%	85-90%	Balanced; typical benchmark choice
Permissive	80-90%	95%+	Retains data but allows some leakage

Rule of thumb for proteins: Use 40% identity for variant effect prediction, 30% for structure prediction (where even distant homologs share structure).

Rule of thumb for DNA: Use 80% for regulatory prediction, but consider gene-family splits instead of sequence identity alone.

12.2.2 Practical Considerations

Several subtleties affect the quality of homology-aware splits. When one cluster contains half the data and is assigned to training, the remaining clusters may be too small or too biased to serve as representative test sets. This cluster size distribution problem can be mitigated through stratified sampling within clusters or careful balancing across splits, ensuring that test sets contain sufficient examples across the label distribution.

Pairwise clustering can miss hidden relationships that arise through transitive homology. Protein A may share 35% identity with protein B, and protein B may share 35% identity with protein C, yet A and C share only 20% identity. If A is in training and C is in testing, B serves as an indirect bridge that allows information to leak between splits despite no direct high-identity pair spanning them. Why does this transitive leakage matter when A and C are dissimilar? The model does not need to memorize specific sequences; it needs only to learn patterns. Information about C-like sequences can flow through B: patterns learned from B (which is similar enough to A to share features) may transfer to C (which is similar enough to B to benefit from those features). This chain of similarity creates a gradient of information flow even when the endpoints share little direct similarity. Connected component analysis or multi-step clustering can address these transitive relationships, though at increased computational cost.

Multi-domain proteins complicate whole-protein clustering because different domains may have different evolutionary histories. A protein may share one domain with training proteins and another domain with test proteins. Whether this represents leakage depends on the prediction task: if predicting whole-protein function, shared domains matter; if predicting domain-specific properties, they matter more acutely. Domain-aware splitting assigns domains rather than whole proteins to clusters, though this requires domain annotation that may not always be available.

For genomic (non-protein) sequences, repeat elements and transposable elements create analogous challenges. A model trained to predict chromatin state may learn features of LINE elements that recur throughout the genome. Excluding repetitive regions from evaluation or explicitly accounting for repeat content can clarify what the model has actually learned about regulatory sequences versus repetitive element patterns.

12.3 Splitting by Biological Axis

Beyond sequence homology, genomic data admits multiple axes along which splits can be constructed. The choice of axis determines what kind of generalization is being tested.

Stop and Think

You are building a variant pathogenicity predictor for clinical use. You could split your data by: (A) random 80/20, (B) by chromosome, (C) by gene family, or (D) by patient ancestry. Each tests different generalization. Which splitting strategy would best simulate real clinical deployment? Why might you want to try multiple strategies?

Consider: In the clinic, which of these splits most closely resembles encountering a truly novel variant?

Check Your Answer

Best for clinical deployment simulation: (C) Gene family split, because clinical variants often occur in genes for which you have no training data. A new disease gene identified next year will not share sequence with your training genes, so testing generalization across gene families reveals whether your model learns transferable principles rather than gene-specific patterns.

Why multiple strategies? Each split tests a different failure mode: - (A) Random overestimates performance by allowing related variants to leak across splits - (B) Chromosome tests genomic region transfer but does not address gene homology - (C) Gene family tests whether learned features transfer to novel genes - (D) Ancestry reveals whether your model works equitably across populations

A model that passes all four strategies is more likely to succeed in deployment than one that only passes random splits. The stricter the split, the lower but more realistic the performance estimate.

Splitting strategies test different aspects of generalization

12.3.1 Splitting by Individual

For tasks involving human genetic variation, ensuring that data from the same individual (or related individuals) does not appear in both training and test sets is essential. A variant effect predictor trained on variants from person A and tested on other variants from person A may learn individual-specific patterns, such as haplotype structure or ancestry-correlated allele frequencies, that do not generalize to new individuals.

Family structure creates subtler leakage. First-degree relatives share approximately 50% of their genomes identical by descent. Even distant relatives share genomic segments that can be exploited by sufficiently powerful models. Best practice involves computing kinship coefficients across all individuals and either excluding one member of each related pair or assigning entire family clusters to the same split. The UK Biobank provides pre-computed relatedness estimates; other cohorts may require explicit calculation using tools like KING or PLINK. ### Splitting by Genomic Region {#sec-ch12-splitting-genomic-region}

Chromosome-based splits assign entire chromosomes to training or testing. This approach is common in regulatory genomics, where models trained on chromosomes 1-16 are tested on chromosomes 17-22 (or similar partitions). The advantage is simplicity and reproducibility; the disadvantage is that chromosomes are not independent. Chromosome 6 contains the HLA region with its unusual patterns of variation and selection; chromosome 21 is small and gene-poor; sex chromosomes have distinct biology. Results may vary substantially depending on which chromosomes are held out.

Region-based splits hold out contiguous segments (e.g., 1 Mb windows) distributed across the genome. This provides more uniform coverage than chromosome splits but requires careful attention to boundary effects. If a regulatory element spans the boundary between training and test regions, parts of its context may leak into training.

12.3.2 Splitting by Gene or Protein Family

For variant effect prediction, holding out entire genes or protein families tests whether models learn general principles versus gene-specific patterns. A model that achieves high accuracy by memorizing that TP53 variants are often pathogenic has not demonstrated understanding of mutational mechanisms. Gene-level splits force models to generalize to genes they have never seen, providing stronger evidence of biological insight.

Family-level splits extend this logic to groups of related genes. Holding out all kinases or all GPCRs tests whether models can generalize across evolutionary families. This is particularly stringent for protein structure and function prediction, where family membership strongly predicts properties.

Stop and Think: Choosing the Right Split

Consider a project to predict whether coding variants cause loss of protein function. You have variants from 1000 genes, with 50-100 variants per gene. Which splitting strategy would you choose, and why?

A. Random split (80/10/10) B. Chromosome-based (train on chr1-18, test on chr19-22) C. Gene-based (train on 800 genes, test on 200 held-out genes) D. Individual-based (split by patient ID)

Consider: What would each split actually test? Which shortcuts could models exploit?

12.3.3 Splitting by Experimental Context

Multi-task models that predict chromatin marks across cell types can be split by cell type rather than genomic position. Training on liver, lung, and brain while testing on heart and kidney assesses whether learned regulatory logic transfers across tissues. This matters because cell-type-specific factors drive much of regulatory variation; a model that has simply learned which regions are accessible in the training cell types may fail on novel cell types even when sequence features should transfer.

Similarly, models can be split by assay type (e.g., training on ATAC-seq, testing on DNase-seq), laboratory (to assess batch effects), or time point (for longitudinal data). Each split tests a different axis of generalization.

12.3.4 Splitting by Ancestry

For human genomic applications, ancestry-stratified evaluation has become essential. Models trained predominantly on European ancestry cohorts often show degraded performance in African, East Asian, South Asian, and admixed populations. This degradation reflects both differences in allele frequency spectra and differences in linkage disequilibrium patterns that affect which variants are informative.

Best practice reports performance separately for each major ancestry group represented in the data. When held-out ancestry groups are available (e.g., training on Europeans and testing on Africans), this provides the strongest test of cross-population generalization. When only European data are available, this limitation should be explicitly acknowledged, and claims about generalization should be appropriately modest. The confounding effects of ancestry on genomic predictions are detailed in Chapter 13.

12.3.5 Splitting by Time

Temporal splits assign data to training and test sets based on when observations were collected, annotations were created, or variants were classified. This strategy tests whether models generalize forward in time, the actual deployment scenario for any predictive system.

For variant pathogenicity prediction, temporal splits are particularly revealing. ClinVar (Section 2.8.1) provides submission dates enabling clean temporal partitioning. Training on ClinVar annotations from 2018 and testing on variants first classified in 2022 asks whether the model can predict labels that did not yet exist during training. This avoids the circularity that arises when training and test labels were assigned by similar processes at similar times. Variants classified more recently may reflect updated curation standards, new functional evidence, or reclassifications of previously uncertain variants; a model that performs well on these genuinely new classifications demonstrates predictive validity rather than recapitulation of historical curation patterns.

Implementing temporal splits requires metadata that many datasets lack. ClinVar provides submission dates, enabling clean temporal partitioning. UniProt tracks annotation dates for functional assignments. Clinical cohorts with longitudinal follow-up naturally admit temporal splits based on diagnosis dates. When temporal metadata is unavailable, publication dates of source literature can serve as proxies, though these may not perfectly reflect when information became available to model developers.

The key limitation of temporal splits is non-stationarity. The distribution of variants classified in 2022 may differ systematically from those classified in 2018, not because biology changed but because research priorities, sequencing technologies, and ascertainment patterns evolved. Performance degradation on temporally held-out data may reflect distribution shift rather than genuine failure to generalize. Combining temporal splits with stratified analysis (performance by variant type, gene category, or evidence strength) helps disentangle these factors.

12.4 Leakage Taxonomy and Detection

Even with careful splitting, leakage can enter evaluations through multiple pathways. A variant effect predictor that achieves 0.95 auROC on held-out test data may be exploiting information that would never exist for truly novel variants, rendering the performance estimate meaningless for clinical deployment. Understanding common leakage patterns helps practitioners design cleaner evaluations and critically assess published results.

Genomic machine learning faces four distinct leakage types, each creating different pathways for inflated performance: label leakage, feature leakage, temporal leakage, and benchmark leakage. These categories are not mutually exclusive; a single evaluation may suffer from multiple forms simultaneously, with compounding effects on apparent performance.

Table 12.1: The four major leakage types in genomic machine learning, with detection strategies for each.

Leakage Type	Definition	Example	Detection Strategy
Label leakage	Target labels derived from features the model can access	ClinVar classifications informed by SIFT/PolyPhen scores	Compare to baseline using only those features
Feature leakage	Input features encode future or target information	Conservation scores for pathogenicity prediction	Ablate suspicious features; measure degradation
Temporal leakage	Using future information to predict past	Training on 2023 labels to predict 2020 classifications	Strict temporal splits with date metadata
Benchmark leakage	Test set construction influenced by evaluated methods	Benchmark selected proteins with good sequence coverage	Check benchmark construction procedure

12.4.1 Label Leakage

Label leakage occurs when target labels are derived from information that the model can access through its features. The classic example is training pathogenicity predictors on ClinVar annotations while using sequence features that contributed to those annotations. If ClinVar curators used SIFT and PolyPhen scores when classifying variants, and the new model uses similar sequence features, high performance may reflect recapitulation of curation criteria rather than independent predictive power.

The ClinVar circularity problem represents a particularly insidious form of label leakage. When computational predictions contributed to the pathogenicity classifications that later become training labels, new models learn to replicate their predecessors rather than discover independent signal. This circularity propagates through generations of models, each inheriting and reinforcing the biases of earlier predictors. The circularity problem for classical variant effect predictors is examined in Section 4.5, with broader treatment of how such label contamination creates confounded evaluations in Section 13.2.4.

Expression models face analogous challenges when features and labels are derived from the same samples. For example, if a model predicts gene expression from chromatin accessibility measured in the same cells, and those accessibility measurements were selected or normalized based on expression levels, the model may learn associations that reflect sample-specific technical variation rather than regulatory biology. Such circular dependencies inflate apparent performance because the model exploits information that would not be available when predicting expression in new samples.

12.4.2 Feature Leakage

Feature leakage occurs when input features encode information about the target that would not be available at prediction time. In genomics, conservation scores are a common source. If a model uses PhyloP scores as features and the target is pathogenicity, the model may learn that conserved positions are more likely pathogenic without learning anything about variant-specific biology. This would be appropriate if conservation scores are intended to be part of the prediction pipeline, but problematic if the goal is to develop a model that predicts pathogenicity from sequence alone.

Similarly, population allele frequency encodes selection pressure. A model that learns “rare variants are more likely pathogenic” has discovered a useful heuristic but not necessarily mechanistic understanding. Whether this counts as leakage depends on the intended use case. For clinical variant interpretation where allele frequency is always available, exploiting this feature is appropriate. For understanding variant biology, it may mask whether the model has learned anything beyond frequency-based priors.

Feature leakage also arises when features encode information about data partitions or batch structure rather than biology. Coverage patterns that differ systematically between cases and controls, quality metrics that correlate with sequencing center, or variant density profiles that reflect caller-specific behavior all constitute feature leakage of this form.

12.4.3 Temporal Leakage

Temporal leakage violates the causal structure of prediction by using future information to predict past events. A model trained on ClinVar annotations from 2023 and tested on variants that were uncertain in 2020 may perform well because: (1) the 2023 annotations were informed by model-like predictions made after 2020, creating circular validation, and (2) the model learns from reclassification patterns rather than intrinsic variant properties. In both cases, the model exploits the trajectory of scientific knowledge rather than underlying biology.

Clinical outcome prediction faces similar risks when laboratory values, imaging results, or clinical notes recorded after the prediction timepoint enter the feature set. A model predicting 30-day mortality that includes vital signs from day 15 has trivial access to outcome-correlated information. Proper temporal splits must respect not only when samples were collected but when each feature became available.

12.4.4 Benchmark Leakage

Benchmark leakage occurs when test set construction was influenced by methods similar to those being evaluated. If a protein function benchmark was created by selecting proteins with high-confidence annotations, and those annotations were partly derived from sequence similarity searches, sequence-based models may perform well by exploiting the same similarity that guided benchmark construction.

Foundation models face particular challenges with benchmark leakage. If a DNA language model is pretrained on all publicly available genomic sequence including ENCODE data, and then evaluated on ENCODE-derived benchmarks, the pretraining has exposed the model to information about the test distribution even if specific test examples were held out. The model may have learned statistical patterns in ENCODE data that transfer to ENCODE benchmarks without reflecting genuine biological understanding.

This form of leakage is especially difficult to detect because it operates at the level of distributional overlap rather than specific example memorization. A model that has never seen a particular test sequence may still have learned the statistical regularities that make that sequence predictable within the benchmark distribution.

12.4.5 Detecting Leakage

Several strategies help detect leakage, though none provides definitive proof of its absence. These approaches complement each other; rigorous evaluation employs multiple strategies, recognizing that each catches different leakage pathways while remaining blind to others.

Practical Guidance: Leakage Detection Checklist

When evaluating your own model or reviewing published results, work through these detection strategies:

Baseline analysis: Does a simple model using only potentially leaky features (allele frequency, conservation) achieve similar performance?
Feature ablation: Does removing suspicious features cause dramatic performance drops?
Confounder analysis: Does performance remain after conditioning on potential confounders (gene, ancestry, batch)?
Temporal validation: Does performance hold on prospectively collected data?
Overlap audit: Has the overlap between pretraining data and benchmark test sets been documented and checked?

Simple models that could not plausibly have learned biology provide an essential baseline analysis. If a linear model using only allele frequency achieves 0.80 auROC on a pathogenicity benchmark, and a sophisticated deep model achieves 0.82, the marginal improvement may not justify claims of biological insight. The deep model’s performance is bounded by what simple confounders already explain.

Systematic feature ablation removes potentially leaky features and measures performance degradation. If removing conservation scores causes a 20-point drop in auROC, the model was heavily dependent on conservation rather than learning independent predictors. This approach identifies which features drive performance but cannot distinguish legitimate signal from leakage without domain knowledge about what information should be available at prediction time.

Explicit confounder analysis models potential confounders and tests whether model predictions remain informative after conditioning. If a variant effect predictor’s scores become non-predictive after controlling for gene length and expression level, the model may have learned gene-level confounders rather than variant-level effects. Chapter 13 examines how leakage relates to these broader confounding structures.

Temporal validation evaluates models on data collected after the training data was frozen. If performance degrades substantially on newer data, the model may have been fitted to temporal artifacts in the original dataset. This approach is particularly valuable for detecting temporal leakage but requires access to prospectively collected data.

Finally, overlap auditing explicitly checks for sequence or sample overlap between pretraining corpora and evaluation benchmarks. For foundation models, this requires documenting pretraining data composition and comparing against benchmark construction procedures. The audit may reveal that apparent generalization is actually interpolation within seen distributions.

12.5 Benchmark Circularity

A pervasive problem in variant effect prediction benchmarks is circularity: the labels used for evaluation were often derived using methods that share information with the models being evaluated. Grimm et al. (2015) demonstrated this problem systematically, showing that apparent performance advantages of some methods disappeared when evaluation properly accounted for data contamination.

The circularity problem manifests in several forms:

Label circularity: Pathogenicity labels derived from SIFT/PolyPhen scores advantage methods using similar conservation features
Temporal leakage: Training on variants classified after model development inflates apparent performance
Gene-level leakage: Variants in the same gene as training examples may share confounding features

Practical Implication

When evaluating VEP methods, always ask: how were the labels generated? Do any label sources overlap with model features? Are training and test variants from the same genes? Apparent benchmark leaders may reflect circularity rather than genuine predictive improvement.

Exercise

Examine a VEP benchmark you have encountered. How were the pathogenicity labels generated? Can you identify any potential sources of circularity with the prediction methods being evaluated?

12.5.1 Emerging Comprehensive Benchmarks

The limitations of existing benchmarks have motivated development of more rigorous evaluation frameworks. Wang et al. (2025) introduce Genomic Touchstone, a benchmark designed specifically for genomic foundation models with features intended to address circularity:

Systematic holdout: Gene-level splits substantially reduce within-gene information leakage
Multi-task evaluation: Performance across diverse tasks is designed to assess generalization rather than task-specific optimization
Temporal splits: Training cutoffs ensure evaluation on variants classified after model development

Tanigawa et al. (2022) provides complementary resources for comparing foundation model approaches against traditional PRS methods, enabling principled assessment of when deep learning adds value over classical approaches.

For protein variant effect prediction, deep mutational scanning datasets provide gold-standard benchmarks with experimentally measured fitness effects for thousands of variants. Sarkisyan et al. (2016) systematically measured fitness effects of thousands of GFP (green fluorescent protein) variants, creating a dense fitness landscape that tests whether models capture epistatic interactions between mutations. The GFP dataset remains a key benchmark in ProteinGym and related evaluation suites, providing ground-truth fitness values free from the label circularity issues that affect clinical databases.

12.6 Metrics for Genomic Tasks

Metrics quantify model performance but different metrics answer different questions. Choosing appropriate metrics requires clarity about what aspect of performance matters for the intended application.

Stop and Think

Before reading about specific metrics, consider: You are evaluating a variant pathogenicity predictor where only 0.5% of variants are truly pathogenic. Your model achieves auROC = 0.92. Is this good? What other information do you need to decide whether to use this model in practice?

Hint: Think about what happens when you apply a threshold to actually flag variants for follow-up.

12.6.1 Discrimination Metrics

For binary outcomes (pathogenic versus benign, bound versus unbound, accessible versus closed), discrimination metrics assess how well the model separates classes. The auROC measures the probability that a randomly selected positive example is ranked above a randomly selected negative example. auROC is threshold-independent and widely reported but can be misleading when classes are highly imbalanced.

Terminology: ML vs. Clinical Statistics

The same metrics carry different names across machine learning and clinical statistics literature. When reading papers or communicating across disciplines, these equivalences help:

Table 12.2: Cross-discipline terminology for binary classification metrics.

Machine Learning	Clinical Statistics	Definition
Precision	Positive predictive value (PPV)	TP / (TP + FP)
Recall	Sensitivity, True positive rate	TP / (TP + FN)
Specificity	Specificity, True negative rate	TN / (TN + FP)
F1 score	(no common equivalent)	Harmonic mean of precision and recall
auROC	AUC, C-statistic	Area under ROC curve
auPRC	Average precision	Area under precision-recall curve

The auPRC better reflects performance when positives are rare. For variant pathogenicity prediction, where perhaps 1% of variants are truly pathogenic, a model achieving 0.95 auROC might still have poor precision at useful recall levels. auPRC directly captures the precision-recall trade-off that matters for applications requiring both high sensitivity and manageable false positive rates.

Key Insight: Why auROC Can Mislead

The distinction between auROC and auPRC reflects a mathematical property with practical consequences:

auROC is invariant to class imbalance: A model’s auROC remains identical whether 1% or 50% of examples are positive, because it measures pairwise ranking between one positive and one negative.
This invariance becomes a liability in deployment: A model with 0.95 auROC applied to a dataset where 0.1% of variants are pathogenic might flag 100 false positives for every true positive at a threshold capturing 80% of positives.

Rule of thumb: Report both auROC (for comparison across datasets) and auPRC (for realistic assessment of deployment utility). When in doubt, auPRC is the more honest metric for imbalanced problems.

This same invariance becomes a liability when evaluating for deployment. A model with 0.95 auROC applied to a dataset where 0.1% of variants are pathogenic might flag 100 false positives for every true pathogenic variant at a threshold capturing 80% of positives. The auROC provides no warning of this behavior because it treats a positive-to-negative pair the same regardless of how many negatives exist. For any application where false positives carry real costs (manual curation, clinical follow-up, unnecessary patient anxiety), auROC presents an optimistic picture that collapses upon deployment.

auPRC explicitly accounts for the negative class size. When positives are rare, achieving high precision requires a model that scores the vast majority of negatives lower than positives, not just a typical negative. This makes auPRC sensitive to class imbalance in exactly the way deployment is sensitive to class imbalance. A model moving from a balanced benchmark to a 1000:1 imbalanced application will show stable auROC but declining auPRC, mirroring the actual increase in false discovery rate users will experience. For this reason, auPRC (or equivalently, average precision) should be the primary metric when the deployment class distribution is known and imbalanced.

Threshold-dependent metrics including sensitivity, specificity, positive predictive value, and negative predictive value require specifying a decision threshold. These metrics are more interpretable for specific use cases (e.g., “the model identifies 90% of pathogenic variants while flagging only 5% of benign variants as false positives”) but require choosing thresholds that may not generalize across settings.

12.6.2 Regression and Correlation Metrics

For continuous predictions (expression levels, effect sizes, binding affinities), correlation metrics assess agreement between predicted and observed values. Pearson correlation measures linear association; Spearman correlation measures rank association and is robust to nonlinear relationships. The coefficient of determination (\(R^2\)) measures variance explained, though interpretation requires care when baseline performance is near zero.

For predictions at genomic scale (e.g., predicted versus observed expression across thousands of genes), these metrics may obscure important patterns. A model might achieve high genome-wide correlation by correctly predicting which genes are highly expressed while failing on the genes where predictions matter most. Task-specific stratification, such as correlation within expression quantiles or among disease-relevant genes, provides more actionable information.

12.6.3 Ranking and Prioritization Metrics

Many genomic workflows care about ranking rather than absolute prediction. Variant prioritization pipelines rank candidates for follow-up; gene prioritization ranks targets for experimental validation. Top-k recall measures the fraction of true positives captured in the top \(k\) predictions. Enrichment at k compares the true positive rate in the top \(k\) to the background rate. Normalized discounted cumulative gain (NDCG) weights ranking quality by position, penalizing relevant items placed lower in the list more than those placed near the top. Why penalize by position rather than treating all rankings equally? The cost of ranking a true positive lower depends on where it falls in the list. A pathogenic variant ranked 5th instead of 1st will still be found quickly; the same variant ranked 500th instead of 495th has already been effectively lost in the noise. NDCG captures this intuition through logarithmic discounting: moving an item from position 10 to position 2 improves the score more than moving it from position 100 to position 92, because early ranks matter more for workflows with finite follow-up capacity.

These metrics align with how predictions are actually used. If experimental capacity permits validating only 20 variants per locus, top-20 recall matters more than global auROC. Reporting both global metrics and rank-aware metrics at relevant cutoffs provides a complete picture.

12.6.4 Clinical Utility Metrics

For clinical applications, discrimination and calibration are necessary but not sufficient. Decision curves plot net benefit across decision thresholds, where net benefit weighs the value of true positives against the cost of false positives at each threshold. A model may achieve high auROC but offer no net benefit at clinically relevant thresholds if it fails to discriminate in the region where decisions are actually made.

Net reclassification improvement (NRI) measures how often adding genomic features to a clinical model changes risk classifications in the correct direction. This directly addresses whether genomics adds clinical value beyond existing predictors. Chapter 28 provides detailed treatment of clinical evaluation frameworks.

12.7 Baseline Selection

Baseline comparisons determine the meaning of reported performance. A model achieving 0.85 auROC might represent a major advance if the best prior method achieved 0.70, or a trivial improvement if simple heuristics achieve 0.83. Choosing appropriate baselines is as important as choosing appropriate metrics.

12.7.1 Strong Baselines, Not Straw Men

The temptation to compare against weak baselines inflates apparent contributions. A deep learning model compared against a naive prior or a deliberately crippled baseline will appear impressive regardless of whether it offers genuine value. Strong baselines force honest assessment of improvement.

For sequence-based predictions, position weight matrices (PWMs) and k-mer logistic regression provide classical baselines that capture sequence composition without deep learning. If a convolutional model barely outperforms logistic regression on k-mer counts, the convolutional architecture may not be contributing as much as claimed.

For variant effect prediction, simple features like allele frequency, conservation scores, and amino acid properties provide baselines that any sophisticated model should substantially exceed. CADD (Section 4.3) serves as a well-calibrated baseline that combines many hand-crafted features; outperforming CADD demonstrates that learning provides value beyond feature engineering.

For foundation models, comparisons should include both randomly initialized models of similar architecture (to isolate the value of pretraining) and simpler pretrained models (to isolate the value of scale or architectural innovations). Claiming that pretraining helps requires demonstrating improvement over training from scratch on the same downstream data.

Knowledge Check

For each scenario, identify the appropriate baseline:

A new DNA language model claims to predict TF binding sites better than previous approaches. What baselines should it beat?
A variant pathogenicity predictor claims state-of-the-art performance. What would a “straw man” comparison look like, and what would a rigorous comparison include?
A foundation model claims that pretraining improves downstream performance. What comparison demonstrates the value of pretraining specifically?

Check Your Answer

It should beat: PWM/k-mer baselines, the best current deep learning model (like Enformer or DNABERT-2), and a randomly initialized model of the same architecture to show pretraining value.
A straw man would compare only to outdated methods or use different data; rigorous comparison includes current best tools (CADD, REVEL, AlphaMissense), simple baselines (conservation scores, allele frequency), and ablations testing each component’s contribution.
Compare against a randomly initialized model with identical architecture trained from scratch on the same downstream task data - this isolates whether gains come from pretraining rather than just model size or architecture.

12.7.2 Historical Baselines and Progress Tracking

Comparing to methods from five years ago may demonstrate progress but overstates the contribution of any single method. Comparisons should include the best currently available alternatives, not just historically important ones. When prior work is not directly comparable (different data, different splits, different metrics), reimplementing baselines on common benchmarks provides fairer comparison.

Field-wide progress tracking benefits from persistent benchmarks with frozen test sets. Once test set results for a benchmark are published, that benchmark becomes less useful for future model development because the test set is no longer truly held out. Periodic benchmark refresh with new held-out data helps maintain evaluation integrity.

12.7.3 Non-Deep-Learning Baselines

Deep learning models should be compared against strong non-deep alternatives. Gradient-boosted trees, random forests, and regularized linear models often achieve competitive performance with far less computation. If a 100-million-parameter transformer barely outperforms XGBoost on tabular features, the complexity may not be justified.

This comparison is especially important for clinical deployment, where simpler models may be preferred for interpretability, computational efficiency, or regulatory approval. Demonstrating that deep learning provides substantial gains over strong non-deep baselines strengthens the case for adoption.

12.8 Experimental Design Principles

Rigorous evaluation requires applying principles from the design of experiments (DoE), a statistical framework for extracting maximum information from limited experimental resources.

12.8.1 Factorial Thinking for Ablations

When evaluating model components, researchers often vary one factor at a time. This misses interaction effects that can dominate performance.

Full Factorial Design. To study \(k\) binary factors (e.g., “use pretrained weights,” “use data augmentation”), a full factorial requires \(2^k\) experiments but reveals all interactions.

Example: Evaluating DNA Language Model Adaptation

Table 12.3: Full factorial design revealing interaction effects. Main effects: Pretrained (+0.09), LoRA (+0.02), Augmentation (+0.03). Interaction: Pretrained × Augmentation (+0.04); augmentation helps more with pretrained weights.

Pretrained	LoRA	Augmentation	Performance
No	No	No	0.72
Yes	No	No	0.81
No	Yes	No	0.73
No	No	Yes	0.75
Yes	Yes	No	0.84
Yes	No	Yes	0.83
No	Yes	Yes	0.76
Yes	Yes	Yes	0.89

Without factorial design, testing only the “default” configuration misses the synergy between pretraining and augmentation.

12.8.2 Fractional Factorial for Efficiency

When full factorial is too expensive (e.g., \(2^{10} = 1024\) experiments for 10 hyperparameters), fractional factorial designs test a strategic subset.

Resolution III designs (e.g., \(2^{10-6} = 16\) experiments) estimate main effects but confound them with two-way interactions. Use when screening many hyperparameters, interactions are expected to be small, or computational budget is severely limited.

12.8.3 Blocking and Confounding Control

Blocking groups experiments to control known sources of variation:

Example: Multi-GPU Training

Confound: Different GPUs have different memory/speed characteristics
Block: Ensure each model variant runs on each GPU type
Analyze: Use GPU as a blocking factor in statistical analysis

Example: Random Seed Variation

Confound: Training stochasticity affects results
Block: Run each configuration with multiple seeds (typically 3-5)
Report: Mean ± standard deviation across seeds

Common Aliasing Trap

If you test “pretraining” and “larger model” together (both on or both off), you cannot determine which factor caused improvement. The design aliases these main effects.

Always verify your experimental design does not alias effects you need to distinguish.

12.9 Ablation Studies

Ablation studies systematically remove or modify model components to understand their contributions. Where baselines compare across methods, ablations investigate within a method, revealing which design choices actually matter.

12.9.1 Component Isolation

Standard ablations remove individual components: attention layers, skip connections, normalization schemes, specific input features. If removing attention heads causes minimal performance degradation, the model may not be exploiting long-range dependencies as claimed. If removing a particular input modality has no effect, that modality may not be contributing useful information.

Ablations should be designed to test specific hypotheses. If the claim is that a foundation model learns biologically meaningful representations, ablating pretraining (comparing to random initialization) directly tests this claim. If the claim is that cross-attention between modalities enables integration, ablating cross-attention while retaining separate encoders tests whether integration provides value.

12.9.2 Hyperparameter Sensitivity

Reporting performance across hyperparameter ranges reveals robustness. A model that achieves state-of-the-art performance only at a narrow learning rate range with specific regularization may be overfit to the evaluation setup. Consistent performance across reasonable hyperparameter variations provides stronger evidence of genuine capability.

12.9.3 Architecture Search Confounds

When model development involves extensive architecture search, reported performance conflates the value of the final architecture with the value of search on the validation set. The validation set is no longer truly held out; it has been used to select among architectures. Final evaluation on a completely untouched test set, with the architecture fixed before test set examination, provides cleaner assessment.

12.9.4 Reporting Standards

Ablation tables should clearly indicate what was changed in each condition, the number of random seeds or runs, and measures of variance. Single-run ablations can produce misleading results due to training stochasticity. Reporting means and standard deviations across multiple runs reveals whether observed differences exceed random variation.

12.10 Statistical Rigor

Performance differences between models may reflect genuine capability differences or random variation in training and evaluation. Statistical analysis distinguishes signal from noise.

Two Cultures: Inference vs. Prediction

Readers from biostatistics may find this chapter’s evaluation paradigm unfamiliar. The distinction traces to what Leo Breiman called the “two cultures” of statistical modeling (Breiman 2001).

Table 12.4: The two cultures of statistical modeling applied to genomics.

Aspect	Inferential (Classical Statistics)	Predictive (Machine Learning)
Primary goal	Estimate parameters, test hypotheses	Generalize to new data
Key question	“Is this effect significant?”	“How well does this predict?”
Validation	p-values, confidence intervals	Held-out test sets, cross-validation
Model preference	Interpretable (linear, logistic)	Whatever predicts best
Typical application	GWAS effect sizes, clinical trials	Foundation models, risk prediction

Inferential modeling asks whether an observed relationship is “real” (unlikely under the null hypothesis). A GWAS reports effect sizes with p-values because the goal is understanding which variants associate with disease and estimating their effects. Model complexity is constrained to enable interpretation: a coefficient in logistic regression has meaning; a weight in a neural network does not.

Predictive modeling asks whether a model generalizes beyond training data. A variant effect predictor reports auROC on held-out variants because the goal is accurate classification, not mechanistic understanding. Model complexity is constrained only by overfitting: if a billion-parameter transformer predicts better, use it.

Foundation models are fundamentally predictive. They optimize for generalization, not inference. This explains why this chapter emphasizes held-out test performance rather than hypothesis testing, and why readers trained in classical biostatistics may need to shift their evaluation intuitions.

The cultures are not opposed. Mendelian randomization (Section 26.2.2) uses predictive models (genetic instruments) to answer inferential questions (causal effects). Polygenic risk scores use inferential results (GWAS effect sizes) for predictive applications. But understanding which culture a method belongs to clarifies what its evaluation metrics actually measure.

Clinical Prediction Methodology

The evaluation frameworks in this chapter draw from Steyerberg’s Clinical Prediction Models framework (Steyerberg 2019), which distinguishes three core performance dimensions:

Discrimination: The model’s ability to separate classes (c-statistic, auROC)
Calibration: Agreement between predicted probabilities and observed frequencies
Clinical utility: Net benefit at actionable decision thresholds

Foundation models must satisfy these requirements if deployed clinically, as the core evaluation framework remains invariant to underlying architecture. A model with perfect discrimination but poor calibration provides unreliable probability estimates; one with good calibration but poor discrimination cannot stratify risk meaningfully. Clinical deployment requires both.

The TRIPOD statement (Collins et al. 2015) provides reporting guidelines ensuring evaluation results are complete and reproducible. When preparing manuscripts describing genomic foundation model evaluations, TRIPOD compliance should be the minimum standard.

12.10.1 Significance Testing

For classification metrics, significance tests ask whether observed differences exceed what would be expected from sampling variation. Bootstrap confidence intervals resample the test set with replacement, recompute metrics on each resample, and report the distribution of metric values. Non-overlapping 95% confidence intervals suggest significant differences. Permutation tests shuffle predictions between models and measure how often shuffled differences exceed observed differences.

12.10.2 Multiple Testing Corrections

When comparing models across multiple benchmarks or testing multiple hypotheses, naive p-value thresholds inflate false positive rates.

The Problem. Testing \(m\) independent hypotheses at \(\alpha = 0.05\) yields:

\[P(\text{at least one false positive}) = 1 - (1-\alpha)^m \approx 1 - e^{-m\alpha}\]

For \(m = 20\) benchmarks: approximately 64% chance of at least one false positive.

Bonferroni Correction. The simplest fix divides the significance threshold by the number of tests: use \(\alpha/m\) as threshold.

Pros: Controls family-wise error rate (FWER)
Cons: Very conservative, high false negative rate

Benjamini-Hochberg (BH) Procedure. Controls false discovery rate (FDR), the expected proportion of false positives among rejections (Benjamini and Hochberg 1995):

Order p-values: \(p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(m)}\)
Find largest \(k\) where \(p_{(k)} \leq \frac{k}{m} \cdot \alpha\)
Reject hypotheses \(1, \ldots, k\)

BH is less conservative than Bonferroni and appropriate when some false positives are tolerable (typical in benchmark comparisons).

Practical Recommendations:

Few comparisons (≤5): Bonferroni is acceptable
Many comparisons (>5): Use BH with FDR = 0.05 or 0.10
Dependent tests: Use Benjamini-Yekutieli (BY) procedure
Always report: Number of tests performed and correction method

12.10.3 Effect Sizes

Statistical significance does not imply practical significance. A difference of 0.001 auROC might be statistically significant with millions of test examples while being practically meaningless. Effect sizes quantify the magnitude of differences independent of sample size. Cohen’s d for continuous outcomes and odds ratios for binary outcomes provide standardized measures of effect magnitude.

Reporting both significance tests and effect sizes provides complete information. A result that is statistically significant with a tiny effect size warrants different interpretation than one that is significant with a large effect size.

12.10.4 Confidence Intervals on Metrics

Point estimates of auROC or correlation should be accompanied by confidence intervals. DeLong’s method provides analytical confidence intervals for auROC (DeLong, DeLong, and Clarke-Pearson 1988); bootstrap methods provide distribution-free intervals for any metric. Reporting “auROC = \(0.85\) (95% CI: \(0.82\)–\(0.88\))” is more informative than “auROC = \(0.85\)” alone.

12.10.5 Power Analysis: How Many Samples?

Before running expensive experiments, power analysis determines whether you have enough data to detect meaningful effects.

Key Quantities:

Effect size (\(d\)): Standardized difference you want to detect
Power (\(1-\beta\)): Probability of detecting a true effect (typically 0.80)
Significance level (\(\alpha\)): False positive rate (typically 0.05)
Sample size (\(n\)): Number of test examples or experimental runs

For Paired Model Comparison (same test set, comparing model A vs B):

Using a paired t-test, required sample size for detecting effect size \(d\):

\[n \approx \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{d^2}\]

Table 12.5: Required sample sizes for detecting performance differences of various effect sizes at 80% power. Small improvements require substantially larger test sets.

Effect Size	Interpretation	Required \(n\) (power=0.80)
0.2 (small)	~1% AUC difference	~400
0.5 (medium)	~3% AUC difference	~64
0.8 (large)	~5% AUC difference	~26

Implications for Genomic Benchmarks:

Detecting small improvements requires large test sets
With only 100 test variants, you can reliably detect only large effects
Stratified analysis (by variant type, gene) reduces effective sample size

Practical Guidance: Power Calculations

Before claiming “Model A outperforms Model B”:

Estimate the effect size from your results
Calculate post-hoc power given your test set size
If power < 0.80, conclusions are unreliable; get more data or report as “suggestive but underpowered”

Tools: statsmodels.stats.power (Python), pwr package (R)

12.10.6 Variance Across Random Seeds

Deep learning models are sensitive to initialization and optimization stochasticity. Training the same architecture with different random seeds can produce substantially different results. Best practice trains multiple runs and reports means and standard deviations. If the standard deviation across runs exceeds the difference between methods, claimed improvements may not be reproducible.

12.11 Evaluating Foundation Models

Genomic foundation models (Chapter 14) admit multiple evaluation paradigms, each testing different aspects of learned representations.

12.11.1 Zero-Shot Evaluation

In zero-shot evaluation, the pretrained model is applied without any task-specific training (Section 10.6.2 introduces the conceptual foundations). For masked language models, this typically means using token probabilities to score variants or classify sequences. A variant that disrupts a position the model predicts with high confidence may indicate functional importance.

Zero-shot performance tests whether pretraining captures task-relevant structure without explicit supervision. Strong zero-shot performance suggests the pretraining objective aligned with the evaluation task; weak zero-shot performance suggests misalignment. Comparing zero-shot performance to simple baselines (e.g., conservation scores for variant effects) calibrates whether the foundation model provides value beyond what simpler approaches achieve.

12.11.2 Linear Probing

Linear probing freezes the foundation model and trains only a linear classifier on extracted embeddings. This isolates representation quality from fine-tuning capacity. If a linear probe on foundation model embeddings substantially outperforms a linear probe on random embeddings, the foundation model has learned useful features.

Layer-wise probing reveals where information is encoded. Early layers may capture local sequence features while later layers capture more abstract patterns. If the information needed for a task is extractable from early layers, the model may not require the full depth of the architecture for that application.

12.11.3 Fine-Tuning Evaluation

Full fine-tuning adapts all model parameters to the downstream task. This provides the best performance but conflates representation quality with adaptation capacity. A foundation model might achieve high fine-tuned performance through the capacity of its architecture rather than the quality of its pretrained representations.

Comparing fine-tuned foundation models to equivalently architected models trained from scratch isolates the value of pretraining. If both approaches converge to similar performance given sufficient downstream data, pretraining provides label efficiency (less data needed to reach a given performance level) rather than improved final performance. Data efficiency curves, plotting performance against downstream training set size, reveal this trade-off.

12.11.4 Transfer Across Tasks

Foundation models justify their “foundation” designation by transferring to diverse downstream tasks (Chapter 9 covers the theoretical foundations of transfer learning). Evaluating on a single task, however well-designed, cannot assess breadth of transfer. Multi-task evaluation across regulatory prediction, variant effects, protein properties, and other applications reveals whether foundation models provide general-purpose representations or excel only on tasks similar to their pretraining objective.

Transfer across species, tissues, and experimental modalities provides additional evidence of generalization. A DNA language model that transfers from human to mouse, or from blood cells to neurons, demonstrates that its representations capture biological principles rather than species-specific or tissue-specific patterns.

12.12 Calibration Essentials

Strong discrimination does not guarantee useful probability estimates. A model achieving 0.95 auROC might assign probability 0.99 to all positive examples and 0.98 to all negatives, ranking perfectly while providing meaningless confidence values. Clinical decision-making requires both: accurate ranking to identify high-risk variants and accurate probabilities to inform the weight of computational evidence. Calibration assesses whether predicted probabilities match observed frequencies, a property essential for rational integration of model outputs into diagnostic workflows.

The clinical prediction literature establishes rigorous standards for calibration assessment (Steyerberg 2019). Beyond reliability diagrams and ECE, calibration can be characterized as:

Calibration-in-the-large: Does the mean predicted probability equal the observed event rate? (intercept = 0 in calibration regression)
Calibration slope: Are predictions appropriately spread? (slope = 1 indicates correct spread; slope < 1 indicates overfitting)
Moderate calibration: Within meaningful risk strata, do predictions match outcomes?

For genomic foundation models using training data enriched for disease (e.g., case-control studies at ~50% prevalence), calibration-in-the-large commonly fails when applied to population screening contexts (1-5% prevalence). Recalibration techniques described in Chapter 24 address this mismatch.

12.12.1 Assessing Calibration

The most intuitive assessment comes from reliability diagrams, which plot predicted probabilities against observed frequencies. The construction bins predictions into intervals (commonly ten bins spanning 0 to 0.1, 0.1 to 0.2, and so forth), computes the mean predicted probability within each bin, computes the fraction of positive examples within each bin, and plots these quantities against each other. Perfect calibration produces points along the diagonal; systematic deviations reveal overconfidence (points below the diagonal) or underconfidence (points above).

A single summary statistic, the expected calibration error (ECE), captures miscalibration as the weighted average absolute difference between predicted and observed probabilities across bins. Lower ECE indicates better calibration. The metric depends on binning choices; equal-width bins may place most examples in a few bins for models with concentrated predictions, while equal-mass bins ensure each bin contains the same number of examples but may span wide probability ranges. ECE should be reported alongside reliability diagrams for interpretability.

Aggregate calibration metrics can mask important heterogeneity. A model might achieve low aggregate ECE while being systematically overconfident for rare variant classes and underconfident for common ones, with opposite errors canceling in the aggregate statistic. Stratified calibration analysis across ancestry groups, variant classes, and gene categories identifies these disparities. For genomic models intended for diverse populations, subgroup-stratified calibration is not optional; aggregate metrics can mask clinically significant differential performance.

12.12.2 Recalibration Methods

Post-hoc recalibration adjusts predicted probabilities without retraining the underlying model. Methods range from single-parameter approaches like temperature scaling (Guo et al. 2017), which divides logits by a learned constant to compress overconfident distributions, to non-parametric transformations like isotonic regression, which fits a monotonic function mapping raw scores to calibrated probabilities. Platt scaling (Platt 1999) fits a logistic regression from model outputs to true labels, providing intermediate flexibility. Each method makes different assumptions about the structure of miscalibration and requires different amounts of calibration data. The mathematical details, theoretical foundations, and guidance for method selection are developed in Section 24.4.

All recalibration methods require held-out calibration data distinct from both training and test sets. Calibrating on test data and then evaluating calibration on the same test data produces overoptimistic estimates. For foundation models, the calibration set should be drawn from the deployment distribution; calibrating on ClinVar expert-reviewed variants may not transfer to variants in less-studied genes or underrepresented populations.

12.12.3 Calibration in Model Comparison

When comparing models, calibration metrics complement discrimination metrics. Two models with identical auROC may have dramatically different calibration, and the better-calibrated model will produce more reliable clinical evidence even though its ranking performance is equivalent. Reporting both discrimination (auROC, auPRC) and calibration (ECE, reliability diagrams) provides a complete picture of model performance.

Calibration can often be improved post-hoc without sacrificing discrimination. Temperature scaling preserves ranking while adjusting probability magnitudes, meaning a model can be recalibrated to improve ECE without changing auROC. This observation suggests that raw discrimination metrics may be more fundamental indicators of model quality, with calibration treated as an adjustable property. The comprehensive treatment of calibration theory is developed in Section 24.3, including its relationship to uncertainty quantification (Section 24.1) and methods for quantifying different sources of prediction uncertainty. Clinical deployment requires additional calibration considerations examined in Section 28.6.2.

12.14 The Question Behind the Metric

The question is never simply “what is the auROC?” but rather “what has been demonstrated, and how much should we trust it?” A reported metric summarizes one aspect of model behavior on one dataset under one evaluation protocol. Whether that metric predicts performance in deployment depends on details that standard reporting obscures: how data were split, whether leakage occurred, which subgroups were evaluated, what baselines were compared, and whether statistical conclusions account for multiple comparisons and estimation uncertainty.

The shortcuts that accelerate research in other machine learning domains produce misleading conclusions when applied to genomic data. Random train-test splits ignore homology that creates pseudo-replication. Single-metric comparisons miss failure modes in clinically relevant subgroups. Significance tests without effect sizes conflate statistical and practical importance. Benchmark evaluation without temporal awareness allows indirect leakage through shared community resources. Homology, population structure, batch effects, and label circularity create countless opportunities for self-deception, and genomic data exhibit all of these in abundance.

Rigorous evaluation requires sustained effort at every stage, from experimental design through statistical analysis. The confounding and leakage structures examined in Chapter 13 detail how population stratification, batch effects, and ascertainment bias produce results that evaporate under deployment. Uncertainty quantification (Chapter 24) extends calibration assessment to epistemic versus aleatoric uncertainty (Section 24.1) and selective prediction (Section 24.8). Interpretability (Chapter 25) addresses whether models have learned genuine biology or exploited confounded patterns, with attribution methods in Section 25.1 providing specific diagnostic tools. For clinical applications specifically, risk prediction frameworks (Chapter 28) develop evaluation approaches tailored to decision-making, where net benefit and decision curves supplement discrimination metrics. Together, these perspectives provide the critical apparatus for engaging with genomic foundation model claims.

Test Yourself

Before reviewing the summary, test your recall:

What is homology leakage, and why does it make random train-test splits inadequate for protein and DNA sequence benchmarks?
Explain the four types of data leakage (label, feature, temporal, benchmark) and give a concrete genomic example of each.
A model achieves 0.85 Spearman correlation on a DMS benchmark with random splits, but only 0.60 with contiguous region splits. What does this performance gap reveal?
Why does auROC alone fail to capture calibration quality, and when would a well-calibrated model matter more than high discrimination?
How does label circularity in ClinVar (where computational predictions influence annotations) compromise benchmark validity for new variant effect predictors?

Check Your Answer

Homology leakage occurs when evolutionarily related sequences appear in both training and test sets, allowing models to succeed through memorization of family-specific patterns rather than learning general biological principles. Random splits fail because they ignore sequence similarity: a test protein at 80% identity to a training protein provides minimal generalization evidence, yet random splits routinely create such pairs. Homology-aware splitting (using tools like CD-HIT at 30-40% identity thresholds) ensures test sequences are evolutionarily distant from training sequences.
Four leakage types:
- Label leakage: Target labels derived from model features. Example: ClinVar pathogenicity classifications that incorporated SIFT/PolyPhen predictions, then used to evaluate new predictors using similar features; the model learns to replicate curation criteria rather than discover independent signal.
- Feature leakage: Input features encode unavailable future information. Example: Using population allele frequency as a feature for pathogenicity prediction captures selection pressure (rare = likely pathogenic) without learning variant biology; appropriate for clinical use but problematic for mechanistic understanding.
- Temporal leakage: Using future information to predict past events. Example: Training on 2023 ClinVar annotations to predict variants that were uncertain in 2020, when those 2023 annotations may have used model-like predictions made after 2020.
- Benchmark leakage: Test set construction influenced by similar methods. Example: A protein benchmark selecting well-annotated proteins via sequence similarity searches, then evaluating sequence-based models that exploit the same similarity used in benchmark construction.
The 0.85 → 0.60 performance drop reveals the model relies heavily on local sequence context rather than learning positional effects or long-range constraints. Random splits allow nearby positions to appear in both training and test sets; the model learns that “positions near training variants tend to have similar effects.” Contiguous splits remove this crutch by placing entire sequence regions in test, forcing genuine spatial generalization. The 0.25 correlation gap represents how much performance came from interpolating between nearby measured positions versus understanding the protein’s functional landscape.
auROC measures ranking (whether positives score higher than negatives) but is invariant to probability calibration (whether a score of 0.8 actually means 80% probability). A model can achieve 0.95 auROC by assigning 0.99 to all pathogenic variants and 0.98 to all benign variants (perfect ranking, useless probabilities). Well-calibrated models matter most when:
- Making threshold-based decisions (reporting variants above some probability cutoff)
- Integrating model evidence with other information (Bayesian updating requires calibrated likelihoods)
- Communicating uncertainty to clinicians (a stated “80% probability” should be reliable)
- Comparing risks across different variant types or populations
Label circularity creates a feedback loop: if ClinVar curators used computational predictions (SIFT, PolyPhen, conservation scores) to classify variants, and those classifications become training labels for new predictors using similar features, the new model learns to replicate curation criteria rather than provide independent evidence. This inflates benchmark performance without improving biological insight; the model predicts what previous models predicted, not what biology determines. Detecting circularity requires: (1) comparing against baselines using only the potentially circular features, (2) checking whether performance exceeds what those features alone achieve, (3) temporal validation on prospectively classified variants, and (4) examining whether the model adds value beyond existing predictors in clinical adjudication.

Chapter Summary

This chapter covered the methodological foundations for proper model evaluation in genomic machine learning.

Key Takeaways:

Random splits fail for genomic data because sequences share homology, individuals share ancestry, and samples share batch effects
Homology-aware splitting (CD-HIT/MMseqs2 at appropriate thresholds) prevents the most common leakage pathway
Four leakage types (label, feature, temporal, benchmark) require different detection strategies
Metric selection must match deployment objectives: auPRC for imbalanced data, calibration for probability estimates, ranking metrics for prioritization
Strong baselines and proper ablations distinguish genuine advances from benchmark-specific tuning
Foundation model evaluation requires zero-shot, probing, and fine-tuning comparisons to isolate representation quality

Looking Ahead: The next chapter (Chapter 13) examines how confounding and leakage structures beyond homology create spurious performance claims, including population stratification, batch effects, and ascertainment bias.

Connections:

Apply evaluation principles when assessing claims in later chapters on foundation models (Chapter 14 through Chapter 18)
Calibration concepts developed here connect to uncertainty quantification (Chapter 24)
Clinical utility metrics introduced here are expanded for clinical risk prediction (Chapter 28)

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society: Series B (Methodological) 57 (1): 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science, August.

Collins, Gary S., Johannes B. Reitsma, Douglas G. Altman, and Karel G. M. Moons. 2015. “Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD): The TRIPOD Statement.” BMJ 350: g7594. https://doi.org/10.1136/bmj.g7594.

DeLong, Elizabeth R., David M. DeLong, and Daniel L. Clarke-Pearson. 1988. “Comparing the Areas Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach.” Biometrics 44 (3): 837–45. https://doi.org/10.2307/2531595.

Grimm, Dominik G., Chloé-Agathe Azencott, Fabian Aicheler, Udo Gieraths, Daniel G. MacArthur, Kaitlin E. Samocha, David N. Cooper, et al. 2015. “The Evaluation of Tools Used to Predict the Impact of Missense Variants Is Hindered by Two Types of Circularity.” Human Mutation 36 (5): 513–23. https://doi.org/10.1002/humu.22768.

Guo, Chuan, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. “On Calibration of Modern Neural Networks.” In Proceedings of the 34th International Conference on Machine Learning, 1321–30. PMLR.

Li, Weizhong, and Adam Godzik. 2006. “Cd-Hit: A Fast Program for Clustering and Comparing Large Sets of Protein or Nucleotide Sequences.” Bioinformatics 22 (13): 1658–59. https://doi.org/10.1093/bioinformatics/btl158.

Platt, John. 1999. “Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods.” Advances in Large Margin Classifiers, March.

Rost, Burkhard. 1999. “Twilight Zone of Protein Sequence Alignments.” Protein Engineering 12 (2): 85–94. https://doi.org/10.1093/protein/12.2.85.

Sarkisyan, Karen S., Dmitry A. Bolotin, Margarita V. Meer, Dinara R. Usmanova, Alexander S. Mishin, George V. Sharonov, Dmitry N. Ivankov, et al. 2016. “Local Fitness Landscape of the Green Fluorescent Protein.” Nature 533 (7603): 397–401. https://doi.org/10.1038/nature17995.

Steinegger, Martin, and Johannes Söding. 2017. “MMseqs2 Enables Sensitive Protein Sequence Searching for the Analysis of Massive Data Sets.” Nature Biotechnology 35 (11): 1026–28. https://doi.org/10.1038/nbt.3988.

Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Cham: Springer. https://doi.org/10.1007/978-3-030-16399-0.

Tanigawa, Yosuke, Junyang Qian, Guhan Venkataraman, Johanne Marie Justesen, Ruilin Li, Robert Tibshirani, Trevor Hastie, and Manuel A. Rivas. 2022. “Significant Sparse Polygenic Risk Scores Across 813 Traits in UK Biobank.” PLOS Genetics 18 (3): e1010105. https://doi.org/10.1371/journal.pgen.1010105.

Wang, Yihui, Zhiyuan Cai, Qian Zeng, Yihang Gao, Jiarui Ouyang, Yingxue Xu, Shu Yang, et al. 2025. “Genomic Touchstone: Benchmarking Genomic Language Models in the Context of the Central Dogma.” bioRxiv. https://doi.org/10.1101/2025.06.25.661622.

12.1 Why Random Splits Fail

12.2 Homology-Aware Splitting

12.2.1 Clustering Tools and Workflows

12.2.2 Practical Considerations

12.3 Splitting by Biological Axis

12.3.1 Splitting by Individual

12.3.2 Splitting by Gene or Protein Family

12.3.3 Splitting by Experimental Context

12.3.4 Splitting by Ancestry

12.3.5 Splitting by Time

12.4 Leakage Taxonomy and Detection

12.4.1 Label Leakage

12.4.2 Feature Leakage

12.4.3 Temporal Leakage

12.4.4 Benchmark Leakage

12.4.5 Detecting Leakage

12.5 Benchmark Circularity

12.5.1 Emerging Comprehensive Benchmarks

12.6 Metrics for Genomic Tasks

12.6.1 Discrimination Metrics

12.6.2 Regression and Correlation Metrics

12.6.3 Ranking and Prioritization Metrics

12.6.4 Clinical Utility Metrics

12.7 Baseline Selection

12.7.1 Strong Baselines, Not Straw Men

12.7.2 Historical Baselines and Progress Tracking

12.7.3 Non-Deep-Learning Baselines

12.8 Experimental Design Principles

12.8.1 Factorial Thinking for Ablations

12.8.2 Fractional Factorial for Efficiency

12.8.3 Blocking and Confounding Control

12.9 Ablation Studies

12.9.1 Component Isolation

12.9.2 Hyperparameter Sensitivity

12.9.3 Architecture Search Confounds

12.9.4 Reporting Standards

12.10 Statistical Rigor

12.10.1 Significance Testing

12.10.2 Multiple Testing Corrections

12.10.3 Effect Sizes

12.10.4 Confidence Intervals on Metrics

12.10.5 Power Analysis: How Many Samples?

12.10.6 Variance Across Random Seeds

12.11 Evaluating Foundation Models

12.11.1 Zero-Shot Evaluation

12.11.2 Linear Probing

12.11.3 Fine-Tuning Evaluation

12.11.4 Transfer Across Tasks

12.12 Calibration Essentials

12.12.1 Assessing Calibration

12.12.2 Recalibration Methods

12.12.3 Calibration in Model Comparison

12.13 Putting It All Together

12.14 The Question Behind the Metric