11 Benchmark Landscape

Every benchmark measures a proxy. No benchmark measures what you actually need to know.

Chapter Overview

Estimated reading time: 35-40 minutes

Prerequisites: This chapter assumes familiarity with basic machine learning concepts including train/test splits, common metrics (accuracy, auROC), and model evaluation (Chapter 5). Understanding of genomic data types (Chapter 2) and variant effect prediction concepts (Chapter 4) will help contextualize benchmark tasks.

Learning Objectives: After completing this chapter, you should be able to:

Navigate the major benchmark suites for protein (TAPE, FLIP, ProteinGym), DNA (BEND, Genomic Benchmarks), and variant effect prediction (ClinVar, CAGI)
Understand how benchmark construction choices shape what is measured and what is missed
Identify the proxy-target gap between benchmark success and deployment value
Critically evaluate benchmark claims by identifying saturation, staleness, and systematic biases

Chapter Structure: This chapter surveys what benchmarks exist across modalities, examining protein, DNA, and variant effect benchmarks before addressing the gap between benchmark success and deployment value. The companion chapter on Evaluation Methods (Chapter 12) examines how to evaluate properly.

ClinVar pathogenicity labels proxy clinical impact. Area under the receiver operating characteristic curve (auROC) on held-out variants proxies discrimination ability in deployment. Chromatin accessibility predictions proxy regulatory function. The gap between proxy and target varies across benchmarks, across variant types, and across populations. A model achieving state-of-the-art performance on ClinVar may systematically miscalibrate predictions for the rare variants that matter most clinically, because ClinVar’s composition does not reflect the distribution of variants clinicians actually encounter. A DNA language model excelling at enhancer classification may have learned GC content rather than regulatory grammar, because the benchmark’s negative examples differ from positives in ways that have nothing to do with enhancer function.

Understanding what benchmarks actually measure, and how that differs from what we need to know, is prerequisite to interpreting any leaderboard result. The genomic AI field has accumulated substantial evaluation infrastructure. Dozens of benchmark suites target different modalities: protein structure and function, DNA regulatory elements, variant pathogenicity, gene expression prediction, and more. Hundreds of individual tasks probe specific capabilities. Thousands of models have reported results, creating leaderboards that rank approaches and track progress over time. This infrastructure enables comparison and drives methodological improvement. Yet the relationship between benchmark success and deployment value remains poorly characterized. A foundation model that dominates protein benchmarks may fail on the specific protein family relevant to a drug discovery campaign. A variant effect predictor that leads regulatory benchmarks may provide no clinical utility for the variant classes that lack representation in evaluation data.

11.1 Protein Language Model Benchmarks

Protein language models (Chapter 16) benefit from the longest-established and most systematic evaluation ecosystem in genomic AI, reflecting both the longer history of computational protein science and the relative tractability of protein structure and function prediction compared to the regulatory genomics tasks discussed in Chapter 17. The maturity of protein benchmarks reflects both the longer history of computational protein science and the relative tractability of protein structure and function prediction compared to regulatory genomics.

11.1.1 TAPE: Tasks Assessing Protein Embeddings

How do you know if one protein representation is better than another? The challenge parallels standardized testing in education: before SAT and GRE exams existed, comparing students from different schools was nearly impossible because each school used different grading scales and curricula. A 3.8 GPA from one school might mean something very different from a 3.8 at another. Standardized benchmarks solve this by providing a common yardstick that everyone agrees to use, enabling fair comparison across diverse candidates. Before 2019, comparing protein language models faced the same problem: different papers ran different evaluations with inconsistent protocols, making apples-to-apples comparison nearly impossible. The Tasks Assessing Protein Embeddings (TAPE) benchmark, introduced in 2019, established the template for systematic protein representation evaluation (Rao et al. 2019). TAPE frames protein language model assessment as transfer learning evaluation (Chapter 9): pretrained models generate embeddings (Section 5.6), which are then used as features for supervised prediction on downstream tasks. This framework decouples representation quality from task-specific modeling, enabling comparison across architectures that may have very different inductive biases.

TAPE comprises five tasks spanning different aspects of protein biology. Secondary structure prediction requires classifying each residue as helix, sheet, or coil, testing whether embeddings capture local structural preferences. Contact prediction asks whether residue pairs are spatially proximate in the folded structure, probing the representation’s ability to encode tertiary structure information from sequence alone. Remote homology detection requires classifying proteins into structural superfamilies, testing whether embeddings capture evolutionary relationships that transcend sequence similarity. Fluorescence prediction and stability prediction use data from deep mutational scanning experiments to assess whether embeddings encode fitness landscapes.

The benchmark’s design reflects deliberate methodological choices. Train, validation, and test splits enforce sequence identity thresholds to prevent homology-based leakage (Section 12.4). Evaluation uses simple linear or shallow neural network heads rather than complex task-specific architectures, isolating representation quality from modeling capacity. Standardized preprocessing and data loading eliminate confounds from inconsistent implementation.

Key Insight: The Transfer Learning Evaluation Framework

TAPE established an important principle: evaluate representations separately from task-specific modeling. By using simple linear classifiers on top of frozen embeddings, TAPE isolates what the pretrained model learned from what a complex head might learn during fine-tuning. This approach became the standard template for foundation model evaluation across genomics, enabling fair comparison between models with different architectures and pretraining objectives.

TAPE’s influence extended beyond its specific tasks. The benchmark established norms for protein representation evaluation: systematic coverage of diverse prediction targets, controlled transfer learning protocols, and explicit attention to data splitting. Subsequent benchmarks adopted and extended this framework.

11.1.2 FLIP: Function-Linked Protein Benchmark

Can a model predict what a protein actually does, not just what it looks like? TAPE’s labels include computationally inferred structure and conservation, but these are indirect proxies for function. A researcher wanting to know whether a mutation affects enzymatic activity needs predictions grounded in experimental measurements of that activity. The FLIP (Function-Linked Integrated Protein) benchmark addresses this gap by focusing on experimentally measured functional properties (Dallago et al. 2022). Where TAPE includes structurally derived labels and computational annotations, FLIP emphasizes high-throughput experimental assays that directly measure protein fitness.

FLIP aggregates deep mutational scanning datasets across diverse proteins and functional readouts. The benchmark includes assays measuring enzymatic activity, binding affinity, thermostability, and expression level. Each dataset provides quantitative measurements for thousands of single-point mutations, enabling evaluation of fine-grained variant effect prediction.

The benchmark’s value lies in its experimental grounding. Computational structure predictions and evolutionary conservation scores, while useful, are indirect proxies for function. Deep mutational scanning provides direct measurements of how sequence changes affect the property of interest. Models that perform well on FLIP demonstrate the ability to predict experimentally validated functional consequences rather than computationally inferred annotations.

FLIP also introduced systematic evaluation of different splitting strategies. Random splits, where training and test variants are sampled uniformly from the same protein, represent the easiest setting. Contiguous splits, where training and test variants occupy different sequence regions, test spatial generalization. Modulo splits, which interleave training and test positions along the sequence, provide intermediate difficulty. Performance typically degrades from random to contiguous splits, revealing how much models rely on local sequence context versus genuine functional understanding.

Stop and Think: Splitting Strategy Implications

Before reading on, consider: If a model achieves 0.85 correlation on FLIP with random splits but only 0.60 correlation with contiguous splits, what does this reveal about what the model has learned? What kind of information would be available in random splits but not contiguous splits?

Hint: Think about what information from nearby positions might “leak” across random splits.

11.1.3 ProteinGym: Comprehensive Variant Effect Evaluation

Which variant effect predictor should you trust? With dozens of models claiming state-of-the-art performance on different proteins, the field needed a unified benchmark comprehensive enough to reveal which approaches genuinely generalize. ProteinGym has emerged as the most comprehensive benchmark for protein variant effect prediction, compiling 217 deep mutational scanning assays across diverse protein families (Notin et al. 2023). The benchmark’s scale enables statistically robust comparison across modeling approaches while its diversity reveals where different methods excel or struggle.

The primary evaluation metric is Spearman correlation between predicted and experimentally measured fitness effects. This rank-based metric is appropriate for deep mutational scanning data, where absolute fitness values depend on assay-specific calibration but relative rankings are more comparable across experiments. ProteinGym reports correlations for each assay individually and aggregated across the full benchmark, enabling both global comparison and identification of task-specific strengths.

ProteinGym distinguishes between zero-shot and supervised evaluation regimes. In zero-shot evaluation, models predict variant effects without any task-specific training, relying entirely on representations learned during pretraining. Models like ESM-1v (Section 16.1) compute effects as log-likelihood ratios under the pretrained language model, while structure-based methods like AlphaMissense (Section 18.2.3) incorporate predicted structural consequences. In supervised evaluation, models are fine-tuned on a subset of measured variants before predicting held-out effects. The gap between zero-shot and supervised performance indicates how much task-specific information improves over general-purpose representations.

The benchmark reveals systematic patterns in model performance. Protein language models generally outperform conservation-based methods, particularly for variants in regions with sparse evolutionary sampling. Structure-aware models show advantages for variants affecting protein stability or buried residues. Ensemble methods that combine multiple predictors often achieve the highest correlations, suggesting that different approaches capture complementary information.

ProteinGym’s limitations mirror those of its constituent datasets. Deep mutational scanning experiments are biased toward well-studied proteins amenable to high-throughput screening. Assay-specific selection pressures affect which variants appear deleterious: a variant may strongly affect enzymatic activity while leaving thermostability unchanged, or vice versa. The benchmark measures correlation with specific experimental readouts rather than clinical pathogenicity, which integrates multiple functional consequences in complex ways.

Table 11.1: Comparison of major protein benchmarks. Each benchmark makes different trade-offs between scale, label quality, and evaluation rigor.

Benchmark	Tasks	Labels	Splitting	Strengths	Limitations
TAPE	5 (structure, homology, fitness)	Mixed (experimental + computational)	Homology-aware	Established template; diverse tasks	Some computational labels; limited DMS coverage
FLIP	Multiple DMS datasets	Experimental only	Multiple strategies (random, contiguous, modulo)	Experimental grounding; systematic split analysis	Limited to proteins with DMS data
ProteinGym	217 DMS assays	Experimental only	Zero-shot + supervised	Comprehensive scale; diverse protein families	Biased toward well-studied proteins

11.1.4 Structure Prediction Benchmarks

Structure prediction has long served as the ultimate test of whether we truly understand proteins: if you can predict how a sequence folds, you have captured something fundamental about protein physics. Protein structure prediction benchmarks derive from the Critical Assessment of protein Structure Prediction (CASP) tradition, which has evaluated computational methods against experimentally determined structures since 1994 (Kryshtafovych et al. 2021). The dramatic success of AlphaFold2 at CASP14 in 2020 transformed the field, but structure prediction benchmarks remain relevant for evaluating single-sequence methods and assessing whether language model pretraining improves structural accuracy.

Structure prediction quality is typically assessed using the Global Distance Test (GDT-TS) and Template Modeling score (TM-score). GDT-TS measures the percentage of residues that can be superimposed within various distance thresholds, providing a single number between 0 and 100 that correlates well with visual assessment of structural similarity. TM-score normalizes by protein length, enabling comparison across proteins of different sizes.

For protein language models, the relevant evaluation setting is single-sequence structure prediction, where the model receives only the target sequence without multiple sequence alignments. This tests whether pretraining on evolutionary sequence databases enables structure prediction without explicit evolutionary analysis at inference time. ESMFold (Section 16.4) demonstrated that single-sequence prediction can approach MSA-based methods for many proteins, though performance gaps remain for sequences with sparse evolutionary coverage.

Structure prediction benchmarks complement sequence-based evaluations by testing whether learned representations encode biophysical constraints. A model that achieves high accuracy on contact prediction or secondary structure classification may still fail to integrate these local predictions into globally consistent structures. The emergence of accurate single-sequence structure prediction from language model embeddings suggests that pretraining captures substantial structural information, even without explicit structural supervision.

11.2 DNA and Regulatory Benchmarks

DNA foundation models (Chapter 15) and regulatory models (Chapter 17) face a less mature but rapidly developing benchmark landscape compared to the protein ecosystem. Early deep learning work in genomics focused on individual tasks derived from ENCODE-style assays (Section 2.4.1), establishing evaluation paradigms that later benchmark suites would systematize. Recent efforts have introduced benchmark suites that attempt to standardize evaluation across multiple tasks, tissues, and species.

Stop and Think

Before diving into DNA benchmarks, consider: Why might DNA benchmarks be “less mature” than protein benchmarks? What makes regulatory prediction fundamentally harder to benchmark than protein structure or function prediction?

Hint: Think about the length scales involved and whether regulatory activity is an intrinsic sequence property.

11.2.1 Classical Regulatory Prediction Tasks

Where in the genome does gene regulation actually happen? Early deep learning researchers needed benchmarks to test whether neural networks could learn to recognize promoters, enhancers, and transcription factor binding sites from DNA sequence alone. The earliest deep learning benchmarks for genomics framed regulatory prediction as classification over short sequence windows. Transcription factor binding prediction asks whether a specific TF ChIP-seq peak overlaps a given sequence window, typically around 1 kilobase centered on the binding site. Open chromatin prediction requires classifying regions as accessible or inaccessible based on DNase-seq or ATAC-seq signal. Histone mark prediction asks whether a chromatin modification peak (H3K27ac, H3K4me3, etc.) is present at each position.

These tasks derive from consortia like ENCODE and Roadmap Epigenomics (Section 2.4.1), which systematically profiled chromatin states across cell types. Benchmark construction typically involves defining positive regions from called peaks and sampling negative regions from elsewhere in the genome, extracting fixed-length sequences centered on each region, and evaluating binary classification using auROC or average precision.

Models such as DeepSEA, Basset, and DanQ established baseline performance on these tasks (Chapter 6 for architectural details). Their success demonstrated that convolutional networks could learn sequence features predictive of regulatory state without hand-crafted motifs. Modern foundation models still report performance on similar tasks as sanity checks, though these classical benchmarks have significant limitations.

Limitation Alert: Binary Classification Over Short Windows

The primary limitation of classical regulatory benchmarks is that binary classification over short windows fails to capture the quantitative, cell-type-specific, and long-range nature of transcriptional regulation. A region may be weakly accessible in some cell types and strongly accessible in others; binary labels collapse this continuous variation. Short windows cannot assess whether models capture distal regulatory interactions that span tens to hundreds of kilobases. Evaluation on curated peak regions may overestimate performance relative to genome-wide prediction, where the vast majority of positions are regulatory “background.”

11.2.2 Quantitative Regulatory Prediction

Beyond binary classification, benchmarks increasingly require prediction of quantitative regulatory readouts. Signal regression asks models to predict per-base or per-bin signal intensity from ChIP-seq, ATAC-seq, or related assays. Gene expression prediction requires predicting transcript abundance (TPM, counts) from promoter sequences or larger genomic contexts. Massively parallel reporter assays (MPRAs) provide systematic measurements of enhancer or promoter activity for thousands of sequences, enabling evaluation of quantitative activity prediction.

Hybrid architectures like Enformer (Section 17.2) popularized benchmarks combining large receptive fields with dense quantitative targets across many assays and cell types. Evaluation metrics shift from auROC to Pearson or Spearman correlation between predicted and observed profiles. Some benchmarks report correlation relative to replicate concordance, establishing an upper bound set by experimental reproducibility.

Quantitative benchmarks better reflect the continuous nature of regulatory activity but introduce new challenges. Heterogeneous noise across assays and laboratories complicates aggregation: should a model be penalized equally for poor performance on a low-quality assay versus a high-quality one? Cell-type diversity raises questions about how to weight performance across tissues: is accurate prediction in a rare cell type more or less important than in a common one? The relationship between predicted and observed signal depends on assay-specific calibration that may not transfer across experimental batches.

11.2.3 Genomic Benchmarks

Reproducibility in DNA modeling has been notoriously difficult: different papers used different data preprocessing, different train-test splits, and different evaluation metrics, making it nearly impossible to compare methods fairly. The Genomic Benchmarks resource addresses this fragmentation by providing standardized classification datasets for DNA sequence models (Grešová et al. 2023). The benchmark compiles tasks including enhancer identification, promoter recognition, splice site detection, and coding sequence classification across multiple species. Standardized train, validation, and test splits enable direct comparison of different architectures without confounds from inconsistent data processing.

Genomic Benchmarks emphasizes accessibility and reproducibility. Datasets are available in a unified format with documented preprocessing. Baseline results for multiple architectures provide reference points for new models. The benchmark includes tasks of varying difficulty, from relatively easy (distinguishing coding from non-coding sequence) to challenging (identifying tissue-specific enhancers).

The benchmark’s limitations reflect its design priorities. Focus on classification rather than regression excludes quantitative prediction tasks. Task difficulty varies substantially, with some tasks approaching saturation where gains become difficult to measure. Species coverage, while broader than many benchmarks, remains biased toward well-studied model organisms.

11.2.4 BEND: Benchmark for DNA Language Models

As DNA language models proliferated, each reporting impressive results on different tasks, a pressing question emerged: which model should you actually use for your application? BEND (Benchmark for Evaluating DNA Models) provides a unified framework for evaluating genomic foundation models across diverse tasks (Marin et al. 2024). The benchmark includes regulatory element classification, chromatin accessibility prediction, variant effect scoring, and gene expression prediction. Standardized splits and evaluation protocols enable fair comparison across model families.

BEND’s design reflects lessons learned from earlier benchmarks. Tasks span multiple biological scales, from nucleotide-level variant effects to kilobase-scale regulatory elements. Evaluation includes both zero-shot settings (using pretrained representations directly) and fine-tuned settings (adapting models to specific tasks). Performance is reported separately for each task rather than aggregated into a single score, acknowledging that different models may excel at different aspects of genomic prediction.

Comparative evaluations using BEND reveal that no single model dominates across all tasks. Architecture choices (CNN versus transformer versus state space model), tokenization schemes (single nucleotide versus k-mer versus BPE), and pretraining corpora all influence task-specific performance (Chapter 5). These patterns inform model selection for specific applications while highlighting the limitations of aggregate benchmarks that obscure such variation.

Knowledge Check

Test your understanding of the DNA benchmark landscape:

What is the key difference between classical regulatory benchmarks (binary classification) and modern quantitative benchmarks?
Why might a model perform well on Genomic Benchmarks but poorly on BEND?
What does it mean when “no single model dominates across all tasks” in BEND?

Check Your Answer

Classical benchmarks focus on binary classification (promoter vs. non-promoter, enhancer vs. background) while modern quantitative benchmarks predict continuous values (expression levels, binding affinity, chromatin accessibility), requiring models to capture magnitude rather than just presence/absence.
A model might excel at binary classification tasks in Genomic Benchmarks by learning to distinguish broad sequence classes, but fail on BEND’s diverse tasks requiring different capabilities like long-range dependencies, quantitative prediction, or cross-species transfer.
When no single model dominates, it means different architectures excel at different task types - suggesting that architectural choices matter and that benchmark diversity successfully measures complementary capabilities rather than a single underlying skill.

11.2.5 Long-Range Benchmarks

Long-range regulatory interactions, where enhancers tens to hundreds of kilobases from their target genes influence expression, require benchmarks that specifically test extended context modeling. Consider the challenge of understanding a sentence versus understanding a novel: predicting the next word from the previous five words tests local grammar, while predicting plot resolution from earlier foreshadowing tests whether a model truly comprehends narrative structure across hundreds of pages. Similarly, short-context benchmarks test whether models recognize local motifs, while long-range benchmarks test whether they understand how distant regulatory elements coordinate gene expression: the “narrative structure” of the genome.

The Long Range Benchmark (LRB) evaluates models’ ability to integrate information across large genomic distances, with tasks including predicting distal enhancer-promoter interactions, modeling topologically associating domain (TAD) boundary effects, and identifying long-range regulatory dependencies . TADs are regions of the genome (typically 100 kb to 2 Mb) that preferentially interact with themselves rather than with neighboring regions, like chapters in a book where scenes within a chapter relate closely to each other but less to scenes in other chapters. TAD boundaries constrain which enhancers can reach which promoters; variants disrupting these boundaries can cause disease by allowing inappropriate regulatory crosstalk.

DNALongBench extends evaluation to ultra-long contexts spanning up to millions of base pairs . Tasks at this scale test whether models can use chromosome-level context for regulatory prediction, potentially capturing effects from 3D chromatin organization and large-scale chromatin domains.

These benchmarks are particularly relevant for evaluating efficient attention mechanisms, state space models, and other architectures designed to extend effective context length (Section 7.4). Performance on long-range benchmarks does not necessarily correlate with short-range task performance, indicating that different architectural choices optimize for different aspects of sequence modeling.

11.2.6 Cross-Species Evaluation

GenBench and related resources test whether models trained on one organism generalize to related species . Cross-species evaluation is important for several reasons. Many applications require predictions in non-human organisms (agricultural genomics, model organism research, comparative genomics). Multi-species training may improve within-species performance by providing additional evolutionary signal (Section 8.8.3). The ability to transfer across species indicates that models have learned general principles of genome organization rather than species-specific artifacts.

Cross-species benchmarks typically evaluate models on held-out species not seen during training. Performance degradation from training to held-out species indicates the degree to which learned representations depend on species-specific features. Some architectures show better cross-species transfer than others, suggesting differences in how well they capture conserved regulatory principles.

11.2.7 When Foundation Models Fail: The Single-Cell Reality Check

Can a foundation model trained on millions of cells predict perturbation responses better than a linear classifier? The assumption underlying much single-cell foundation model development is that self-supervised pretraining on massive cellular atlases captures regulatory logic that transfers to downstream prediction tasks. A 2025 Nature Methods study tested this assumption directly by comparing Geneformer and scGPT to simple baselines on perturbation prediction tasks (rood_geneformer_benchmark_2025?).

The results were sobering. For predicting gene expression changes following CRISPR perturbations, both foundation models were outperformed by a linear regression trained on the same task-specific data. The gap was not subtle: baseline methods achieved higher correlation with experimental measurements across multiple perturbation datasets, and the difference persisted even when foundation models were fine-tuned rather than used zero-shot. The finding echoes similar failures in protein language models, where zero-shot predictions frequently underperform domain-specific baselines despite impressive pretraining scale (see Section 16.8 for protein-specific examples).

What went wrong? The study identified three systematic issues. First, foundation models excelled at cell type classification and other tasks directly related to their pretraining objective (predicting masked genes from expression context). This capability did not transfer to perturbation prediction, which requires understanding causal interventions rather than observational associations. Second, the models struggled with out-of-distribution genes: perturbations targeting genes poorly represented in pretraining data produced unreliable predictions. Third, simple baselines leveraged task-specific inductive biases that foundation models lacked. A linear model predicting expression changes can directly learn which genes respond together to perturbations. A transformer pretrained on gene co-expression learns which genes co-occur in unperturbed states.

Zero-Shot ≠ Task-Ready

This benchmark establishes a critical methodological standard: all foundation model evaluations should include strong, task-specific baselines. Reporting that a model achieves 0.72 correlation on perturbation prediction is meaningless without knowing that a linear regression achieves 0.81. The “foundation” label does not exempt models from basic comparison requirements.

The implications extend beyond single-cell models. Any claim that pretraining improves downstream performance requires demonstrating superiority over appropriately matched baselines trained on the same downstream data. Large scale and self-supervised objectives are necessary but not sufficient for transfer. The questions become: what inductive biases must models capture to generalize beyond their pretraining distribution, and how do we design pretraining objectives that encourage learning those biases rather than shortcuts? These questions remain open.

11.3 Variant Effect Prediction Benchmarks

Variant effect prediction (VEP) benchmarks connect sequence changes to molecular or phenotypic consequences, addressing the clinically central question of which variants matter. These benchmarks span multiple biological levels, from molecular function to clinical pathogenicity.

Stop and Think

Before exploring VEP benchmarks, pause to consider: What makes variant effect prediction fundamentally different from other benchmark tasks we have covered? If a model achieves 0.95 auROC on a variant pathogenicity benchmark, what does that tell you, and what does not it tell you?

Hint: Consider the difference between ranking variants and making clinical decisions about individual patients.

11.3.1 Clinical Variant Databases

How do you know if a variant you have never seen before is pathogenic? Clinical laboratories face this question daily, and the answer increasingly depends on computational predictions. ClinVar provides the most widely used labels for clinical variant effect prediction, aggregating pathogenicity assertions from clinical laboratories and researchers worldwide (Section 2.8.1). Benchmarks derived from ClinVar frame variant interpretation as classification: given a variant, predict whether it is pathogenic, likely pathogenic, benign, or likely benign.

ClinVar’s value as a benchmark stems from its clinical relevance. Variants classified in ClinVar represent the actual population of variants encountered in clinical testing. Performance on ClinVar directly addresses whether a model can assist variant interpretation workflows. The database’s scale (over 2 million variant submissions as of 2024) enables statistically robust evaluation (Landrum et al. 2018).

Critical Limitation: ClinVar Circularity

ClinVar’s limitations as a benchmark are equally important. Submission heterogeneity means that label quality varies dramatically: expert-curated panels provide high-confidence classifications while single-laboratory submissions may reflect limited evidence. Version sensitivity means that benchmark composition changes over time as new submissions arrive and old classifications are updated. Most consequentially, circularity with computational predictors creates feedback loops: variants may have been classified using the very tools being evaluated, inflating apparent performance. This circularity problem, examined in detail for classical predictors in Section 4.5 and for its broader confounding implications in Chapter 13, represents one of the most insidious forms of benchmark contamination.

Ancestry and gene coverage biases profoundly shape what ClinVar benchmarks measure. Variants from European ancestry individuals and well-studied disease genes are heavily overrepresented. High performance on ClinVar demonstrates accuracy for this specific population rather than robust generalization across human genetic diversity (Section 3.7). Benchmarks stratified by ancestry reveal substantial performance gaps, with models typically performing worse on variants from underrepresented populations.

Best practices for using ClinVar as a benchmark include specifying the exact database version and download date, excluding variants with conflicting assertions, stratifying performance by evidence level and ancestry, and comparing to baselines using only allele frequency to detect circularity. These practices are detailed in Section 12.4, with specific guidance on detecting label leakage in Section 12.4.1.

11.3.2 CAGI: Critical Assessment of Genome Interpretation

Can a benchmark ever be truly leak-proof? The fundamental problem with retrospective evaluation is that someone, somewhere, might have seen the test data. The Critical Assessment of Genome Interpretation (CAGI) challenges provide prospective evaluation of variant effect predictors on unpublished datasets (Brunak, Carter, and Moult 2023). Unlike retrospective benchmarks that evaluate models on historical data, CAGI distributes prediction targets before ground truth is available, preventing any possibility of overfitting to known labels.

CAGI challenges cover diverse prediction targets. Some challenges focus on molecular phenotypes: predicting the effect of variants on protein stability, binding affinity, or enzymatic activity. Others target clinical phenotypes: predicting disease risk, drug response, or clinical severity from individual genomes. The diversity of challenges tests whether models generalize across different types of variant effects.

Key Insight: The Value of Prospective Evaluation

The prospective design of CAGI provides several important advantages over retrospective benchmarks:

Predictions must be made before labels are known, eliminating leakage from any source
The timeline forces models to commit to predictions rather than post-hoc optimization
Community participation enables fair comparison across many approaches under identical conditions

This prospective design represents the gold standard for benchmark validity. When evaluating any retrospective benchmark result, ask: “Would this performance hold up in a CAGI-style prospective evaluation?”

CAGI’s limitation is scale: challenges include hundreds to thousands of variants rather than the millions available in databases like ClinVar. Statistical power to detect small performance differences is correspondingly limited. The challenges also depend on experimental collaborators willing to withhold data until after the prediction deadline, limiting the range of phenotypes that can be assessed.

11.3.3 Deep Mutational Scanning Benchmarks

What if you could measure the effect of every possible mutation in a protein, not just the ones that happen to occur in patients? Deep mutational scanning (DMS) provides exactly this: systematic experimental measurement of variant effects across entire proteins or regulatory elements (Section 2.4.4). DMS benchmarks test whether models can predict these experimentally determined effects, providing direct validation against measured functional consequences rather than inferred clinical classifications.

MaveDB aggregates DMS datasets in a standardized format, enabling systematic benchmarking across diverse proteins and assays (Esposito et al. 2019). ProteinGym’s DMS component (discussed above) represents the most comprehensive benchmark in this space. For non-coding variants, MPRA datasets provide analogous systematic measurements of regulatory activity.

DMS benchmarks have distinct strengths and limitations compared to clinical databases. The experimental grounding means that labels reflect actual measured effects rather than clinical inference that may involve multiple assumptions. The relationship between DMS fitness and clinical pathogenicity is complex: a variant may substantially affect enzymatic activity without causing disease if the residual activity suffices for normal physiology. DMS benchmarks measure one component of the variant interpretation puzzle rather than the full clinical picture.

11.3.4 Regulatory and Non-Coding Variant Benchmarks

Non-coding variants require specialized benchmarks because their effects operate through different mechanisms than coding variants. The foundation model approaches to non-coding variant effect prediction are examined in Section 18.3, with the underlying regulatory models detailed in Chapter 17. Massively Parallel Reporter Assays (MPRAs) enable high-throughput measurement of regulatory element activity by testing thousands of sequences simultaneously in cell-based assays. MPRA-based benchmarks test whether models can predict the quantitative effect of variants on enhancer or promoter activity measured in these reporter assays. Expression quantitative trait locus (eQTL)-based benchmarks use naturally occurring variants associated with expression changes, treating the statistical evidence for eQTL status as a proxy for regulatory impact.

The challenge for non-coding benchmarks is connecting molecular effects to phenotypic consequences. A variant may alter chromatin accessibility without affecting any gene’s expression. A variant may affect expression without influencing disease risk. This gap between molecular and clinical effects complicates interpretation: high performance on MPRA prediction does not necessarily translate to accurate regulatory disease variant interpretation.

Fine-mapped genome-wide association study (GWAS) variants provide another benchmark source for non-coding VEP. Statistical fine-mapping identifies putatively causal variants within associated loci (Section 3.4), and models can be evaluated on their ability to prioritize these variants over nearby non-causal variants. Performance on fine-mapping tasks more directly assesses clinical relevance than molecular phenotype prediction, though fine-mapping itself has substantial uncertainty.

Table 11.2: Comparison of variant effect prediction benchmark types. Each provides different evidence about model capabilities, with distinct validity trade-offs.

Benchmark Type	Label Source	Strengths	Limitations	Best For
ClinVar	Clinical assertions	Clinical relevance; scale	Circularity; ancestry bias; heterogeneous quality	Clinical deployment validation (with caveats)
CAGI	Prospective experiments	No leakage possible; forces commitment	Limited scale; infrequent	Gold-standard validation
DMS/MaveDB	High-throughput assays	Direct experimental measurement	Assay-specific; fitness != pathogenicity	Molecular mechanism understanding
MPRA	Reporter assays	Quantitative; regulatory focus	Reporter != endogenous; context-dependent	Regulatory variant effects
eQTL/GWAS	Statistical associations	Population-level evidence	Correlation != causation; LD confounding	Common variant prioritization

Checkpoint: Benchmark Selection

Before proceeding, ensure you can explain:

Why ClinVar benchmarks alone cannot validate a model for clinical deployment
What makes CAGI the “gold standard” for benchmark validity
The distinction between DMS fitness scores and clinical pathogenicity

If you cannot articulate these differences, revisit the benchmark type descriptions above.

11.4 Trait and Population-Level Benchmarks

At the individual and population level, benchmarks assess whether models improve prediction of complex traits and disease risk.

11.4.1 Polygenic Score Evaluation

Polygenic score (PGS) benchmarks evaluate how well genotype-derived scores predict disease risk or quantitative traits (Section 3.5). Common evaluation settings include within-biobank evaluation, where a single large cohort is partitioned into training and test sets, and cross-biobank evaluation, where models trained in one population are tested in another. The integration of foundation model features with PGS approaches represents an emerging research direction (Section 28.1).

Metrics depend on the phenotype. For quantitative traits, benchmarks report the coefficient of determination (\(R^2\)), which measures the proportion of variance explained by genetic factors. More informative is incremental \(R^2\), which measures how much additional variance the genetic score explains beyond non-genetic covariates (age, sex, principal components for population structure). An incremental \(R^2\) of 0.05 means the genetic score explains 5% of trait variance that covariates alone cannot capture, providing a clearer picture of the genetic model’s unique contribution. For binary disease outcomes, auROC and area under the precision-recall curve (auPRC) quantify discrimination. Calibration metrics assess whether predicted risks match observed event rates (Section 24.3). The clinical utility of PGS, discussed in Chapter 28, depends on all these properties: a score may discriminate well (high auROC) while being poorly calibrated (predicted risks do not match actual event rates).

Cross-population evaluation is particularly important because PGS portability is a major limitation of current methods (Section 3.7). Benchmarks stratified by ancestry typically reveal substantial performance degradation from European ancestry (where most GWAS have been conducted) to other populations. This degradation stems from multiple sources: different linkage disequilibrium patterns mean that tag SNPs identify different causal variants, population-specific variants are absent from training data, and effect sizes may differ across populations due to gene-environment interactions.

11.4.2 TraitGym

Do foundation models actually improve disease risk prediction, or do they just add computational overhead to methods that already work? Traditional polygenic scores use simple weighted sums of variant effects; the burden is on foundation models to prove they add value. TraitGym provides a framework specifically designed to assess complex trait prediction using genomic foundation models (Benegas, Eraslan, and Song 2025). The benchmark evaluates whether foundation model embeddings or variant scores improve prediction beyond traditional polygenic score methods.

TraitGym’s design addresses several limitations of standard PGS benchmarks. Ancestry stratification is built into the evaluation protocol, requiring models to report performance separately for different population groups, directly addressing the portability concerns documented in Section 3.7. This population-stratified evaluation approach parallels how ProteinGym (Section 11.1.3) systematically evaluates variant effect predictors across diverse protein families: just as ProteinGym reveals which models generalize across protein contexts, TraitGym reveals which models generalize across ancestry contexts. Multiple phenotypes spanning different genetic architectures (highly polygenic versus more oligogenic) test generalization across trait types. Comparison to appropriate baselines (standard PGS methods, clinical covariates alone) isolates the contribution of foundation model features.

The benchmark is particularly relevant for assessing claims that genomic foundation models add predictive value beyond classical statistical genetics. Foundation models incur substantial computational costs compared to linear PGS models; TraitGym helps determine whether these costs are justified by improved prediction.

11.4.3 EmbedGEM Framework

A foundation model embedding might correlate with disease outcomes for the wrong reasons: batch effects, population structure, or other confounders that happen to track with health status. How do you distinguish models that have discovered genuine biology from those that have merely learned to recognize data artifacts? The EmbedGEM framework evaluates whether foundation model embeddings capture biologically meaningful genetic signal, as opposed to technical artifacts or confounders (Mukherjee et al. 2024). The framework assesses embeddings along two axes: heritability (the proportion of phenotypic variance attributable to genetic factors, as introduced in Chapter 3) and disease relevance.

The heritability axis measures how much genetic signal an embedding captures. EmbedGEM counts the number of genome-wide significant loci associated with embedding components and quantifies the strength of association through mean chi-squared statistics. Higher values indicate that the embedding reflects heritable biology rather than noise.

The disease relevance axis measures whether embedding-associated variants predict clinically meaningful outcomes. Polygenic scores constructed from embedding GWAS hits are evaluated for their ability to predict disease in independent cohorts. Incremental predictive value over standard clinical models indicates that the embedding captures disease-relevant genetic information.

This two-axis evaluation addresses a critical question for foundation model deployment: do learned representations discover novel biology or merely recapitulate known associations with additional computational overhead? Embeddings that show high heritability but low disease relevance may capture biological signal that is not clinically actionable. Embeddings that show disease relevance without novel genetic discoveries may not add value beyond existing PGS methods.

11.5 Benchmark Construction and Hidden Assumptions

Beyond cataloging benchmark suites, understanding how benchmarks are constructed reveals assumptions that shape what they measure and what they miss.

11.5.1 Data Sources and Label Provenance

Benchmark labels derive from diverse sources with different properties. Experimental assays (ChIP-seq, DMS, MPRA) provide direct measurements but are limited by assay-specific artifacts and selection pressures. Computational annotations (gene calls, functional predictions, conservation scores) provide broader coverage but introduce circular dependencies if models are trained and evaluated on overlapping sources. Clinical classifications aggregate expert judgment but reflect the evidence available at classification time, which may include the very predictors being benchmarked.

The provenance of benchmark labels determines what success on that benchmark actually means. High performance on experimentally derived labels suggests the model captures the specific molecular process assayed. High performance on clinical labels may indicate genuine clinical utility or may reflect circularity with existing prediction tools. Understanding label provenance is prerequisite to interpreting benchmark results.

Stop and Think: Label Provenance

Consider a variant effect predictor that achieves 0.95 auROC on a ClinVar benchmark. Before interpreting this result, you should ask:

What evidence types contributed to the ClinVar classifications? (Functional assays? Segregation? Computational predictions?)
Did any of those computational predictions use similar features to your model?
How would you detect whether circularity inflated your performance?

These questions apply to any benchmark with aggregated or curated labels.

11.5.2 Splitting Strategies and Leakage

How benchmarks partition data into training and test sets determines whether evaluation measures generalization or memorization (Chapter 12). Random splitting, where examples are assigned to splits uniformly at random, represents the weakest form of evaluation. In genomics, random splits often permit homology-based leakage: training and test sequences may share sufficient similarity that memorization suffices for good performance.

Homology-aware splitting clusters sequences by similarity before assigning clusters to splits, ensuring that test sequences are evolutionarily distant from training sequences. This approach is standard for protein benchmarks (using tools like CD-HIT or MMseqs2) but less consistently applied for DNA benchmarks.

Chromosome-based splitting holds out entire chromosomes for testing, preventing any position-based leakage within chromosomes. This approach is common for regulatory benchmarks but does not account for homologous sequences on different chromosomes. Temporal splitting reserves recent data for testing, appropriate when benchmarks derive from databases with submission timestamps. Each splitting strategy tests different aspects of generalization; the choice should match the intended deployment scenario.

Data leakage pathways in genomic foundation model evaluation

11.5.3 The Leakage Tax: Genomic Heterogeneity Inflates Performance

Variant effect prediction models achieve 0.975 auROC when evaluated on random test splits but only 0.697 auROC when evaluated at splice sites. The 28-point drop does not reflect splice site difficulty; it reflects data leakage. A 2025 Nature Methods study quantified how genomic heterogeneity (sequence similarity, haplotype structure, conserved regulatory elements) creates hidden connections between training and test sets that random splits fail to sever (data_leakage_guidelines_2025?).

The leakage mechanism differs from the family-level relatedness discussed in Section 12.4. Genomes contain repeated elements, conserved motifs, and long-range linkage disequilibrium that create subtle similarities between distant loci. A model trained to predict chromatin accessibility at enhancers in chromosome 1 may memorize GC-rich motifs that also appear in chromosome 12 enhancers. When test enhancers share these motifs, performance appears excellent despite the model learning sequence composition rather than regulatory grammar. The solution is homology-aware splitting: cluster sequences by similarity, assign entire clusters to train or test, and verify that no significant sequence overlap remains.

The paper documents three splitting strategies with progressively stricter leakage control. Sequence-identity clustering (90% identity threshold) prevents direct homology leakage. Locus-based splits assign entire genes or regulatory regions to train or test, blocking local linkage. Chromosome-based splits eliminate all within-genome structure sharing, though they create artificial distribution shifts that may underestimate real-world performance. The optimal strategy depends on the deployment scenario: if the model will score variants in previously studied genes, sequence-identity clustering suffices; if it must generalize to entirely novel loci, locus-based splits are required.

Practical Guidance: The hashFrag Tool

The hashFrag tool automates homology-aware splitting for genomic sequence data (hashfrag_2025?). It hashes sequence fragments, clusters by Jaccard similarity, and outputs train/test assignments that respect sequence homology boundaries. Integration with standard benchmark workflows requires only specifying the similarity threshold and cluster assignment strategy.

11.5.4 Metric Selection and Aggregation

Benchmark metrics determine what aspects of model performance are measured. Discrimination metrics (auROC, auPRC, correlation) assess whether models rank predictions correctly. Calibration metrics (expected calibration error, reliability diagrams) assess whether predicted probabilities match observed frequencies (Section 24.3). Clinical utility metrics (net benefit, decision curves) assess whether predictions improve decisions compared to treating all patients the same (Chapter 28).

Different metrics can yield different rankings of models. A model with superior discrimination may have poor calibration, predicting the right relative order but wrong absolute probabilities. Choosing which metric to optimize, and how to aggregate across multiple tasks or datasets, involves implicit decisions about what matters for downstream use.

Aggregation across tasks raises additional issues. Mean performance across many tasks weights each task equally, regardless of clinical importance or dataset quality. Median performance is robust to outliers but obscures variation. Reporting full distributions of task-level performance provides more information but complicates comparison. The choice of aggregation method can substantially affect which model appears best.

11.5.5 Goodhart’s Law and Benchmark Gaming

Benchmarks create incentive structures, and incentive structures invite optimization. Goodhart’s Law, that a measure ceases to be a good measure once it becomes a target, applies with particular force to machine learning evaluation. When model development prioritizes leaderboard position, the benchmark becomes the optimization target rather than a proxy for the underlying capability it was designed to measure.

Gaming takes multiple forms in genomic AI. Architectural choices may be tuned specifically to benchmark characteristics: receptive fields sized to match benchmark sequence lengths, output heads designed for benchmark label distributions, hyperparameters selected through extensive benchmark-specific search. Such tuning improves benchmark performance without necessarily improving generalization to deployment scenarios that differ from benchmark conditions.

More subtle gaming arises from selective reporting. Models may be evaluated on many benchmarks with only favorable results published. Benchmark versions may be chosen to maximize apparent performance. Evaluation protocols may deviate from published standards in ways that inflate metrics. The cumulative effect is a literature where reported performance systematically overestimates deployment capability.

The circularity between predictors and databases creates particularly insidious gaming dynamics. When ClinVar classifications incorporate computational predictions, and those predictions are then benchmarked against ClinVar, the benchmark rewards models that resemble their predecessors rather than models that provide independent information (Chapter 13). This circularity is rarely acknowledged in benchmark reporting, yet it fundamentally compromises the validity of performance claims.

Mitigating gaming requires structural changes to evaluation practice: prospective benchmarks like CAGI (Section 11.3.2) where predictions precede labels, held-out evaluation consortia that resist optimization pressure, and reporting standards that require disclosure of all benchmarks attempted rather than only those where performance was favorable. The field’s maturation depends on developing evaluation cultures that reward honest assessment over leaderboard position.

11.6 Benchmark Saturation and Staleness

Benchmarks have finite useful lifetimes. As models improve, benchmarks saturate; as data and methods evolve, benchmarks become stale.

11.6.1 Saturation: When Benchmarks Stop Discriminating

A benchmark saturates when the best models achieve performance that cannot be meaningfully improved. Saturation may reflect fundamental limits (the benchmark approaches the Bayes error rate, the theoretical minimum error achievable by any classifier given the inherent overlap between classes), measurement noise (the benchmark’s labels are too noisy to support finer discrimination), or ceiling effects (the metric itself cannot distinguish between excellent and perfect performance). The Bayes error rate represents irreducible error: even a perfect model cannot do better because the input features simply do not contain enough information to distinguish all cases. This irreducible error is a form of aleatoric uncertainty (Section 24.1.3), arising from inherent randomness in the data rather than model limitations.

Saturation is problematic because it removes the benchmark’s value for model selection. When all reasonable models achieve 0.97 auROC, differences between 0.970 and 0.975 are unlikely to reflect meaningful capability differences. Yet benchmark reporting conventions often emphasize such decimal places, creating an illusion of progress.

Detecting saturation requires estimating the irreducible error. For benchmarks with replicate measurements, comparing model performance to replicate concordance provides an upper bound: models cannot systematically outperform the reproducibility of the underlying assay. For benchmarks without replicates, saturation is harder to diagnose. One heuristic is tracking the rate of improvement: when new methods provide diminishing gains despite substantial architectural innovations, saturation is likely.

The response to saturation should be moving to harder benchmarks that still discriminate between methods, developing new benchmarks that capture aspects of performance that existing benchmarks miss, and retiring saturated benchmarks from active leaderboard competition while retaining them as sanity checks.

11.6.2 Staleness: When Benchmarks Diverge from Practice

Benchmarks become stale when they no longer reflect current data, methods, or clinical practice. Assays evolve: a benchmark constructed from early ENCODE data may not represent current experimental protocols. Annotations improve: gene models, variant classifications, and functional element maps are continuously updated. Clinical practice shifts: treatment guidelines and diagnostic criteria change the meaning of historical labels.

Staleness is insidious because it erodes benchmark validity gradually rather than abruptly. A benchmark that accurately represented regulatory prediction in 2015 may systematically misrepresent it in 2025, yet the benchmark’s continued use perpetuates optimization for an outdated target.

Addressing staleness requires periodic benchmark refresh with updated data and annotations, version control that documents exactly what each benchmark version contains, and awareness that performance on historical benchmarks may not predict performance on current data.

11.6.3 Leakage from Scale

Modern foundation models are pretrained on corpora that may include most publicly available genomic data. This creates novel leakage risks distinct from classical train-test overlap, compounding the circularity concerns discussed for clinical databases (Section 11.3.1). A model pretrained on all ENCODE data may effectively have seen the exact experiments used in many regulatory benchmarks. A model pretrained on all UniRef may have seen sequences highly similar to protein benchmark test sets. This pretraining-benchmark overlap inflates performance in ways that are difficult to detect and even more difficult to correct.

Leakage from scale is particularly problematic because it is often undocumented. Model papers rarely enumerate exactly which datasets were included in pretraining corpora, and benchmark papers rarely specify which datasets should be excluded. The result is ambiguity about whether benchmark success reflects genuine generalization or memorization from pretraining.

Mitigating leakage from scale requires explicit documentation of pretraining corpora, tools or hashes that help identify overlap between pretraining data and benchmark test sets, and held-out evaluation consortia that reserve data specifically for assessment without any use in pretraining.

11.7 Benchmark-Deployment Gap

High benchmark performance does not guarantee deployment success. Understanding why requires examining the systematic differences between benchmark settings and real-world applications.

The proxy-target gap in genomic AI evaluation

11.7.1 Distribution Shift

Benchmark test sets sample from the same distribution as training sets. Deployment populations may differ systematically. For variant effect prediction, benchmark variants are typically common enough to appear in multiple databases, while deployment often targets rare variants seen in single individuals. For regulatory prediction, benchmarks derive from well-studied cell types and tissues, while deployment may require prediction in understudied contexts.

Distribution shift manifests as degraded performance, but the pattern of degradation varies. The transfer learning framework in Section 10.5 examines how models handle distribution shift from a methodological perspective, while Section 13.10 addresses the confounding implications when shift correlates with protected attributes. Some models degrade gracefully, maintaining reasonable accuracy across the distribution shift. Others degrade catastrophically, with confident predictions that prove systematically wrong. Benchmarks that include held-out subpopulations or out-of-distribution test sets provide some information about robustness, but cannot anticipate every deployment scenario.

11.7.2 Calibration Requirements

Clinical deployment requires not just accurate rankings but accurate probability estimates (Section 24.3). A variant classifier that achieves 0.95 auROC by assigning probability 0.9 to all pathogenic variants and 0.3 to all benign variants discriminates well but provides miscalibrated uncertainty. Clinical decisions that depend on thresholded predictions (reporting variants above a certain probability) will perform poorly if those probabilities do not reflect actual pathogenicity rates.

Most benchmark metrics emphasize discrimination over calibration. auROC is invariant to monotonic transformations of predicted probabilities. Correlation measures rank preservation. As a result, models may be optimized for benchmark success through strategies that damage calibration. The benchmark-deployment gap for calibration can be large even when discrimination metrics are excellent.

11.7.3 Metric Mismatch

Benchmarks optimize specific metrics that may not align with deployment objectives. auROC weights errors equally regardless of where they occur on the score distribution, but clinical utility may depend primarily on performance at specific operating points. As discussed in the evaluation metrics coverage of Chapter 12, auROC and auPRC capture different aspects of model performance: auROC measures discrimination across all thresholds while auPRC is more sensitive to performance on the minority class, which matters greatly for rare pathogenic variants. Correlation rewards getting the overall pattern right but may not penalize systematic errors in clinically important regions.

The gap between optimized metrics and deployment objectives creates misaligned incentives. Model developers optimize for benchmark success, which rewards specific metric improvements. Deployment success may require different tradeoffs: prioritizing calibration over discrimination, minimizing false negatives over false positives, or performing well on specific subpopulations rather than overall.

11.7.4 Practical Constraints

Deployment environments impose constraints that benchmarks typically ignore. Inference speed matters when predictions must be returned in clinical timescales. Model size matters when deployment hardware has limited memory. Interpretability matters when predictions must be explained to clinicians or patients (Chapter 25). Benchmarks that evaluate only accuracy miss these dimensions of deployment fitness.

The benchmark-deployment gap is not merely a technical inconvenience. It represents a fundamental tension between evaluation tractability and deployment validity. Benchmarks are valuable precisely because they are standardized, reproducible, and comparable across methods. Deployment is valuable precisely because it addresses the specific needs of real-world applications. Bridging this gap requires benchmark designs that better approximate deployment conditions and deployment evaluations that provide feedback to benchmark development.

11.8 Systematic Gaps in Current Benchmarks

Despite the proliferation of benchmark suites, systematic gaps remain in the genomic evaluation landscape.

Variant types remain inadequately covered: structural variants, inversions, copy number variants, and complex rearrangements are rarely evaluated despite accounting for substantial genomic variation and disease burden (Section 1.5.4). Repeat regions are often excluded or masked. Multi-variant effects and haplotype-specific phenomena receive minimal attention; the phasing challenges that underlie compound heterozygosity interpretation (Section 1.4.1) rarely appear in benchmark settings.

Population representation shows profound disparities: non-European ancestry groups remain severely underrepresented (Section 3.7). The confounding implications of this underrepresentation extend beyond benchmark validity to fairness concerns examined in Section 13.10. Performance stratified by ancestry reveals gaps that aggregate metrics conceal. Environmental diversity (lifestyle, exposures, treatments) that shapes phenotypic expression is rarely incorporated.

Cross-population performance reveals systematic failures

Modality coverage remains uneven: long-read sequencing data is scarce in benchmarks despite its advantages for structural variants and phasing (Section 1.2.4). Single-cell benchmarks are emerging but remain limited compared to bulk assay benchmarks; the evaluation challenges specific to single-cell models are examined in Section 20.6.4. Spatial transcriptomics and other emerging modalities have minimal coverage, though multi-omic integration approaches (Chapter 23) are beginning to address cross-modality assessment.

Clinical endpoints are underrepresented: most benchmarks use molecular surrogates rather than hard clinical endpoints. Disease incidence, progression, treatment response, and patient-reported outcomes are rarely the direct prediction target. The gap between molecular proxy accuracy and clinical utility remains poorly characterized.

These gaps mean that strong benchmark performance may not predict utility for underserved populations, understudied variant classes, or clinical applications that depend on endpoints the benchmarks do not measure.

11.9 The Proxy Problem

Benchmarks structure the incentives of genomic AI development. The specific tasks, metrics, and leaderboards that the community adopts determine what models are optimized for, what claims of progress are evaluated against, and what capabilities receive attention versus neglect. A benchmark that emphasizes European-ancestry variants produces models tuned for European-ancestry performance. A benchmark that rewards discrimination (auROC) over calibration produces models that rank variants well but estimate probabilities poorly. A benchmark that reuses training data from widely available resources creates indirect leakage that inflates apparent performance. The benchmark landscape is not neutral infrastructure but an active force shaping what the field builds.

The landscape surveyed here spans protein benchmarks (TAPE, FLIP, ProteinGym), DNA and regulatory benchmarks (Genomic Benchmarks, BEND), variant effect benchmarks (ClinVar, CAGI, DMS), and trait-level benchmarks (TraitGym, EmbedGEM). Across all categories, persistent challenges emerge: saturation that reduces discriminative power as models approach ceiling performance, staleness that erodes validity as benchmarks age, leakage risks that inflate apparent capabilities, and systematic gaps in population diversity, variant type coverage, and clinical endpoint representation.

The benchmark-deployment gap represents perhaps the most consequential limitation. Strong performance on established benchmarks does not guarantee that models will behave reliably when deployed in clinical or research settings with different data distributions, patient populations, or outcome definitions. The companion chapter on Evaluation Methods (Chapter 12) examines the methodological foundations that determine whether benchmark results translate to deployment success: proper splitting strategies, leakage detection, metric selection, and experimental design. The confounding issues that plague both benchmark construction and model training receive dedicated treatment in Chapter 13, while uncertainty quantification methods (Chapter 24) provide tools for assessing when benchmark performance translates to deployment confidence. Interpretability approaches (Chapter 25) reveal whether benchmark success reflects genuine biological learning or exploitation of shortcuts. Together with this catalog of what benchmarks exist, these methodological principles provide the critical apparatus for evaluating genomic foundation model claims.

Chapter Summary

This chapter surveyed the benchmark landscape for genomic foundation models across protein, DNA, and variant effect prediction domains.

Key Takeaways:

Protein benchmarks (TAPE, FLIP, ProteinGym) are most mature, with standardized transfer learning evaluation and homology-aware splitting
DNA/regulatory benchmarks (Genomic Benchmarks, BEND) are rapidly developing but face challenges with quantitative targets and long-range dependencies
Variant effect benchmarks span molecular (DMS) to clinical (ClinVar) levels, with critical circularity concerns for database-derived labels
Trait-level benchmarks (TraitGym, EmbedGEM) assess whether foundation models add value beyond classical PGS methods
Systematic gaps persist in population diversity, variant type coverage, and clinical endpoint representation
The proxy-target gap means strong benchmark performance does not guarantee deployment success

Looking Ahead: The next chapter (Chapter 12) examines how to evaluate models properly: splitting strategies, leakage detection, metric selection, and experimental design that determine whether benchmark results translate to real-world utility.

Connections:

Apply benchmark selection principles when assessing claims in later chapters on foundation models (Chapter 14 through Chapter 18)
Clinical utility metrics introduced here are expanded for clinical risk prediction (Chapter 28)
Population bias concerns connect to fairness and equity considerations (Section 13.10)

Benegas, Gonzalo, Gökcen Eraslan, and Yun S. Song. 2025. “[TraitGym] Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics.” bioRxiv. https://doi.org/10.1101/2025.02.11.637758.

Brunak, Soren, Hannah Carter, and John Moult. 2023. “CAGI 6: Critical Assessment of Genome Interpretation, Sixth Edition.” Human Genetics, October.

Dallago, Christian, Jody Mou, Kadina E. Johnston, Bruce J. Wittmann, Nicholas Bhattacharya, Samuel Goldman, Ali Madani, and Kevin K. Yang. 2022. “FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins.” bioRxiv. https://doi.org/10.1101/2021.11.09.467890.

Esposito, Daniel, Jochen Weile, Jay Shendure, Lea M. Starita, Anthony T. Papenfuss, Frederick P. Roth, Douglas M. Fowler, and Alan F. Rubin. 2019. “MaveDB: An Open-Source Platform to Distribute and Interpret Data from Multiplexed Assays of Variant Effect.” Genome Biology 20 (1): 223. https://doi.org/10.1186/s13059-019-1845-6.

Grešová, Katarína, Vlastimil Martinek, David Čechák, Petr Šimeček, and Panagiotis Alexiou. 2023. “Genomic Benchmarks: A Collection of Datasets for Genomic Sequence Classification.” BMC Genomic Data 24 (1): 25. https://doi.org/10.1186/s12863-023-01123-8.

Kryshtafovych, Andriy, Torsten Schwede, Maya Topf, Krzysztof Fidelis, and John Moult. 2021. “Critical Assessment of Methods of Protein Structure Prediction (CASP)—Round XIV.” Proteins: Structure, Function, and Bioinformatics 89 (12): 1607–17. https://doi.org/10.1002/prot.26237.

Landrum, Melissa J, Jennifer M Lee, Mark Benson, Garth R Brown, Chen Chao, Shanmuga Chitipiralla, Baoshan Gu, et al. 2018. “ClinVar: Improving Access to Variant Interpretations and Supporting Evidence.” Nucleic Acids Research 46 (D1): D1062–67. https://doi.org/10.1093/nar/gkx1153.

Marin, Frederikke Isa, Felix Teufel, Marc Horlacher, Dennis Madsen, Dennis Pultz, Ole Winther, and Wouter Boomsma. 2024. “BEND: Benchmarking DNA Language Models on Biologically Meaningful Tasks.” arXiv. https://doi.org/10.48550/arXiv.2311.12570.

Mukherjee, Sumit, Zachary R. McCaw, Jingwen Pei, Anna Merkoulovitch, Tom Soare, Raghav Tandon, David Amar, et al. 2024. “EmbedGEM: A Framework to Evaluate the Utility of Embeddings for Genetic Discovery.” Bioinformatics Advances 4 (1). https://doi.org/10.1093/bioadv/vbae135.

Notin, Pascal, Aaron Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Han Spinner, Nathan Rollins, et al. 2023. “ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design.” Advances in Neural Information Processing Systems 36 (December): 64331–79.

Rao, Roshan, Nicholas Bhattacharya, Neil Thomas, Yan Duan, Xi Chen, John Canny, Pieter Abbeel, and Yun S. Song. 2019. “Evaluating Protein Transfer Learning with TAPE.” arXiv. https://doi.org/10.48550/arXiv.1906.08230.