4 Deleteriousness Scores
TODO:
- …
- …
4.1 The Variant Prioritization Challenge
Across a typical human genome there are millions of variants—most of them rare, and the vast majority functionally neutral. For any individual with a suspected genetic condition, the central interpretability problem is to identify the handful of variants that plausibly contribute to disease from this enormous background.
The data resources in Chapter 2 give us multiple, complementary views:
- Population frequency from gnomAD and related resources tells us what variation survives in large cohorts of ostensibly healthy individuals (“The Genome Aggregation Database (gnomAD)” n.d.).
- Functional genomics from ENCODE, Roadmap, and other consortia indicates which regions are biochemically active.
- Clinical databases such as ClinVar and HGMD collect expert-curated variant classifications from case reports and diagnostic labs.
Each source is partial. Population databases are dominated by common, mostly tolerated variants. Functional genomics data is noisy and often context-specific. Clinical databases are sparse and heavily biased toward well-studied genes and variant types.
Before deep learning, many variant effect predictors tackled this problem by focusing on one narrow signal:
- Sequence conservation (e.g., phyloP (Siepel et al. (2005)), GERP (Davydov et al. (2010)))
- Protein-level features (e.g., SIFT (Ng and Henikoff (2003)), PolyPhen (Adzhubei et al. (2010)))
- Simple positional annotations (e.g., distance to splice sites)
Combined Annotation-Dependent Depletion (CADD) was a step change: it defined a general framework for genome-wide variant prioritization that integrates dozens of heterogeneous annotations and uses evolutionary depletion as a proxy label (Rentzsch et al. 2019). Rather than training directly on small sets of known pathogenic vs benign variants, CADD contrasts variants that survived purifying selection in humans with matched simulated variants that “could have occurred but did not.”
This chapter focuses on the CADD framework because it establishes design patterns that show up repeatedly in later deep learning models: proxy labels, large-scale training, integration of diverse features, and genome-wide coverage.
4.2 The Evolutionary Proxy Training Strategy
CADD’s most important conceptual move was to avoid the scarcity and bias of curated pathogenic variants by constructing a synthetic classification problem:
“How different do the annotations of observed human variants look from those of matched, simulated variants that have not survived evolution?”
4.2.1 Proxy-Neutral Variants
CADD defines a proxy-neutral class from variants that are actually observed in human populations:
- Single nucleotide variants (SNVs) and short indels from large sequencing datasets such as the 1000 Genomes Project and early gnomAD-like resources.
- Typically restricted to variants with high derived allele frequency (DAF), under the assumption that alleles that have drifted to high frequency are unlikely to be strongly deleterious over recent human evolutionary time.
These variants are not guaranteed to be benign, but, on average, they are enriched for tolerated alleles that have survived purifying selection.
4.2.2 Proxy-Deleterious Variants
To construct the opposing class, CADD simulates mutations across the genome:
- The simulation matches local trinucleotide (or more complex) sequence context so that the spectrum of simulated mutations reflects realistic mutation processes.
- Mutation rates are scaled in local genomic windows to approximate heterogeneous mutation rates (e.g., elevated at CpGs) (Rentzsch et al. 2019; Schubach et al. 2024).
- The key idea is that simulated variants represent changes that could occur but are generally not observed at high frequency in large cohorts.
These simulated variants are treated as proxy-deleterious: enriched for potential alleles that are disfavored by selection, even though many individual simulated variants will in fact be neutral.
4.2.3 Training Objective
With proxy-neutral and proxy-deleterious sets in hand, CADD trains a binary classifier:
- Input: a vector of annotations describing each variant (Section “Integration of Diverse Annotations”).
- Label: simulated (proxy-deleterious) vs observed (proxy-neutral).
- Objective: learn a scoring function that assigns higher scores to simulated variants.
Early CADD models used linear support vector machines trained on millions of simulated vs observed variants with dozens of annotations (Rentzsch et al. 2019). Later versions, including CADD v1.7, use a logistic regression–style machine learning model over an expanded feature set to assign numerical deleteriousness scores to all possible SNVs and small indels (Schubach et al. 2024).
This “evolutionary depletion” setup is conceptually similar to modern self-supervised learning:
- The labels are not clinical (pathogenic vs benign) but derived from a proxy signal (survival under selection).
- The training set is extremely large (tens of millions of variants), enabling complex models and rich decision boundaries.
- The resulting scores are then re-used downstream for many different tasks—rare disease gene discovery, variant filtration pipelines, and evaluation baselines.
4.3 Integration of Diverse Annotations
CADD’s second pillar is integrating many weak, noisy annotations into a single composite score. Where earlier tools might rely on one or a few scores, CADD combines more than 60 features in its original incarnation and substantially more in v1.7 (Rentzsch et al. 2019; Schubach et al. 2024).
Because Chapter 2 already surveys the underlying resources, we focus here on the categories of features and how they function in CADD. For details on individual databases (ENCODE, Roadmap, gnomAD, ClinVar, etc.), see Chapter 2.
4.3.1 Gene Model Annotations
These features describe the local gene and transcript context of each variant:
- Sequence consequence (e.g., synonymous, missense, nonsense, frameshift, splice site).
- Distance to exon–intron boundaries and untranslated regions.
- Gene-level attributes, such as known disease associations or gene constraint metrics (e.g., pLI, LOEUF), derived from population data (“The Genome Aggregation Database (gnomAD)” n.d.).
These annotations allow CADD to distinguish, for example, a synonymous variant in a tolerant gene from a truncating variant in a highly constrained gene.
4.3.2 Conservation and Constraint
A major feature block captures evolutionary conservation:
- Base-level conservation metrics such as GERP, phastCons, and phyloP, computed over multi-species alignments (Siepel et al. (2005), Davydov et al. (2010)).
- Regional measures of evolutionary constraint and mutation rate.
Highly conserved positions tend to be functionally important. CADD uses these conservation scores as some of its strongest signals that a variant may be deleterious, particularly in non-coding regions where direct functional labels are scarce (Rentzsch et al. 2019).
4.3.3 Epigenetic and Regulatory Activity
CADD also incorporates regulatory annotations derived from functional genomics:
- DNase I hypersensitivity and ATAC-seq peaks.
- ChIP–seq signals for histone marks and transcription factors (from ENCODE, Roadmap, and related efforts).
- Chromatin state segmentations that summarize combinatorial patterns of marks.
These features provide a view of whether a variant lies in an active enhancer, promoter, or other regulatory element, and help prioritize non-coding variants that disrupt regulatory regions.
4.3.4 Additional Features
Additional features capture sequence and genomic context:
- Local GC content and CpG dinucleotide context.
- Segmental duplications and low-complexity regions.
- Protein-level features for coding variants, such as amino acid properties and legacy scores like SIFT and PolyPhen Adzhubei et al. (2010).
Not every individual annotation is informative in isolation. The power of CADD lies in learning how to weight and combine these heterogeneous signals.
4.4 Model Architecture and Scoring
4.4.1 Machine Learning Framework
CADD’s classifier operates on a high-dimensional feature vector \(x\) for each variant:
- In early versions, a linear SVM was trained on ~30 million observed and simulated variants with 63 annotations, plus some interaction terms (Rentzsch et al. 2019, 2019).
- In CADD v1.7, the framework uses a logistic regression–style model and an expanded annotation set but retains the same basic paradigm of contrasting simulated and observed variants (Schubach et al. 2024).
Conceptually, the classifier learns a score \(s(x)\) such that:
- Large positive values indicate variants whose annotation profiles resemble the proxy-deleterious class.
- Large negative values indicate variants whose profiles resemble the proxy-neutral class.
The raw output of this classifier is often referred to as the C-score or raw CADD score.
4.4.2 PHRED-Scaled Scores
Raw CADD scores are not directly interpretable as probabilities or effect sizes. To provide a more intuitive scale, CADD defines PHRED-like scaled scores based on the rank of each variant among all possible single-nucleotide substitutions in the reference genome (Rentzsch et al. 2019; Schubach et al. 2024):
- Scaled score 10 ≈ variant is in the top 10% most deleterious predicted substitutions.
- Scaled score 20 ≈ top 1%.
- Scaled score 30 ≈ top 0.1%.
In other words, the scaled score compresses the raw scores into a 1–99 range that reflects percentile rank rather than an absolute effect size. This has several practical consequences:
- A simple, rank-based interpretation (“is this variant in the top x% of predicted deleteriousness?”).
- Comparability across CADD versions: a score of 20 always means “top 1%” even if the underlying model and features change.
- Coarser resolution in the bulk of the distribution: many common or mildly deleterious variants share similar scaled scores.
In rare disease pipelines, it is common to apply basic CADD filters—e.g., focusing on variants with scaled scores ≥ 15 or ≥ 20—to enrich for potentially deleterious hits before more detailed interpretation.
4.5 CADD v1.7: Integration of Deep Learning Predictions
CADD v1.7 illustrates how the original framework naturally accommodates deep learning outputs and modern sequence models (Schubach et al. 2024).
4.5.1 Protein Language Model Features
For protein-coding variants, CADD v1.7 integrates scores from protein language models such as ESM-1v (Meier et al. 2021). These models:
- Are trained self-supervised on hundreds of millions of protein sequences.
- Learn contextual embeddings and per-residue “likelihoods” that reflect evolutionary plausibility and functional constraints.
- Provide powerful per-variant scores for missense, nonsense, and frameshift events.
By embedding ESM-1v–derived features into its annotation set, CADD v1.7 effectively delegates part of the representation learning to a large foundational protein model, then uses its own classifier to recalibrate and integrate these signals with other annotations (Schubach et al. 2024).
4.5.2 Regulatory CNN Predictions
For non-coding variants, CADD v1.7 incorporates regulatory variant effect predictions from sequence-based convolutional neural networks trained on chromatin accessibility and related assays (Zhou and Troyanskaya 2015; Schubach et al. 2024). These CNNs:
- Take raw DNA sequence as input.
- Predict a battery of chromatin features (e.g., TF binding, histone marks).
- Provide per-variant delta scores reflecting the predicted impact of a mutation on these regulatory readouts.
In this way, CADD v1.7 uses early sequence-to-function CNNs (like those discussed in Chapters 5 and 6) as feature generators in its broader integrative framework.
4.5.3 Extended Conservation Scores
CADD v1.7 also updates its conservation and mutation-rate features to include:
- Deeper mammalian alignments from projects like Zoonomia.
- Improved models of genome-wide mutation rates and regional constraint (Schubach et al. 2024).
These updates help sharpen the evolutionary signal, particularly in non-coding regions and for variant types that were under-represented in earlier releases.
4.5.4 Performance Improvements
CADD v1.7 is evaluated on several benchmark sets (Schubach et al. 2024):
- ClinVar and gnomAD/ExAC variants, where pathogenic vs benign labels provide a coarse clinical ground truth.
- Deep mutational scanning (DMS) assays for coding variants, summarized in ProteinGym (Notin et al. 2023).
- Saturation mutagenesis reporter assays for promoters and enhancers, capturing regulatory variant effects.
Across these benchmarks, incorporating PLM scores, regulatory CNN predictions, and updated conservation features yields consistent improvements in classification and ranking performance compared to earlier CADD versions.
4.6 Benchmarking Against Alternative Approaches
4.6.1 Coding Variants
For coding variants, CADD is one among many deleteriousness scores:
- Legacy tools such as SIFT and PolyPhen focus on protein sequence and structure (Ng and Henikoff (2003), Adzhubei et al. (2010)).
- Ensemble scores (e.g., REVEL, MetaLR) combine multiple predictors.
- Modern methods exploit PLMs, structure prediction (AlphaFold), and other deep learning architectures.
In many comparisons, CADD’s combination of evolutionary, protein, and gene-context features yields performance that is competitive with or superior to specialized scores for Mendelian disease variant prioritization, especially when used as part of a broader interpretation pipeline (Rentzsch et al. 2019; Schubach et al. 2024).
4.6.2 Non-coding Variants
Non-coding variant interpretation is inherently more challenging:
- Ground-truth pathogenic variants are rarer and more biased toward specific genes and regulatory elements.
- Functional genomics assays often provide noisy, context-specific readouts.
Here, CADD’s integration of regulatory annotations and conservation allows it to rank plausible non-coding candidates genome-wide, particularly in promoters and enhancers covered by ENCODE/Roadmap-like data (Rentzsch et al. 2019). However, its performance depends heavily on the availability and quality of underlying annotations.
4.6.3 Population Frequency Correlation
Because CADD uses evolution as a training signal, its scores naturally correlate with population allele frequencies:
- Common variants in gnomAD tend to have low CADD scores.
- Very rare variants (especially singletons) show a broad distribution, with a subset in the extreme high-score tail (Rentzsch et al. 2019; “The Genome Aggregation Database (gnomAD)” n.d.).
This correlation is useful—high CADD scores often highlight variants under purifying selection—but it also means that CADD partially recapitulates frequency filtering. In downstream pipelines, it is important not to double-count this signal (e.g., by applying both aggressive frequency cutoffs and strict CADD thresholds).
4.6.4 Limitations and Circularity with ClinVar
CADD is now deeply embedded in variant interpretation workflows, and this success raises an important methodological issue: circularity between CADD and clinical databases such as ClinVar.
Two forms of circularity are particularly relevant:
Evaluation circularity
- CADD v1.7 is evaluated in part on datasets derived from ClinVar, ExAC/gnomAD, and 1000 Genomes variants (Schubach et al. 2024).
- However, ClinVar submissions increasingly incorporate in silico evidence, including CADD scores, as part of their classification process.
- When we evaluate CADD on ClinVar after ClinVar curation has already used CADD, we risk overestimating performance, because the model is partially being judged against labels that it helped create.
Feedback into training and model development
- Although the core CADD training labels are based on simulated vs observed variants, ClinVar and related resources still influence model development: they guide feature engineering, threshold selection, and choice of evaluation benchmarks.
- Over time, variants that are consistently prioritized by CADD are more likely to receive follow-up, be published, and enter ClinVar as “likely pathogenic,” reinforcing the underlying signal.
This kind of sociotechnical feedback loop is not unique to CADD; it is a general challenge for widely-used predictive tools in genomics and medicine. For CADD and successor models, it motivates several best practices:
- Include evaluation datasets that are independent of clinical curation pipelines, such as DMS experiments, reporter assays, and population-based burden tests.
- Report performance separately on pre-CADD and post-CADD ClinVar subsets when possible.
- Treat performance on ClinVar as a sanity check, not the sole or primary measure of model quality.
These concerns foreshadow similar issues we will encounter later in the book when genomic foundation models are evaluated on benchmarks that themselves rely on older predictive tools.
4.7 Significance for Genomic Deep Learning
CADD sits at an important historical junction between hand-crafted feature integration and modern deep, self-supervised representation learning. Several aspects of its design resonate throughout the rest of this book:
Annotation integration as a precursor to multi-task deep models
CADD’s integration of dozens of annotations into a single score anticipates later deep learning models that predict many functional genomics readouts from sequence and then reuse these as building blocks (Chapters 5–7 and 11). In CADD v1.7, the boundary blurs: deep networks (ESM-1v, regulatory CNNs) now provide features that CADD integrates (Meier et al. 2021; Zhou and Troyanskaya 2015; Schubach et al. 2024).
Evolutionary proxy labels as a template for self-supervision
Training on simulated vs observed variants uses the signature of selection as a rich, weak supervisory signal (Rentzsch et al. 2019). This is conceptually similar to masked language modeling and other self-supervised tasks that exploit abundant unlabeled data (Chapters 8–10).
Genome-wide coverage and scalability
By scoring all possible SNVs in the reference genome, CADD demonstrated the feasibility and utility of precomputing genome-wide variant scores for use in downstream analyses. Many genomic foundation models now do something analogous: precompute embeddings or predictions for every base or variant and expose them as reusable resources.
Composable with deep learning
CADD is not a direct competitor to modern sequence-based deep models; instead, it increasingly incorporates them as features while providing a stable, interpretable interface (PHRED-like scores) to end users. This “deep features + shallow integrator” pattern appears repeatedly in practical deployments.
As we move into the CNN-based sequence-to-function models of Part II and the Transformer-based genomic foundation models of Parts III and IV, it is helpful to remember that CADD solved a difficult problem—variant prioritization under data scarcity and heterogeneity—using tools that were available at the time. The models that follow expand on CADD’s core ideas by learning representations directly from sequence and by tying those representations to richer experimental readouts, but they still rely on many of the same data resources and confront many of the same challenges around evaluation, bias, and circularity.