7  Splicing Prediction

Warning

TODO:

7.1 The Splicing Challenge

While DeepSEA and ExPecto (Chapters 5–6) addressed chromatin state and gene expression, a distinct class of functional variants operates through a different mechanism: disruption of pre-mRNA splicing. The spliceosome—the cellular machinery that removes introns and joins exons—achieves remarkable precision, recognizing the correct splice sites among millions of potential candidates in the human transcriptome. Yet the sequence determinants underlying this specificity remained incompletely understood, limiting interpretation of variants that might alter splicing.

SpliceAI, introduced by Jaganathan et al. in 2019, demonstrated that deep neural networks could learn the sequence rules governing splicing with near-spliceosomal precision (Jaganathan et al. 2019). The model predicts splice site locations directly from pre-mRNA sequence, enabling identification of “cryptic splice” variants—mutations that create novel splice sites or disrupt existing ones in ways that evade traditional annotation-based detection.

The clinical implications are substantial: SpliceAI estimates that 9–11% of pathogenic mutations in rare genetic disorders act through cryptic splicing, representing a previously underappreciated class of disease variation.

7.2 Prior Approaches and Limitations

Before SpliceAI, splice site prediction relied on methods with limited context:

  • MaxEntScan: Models core splice motifs using maximum entropy, limited to ~9 bp context around donor/acceptor sites
  • GeneSplicer: Combines Markov models with decision trees
  • NNSplice: Early neural network approach with narrow receptive fields

These methods captured the essential GT (donor) and AG (acceptor) dinucleotides and surrounding consensus sequences, but could not model the long-range determinants—exon/intron length constraints, branch points, enhancers, and silencers—that contribute to splicing specificity. As a result, they produced many false positive predictions and missed variants acting through distal mechanisms.

7.3 The SpliceAI Architecture

SpliceAI employs an ultra-deep residual convolutional network that integrates information across 10,000 nucleotides of sequence context—orders of magnitude more than prior methods.

7.3.1 Residual Block Design

The architecture’s fundamental unit is the residual block, comprising batch normalization, ReLU activation, and dilated convolutions. Residual connections address the vanishing gradient problem that had limited earlier deep networks:

\[ \text{output} = \text{input} + F(\text{input}) \]

where \(F\) represents the transformation learned by the convolutional layers. Skip connections from every fourth residual block feed directly to the penultimate layer, accelerating training convergence.

7.3.2 Dilated Convolutions for Long-Range Context

Each residual block uses dilated (atrous) convolutions parameterized by:

  • \(N\): Number of convolutional kernels
  • \(W\): Window size
  • \(D\): Dilation rate

A kernel with window size \(W\) and dilation rate \(D\) spans \((W-1) \cdot D\) neighboring positions. The total receptive field \(S\) of the network is:

\[ S = \sum_{i=1}^{K} 2 \cdot (W_i - 1) \cdot D_i \]

where \(K\) is the number of residual blocks. By progressively increasing dilation rates through the network, SpliceAI achieves a 10,000 bp receptive field without the computational cost of processing 10,000 positions at full resolution.

7.3.3 Architecture Variants

Four architectures were developed with different context windows:

Model Flanking Sequence Total Context Residual Blocks
SpliceAI-80nt 40 bp each side 80 bp 4
SpliceAI-400nt 200 bp each side 400 bp 8
SpliceAI-2k 1,000 bp each side 2,000 bp 16
SpliceAI-10k 5,000 bp each side 10,000 bp 32

The 32-layer SpliceAI-10k model substantially outperformed shorter-context variants, demonstrating that long-range sequence features contribute meaningfully to splice site prediction.

7.3.4 Output Format

For each nucleotide position, SpliceAI outputs three probabilities summing to one:

  1. Probability of being a splice acceptor (first nucleotide of an exon)
  2. Probability of being a splice donor (last nucleotide of an exon)
  3. Probability of being neither

The model operates in sequence-to-sequence mode: given an input of length \(S/2 + l + S/2\), it outputs predictions for the central \(l\) positions. This enables efficient batch processing where overlapping computations are shared.

7.4 Training and Evaluation

7.4.1 Training Data

SpliceAI was trained on 20,287 protein-coding genes from GENCODE V24, selecting principal transcripts when multiple isoforms existed. The training/test split used odd versus even chromosomes:

  • Training: Chromosomes 2, 4, 6, 8, 10–22, X, Y (13,384 genes, 130,796 donor-acceptor pairs)
  • Testing: Chromosomes 1, 3, 5, 7, 9—excluding genes with paralogs on training chromosomes (1,652 genes, 14,289 donor-acceptor pairs)

The paralog exclusion prevents information leakage through sequence homology.

For variant effect prediction, training was augmented with novel splice junctions commonly observed in GTEx RNA-seq data (adding ~67,000 donor and ~63,000 acceptor annotations), improving sensitivity for detecting splice-altering variants, particularly in deep intronic regions.

7.4.2 Splice Site Prediction Performance

SpliceAI-10k achieved:

  • Top-k accuracy: 95% (at threshold where predicted sites equal actual sites)
  • PR-AUC: 0.98

For comparison, MaxEntScan achieved only 57% top-k accuracy under equivalent conditions. The dramatic improvement reflects SpliceAI’s ability to reject false positive splice sites by considering sequence context beyond the core motif.

Notably, performance improved substantially with context length (80 bp → 400 bp → 2,000 bp → 10,000 bp), confirming that distal sequence features contribute to splice site recognition.

7.5 Variant Effect Prediction

7.5.1 The Delta Score

SpliceAI predicts variant effects by comparing splice site predictions for reference and alternative sequences:

\[ \Delta\text{score} = \max_{|p - v| \leq 50} \left| P_{\text{alt}}(p) - P_{\text{ref}}(p) \right| \]

where \(v\) is the variant position and \(p\) ranges over positions within 50 bp of the variant. The maximum change across all positions captures variants that strengthen existing sites, weaken existing sites, or create entirely new splice sites.

Critically, the model was trained only on reference transcript sequences and splice junction annotations—it never saw variant data during training. Variant effect prediction is thus a challenging test of whether the network learned genuine sequence determinants of splicing.

7.5.2 Cryptic Splice Variant Classes

SpliceAI detects several classes of splice-altering variants:

  • Donor/acceptor loss: Disruption of annotated splice sites
  • Donor/acceptor gain: Creation of novel splice sites
  • Exon skipping: Variants causing an exon to be spliced out
  • Intron retention: Variants causing an intron to remain in mature mRNA
  • Cryptic exon activation: Deep intronic variants creating novel exons

Traditional annotation-based methods can identify variants in the essential GT/AG dinucleotides but miss the broader landscape of cryptic splice variants operating through more subtle mechanisms.

7.6 Validation on GTEx RNA-seq

The authors validated SpliceAI predictions using RNA-seq data from 149 GTEx individuals with matched whole-genome sequencing. Private variants (present in only one individual) predicted to alter splicing were tested for association with aberrant splice junctions.

7.6.1 Validation Rates

At a Δ score threshold of ≥0.5, cryptic splice variants validated at three-quarters the rate of essential GT/AG splice disruptions:

Variant Class Validation Rate
Essential GT/AG disruption ~100% (by definition)
Cryptic splice (Δ ≥ 0.8) ~85%
Cryptic splice (Δ ≥ 0.5) ~75%
Cryptic splice (Δ ≥ 0.2) ~50%

Validation rate and effect size both tracked closely with Δ score, confirming that the model’s confidence correlates with functional impact.

7.6.2 Position-Dependent Sensitivity

Sensitivity varied by genomic location:

  • Near exons (≤50 bp from exon-intron boundaries): 71% sensitivity at Δ ≥ 0.5
  • Deep intronic (>50 bp from boundaries): 41% sensitivity at Δ ≥ 0.5

Deep intronic variants are more challenging because intronic regions contain fewer of the specificity determinants selected to be present near exons. Nevertheless, SpliceAI substantially outperformed prior methods in both regions.

7.6.3 Comparison to Prior Methods

Benchmarking against MaxEntScan, GeneSplicer, and NNSplice demonstrated SpliceAI’s superior performance across all operating points. At matched sensitivity, SpliceAI achieved higher validation rates; at matched validation rates, SpliceAI achieved higher sensitivity.

7.7 Population Genetics Evidence

Beyond RNA-seq validation, the authors assessed whether predicted cryptic splice variants show signatures of negative selection in human populations.

7.7.1 Allele Frequency Depletion

Using ExAC/gnomAD data, high-confidence cryptic splice variants (Δ ≥ 0.8) showed 78% depletion at common allele frequencies compared to expectation—comparable to the 82% depletion observed for frameshift, stop-gain, and essential splice-disrupting variants. This indicates that most confidently predicted cryptic splice variants are functional and deleterious.

The depletion was stronger for variants predicted to cause frameshifts versus in-frame alterations, consistent with the expectation that frameshift-causing splice variants have more severe fitness consequences.

7.7.2 Rare Variant Burden

The average human genome carries approximately:

  • 11 rare protein-truncating variants (allele frequency <0.1%)
  • 5 rare functional cryptic splice variants

Cryptic splice variants outnumber essential GT/AG splice-disrupting variants roughly 2:1, highlighting the substantial mutational target space beyond canonical splice sites.

7.8 De Novo Mutations in Rare Disease

The central clinical finding of SpliceAI is that cryptic splice mutations constitute a major, previously underappreciated cause of rare genetic disorders.

7.8.1 Case-Control Analysis

The authors analyzed de novo mutations in:

  • 4,293 individuals with intellectual disability (Deciphering Developmental Disorders cohort)
  • 3,953 individuals with autism spectrum disorders (Simons Simplex Collection + Autism Sequencing Consortium)
  • 2,073 unaffected sibling controls

De novo mutations predicted to disrupt splicing (Δ ≥ 0.1) were significantly enriched in affected individuals:

Cohort Enrichment vs. Controls p-value
Intellectual disability (DDD) 1.51-fold 4.2×10⁻⁴
Autism spectrum disorder 1.30-fold 0.020

The enrichment remained significant when restricting to synonymous and intronic mutations, excluding the possibility that results were driven solely by variants with dual protein-coding and splicing effects.

7.8.2 Fraction of Pathogenic Mutations

Based on the excess of de novo mutations in cases versus controls:

  • 9% of pathogenic de novo mutations in intellectual disability act through cryptic splicing
  • 11% of pathogenic de novo mutations in autism act through cryptic splicing

In absolute terms, ~250 cases across the cohorts could be explained by de novo cryptic splice mutations, compared to ~909 cases explained by de novo protein-truncating variants.

7.8.3 Clinical Penetrance

Cryptic splice mutations showed roughly 50% of the clinical penetrance of classic protein-truncating mutations (stop-gain, frameshift, essential splice). This reduced penetrance reflects that many cryptic splice variants are hypomorphic—producing a mixture of normal and aberrant transcripts rather than complete loss of function.

Well-characterized examples from Mendelian disease support this interpretation: the c.315-48T>C variant in FECH and c.-32-13T>G in GAA are both hypomorphic cryptic splice alleles associated with milder phenotype or later age of onset.

7.8.4 Novel Gene Discovery

Including cryptic splice mutations in gene discovery analyses identified:

  • 5 additional candidate genes for intellectual disability
  • 2 additional candidate genes for autism

These genes would have fallen below the discovery threshold (FDR <0.01) when considering only protein-coding mutations.

7.9 Experimental Validation in Autism Patients

To directly validate predicted cryptic splice effects, the authors performed deep RNA-seq (~350 million reads per sample, ~10× GTEx coverage) on lymphoblastoid cell lines from 36 autism probands harboring predicted de novo cryptic splice mutations.

7.9.1 Validation Results

Among 28 cases with adequate RNA-seq coverage at the gene of interest:

  • 21 (75%) showed unique aberrant splicing events associated with the predicted de novo variant
  • 7 (25%) showed no aberrant splicing in lymphoblastoid cells

The 75% validation rate is remarkable given that the relevant tissue (developing brain) was not accessible—some cryptic splice effects may be tissue-specific and not observable in blood-derived cells.

7.9.2 Aberrant Splicing Classes

Among the 21 validated cases:

Splicing Aberration Count
Novel junction creation 9
Exon skipping 8
Intron retention 4

These aberrant events were absent from all other samples (the remaining 35 probands and 149 GTEx individuals), confirming their association with the predicted de novo variants.

7.10 What SpliceAI Learns

Analysis of SpliceAI’s learned representations revealed that the network captures known splicing biology:

7.10.1 Core Splice Motifs

The model correctly learned the essential GT donor and AG acceptor dinucleotides, plus surrounding consensus sequences. In silico mutagenesis of these positions produced the largest predicted effects.

7.10.2 Branch Point Recognition

Introducing the optimal branch point sequence (TACTAAC) at varying distances from splice acceptors showed that SpliceAI learned the expected distance constraints (20–45 bp upstream of acceptors). At distances <20 bp, the branch point disrupts the polypyrimidine tract, and SpliceAI correctly predicted reduced acceptor strength.

7.10.3 Exonic Splicing Enhancers

The SR-protein binding motif GAAGAA, introduced at various positions, enhanced splice site strength when placed in expected locations within exons, demonstrating that SpliceAI learned the contribution of exonic splicing enhancers.

7.10.4 Nucleosome Positioning

Novel exon-creation events (where variants activate cryptic exons in introns) were significantly associated with existing nucleosome positioning, supporting a causal role for nucleosome occupancy in exon definition. SpliceAI implicitly captures this relationship despite not being trained on chromatin data.

7.11 Limitations and Considerations

7.11.1 Tissue Specificity

SpliceAI predicts splice sites based on sequence alone, without modeling tissue-specific alternative splicing. The same variant may have different effects across tissues depending on the expression of splicing factors and regulatory RNAs.

7.11.2 Incomplete Penetrance

Many cryptic splice variants produce partial shifts in splicing (alternative splicing) rather than complete disruption. The Δ score correlates with penetrance, but precise quantification of isoform ratios requires experimental validation.

7.11.3 Deep Intronic Predictions

While SpliceAI substantially improves deep intronic variant prediction over prior methods, sensitivity remains lower than for variants near exons. The 41% sensitivity (Δ ≥ 0.5) in deep intronic regions suggests that additional sequence features beyond the 10 kb context may contribute to splicing.

7.11.4 Training on Canonical Transcripts

Training on principal transcripts may not fully capture the diversity of alternative splicing. Augmentation with RNA-seq-derived junctions improved performance, suggesting that expanded training data could further enhance predictions.

7.12 Significance for the Field

SpliceAI established several important contributions:

  1. Clinical impact quantification: The estimate that 9–11% of pathogenic mutations act through cryptic splicing fundamentally changed understanding of the noncoding disease mutation landscape

  2. Deep context matters: The 32-layer, 10 kb context architecture demonstrated that splicing involves long-range sequence integration, motivating similar approaches in other genomic prediction tasks

  3. Genome-wide variant scoring: Precomputed Δ scores for all possible single nucleotide substitutions (available at https://github.com/Illumina/SpliceAI) enable routine clinical annotation

  4. Validation standards: The combination of RNA-seq validation, population genetics evidence, and case-control analysis established a rigorous framework for evaluating variant effect predictors

  5. Specialized versus general models: SpliceAI’s success demonstrated that task-specific deep learning models could outperform general-purpose approaches by focusing computational capacity on a well-defined prediction problem

SpliceAI has become a standard component of clinical variant interpretation pipelines, complementing protein-effect predictors and regulatory variant scores. The approach has influenced subsequent work on tissue-specific splicing prediction and integration of splicing effects into comprehensive variant effect models like Borzoi (Chapter 11).

The model’s code and precomputed scores are publicly available (https://github.com/Illumina/SpliceAI), enabling widespread adoption in both research and clinical settings.