7 Splicing Prediction
TODO:
- …
- …
7.1 The Splicing Challenge
While DeepSEA and ExPecto (Chapters 5–6) addressed chromatin state and gene expression, a distinct class of functional variants operates through a different mechanism: disruption of pre-mRNA splicing. The spliceosome—the cellular machinery that removes introns and joins exons—achieves remarkable precision, recognizing the correct splice sites among millions of potential candidates in the human transcriptome. Yet the sequence determinants underlying this specificity remained incompletely understood, limiting interpretation of variants that might alter splicing.
SpliceAI, introduced by Jaganathan et al. in 2019, demonstrated that deep neural networks could learn the sequence rules governing splicing with near-spliceosomal precision (Jaganathan et al. 2019). The model predicts splice site locations directly from pre-mRNA sequence, enabling identification of “cryptic splice” variants—mutations that create novel splice sites or disrupt existing ones in ways that evade traditional annotation-based detection.
The clinical implications are substantial: SpliceAI estimates that 9–11% of pathogenic mutations in rare genetic disorders act through cryptic splicing, representing a previously underappreciated class of disease variation.
7.2 Prior Approaches and Limitations
Before SpliceAI, splice site prediction relied on methods with limited context:
- MaxEntScan: Models core splice motifs using maximum entropy, limited to ~9 bp context around donor/acceptor sites
- GeneSplicer: Combines Markov models with decision trees
- NNSplice: Early neural network approach with narrow receptive fields
These methods captured the essential GT (donor) and AG (acceptor) dinucleotides and surrounding consensus sequences, but could not model the long-range determinants—exon/intron length constraints, branch points, enhancers, and silencers—that contribute to splicing specificity. As a result, they produced many false positive predictions and missed variants acting through distal mechanisms.
7.3 The SpliceAI Architecture
SpliceAI employs an ultra-deep residual convolutional network that integrates information across 10,000 nucleotides of sequence context—orders of magnitude more than prior methods.
7.3.1 Residual Block Design
The architecture’s fundamental unit is the residual block, comprising batch normalization, ReLU activation, and dilated convolutions. Residual connections address the vanishing gradient problem that had limited earlier deep networks:
\[ \text{output} = \text{input} + F(\text{input}) \]
where \(F\) represents the transformation learned by the convolutional layers. Skip connections from every fourth residual block feed directly to the penultimate layer, accelerating training convergence.
7.3.2 Dilated Convolutions for Long-Range Context
Each residual block uses dilated (atrous) convolutions parameterized by:
- \(N\): Number of convolutional kernels
- \(W\): Window size
- \(D\): Dilation rate
A kernel with window size \(W\) and dilation rate \(D\) spans \((W-1) \cdot D\) neighboring positions. The total receptive field \(S\) of the network is:
\[ S = \sum_{i=1}^{K} 2 \cdot (W_i - 1) \cdot D_i \]
where \(K\) is the number of residual blocks. By progressively increasing dilation rates through the network, SpliceAI achieves a 10,000 bp receptive field without the computational cost of processing 10,000 positions at full resolution.
7.3.3 Architecture Variants
Four architectures were developed with different context windows:
| Model | Flanking Sequence | Total Context | Residual Blocks |
|---|---|---|---|
| SpliceAI-80nt | 40 bp each side | 80 bp | 4 |
| SpliceAI-400nt | 200 bp each side | 400 bp | 8 |
| SpliceAI-2k | 1,000 bp each side | 2,000 bp | 16 |
| SpliceAI-10k | 5,000 bp each side | 10,000 bp | 32 |
The 32-layer SpliceAI-10k model substantially outperformed shorter-context variants, demonstrating that long-range sequence features contribute meaningfully to splice site prediction.
7.3.4 Output Format
For each nucleotide position, SpliceAI outputs three probabilities summing to one:
- Probability of being a splice acceptor (first nucleotide of an exon)
- Probability of being a splice donor (last nucleotide of an exon)
- Probability of being neither
The model operates in sequence-to-sequence mode: given an input of length \(S/2 + l + S/2\), it outputs predictions for the central \(l\) positions. This enables efficient batch processing where overlapping computations are shared.
7.4 Training and Evaluation
7.4.1 Training Data
SpliceAI was trained on 20,287 protein-coding genes from GENCODE V24, selecting principal transcripts when multiple isoforms existed. The training/test split used odd versus even chromosomes:
- Training: Chromosomes 2, 4, 6, 8, 10–22, X, Y (13,384 genes, 130,796 donor-acceptor pairs)
- Testing: Chromosomes 1, 3, 5, 7, 9—excluding genes with paralogs on training chromosomes (1,652 genes, 14,289 donor-acceptor pairs)
The paralog exclusion prevents information leakage through sequence homology.
For variant effect prediction, training was augmented with novel splice junctions commonly observed in GTEx RNA-seq data (adding ~67,000 donor and ~63,000 acceptor annotations), improving sensitivity for detecting splice-altering variants, particularly in deep intronic regions.
7.4.2 Splice Site Prediction Performance
SpliceAI-10k achieved:
- Top-k accuracy: 95% (at threshold where predicted sites equal actual sites)
- PR-AUC: 0.98
For comparison, MaxEntScan achieved only 57% top-k accuracy under equivalent conditions. The dramatic improvement reflects SpliceAI’s ability to reject false positive splice sites by considering sequence context beyond the core motif.
Notably, performance improved substantially with context length (80 bp → 400 bp → 2,000 bp → 10,000 bp), confirming that distal sequence features contribute to splice site recognition.
7.5 Variant Effect Prediction
7.5.1 The Delta Score
SpliceAI predicts variant effects by comparing splice site predictions for reference and alternative sequences:
\[ \Delta\text{score} = \max_{|p - v| \leq 50} \left| P_{\text{alt}}(p) - P_{\text{ref}}(p) \right| \]
where \(v\) is the variant position and \(p\) ranges over positions within 50 bp of the variant. The maximum change across all positions captures variants that strengthen existing sites, weaken existing sites, or create entirely new splice sites.
Critically, the model was trained only on reference transcript sequences and splice junction annotations—it never saw variant data during training. Variant effect prediction is thus a challenging test of whether the network learned genuine sequence determinants of splicing.
7.5.2 Cryptic Splice Variant Classes
SpliceAI detects several classes of splice-altering variants:
- Donor/acceptor loss: Disruption of annotated splice sites
- Donor/acceptor gain: Creation of novel splice sites
- Exon skipping: Variants causing an exon to be spliced out
- Intron retention: Variants causing an intron to remain in mature mRNA
- Cryptic exon activation: Deep intronic variants creating novel exons
Traditional annotation-based methods can identify variants in the essential GT/AG dinucleotides but miss the broader landscape of cryptic splice variants operating through more subtle mechanisms.
7.6 Validation on GTEx RNA-seq
The authors validated SpliceAI predictions using RNA-seq data from 149 GTEx individuals with matched whole-genome sequencing. Private variants (present in only one individual) predicted to alter splicing were tested for association with aberrant splice junctions.
7.6.1 Validation Rates
At a Δ score threshold of ≥0.5, cryptic splice variants validated at three-quarters the rate of essential GT/AG splice disruptions:
| Variant Class | Validation Rate |
|---|---|
| Essential GT/AG disruption | ~100% (by definition) |
| Cryptic splice (Δ ≥ 0.8) | ~85% |
| Cryptic splice (Δ ≥ 0.5) | ~75% |
| Cryptic splice (Δ ≥ 0.2) | ~50% |
Validation rate and effect size both tracked closely with Δ score, confirming that the model’s confidence correlates with functional impact.
7.6.2 Position-Dependent Sensitivity
Sensitivity varied by genomic location:
- Near exons (≤50 bp from exon-intron boundaries): 71% sensitivity at Δ ≥ 0.5
- Deep intronic (>50 bp from boundaries): 41% sensitivity at Δ ≥ 0.5
Deep intronic variants are more challenging because intronic regions contain fewer of the specificity determinants selected to be present near exons. Nevertheless, SpliceAI substantially outperformed prior methods in both regions.
7.6.3 Comparison to Prior Methods
Benchmarking against MaxEntScan, GeneSplicer, and NNSplice demonstrated SpliceAI’s superior performance across all operating points. At matched sensitivity, SpliceAI achieved higher validation rates; at matched validation rates, SpliceAI achieved higher sensitivity.
7.7 Population Genetics Evidence
Beyond RNA-seq validation, the authors assessed whether predicted cryptic splice variants show signatures of negative selection in human populations.
7.7.1 Allele Frequency Depletion
Using ExAC/gnomAD data, high-confidence cryptic splice variants (Δ ≥ 0.8) showed 78% depletion at common allele frequencies compared to expectation—comparable to the 82% depletion observed for frameshift, stop-gain, and essential splice-disrupting variants. This indicates that most confidently predicted cryptic splice variants are functional and deleterious.
The depletion was stronger for variants predicted to cause frameshifts versus in-frame alterations, consistent with the expectation that frameshift-causing splice variants have more severe fitness consequences.
7.7.2 Rare Variant Burden
The average human genome carries approximately:
- 11 rare protein-truncating variants (allele frequency <0.1%)
- 5 rare functional cryptic splice variants
Cryptic splice variants outnumber essential GT/AG splice-disrupting variants roughly 2:1, highlighting the substantial mutational target space beyond canonical splice sites.
7.8 De Novo Mutations in Rare Disease
The central clinical finding of SpliceAI is that cryptic splice mutations constitute a major, previously underappreciated cause of rare genetic disorders.
7.8.1 Case-Control Analysis
The authors analyzed de novo mutations in:
- 4,293 individuals with intellectual disability (Deciphering Developmental Disorders cohort)
- 3,953 individuals with autism spectrum disorders (Simons Simplex Collection + Autism Sequencing Consortium)
- 2,073 unaffected sibling controls
De novo mutations predicted to disrupt splicing (Δ ≥ 0.1) were significantly enriched in affected individuals:
| Cohort | Enrichment vs. Controls | p-value |
|---|---|---|
| Intellectual disability (DDD) | 1.51-fold | 4.2×10⁻⁴ |
| Autism spectrum disorder | 1.30-fold | 0.020 |
The enrichment remained significant when restricting to synonymous and intronic mutations, excluding the possibility that results were driven solely by variants with dual protein-coding and splicing effects.
7.8.2 Fraction of Pathogenic Mutations
Based on the excess of de novo mutations in cases versus controls:
- 9% of pathogenic de novo mutations in intellectual disability act through cryptic splicing
- 11% of pathogenic de novo mutations in autism act through cryptic splicing
In absolute terms, ~250 cases across the cohorts could be explained by de novo cryptic splice mutations, compared to ~909 cases explained by de novo protein-truncating variants.
7.8.3 Clinical Penetrance
Cryptic splice mutations showed roughly 50% of the clinical penetrance of classic protein-truncating mutations (stop-gain, frameshift, essential splice). This reduced penetrance reflects that many cryptic splice variants are hypomorphic—producing a mixture of normal and aberrant transcripts rather than complete loss of function.
Well-characterized examples from Mendelian disease support this interpretation: the c.315-48T>C variant in FECH and c.-32-13T>G in GAA are both hypomorphic cryptic splice alleles associated with milder phenotype or later age of onset.
7.8.4 Novel Gene Discovery
Including cryptic splice mutations in gene discovery analyses identified:
- 5 additional candidate genes for intellectual disability
- 2 additional candidate genes for autism
These genes would have fallen below the discovery threshold (FDR <0.01) when considering only protein-coding mutations.
7.9 Experimental Validation in Autism Patients
To directly validate predicted cryptic splice effects, the authors performed deep RNA-seq (~350 million reads per sample, ~10× GTEx coverage) on lymphoblastoid cell lines from 36 autism probands harboring predicted de novo cryptic splice mutations.
7.9.1 Validation Results
Among 28 cases with adequate RNA-seq coverage at the gene of interest:
- 21 (75%) showed unique aberrant splicing events associated with the predicted de novo variant
- 7 (25%) showed no aberrant splicing in lymphoblastoid cells
The 75% validation rate is remarkable given that the relevant tissue (developing brain) was not accessible—some cryptic splice effects may be tissue-specific and not observable in blood-derived cells.
7.9.2 Aberrant Splicing Classes
Among the 21 validated cases:
| Splicing Aberration | Count |
|---|---|
| Novel junction creation | 9 |
| Exon skipping | 8 |
| Intron retention | 4 |
These aberrant events were absent from all other samples (the remaining 35 probands and 149 GTEx individuals), confirming their association with the predicted de novo variants.
7.10 What SpliceAI Learns
Analysis of SpliceAI’s learned representations revealed that the network captures known splicing biology:
7.10.1 Core Splice Motifs
The model correctly learned the essential GT donor and AG acceptor dinucleotides, plus surrounding consensus sequences. In silico mutagenesis of these positions produced the largest predicted effects.
7.10.2 Branch Point Recognition
Introducing the optimal branch point sequence (TACTAAC) at varying distances from splice acceptors showed that SpliceAI learned the expected distance constraints (20–45 bp upstream of acceptors). At distances <20 bp, the branch point disrupts the polypyrimidine tract, and SpliceAI correctly predicted reduced acceptor strength.
7.10.3 Exonic Splicing Enhancers
The SR-protein binding motif GAAGAA, introduced at various positions, enhanced splice site strength when placed in expected locations within exons, demonstrating that SpliceAI learned the contribution of exonic splicing enhancers.
7.10.4 Nucleosome Positioning
Novel exon-creation events (where variants activate cryptic exons in introns) were significantly associated with existing nucleosome positioning, supporting a causal role for nucleosome occupancy in exon definition. SpliceAI implicitly captures this relationship despite not being trained on chromatin data.
7.11 Limitations and Considerations
7.11.1 Tissue Specificity
SpliceAI predicts splice sites based on sequence alone, without modeling tissue-specific alternative splicing. The same variant may have different effects across tissues depending on the expression of splicing factors and regulatory RNAs.
7.11.2 Incomplete Penetrance
Many cryptic splice variants produce partial shifts in splicing (alternative splicing) rather than complete disruption. The Δ score correlates with penetrance, but precise quantification of isoform ratios requires experimental validation.
7.11.3 Deep Intronic Predictions
While SpliceAI substantially improves deep intronic variant prediction over prior methods, sensitivity remains lower than for variants near exons. The 41% sensitivity (Δ ≥ 0.5) in deep intronic regions suggests that additional sequence features beyond the 10 kb context may contribute to splicing.
7.11.4 Training on Canonical Transcripts
Training on principal transcripts may not fully capture the diversity of alternative splicing. Augmentation with RNA-seq-derived junctions improved performance, suggesting that expanded training data could further enhance predictions.
7.12 Significance for the Field
SpliceAI established several important contributions:
Clinical impact quantification: The estimate that 9–11% of pathogenic mutations act through cryptic splicing fundamentally changed understanding of the noncoding disease mutation landscape
Deep context matters: The 32-layer, 10 kb context architecture demonstrated that splicing involves long-range sequence integration, motivating similar approaches in other genomic prediction tasks
Genome-wide variant scoring: Precomputed Δ scores for all possible single nucleotide substitutions (available at https://github.com/Illumina/SpliceAI) enable routine clinical annotation
Validation standards: The combination of RNA-seq validation, population genetics evidence, and case-control analysis established a rigorous framework for evaluating variant effect predictors
Specialized versus general models: SpliceAI’s success demonstrated that task-specific deep learning models could outperform general-purpose approaches by focusing computational capacity on a well-defined prediction problem
SpliceAI has become a standard component of clinical variant interpretation pipelines, complementing protein-effect predictors and regulatory variant scores. The approach has influenced subsequent work on tissue-specific splicing prediction and integration of splicing effects into comprehensive variant effect models like Borzoi (Chapter 11).
The model’s code and precomputed scores are publicly available (https://github.com/Illumina/SpliceAI), enabling widespread adoption in both research and clinical settings.