19  RNA Structure and Function

A “silent” mutation that changes nothing can change everything.

Scope: Molecular RNA, Not Transcriptome Readout

This chapter examines RNA as a molecular entity (its sequence, secondary structure, folding, and design) rather than RNA as a transcriptomic readout of cellular state. For single-cell transcriptomics, expression profiling, and cell-state inference from RNA measurements, see Chapter 20. The distinction matters: molecular RNA models take RNA sequence as input and predict structure or function; transcriptomic models treat expression levels as readouts of upstream regulatory processes.

Chapter Overview

Estimated reading time: 35-45 minutes

Prerequisites: Understanding of DNA and protein foundation models from Part III (Chapter 15, Chapter 16), familiarity with self-supervised pretraining objectives (Chapter 8), and basic knowledge of transformer architectures (Chapter 7).

You will learn:

  • Why RNA modeling presents unique challenges distinct from DNA and protein modeling
  • How RNA secondary structure prediction differs fundamentally from protein folding
  • The current state of RNA foundation models and why they lag behind protein counterparts
  • How codon-level models capture translation dynamics invisible to protein language models
  • Practical applications in UTR optimization and therapeutic mRNA design

Key insight: RNA occupies a unique position in the central dogma: it is both the output of DNA transcription and the input to protein translation, yet its structure and regulation follow rules that neither DNA nor protein models capture. The flat energy landscapes, conformational dynamics, and sparse structural data that characterize RNA create modeling challenges that no other molecular domain faces at comparable scale.

A synonymous mutation changes the DNA codon but preserves the amino acid. By the logic of protein-centric biology, such mutations should be functionally neutral: same protein sequence, same structure, same function. Yet synonymous variants can dramatically alter gene expression, affect protein folding, and cause disease. The mechanisms operate at the RNA level: altered codon optimality changes translation speed, modified mRNA secondary structure affects ribosome processivity, disrupted regulatory motifs change transcript stability. A model that sees only DNA sequence or only protein sequence misses these effects entirely. DNA foundation models learn regulatory sequence patterns; protein language models learn amino acid constraints. Neither captures RNA-level biology: secondary structure stability, RBP binding accessibility, or the coupling between splicing and decay.

RNA occupies a distinct position in the central dogma, essential to every step from transcription to translation, yet historically receiving less computational attention than its neighbors. The disparity reflects data availability more than biological importance. Protein sequences accumulate over billions of years of evolution, providing the massive corpora that enabled ESM to learn structure from sequence (Chapter 16). DNA benefits from reference genomes, population sequencing, and functional genomics consortia generating petabytes of data (Chapter 2). RNA databases remain comparatively sparse, structural annotations cover only well-characterized families, and no equivalent of AlphaFold’s crystallographic training set exists for RNA tertiary structure. The result is a modeling landscape where RNA foundation models exist but remain immature relative to protein and DNA counterparts.

The foundation models examined previously all manifest their predictions through RNA intermediates: Enformer predicts RNA-seq coverage (Chapter 17), protein models predict translation products (Chapter 16), SpliceAI models spliceosome recognition of RNA (Section 6.5). RNA-specific models add a distinct layer, treating RNA not merely as a readout of DNA or a precursor to protein, but as a structured molecule with its own sequence constraints, folding landscapes, and functional roles. We examine secondary structure prediction, RNA foundation models, codon-level mRNA models, and noncoding RNA classification, while confronting the data limitations that constrain current approaches.

19.1 RNA as Molecule Versus Transcriptome Readout

Two complementary perspectives frame computational approaches to RNA. The molecular view treats RNA as a physical object with primary sequence, secondary structure through base pairing, tertiary organization in three-dimensional space, and chemical modifications that alter its properties. In this view, modeling goals include predicting which bases pair with which, how the molecule folds, which proteins bind to it, and how synthetic RNAs might be designed with desired properties. The transcriptomic view treats RNA as a cellular readout: coverage profiles along the genome, splice junction usage, isoform abundances, expression levels that vary across cell types and conditions. Here the goal is explaining how genomic sequence and chromatin state give rise to these measurements.

Models that predict transcriptomic signals from DNA sequence (Enformer, Borzoi, and related architectures covered in Chapter 17) operate in the second paradigm. They take genomic sequence as input and output RNA-seq or CAGE profiles as predictions. These models never see RNA sequence directly; they learn the mapping from DNA context to transcriptional output. The molecular perspective treats RNA sequence as input and predicts structure, function, or design properties.

The distinction parallels the difference between protein language models and proteomics prediction models. ESM takes amino acid sequences and learns structural representations (Chapter 16). A model predicting protein abundance from genomic features would be solving a different problem. Both perspectives are valuable, and both ultimately concern RNA, but they operate at different levels of the biological hierarchy and require different architectures and training strategies.

Table 19.1: Two perspectives on RNA modeling, distinguished by input modality and biological question.
Perspective Input Output Example Models Key Questions
Molecular RNA sequence Structure, binding, design RNA-FM, SPOT-RNA, structure probing models How does this RNA fold? What binds it?
Transcriptomic DNA sequence + context Expression, splicing, coverage Enformer, Borzoi, SpliceAI How much transcript is produced? Where is it spliced?

The transcriptomic perspective becomes particularly important when RNA expression serves as the primary readout of cellular state. Single-cell foundation models (Chapter 20) use gene expression vectors (fundamentally RNA measurements) as their input representation, learning to embed cells based on transcriptomic profiles. RNA-level regulation also intersects with chromatin architecture: 3D genome organization influences which enhancers contact which promoters, thereby determining which RNAs are transcribed and at what levels (Chapter 21).

19.2 Why Secondary Structure Creates a Distinct Modeling Challenge

Stop and Think

Proteins fold into stable three-dimensional structures, and protein language models have learned to predict these folds with remarkable accuracy. Before reading on, consider: why might RNA structure prediction be fundamentally harder than protein structure prediction, despite RNA having only 4 bases compared to 20 amino acids?

RNA structure prediction is harder than protein structure prediction for three key reasons:

  1. Flat energy landscapes: While proteins fold into a single deep energy minimum, RNA sequences often have multiple competing structures with similar free energies; the sequence-to-structure mapping is many-to-many rather than one-to-one.

  2. Dynamic conformations: RNA can switch between alternative structures in response to cellular conditions, whereas most proteins maintain stable folds.

  3. Long-range base pairing: RNA secondary structure involves pairs spanning hundreds of nucleotides, creating dependencies that standard algorithms struggle to capture efficiently, especially when pseudoknots violate the nested structure assumption required for efficient dynamic programming.

The apparent “simplicity” of 4 bases versus 20 amino acids is misleading; the real complexity lies in conformational dynamics, not alphabet size.

RNA secondary structure prediction differs fundamentally from protein structure prediction in ways that shape every modeling choice. Where DNA language models learn from linear sequence patterns (Chapter 15) and protein models exploit evolutionary constraints across homologs (Chapter 16), RNA models must contend with conformational dynamics that neither domain faces at comparable scale. Three interrelated challenges define the problem: thermodynamic landscapes with multiple competing minima, base-pairing interactions that span hundreds of nucleotides, and pseudoknot structures (configurations where bases in a loop pair with bases outside that loop, creating interleaved patterns that standard algorithms cannot efficiently handle) that violate the assumptions of efficient algorithms. Understanding these challenges clarifies why RNA structure remains harder to predict than protein structure despite the apparent simplicity of four bases versus twenty amino acids.

19.2.1 Flat Energy Landscape Problem

Stop and Think

Before reading about RNA’s energy landscape, predict: if an RNA sequence can fold into multiple different structures with similar energies, what would this mean for predicting “the” structure of that RNA? How does this differ from protein folding?

RNA’s defining computational challenge emerges from thermodynamics. Proteins fold into stable three-dimensional structures because their energy landscapes contain deep minima: the native state sits in a pronounced funnel that guides the folding process. RNA energy landscapes are substantially flatter, like a marble on a dinner plate with multiple shallow dents rather than a deep bowl. The marble settles in one dent but can easily roll to another with a small nudge. Multiple conformations compete for occupancy, with free energy differences often smaller than thermal fluctuations at cellular temperatures. A given RNA sequence may adopt several alternative structures with similar stabilities, and the dominant conformation can shift in response to ion concentrations, temperature, protein binding, or chemical modifications.

Key Insight

RNA’s flat energy landscape means that the same sequence can adopt multiple different structures with similar stabilities. This creates a fundamental many-to-many relationship between sequence and structure that protein modeling largely avoids; most proteins have a single dominant fold. RNA models must somehow represent or accommodate this conformational heterogeneity.

This conformational plasticity has biological functions (riboswitches that change structure in response to ligand binding, RNA thermometers that regulate translation at different temperatures) but creates modeling difficulties. Minimum free energy (MFE) predictions, which identify the single lowest-energy structure, may miss functionally relevant alternative conformations. Partition function calculations that consider the full ensemble are more complete but computationally expensive and harder to interpret. Deep learning models that predict structure from sequence must somehow capture this many-to-many relationship between sequence and conformation, a challenge that protein structure prediction largely avoided because the sequence-to-structure mapping for most proteins is effectively one-to-one.

Protein folding shows deep funnel topology

RNA folding shows flat landscape with multiple minima
Figure 19.1: Energy landscape comparison between protein and RNA folding. (A) Proteins fold into deep energy minima, typically reaching a single stable native structure. (B) RNA energy landscapes are flatter, with multiple conformations competing at similar free energies, explaining why RNA can adopt alternative structures under different conditions.

19.2.2 Base Pairing and Long-Range Dependencies

Secondary structure arises from Watson-Crick base pairing (A-U, G-C) and wobble pairs (G-U) that create stems, loops, bulges, and internal loops. Unlike protein secondary structure, where alpha helices and beta sheets are local motifs determined by nearby residues, RNA secondary structure involves long-range contacts. A base at position i may pair with a base at position j hundreds of nucleotides away. The intervening sequence must accommodate this pairing without introducing steric clashes or thermodynamically unfavorable arrangements.

This long-range dependency in secondary structure differs from protein structure prediction in a specific way: while proteins also have long-range tertiary contacts (and indeed, protein contact prediction is a key challenge), protein secondary structure elements are primarily local. RNA secondary structure, by contrast, is inherently non-local: you cannot determine whether a base is paired without considering the entire sequence. RNA structure prediction must consider all possible pairings across the entire sequence, evaluate their compatibility, and identify the globally optimal (or near-optimal) arrangement. The number of possible secondary structures grows exponentially with sequence length, making exhaustive enumeration intractable for long RNAs.

Worked Example: Consider a 100-nucleotide RNA. Position 10 might pair with position 90, creating a stem. But positions 10-15 might alternatively pair with positions 45-50, creating a different stem. These alternatives are mutually exclusive (a base can only pair with one partner), but both might have similar free energies. The algorithm must evaluate all such possibilities and find the globally best combination, not just locally optimal stems.

19.2.3 Pseudoknots and Tertiary Complexity

Stop and Think

Standard RNA structure prediction algorithms use dynamic programming with O(n³) complexity by assuming nested base pairs. What happens to this assumption when bases in a loop pair with bases outside that loop? Why would this dramatically increase computational complexity?

Pseudoknots occur when bases in a loop pair with bases outside that loop, creating interleaved base-pairing patterns that violate the nested structure assumed by standard secondary structure algorithms. A typical pseudoknot involves two stem regions whose base pairs cross each other when drawn in standard notation. These structures are functionally important (the telomerase RNA catalytic core contains a pseudoknot essential for activity) but algorithmically challenging. Standard dynamic programming approaches for secondary structure prediction exclude pseudoknots because their inclusion increases computational complexity from \(O(n^3)\) to \(O(n^6)\) or worse. ** The complexity increase arises because nested base pairs can be solved by recursive decomposition: if positions \(i\) and \(j\) pair, the structure between them is independent of the structure outside. Pseudoknots violate this independence by creating dependencies between the “inside” and “outside” regions. The algorithm must therefore consider all possible ways the crossing pairs might interleave, leading to exponentially more states.

Basic structural elements: stems, loops, bulges

Long-range base pairing spans hundreds of nucleotides

Pseudoknots increase computational complexity

Common notation systems for secondary structure
Figure 19.2: RNA secondary structure vocabulary. (A) Basic structural elements: stems, hairpin loops, internal loops, bulges, and multi-loop junctions. (B) Long-range base pairing: RNA pairs can span hundreds of nucleotides, unlike protein secondary structure. (C) Pseudoknots: interleaved base pairs that increase prediction complexity from O(n³) to O(n⁶). (D) Common notation systems for representing secondary structure.

Tertiary structure involves the three-dimensional arrangement of secondary structure elements in space, including long-range interactions mediated by non-Watson-Crick base pairs, metal ion coordination, and RNA-RNA kissing loops. Predicting RNA tertiary structure remains far less developed than protein tertiary structure prediction. No RNA equivalent of AlphaFold exists, and the training data situation is dire: the Protein Data Bank contains over 200,000 protein structures but fewer than 2,000 RNA structures, many of which are ribosomal RNA fragments or tRNA variants from the same structural families.

19.3 Classical Approaches to Structure Prediction

Before deep learning entered the field, two complementary paradigms dominated RNA structure prediction. Thermodynamic approaches compute minimum free energy structures from experimentally calibrated energy parameters, while comparative methods infer structure from patterns of compensatory mutations across homologous sequences. Both approaches remain valuable, and understanding their strengths and limitations illuminates what deep learning models must learn to surpass them.

19.3.1 Thermodynamic Folding Models

The dominant classical paradigm for RNA secondary structure prediction relies on nearest-neighbor thermodynamic models. These approaches assign free energy contributions to each base pair and structural element (loops, bulges, internal loops, multiloops) based on experimentally calibrated parameters. Given these parameters, dynamic programming algorithms identify the minimum free energy structure or compute the partition function over all possible structures.

Mfold and the ViennaRNA package represent the most widely used implementations. ** These programs use dynamic programming to fill a table of optimal substructure scores: for each possible base pair \((i,j)\), they compute the minimum free energy achievable for the subsequence between \(i\) and \(j\), building up from small subsequences to the full molecule. The algorithm works because RNA secondary structure is “nested”—if \((i,j)\) pairs, then any other pair \((k,l)\) either lies entirely within \((i,j)\) or entirely outside it, never crossing. This nesting property enables the \(O(n^3)\) dynamic programming solution.

These methods achieve reasonable accuracy for short, well-behaved RNAs where the thermodynamic parameters are most reliable. Limitations emerge for longer RNAs where the flat energy landscape means many structures have similar energies, for RNAs in complex cellular environments where proteins and other factors alter folding, and for RNAs with modifications or non-canonical interactions not captured by standard parameter sets. These methods also assume equilibrium conditions that may not hold for co-transcriptional folding or kinetically trapped states.

Knowledge Check

Thermodynamic folding methods find the minimum free energy structure. Given what you have learned about RNA’s flat energy landscape, what is the main limitation of this approach? (Hint: think about alternative conformations.)

The main limitation is that MFE methods predict only the single lowest-energy structure, missing functionally relevant alternative conformations that may have similar free energies. RNA’s flat energy landscape means multiple structures can coexist with comparable stability, so the global minimum may not be the only biologically important state.

19.3.2 Comparative and Covariation Methods

For RNAs with sufficient homologous sequences, comparative approaches provide an orthogonal route to structure inference. If two positions exhibit compensatory mutations (G-C changing to A-U while maintaining complementarity), those positions likely base-pair. Databases like Rfam curate consensus secondary structures for RNA families based on these covariation signals (Kalvari et al. 2021).

Comparative methods are powerful but require multiple sequence alignments of homologous RNAs. Novel RNAs, rapidly evolving regulatory elements, or species-specific transcripts may lack sufficient homologs for reliable inference. The approach also assumes that structure is conserved across the aligned sequences, which breaks down for RNAs that have diverged in function or that adopt condition-specific alternative structures.

Table 19.2: Comparison of RNA structure prediction approaches. Each has complementary strengths.
Approach Strengths Limitations Best Use Case
Thermodynamic (MFE) No homologs needed; well-understood parameters Misses alternative structures; struggles with long RNAs Short RNAs, initial prediction
Partition function Considers ensemble; provides base-pair probabilities Computationally expensive; hard to interpret Assessing structural confidence
Comparative High accuracy when homologs available; reveals conserved structure Requires homologs; assumes structure conservation Well-characterized RNA families
Deep learning Learns complex patterns; can handle pseudoknots Requires training data; may not generalize Structure probing integration

19.4 Deep Learning for Secondary Structure Prediction

Deep learning reframes secondary structure prediction as sequence-to-structure mapping, learning the relationship directly from data rather than encoding it through thermodynamic parameters. These models can capture patterns that classical approaches miss, particularly for complex structures and pseudoknots, though they require training data that remains limited compared to protein structure prediction. Two complementary training strategies have emerged: supervised learning from experimentally determined structures and semi-supervised approaches using structure probing data.

19.4.1 From Thermodynamics to Learned Patterns

Deep learning models for RNA structure prediction frame the task as sequence-to-structure mapping, analogous to protein contact prediction (Chapter 16). Given an RNA sequence, the model predicts base-pairing probabilities for all position pairs, contact maps indicating which bases interact, or per-nucleotide structural states (paired, unpaired, in loop, in stem).

Models like SPOT-RNA use convolutional or attention-based architectures to capture long-range dependencies in sequence (Singh et al. 2019). Some approaches directly predict pairing matrices as dense outputs; others output per-position classifications that are post-processed into structures. Training typically uses experimentally determined structures from databases like RNAstralign or bpRNA, supplemented by computationally predicted structures from thermodynamic models.

Performance on benchmark datasets often exceeds classical thermodynamic methods, particularly for RNAs with complex structures or pseudoknots where dynamic programming approaches struggle. The learned models can capture patterns beyond nearest-neighbor rules, potentially encoding longer-range sequence dependencies that contribute to folding but were not parameterized in classical approaches.

19.4.2 Structure Probing as Supervision

High-throughput structure probing experiments provide an alternative source of supervision. SHAPE (selective 2’-hydroxyl acylation analyzed by primer extension), DMS-seq, and icSHAPE measure nucleotide accessibility or flexibility across entire transcriptomes. Positions that are base-paired or buried in tertiary structure show lower reactivity than exposed positions (Spitale et al. 2015).

These data offer several advantages for model training. They cover far more RNAs than crystal structures, extending beyond well-characterized families to regulatory elements and novel transcripts. They capture structure in cellular context, reflecting the influence of proteins, modifications, and physiological conditions. And they provide soft constraints rather than binary pairing assignments, potentially better matching the conformational heterogeneity of real RNA populations.

Looking Forward: RNA Structure in Variant Interpretation

RNA secondary structure predictions feed directly into clinical variant interpretation. Variants that disrupt splice sites or create aberrant RNA structures can cause disease even when amino acid sequence is unchanged. These structural effects are integrated into foundation model-based variant interpretation pipelines (Chapter 18) and are particularly important in rare disease diagnosis (Section 29.1.4), where splice-altering variants explain a substantial fraction of previously undiagnosed cases.

Models trained on structure probing data learn to predict accessibility profiles from sequence. These predictions can be integrated with thermodynamic models (using predicted accessibility as constraints) or used directly for downstream tasks like predicting RNA-protein binding or designing stable constructs.

19.5 RNA Foundation Models

Stop and Think

Protein language models like ESM-2 trained on 65 million sequences and learned to predict protein structure from sequence alone. Before reading on, consider: what would an RNA foundation model need to achieve a similar breakthrough? What resources or data might be missing?

The success of protein language models naturally prompted attempts to apply the same paradigm to RNA. Train large transformers on massive sequence corpora following the scaling principles examined in Section 14.3, learn representations through self-supervised objectives (Chapter 8), then transfer to downstream tasks. RNA foundation models exist and show promise, but they have not yet achieved the transformative impact of their protein counterparts. The reasons illuminate fundamental differences between protein and RNA modeling.

19.5.1 Scale Gap with Protein Language Models

Stop and Think

ESM-2 learned to predict protein structure from sequence alone, with attention patterns that correspond to 3D contacts, an emergent capability that surprised even its creators. Why have not RNA foundation models achieved a similar breakthrough? Before reading on, consider what differences between protein and RNA data might explain this gap.

RNA foundation models attempt to replicate the protein language model paradigm: train large transformers on massive sequence corpora using self-supervised objectives, then transfer learned representations to downstream tasks. The approach has produced working models, but the results lag substantially behind protein language models in both scale and demonstrated capabilities.

The comparison with ESM illustrates the gap. ESM-2 trained on over 65 million protein sequences from UniRef, spanning the known diversity of protein families (Chapter 16). RNA-FM, one of the more successful RNA foundation models, trained on approximately 23 million noncoding RNA sequences from RNAcentral (Chen et al. 2022). While not a trivial corpus, this represents roughly three-fold fewer sequences, and (more importantly) the RNA sequences span a narrower range of structural and functional diversity than proteins. The consequences appear in downstream performance: RNA-FM improves over baselines on secondary structure prediction and family classification, but shows nothing like the emergent structure prediction that made ESM-2’s attention patterns predict contact maps without supervision.

Key Insight

The gap between RNA and protein foundation models is not primarily about architecture or training objectives; it is about data. While RNA is equally ancient as protein (indeed, the “RNA world” hypothesis posits that RNA preceded proteins), protein sequence databases are vastly larger because proteins are more amenable to high-throughput sequencing, annotation, and structural determination. RNA databases are smaller, biased toward well-characterized families, and lack equivalent structural training sets. Until this data gap closes, RNA foundation models will struggle to achieve protein-like breakthroughs.

Several factors explain the disparity. Protein databases contain sequences spanning billions of years of evolution across all domains of life, with each functional protein family represented by thousands of homologs, and critically, these sequences have been systematically collected and annotated. RNA databases are biased toward well-characterized structural families (tRNAs, rRNAs, ribozymes) with sparser coverage of regulatory ncRNAs and lineage-specific transcripts. Many functionally important RNAs (particularly regulatory ncRNAs) evolve rapidly and show poor sequence conservation, making them harder to identify across species. The epitranscriptomic modifications that alter RNA function are invisible in sequence databases, unlike protein post-translational modifications that at least occur at predictable sequence motifs.

Key metrics comparison between protein and RNA FMs

Training data composition differs dramatically

Emergent capabilities comparison

The fundamental structural data challenge
Figure 19.3: The scale gap between protein and RNA foundation models. (A) Key metrics comparison showing protein models train on approximately 3× more sequences with much greater diversity. (B) Training data composition: protein data span diverse families while RNA databases are dominated by well-characterized classes. (C) Emergent capabilities comparison showing protein models achieve structure prediction that RNA models lack. (D) The fundamental structural data challenge: as of 2025, the Protein Data Bank (PDB; https://www.rcsb.org) contains over 240,000 protein structures but fewer than 2,000 RNA-only tertiary structures, with most RNA structures being ribosomal fragments or tRNAs.

19.5.2 Architectures and Objectives

Most RNA foundation models follow the masked language modeling (MLM) paradigm established by BERT (Chapter 5). RNA-FM uses a transformer encoder with nucleotide-level tokenization, predicting masked bases from surrounding context. The learned embeddings show some correspondence to secondary structure when probed with downstream tasks, though the correspondence is weaker than the structure-function relationship learned by protein language models.

Alternative architectures explore different design choices. Some models incorporate explicit structure tokens or operate on sequence-structure graphs, learning joint representations over both modalities. Others use codon-level tokenization for coding RNAs or explore state-space models and other efficient attention variants to handle longer sequences. RNAErnie and related models experiment with multi-task objectives that combine MLM with auxiliary predictions for structural features or family classification. The field remains in active development, with no clear consensus on optimal architecture, tokenization, or training strategy. Unlike protein modeling, where ESM established a dominant paradigm that subsequent work has refined, RNA modeling still explores fundamental design choices.

19.5.3 Downstream Applications

RNA foundation model embeddings support various downstream tasks. Secondary structure prediction fine-tunes the model to output pairing probabilities or SHAPE reactivity profiles. RNA-protein binding prediction uses CLIP-seq data to predict interactions with RNA-binding proteins. Family classification assigns sequences to Rfam families or functional categories (tRNA, rRNA, miRNA, lncRNA). Expression and stability tasks predict transcript half-life or steady-state levels from UTR sequences.

Performance varies substantially across tasks. For structurally constrained RNAs like tRNAs and rRNAs, where sequence motifs strongly determine structure and function, foundation model embeddings provide useful features. For regulatory lncRNAs that often lack stable secondary structures and conserved motifs, improvement over baseline methods is more modest. The diversity of RNA types and tasks complicates benchmarking (Chapter 12), and models that excel on one task may struggle on others.

19.6 Codon-Level Models for Coding RNA

Stop and Think

Consider two mRNA sequences that encode the exact same protein but differ in their codon choices (synonymous codons). A protein language model sees identical sequences. What biological information exists in the mRNA that the protein model completely misses?

Coding sequences present a modeling opportunity that neither DNA nor protein foundation models fully exploit. The genetic code’s synonymous redundancy means that mRNA sequence carries information beyond amino acid identity: codon choice affects translation speed, mRNA stability, and co-translational folding. Codon-level foundation models tokenize mRNA into three-nucleotide units, learning representations that capture these codon-specific signals invisible to protein language models.

19.6.1 Beyond Nucleotide Tokenization

Coding sequences occupy a special niche where protein and nucleic acid constraints intersect. The genetic code assigns 61 sense codons to 20 amino acids, creating synonymous redundancy where multiple codons encode the same amino acid. This redundancy is not functionally neutral: synonymous codons differ in tRNA availability, translation speed, co-translational folding effects, and mRNA stability. Protein language models, which operate on amino acid sequences, cannot capture these codon-level signals.

Stop and Think

A protein language model like ESM sees only amino acid sequence. A DNA language model like DNABERT sees nucleotide sequence but treats all positions equally. What information is lost by each approach when modeling a coding sequence? What could a codon-level model capture that neither can?

Codon-level foundation models address this gap by tokenizing mRNA into codons rather than nucleotides. Models like cdsFM, EnCodon, and DeCodon treat each three-nucleotide codon as a single token, training on masked codon prediction and related objectives (Naghipourfar et al. 2024). (Note: CodonFM, a successor to EnCodon with improved architecture and training, has been announced but was not yet published at time of writing.) This tokenization encodes a biological prior: codons are the fundamental units of translation, and mutations at the codon level determine amino acid changes while mutations within synonymous codons affect expression without changing protein sequence.

The codon vocabulary contains 61 tokens (excluding stop codons) plus special tokens for noncoding regions and boundaries. This intermediate vocabulary size (between character-level nucleotide tokenization and typical BPE vocabularies of thousands of tokens) balances resolution with context length (Chapter 5). A 300-amino-acid protein corresponds to 900 nucleotides or 300 codons, making whole-gene modeling tractable within standard transformer context windows. The therapeutic implications of codon-level modeling are examined in Section 31.4, where these representations guide mRNA vaccine and protein replacement therapy design.

Multiple mRNA sequences can encode the same protein

Codon selection affects multiple biological properties

Codon tokenization reduces sequence length

Different model types capture different information
Figure 19.4: Codon-level modeling of mRNA. (A) Multiple mRNA sequences can encode the same protein through synonymous codon choices. (B) Codon selection affects tRNA availability, translation speed, mRNA structure, and stability. (C) Codon tokenization reduces sequence length while encoding biological priors about translation units. (D) Comparison of what protein, DNA, and codon language models can capture from coding sequences.

19.6.2 What Codon Models Add

Stop and Think

You are designing a therapeutic mRNA to express a protein in human cells. The amino acid sequence is fixed (you need that exact protein). But you have roughly 3^300 possible mRNA sequences to choose from for a 300-amino-acid protein. What criteria would guide your choice among synonymous alternatives?

Compared to protein language models, codon-level models enable direct modeling of mRNA design problems where amino acid sequence is fixed but codon choice is variable. They capture codon usage bias and its relationship to expression, model translation elongation dynamics that affect co-translational folding, and distinguish synonymous variants that are neutral at the protein level but affect mRNA properties.

Life-Code extends this approach into a central-dogma-wide framework, linking DNA, RNA, and protein representations through shared or aligned embedding spaces (Liu et al. 2025). CodonBERT specifically targets mRNA design for vaccines and therapeutics, training on over 10 million mRNA sequences to learn representations that predict expression, stability, and immunogenicity (Li et al. 2023).

Codon models typically ignore mRNA secondary structure and modifications. Local structure affects ribosome access and translation rate; modifications like m6A influence stability and localization. Combining codon-aware tokenization with structure-aware representations remains an open direction, less mature than the parallel integration of sequence and structure in protein modeling (Chapter 16).

Table 19.3: Comparison of model types for coding sequences. Codon models fill a gap between DNA and protein approaches.
Model Type Input Captures Codon Bias Captures Structure Use Case
Protein LM Amino acids No Protein structure Variant effects, function
DNA LM Nucleotides Partially (no boundaries) No Regulatory sequence
Codon LM Codons Yes No (typically) mRNA design, translation
Hybrid Codon + structure Yes Yes Future direction
Knowledge Check: Which Model for This Coding Sequence Task?

For each scenario, which model type (Protein LM, DNA LM, or Codon LM) would you choose?

  1. Predicting whether a missense mutation (Val→Ile) disrupts protein function.

  2. Optimizing codon usage for a therapeutic mRNA vaccine while keeping the amino acid sequence identical.

  3. Assessing whether a synonymous SNP in a coding region affects a nearby splice enhancer motif.

  4. Predicting translation speed and ribosome stalling patterns across a transcript.

  1. Protein LM (ESM-2, AlphaMissense) - Missense effects depend on amino acid properties and protein structure, not codon choice. Protein LMs capture evolutionary constraints and structural context that determine functional impact.

  2. Codon LM (CodonBERT, cdsFM) - This is precisely what codon models are designed for: exploring the ~3^300 synonymous sequence space while predicting expression and stability. Protein LMs cannot distinguish synonymous alternatives.

  3. DNA LM (DNABERT, Nucleotide Transformer) - Splice enhancer motifs operate at the nucleotide level across exon-intron boundaries. DNA LMs see the nucleotide context; codon LMs would miss the splice signal since it spans codon boundaries.

  4. Codon LM - Translation dynamics depend on codon-level features: tRNA availability, codon pair effects, and ribosome decoding speed. DNA LMs lack codon boundary information; protein LMs lose synonymous codon information entirely.

19.7 UTR Models and Translation Regulation

Stop and Think

Two mRNAs encode identical proteins with identical codon sequences, but one produces 100-fold more protein than the other. Where would you look to explain this difference, and what mechanisms might be responsible?

The untranslated regions flanking a coding sequence determine how much protein an mRNA produces and how long the message survives in the cell. These regulatory effects operate through distinct mechanisms in the 5’ and 3’ UTRs, creating opportunities for both understanding endogenous regulation and engineering synthetic mRNAs with desired expression properties.

19.7.1 Why UTRs Dominate Expression Control

The protein output of an mRNA depends as much on its untranslated regions as on its coding sequence. A transcript’s 5’ UTR determines whether ribosomes find and engage the start codon; its 3’ UTR controls how long the message survives and where in the cell it localizes, like an expiration date stamped on perishable goods, except here the “date” is encoded in sequence features that cellular machinery reads to decide when to degrade the transcript. Two mRNAs encoding identical proteins can differ by orders of magnitude in expression if their UTRs differ. This regulatory leverage makes UTR modeling essential for both understanding endogenous gene regulation and designing synthetic mRNAs for therapeutic applications.

The 5’ UTR spans from the transcription start site to the start codon, typically 50 to 200 nucleotides in human mRNAs. Within this region, secondary structure can occlude the start codon and impede ribosome scanning, upstream open reading frames (uORFs) can capture ribosomes before they reach the main coding sequence, and internal ribosome entry sites (IRES) can enable cap-independent translation under stress conditions. The Kozak consensus sequence (GCCRCCAUGG in vertebrates, where R denotes purine) surrounding the start codon critically influences initiation efficiency. The ribosome scans the 5’ UTR from the cap until it encounters an AUG in favorable context; a strong Kozak sequence ensures the ribosome recognizes this AUG as the start codon rather than continuing to scan. Weak Kozak sequences lead to “leaky scanning” where some ribosomes skip the intended start codon, reducing protein output or initiating at downstream AUGs. The positions at -3 (purine, ideally A) and +4 (G) relative to the AUG are most critical. Context extending dozens of nucleotides in either direction further modulates this effect. Predicting translation efficiency from 5’ UTR sequence requires integrating these overlapping signals.

The 3’ UTR extends from the stop codon to the poly-A tail, ranging from under 100 nucleotides to over 10 kilobases. This region harbors binding sites for RNA-binding proteins and microRNAs that collectively determine mRNA half-life, localization, and translational status. AU-rich elements (AREs) recruit decay machinery in response to cellular signals. Pumilio and other RNA-binding proteins recognize specific motifs to repress or activate translation. The density and arrangement of miRNA binding sites create combinatorial regulatory logic that varies across cell types depending on which miRNAs are expressed.

Practical Guidance: UTR Design Considerations

When designing or optimizing mRNAs for expression:

  • 5’ UTR: Avoid strong secondary structure near start codon; eliminate upstream AUGs that create uORFs; ensure Kozak consensus (GCCRCCAUGG); consider GC content for stability vs. structure trade-offs
  • 3’ UTR: Consider borrowing UTRs from highly expressed endogenous genes (alpha-globin, beta-globin are common choices); avoid sequences resembling miRNA binding sites for target tissues; balance length (longer = more regulatory sites) against manufacturing constraints
  • Integration: Remember that 5’ and 3’ effects interact; optimize iteratively, not independently

19.7.2 Sequence-to-Expression Models

High-throughput reporter assays have enabled systematic modeling of UTR function. Massively parallel reporter assays (MPRAs) measure expression driven by thousands of UTR variants in a single experiment, providing training data at scales previously unavailable. Sample et al. used such data to train Optimus 5-Prime, a convolutional model that predicts ribosome load from 5’ UTR sequence with accuracy sufficient to guide synthetic UTR design (Sample et al. 2019). The model learned interpretable features corresponding to known regulatory elements (uORF presence, Kozak strength, secondary structure) while also capturing context-dependent interactions invisible to element-counting approaches.

For 3’ UTRs, models must contend with greater length and combinatorial complexity. A 2-kilobase 3’ UTR may contain dozens of potential regulatory sites whose effects depend on spacing, secondary structure context, and the expression levels of cognate binding proteins. Approaches range from motif-based models that score individual elements and sum contributions, to deep learning architectures that process entire UTR sequences and learn nonlinear interactions. Agarwal and Kelley trained models on endogenous mRNA stability measurements, demonstrating that 3’ UTR sequence features explain substantial variance in half-life across the transcriptome (Agarwal and Shendure 2020).

Transfer learning from RNA foundation models offers a complementary approach. Rather than training UTR-specific models from scratch, pretrained representations from RNA-FM or similar models can be fine-tuned on expression prediction tasks (Chapter 5). The pretrained embeddings encode sequence context and potential structural features that may transfer to UTR function prediction, though systematic comparisons between foundation model transfer and task-specific training remain limited.

19.7.3 Integration with mRNA Design

UTR optimization represents a distinct component of therapeutic mRNA design, complementing codon optimization. For a vaccine or protein replacement therapy, the coding sequence determines what protein is made while the UTRs determine how much protein is made and for how long. Current mRNA therapeutics typically use UTRs borrowed from highly expressed endogenous genes (human alpha-globin and beta-globin UTRs are common choices) rather than computationally optimized sequences. Model-guided UTR design could improve on this approach by optimizing for specific objectives: maximizing expression in target tissues, extending mRNA half-life to reduce dosing frequency, or minimizing immunogenicity by avoiding sequences that trigger innate immune sensors. The challenge lies in the combinatorial interaction between UTRs and coding sequence. Secondary structures can span the UTR-CDS boundary, and the optimal 5’ UTR for one coding sequence may perform poorly for another. Integrated models that jointly optimize UTRs and coding sequence represent an active research direction, though experimental validation of computationally designed UTRs remains limited compared to the extensive optimization of coding sequences. The design principles and optimization strategies for therapeutic mRNAs, including COVID-19 vaccine development, are detailed in Section 31.4.2.

19.8 mRNA Design and Optimization

Cross-Reference: Part VII Applications

This section introduces the molecular principles of mRNA design. For clinical and therapeutic applications, see Chapter 30 (target validation and drug development pipelines) and Section 31.4 (detailed mRNA optimization strategies for vaccines and therapeutics).

Therapeutic mRNA design requires navigating a vast sequence space where multiple objectives compete. Expression, stability, immunogenicity, and manufacturability all depend on sequence choices that interact in complex ways. The COVID-19 vaccines demonstrated that rational mRNA design can achieve clinical efficacy, while also revealing how much of current practice remains empirical rather than model-driven.

19.8.1 Design Objectives and Trade-offs

Mathematical Note

The following section discusses the vast combinatorial space of possible mRNA sequences. The key intuition is that for any protein, there are astronomically many different mRNA sequences that could encode it (because multiple codons encode each amino acid). This space is far too large for exhaustive search, motivating model-guided optimization.

mRNA sequence design selects nucleotide sequences that encode a desired protein while optimizing expression, stability, safety, and manufacturability. For a 300-amino-acid protein, there are approximately \(3^{300}\) possible synonymous mRNA sequences (roughly the number of synonymous codons raised to the protein length). This astronomical space cannot be exhaustively searched, motivating both classical heuristics and modern machine learning approaches.

Key objectives include high protein expression in target tissues, mRNA stability during manufacturing and in vivo, controlled translation kinetics that influence co-translational folding, and low immunogenicity for therapeutic applications. These objectives often conflict: increasing GC content may improve stability but introduce unwanted secondary structure, while avoiding rare codons may reduce expression if tRNA pools are limiting. The conflict between stability and expression illustrates a broader principle: GC-rich sequences form more stable base pairs (G-C has three hydrogen bonds versus two for A-U), which extends mRNA half-life but can also create secondary structures that impede ribosome scanning and translation initiation. The design challenge is finding the Pareto frontier where no objective can be improved without sacrificing another.

Worked Example: Consider designing an mRNA to express a therapeutic enzyme in liver cells. You might:

  1. Start with the wild-type human codon usage (baseline)
  2. Replace rare codons with more frequent synonymous alternatives (increases expression)
  3. Check for unwanted secondary structure in 5’ UTR region (may impede translation)
  4. Verify absence of cryptic splice sites or premature polyadenylation signals
  5. Assess GC content (>60% may cause aggregation; <40% may reduce stability)
  6. Use models to predict expression and iterate

Each choice affects multiple objectives; there is no single “optimal” sequence, only trade-offs.

19.8.2 Lessons from COVID-19 Vaccines

The COVID-19 mRNA vaccines provided a high-profile demonstration of mRNA design principles at unprecedented scale. The Pfizer-BioNTech and Moderna vaccines incorporated several design elements: N1-methylpseudouridine modification throughout the sequence to reduce innate immune activation, codon optimization to enhance expression in human cells, optimized 5’ and 3’ UTRs from highly expressed genes, and sequence modifications to stabilize the prefusion spike conformation. These choices drew on decades of basic research but were refined through empirical optimization rather than systematic model-based design. The vaccines’ success demonstrated that rationally designed mRNAs can achieve therapeutic efficacy at scale. It also revealed limitations in current understanding: the optimal combination of modifications, codons, and UTRs for a given protein target remains partly empirical, and transferring designs across proteins or therapeutic applications requires substantial optimization.

19.8.3 Model-Based Design Strategies

RNA and codon foundation models enable several approaches to systematic design. Scoring and screening use pretrained models to evaluate large candidate sets for predicted expression or stability, selecting top designs for experimental validation. When models are differentiable with respect to input embeddings, gradient-based methods can guide sequence optimization toward desired objectives. Generative approaches sample diverse high-scoring sequences subject to constraints like fixed amino acid sequence or avoided motifs.

Empirical results suggest that deep models trained on high-throughput reporter assays or ribosome profiling can outperform classical codon adaptation indices, particularly for context-specific expression prediction. The Codon Adaptation Index (CAI) quantifies how closely a gene’s codon usage matches the pattern of highly expressed genes in the same organism; the assumption is that highly expressed genes have evolved to use codons with abundant cognate tRNAs. The tRNA Adaptation Index (tAI) takes a more mechanistic approach, weighting codons by the copy number and wobble-pairing efficiency of their cognate tRNAs. Both indices provide single-number scores summarizing codon optimality, but they rely on genome-wide averages that may not reflect tissue-specific tRNA pools or the local sequence context of individual codons. Deep models can learn local effects of codon pairs, mRNA structure, and regulatory elements that these aggregate indices miss. However, these models require substantial training data and may not generalize across organisms or synthetic constructs far from natural sequences.

Therapeutic mRNA design pipeline from target to optimized construct
Figure 19.5: Therapeutic mRNA design pipeline. Starting from a target protein, the design process optimizes codon usage (selecting from ~3300 possible synonymous sequences), engineers 5’ UTR elements for translation initiation, designs 3’ UTR for stability and localization, and selects chemical modifications (such as N1-methylpseudouridine) to reduce immunogenicity. Inset shows key design choices made for COVID-19 mRNA vaccines.

19.9 Noncoding RNA Classification and Function

RNA that does not encode protein encompasses an extraordinary range of structures, functions, and regulatory mechanisms. Classifying these transcripts and predicting their functions presents challenges that differ from coding sequence analysis: the relevant features vary across RNA classes, functional annotations remain incomplete, and the boundary between functional ncRNA and transcriptional noise is often unclear.

19.9.1 Diversity of Noncoding RNA

RNA that does not encode protein spans an enormous functional and structural range. Housekeeping RNAs (tRNAs, rRNAs, snRNAs, snoRNAs) perform essential cellular functions with well-characterized structures. Regulatory RNAs (miRNAs, siRNAs, piRNAs, lncRNAs) control gene expression through diverse mechanisms. Structural and catalytic RNAs (ribozymes, riboswitches) adopt complex folds that enable enzymatic activity or ligand sensing. Circular RNAs (circRNAs) and other noncanonical species continue to expand the catalog of RNA diversity.

Each class has characteristic lengths, structural motifs, genomic contexts, and functional mechanisms. tRNAs are approximately 76 nucleotides with a conserved cloverleaf structure. miRNAs are approximately 22 nucleotides processed from longer hairpin precursors. lncRNAs span thousands of nucleotides with poorly conserved sequence and often no stable secondary structure. Unifying these classes under a single modeling framework is challenging, and models that excel on one class may fail on others.

Table 19.4: Diversity of noncoding RNA classes. Each presents distinct modeling challenges.
ncRNA Class Typical Length Structure Conservation Function Modeling Challenge
tRNA ~76 nt Cloverleaf (conserved) High Amino acid delivery Well-characterized; good models exist
miRNA ~22 nt Processed from hairpin Moderate Post-transcriptional silencing Target prediction remains noisy
lncRNA 200 - >10,000 nt Variable/none Low Diverse Functional annotation sparse
circRNA Variable Circular backbone Variable miRNA sponge, other Detection and quantification
rRNA ~1,500-5,000 nt Complex, conserved Very high Ribosome structure Well-characterized

Major ncRNA classes by length and structure

Functional mechanisms across ncRNA types
Figure 19.6: Diversity of noncoding RNA classes. (A) ncRNA classes vary dramatically in length (from ~22 nt miRNAs to >10 kb lncRNAs) and structural complexity (from conserved tRNA cloverleafs to largely unstructured lncRNAs). (B) Functional mechanisms range from direct catalysis (ribozymes) to regulatory roles through target binding (miRNAs, lncRNAs) to structural scaffolding (rRNAs, snRNAs).

19.9.2 From Handcrafted Features to Learned Representations

Classical ncRNA classification relied on engineered features: k-mer frequencies, GC content, minimum free energy of predicted secondary structure, structural motif counts, and genomic context features like proximity to coding genes or chromatin marks. These features fed conventional classifiers (SVMs, random forests, shallow neural networks) that achieved reasonable performance for well-studied classes with strong sequence and structure signatures.

The limits of handcrafted features emerge most clearly for lncRNAs. These transcripts are defined partly by what they lack (no long open reading frame) rather than what they possess. Many lncRNAs show poor conservation, lack stable secondary structures, and have diverse, poorly characterized functions. Distinguishing functional lncRNAs from transcriptional noise remains difficult, and classical feature sets often collapse to generic statistics like length and GC content.

Foundation model embeddings offer a more flexible approach. Per-nucleotide representations can be pooled into fixed-dimensional vectors that support classification with simple downstream heads. For ncRNAs without strong sequence motifs, the pretrained embeddings may capture subtle distributional patterns learned during self-supervised training. Few-shot learning becomes possible: given a handful of newly characterized RNAs, their embeddings can seed new clusters in representation space, guiding annotation of related sequences.

19.10 miRNA Target Prediction

MicroRNAs regulate gene expression by guiding the RNA-induced silencing complex (RISC) to complementary sites in target mRNAs, typically in the 3’ UTR. A single miRNA can regulate hundreds of transcripts, and a single transcript can harbor binding sites for dozens of miRNAs. This regulatory network influences virtually every cellular process, and dysregulation of miRNA-target interactions contributes to cancer, cardiovascular disease, and neurodegeneration. Predicting which transcripts a given miRNA targets (and vice versa) has been a persistent computational challenge since the discovery of miRNA-mediated regulation.

Knowledge Check

miRNA target prediction relies heavily on “seed complementarity” (perfect base pairing between nucleotides 2-7 of the miRNA and the target site). Why might focusing only on seed matches miss many functional targets? What other factors could influence whether a site is actually regulated?

Seed-only approaches miss functional targets because actual regulation depends on additional context: local RNA secondary structure may occlude seed matches, competing RNA-binding proteins may block access, miRNA and target abundance vary across cell types, and non-canonical binding modes exist without perfect seed pairing. The cellular context determines whether a seed match produces functional regulation.

The dominant paradigm centers on seed complementarity. Nucleotides 2 through 7 of the miRNA (the seed region) typically form perfect Watson-Crick pairs with target sites, while the remaining nucleotides contribute variably to binding affinity and regulatory effect. Classical algorithms like TargetScan identify conserved seed matches in 3’ UTRs and rank targets by evolutionary conservation, site type (8mer, 7mer-m8, 7mer-A1), and local sequence context (Agarwal and Shendure 2020). Additional features including AU content flanking the site, position within the UTR, and proximity to other miRNA sites improve prediction accuracy.

Despite decades of refinement, target prediction remains noisy. Experimental validation rates for top predictions rarely exceed 50%, and many functional targets lack canonical seed matches. The disconnect arises partly from context dependence: a site may be accessible in one cell type but occluded by RNA structure or competing protein binding in another. It arises partly from the limitations of reporter assays that measure binding in artificial contexts rather than endogenous regulatory effects. And it arises from the biology itself, where weak individual sites combine additively and miRNA-target interactions are probabilistic rather than deterministic.

Deep learning approaches attempt to improve on seed-based methods by learning complex sequence features from high-throughput binding data. Models trained on CLIP-seq experiments (which crosslink miRNA-target complexes and identify bound sites transcriptome-wide) can capture non-canonical binding modes and context effects invisible to seed-matching algorithms. These models often overfit to cell-type-specific binding patterns and generalize poorly across contexts (Chapter 13). The fundamental challenge is that miRNA targeting depends on factors beyond sequence: miRNA and target abundance, competition among targets for limiting RISC, and cellular state variables that no sequence-based model can capture.

For clinical applications, target prediction informs both the mechanism of disease-associated miRNAs and the design of therapeutic interventions. AntimiR oligonucleotides that sequester specific miRNAs have entered clinical trials for hepatitis C (targeting miR-122) and other indications. Predicting off-target effects of such therapeutics requires understanding the full network of targets that will be derepressed when a miRNA is inhibited. Similarly, miRNA mimics designed to replace lost tumor-suppressor miRNAs must be evaluated for potential regulation of unintended targets. In both cases, computational target prediction provides a starting point that experimental validation must refine.

19.11 Splicing and Transcript Processing Models

Splicing models predict how pre-mRNA is processed into mature transcripts, a problem intimately connected to RNA biology even when the models operate on genomic DNA sequence. SpliceAI established the paradigm, but extensions address tissue specificity, branchpoint prediction, and quantitative splicing outcomes that the original model does not capture.

19.11.1 Beyond SpliceAI

SpliceAI demonstrated that deep convolutional networks could predict splice sites with near-spliceosomal precision (Section 6.5). The model’s success in identifying cryptic splice variants has made it a standard tool in clinical variant interpretation (Chapter 18). Splicing involves more than splice site recognition, and several extensions address aspects that SpliceAI does not fully capture.

Tissue-specific splicing patterns vary substantially across cell types and developmental stages. A splice site may be used in brain but skipped in liver due to differential expression of splicing factors. Models like Pangolin extend splice prediction by training on tissue-specific RNA-seq data, learning to predict not just whether a site is splice-competent but whether it is used in specific cellular contexts. ** These models enable variant interpretation that accounts for tissue-relevant splicing patterns rather than generic predictions. The integration of tissue-specific splice predictions into clinical variant interpretation workflows is addressed in Chapter 29.

Branchpoint prediction identifies the adenosine residue where the lariat intermediate forms during splicing. While SpliceAI focuses on donor and acceptor sites, branchpoint recognition involves distinct sequence features (typically a degenerate YURAY motif 18-40 nucleotides upstream of the acceptor) that specialized models can capture. Combined analysis of donor, acceptor, and branchpoint predictions provides more complete characterization of splice-altering variants.

Alternative splicing prediction moves beyond binary splice site identification to model exon inclusion rates and isoform usage. Models in this space attempt to predict not just whether an exon can be included but quantitative measures of inclusion across conditions, enabling analysis of splicing quantitative trait loci (sQTLs) and their effects on transcript diversity.

19.12 Limitations and Open Challenges

RNA modeling faces constraints that do not apply to protein or DNA foundation models. Data scarcity limits what can be learned from self-supervised training, functional annotations remain incomplete for most ncRNA classes, and the field has not yet achieved the breakthrough moment that AlphaFold represented for proteins. These limitations define the current frontier and point toward the advances needed for RNA foundation models to mature.

19.12.1 Sparse Structural Data

The fundamental limitation of RNA modeling is data scarcity. Protein structure prediction benefits from over 200,000 experimentally determined structures; RNA has fewer than 2,000, heavily biased toward ribosomal RNA and tRNA. ** This scarcity limits supervised learning for tertiary structure prediction and constrains the emergence of structural knowledge from self-supervised pretraining. Until high-throughput methods generate RNA structures at scale comparable to protein crystallography and cryo-EM, RNA tertiary structure prediction will remain a frontier problem rather than a solved one.

Secondary structure data is more abundant but still limited. Experimentally validated structures cover mainly well-characterized families, while computational predictions for novel sequences rely on thermodynamic models whose accuracy degrades for long RNAs and complex folds. Structure probing experiments provide genome-wide coverage but measure accessibility rather than pairing directly, requiring inference to convert reactivity profiles into structural models.

19.12.2 Functional Annotation Gaps

For many ncRNA classes, function remains poorly characterized. LncRNA annotations often specify only genomic location and expression pattern without mechanistic understanding. Circular RNA functions are emerging but incompletely cataloged. Even for better-characterized classes like miRNAs, target prediction remains noisy and context-dependent.

This annotation gap limits supervised learning for function prediction and complicates evaluation (Chapter 12). When ground truth is uncertain, it becomes difficult to assess whether a model’s predictions reflect genuine biological insight or artifacts of incomplete training data. The field needs both experimental advances to characterize ncRNA function and computational approaches that can learn from weak or partial supervision.

19.12.3 Maturity Gap

Key Insight

The maturity gap between RNA and protein foundation models represents both a limitation and an opportunity. The protein modeling roadmap (large-scale self-supervised learning, attention mechanisms, scaling laws) exists and has been proven. Applying that roadmap to RNA requires addressing data scarcity through structure probing and synthetic data, developing architectures that handle conformational flexibility, and building comprehensive benchmarks covering RNA’s diversity.

RNA foundation models exist but have not achieved the transformative impact of protein language models. ESM-2 enabled ESMFold, providing structure prediction from single sequences that nearly matches AlphaFold. No comparable RNA breakthrough has occurred. The reasons include data scarcity, the conformational complexity of RNA, and the diversity of RNA classes that makes unified modeling difficult.

This maturity gap represents both a limitation and an opportunity. The techniques that succeeded for proteins (large-scale self-supervised learning, attention mechanisms, scaling laws) provide a roadmap (Chapter 15). Applying that roadmap to RNA requires addressing the data challenge through structure probing, synthetic data generation, or more efficient use of limited experimental structures. It requires architectural innovations that handle RNA’s long-range base pairing and conformational flexibility. It requires benchmarks and evaluation frameworks that cover the full diversity of RNA types and tasks, following the rigorous evaluation principles established in Chapter 12 and the benchmark construction guidelines in Chapter 11.

19.13 Bridge Between Sequence and Cell

RNA occupies a distinctive position in genomic AI, bridging the sequence-level models of Part III with the cellular perspectives that follow. Splicing models like SpliceAI operate on pre-mRNA and predict transcript processing outcomes (Section 6.5). Codon-level models capture translation dynamics invisible to protein language models. mRNA therapeutic design demonstrates practical value through codon optimization, UTR engineering, and stability prediction. These applications proceed despite the absence of the structure prediction breakthrough that transformed protein modeling; secondary structure prediction has advanced through deep learning, but tertiary structure accuracy lags protein structure by a wide margin.

The relationship between RNA models and other modalities reflects RNA’s position in the central dogma. RNA is the product of transcription that regulatory models predict (Chapter 17), the substrate for translation that protein models assume (Chapter 16), and the primary measurement that single-cell models use to represent cellular state (Chapter 23). Foundation models that learn from RNA sequence capture patterns distinct from those in DNA or protein: codon usage biases, secondary structure constraints, and post-transcriptional regulatory elements that neither genomic nor protein models directly represent.

Beyond sequence, biological understanding requires cellular and tissue context. Single-cell models treat RNA expression as the primary readout of cellular state, learning representations that capture cell type identity and perturbation response (Chapter 23). Three-dimensional genome models add spatial context that influences transcription. Network models integrate gene relationships that transcend individual sequences. RNA models provide sequence-level representations that feed into these higher-level frameworks, completing the molecular arc from DNA through RNA to protein while opening the path to systems-level integration.

Test Yourself

Before reviewing the summary, test your recall:

  1. Why is RNA secondary structure prediction fundamentally harder than protein structure prediction, despite RNA having only 4 bases compared to 20 amino acids?
  2. What information do codon-level foundation models capture that both DNA language models and protein language models miss?
  3. Explain why the 5’ UTR and 3’ UTR both matter for therapeutic mRNA design, but through different mechanisms.
  4. What is the primary data limitation preventing RNA foundation models from achieving breakthroughs comparable to protein language models like ESM-2?
  1. RNA structure is harder because: RNA has flat energy landscapes where multiple conformations have similar free energies (creating many-to-many sequence-structure relationships), long-range base pairing that creates global dependencies, and pseudoknots that violate the nested structure assumptions enabling efficient algorithms. Proteins typically fold to a single stable structure with a deep energy minimum.

  2. Codon models capture: Synonymous codon usage patterns that affect translation speed, mRNA stability, and co-translational folding. DNA language models see nucleotides but do not know codon boundaries; protein language models see only amino acids and miss synonymous variation entirely. Codon models operate at the biologically meaningful unit of translation.

  3. UTR mechanisms differ: The 5’ UTR controls translation initiation through secondary structure accessibility, upstream ORFs, Kozak consensus strength, and ribosome recruitment. The 3’ UTR controls mRNA stability and localization through RBP binding sites, miRNA target sites, and AU-rich elements. Both affect expression but through distinct molecular mechanisms: initiation versus decay/localization.

  4. Primary data limitation: RNA has fewer than 2,000 experimentally determined 3D structures (versus over 240,000 for proteins), and sequence databases are approximately 3× smaller with less functional diversity. Without massive structural training data, RNA foundation models cannot learn the sequence-to-structure mappings that enabled ESM-2’s breakthrough.

Chapter Summary

What we covered:

  • RNA occupies a unique position between DNA and protein, with distinct modeling challenges stemming from flat energy landscapes, long-range base pairing, and sparse structural data
  • Classical RNA structure prediction uses thermodynamic (MFE) or comparative (covariation) approaches; deep learning methods can capture patterns beyond nearest-neighbor rules and handle pseudoknots
  • RNA foundation models lag behind protein counterparts primarily due to data scarcity: fewer sequences, less structural diversity, and heavily biased training sets
  • Codon-level models fill a gap between DNA and protein approaches by capturing synonymous codon effects on translation and stability
  • UTR sequences dominate expression control; model-guided UTR design is an active area with therapeutic applications
  • The COVID-19 vaccines demonstrated mRNA design principles at scale, though much optimization remains empirical
  • miRNA target prediction remains noisy despite decades of work, limited by context dependence and the probabilistic nature of targeting
  • Splicing models extend beyond SpliceAI to tissue-specific predictions and quantitative alternative splicing

Key connections:

  • Backward: RNA foundation models apply the same self-supervised paradigm as protein LMs (Chapter 16) and DNA LMs (Chapter 15), but with different data constraints
  • Forward: Single-cell models (Chapter 20) use RNA expression as primary cellular readout; therapeutic mRNA design (Section 31.4) applies the principles covered here

Open questions:

  • Can structure probing data compensate for limited crystallographic training sets?
  • What architectural innovations will enable RNA models to achieve protein-like breakthroughs?
  • How should models handle RNA’s inherent conformational heterogeneity?