17 Regulatory Models
Gene regulation happens at distances that most models cannot see.
Estimated reading time: 35-45 minutes
Prerequisites: Understanding of convolutional neural networks (Chapter 6), attention mechanisms (Chapter 7), and functional genomics assays (Section 2.4.1, Section 2.4). Familiarity with variant effect prediction concepts (Chapter 4) is helpful but not required.
Learning Objectives: After completing this chapter, you should be able to:
- Explain why short-context models fail to capture distal regulatory relationships and quantify the scale of the problem
- Describe how hybrid CNN-transformer architectures balance computational efficiency with long-range modeling
- Compare Enformer, Borzoi, Sei, and AlphaGenome in terms of their outputs, strengths, and appropriate use cases
- Apply regulatory model predictions to variant effect interpretation and understand their limitations
- Recognize the boundary conditions where regulatory models succeed and where they fail
Key Insight: Hybrid architectures solve a fundamental trade-off: convolutions efficiently compress sequence to manageable length, then attention enables direct information flow across regulatory distances that would otherwise be computationally prohibitive.
An enhancer 80 kilobases from a promoter can determine whether a gene is expressed in liver or brain. An insulator 50 kilobases downstream can block inappropriate activation from a neighboring regulatory domain. A disease-associated variant in an intergenic region may exert its effect by disrupting a distal element that contacts its target gene through chromatin looping. Mammalian gene regulation operates across distances that dwarf the context windows of most sequence models. The convolutional architectures examined in Chapter 6 excel at detecting local motifs but cannot span these distances. A model that processes sequences in kilobase windows treats regulatory elements tens of kilobases away as if they do not exist. For understanding human gene regulation, they effectively do not.
The attention mechanisms introduced in Chapter 7 could theoretically model arbitrary-range dependencies, but naive transformer application to hundred-kilobase windows is computationally prohibitive. Attention scales quadratically with sequence length: doubling context length quadruples memory and computation. Processing 200 kilobases at single-nucleotide resolution would require attending over 200,000 positions simultaneously, far beyond practical limits. The field needed architectures that could span regulatory distances without the quadratic penalty.
Hybrid architectures resolve this tension by combining the strengths of both paradigms. A convolutional front-end efficiently extracts local sequence features and compresses the input to a manageable length, reducing 200 kilobases of sequence to a few thousand feature vectors. A transformer backbone then propagates information across this compressed representation through attention. The result is a new class of regulatory models that can capture enhancer-promoter interactions, predict the effects of distal variants on gene expression, and provide mechanistic hypotheses about long-range regulation. Enformer, the first widely adopted model in this class, processes 200-kilobase windows and predicts chromatin state, transcription initiation, and gene expression from sequence alone.
17.1 Long-Range Regulation Problem
Consider a canonical mammalian gene with complex tissue-specific expression. The promoter sits at the transcription start site, but the sequences that determine when and where the gene is expressed may be scattered across a 200 kilobase neighborhood. Think of it like a marionette puppet: the puppet itself (the gene) sits in one location, but the strings controlling its movement (enhancers, silencers) may attach from positions far above, each string pulled by a different puppeteer (transcription factor) to produce coordinated motion. Multiple enhancers drive expression in different tissues; silencers suppress expression in inappropriate contexts; insulators demarcate regulatory domains. Chromatin looping brings these distal elements into physical proximity with the promoter, but the loops themselves are dynamic and cell-type-specific.
Before reading further, consider: if you wanted to predict whether a gene is expressed in liver versus brain, what genomic features beyond the promoter sequence might you need to examine? How far from the gene might relevant information lie?
Short-context models face an information-theoretic barrier in this setting. A model with a 2 kilobase receptive field cannot distinguish a variant in an enhancer 50 kilobases upstream from a variant in neutral sequence at the same distance. Both fall outside the model’s effective context. Stacking more convolutional layers or using dilated convolutions can expand the receptive field, but the computational path between distant positions grows long, and gradients attenuate over many layers. Models like Basenji2 pushed convolutional receptive fields to tens of kilobases through aggressive pooling, but purely convolutional architectures struggle to propagate information across hundreds of kilobases without impractical depth, a limitation examined in Section 6.6.
The scale of the problem becomes concrete when examining enhancer-promoter distances in the human genome. Median enhancer-promoter distances in many tissues span 20 to 50 kilobases, with substantial fractions exceeding 100 kilobases (Gasperini et al. 2019). Topologically associating domains (TADs), which define the neighborhoods within which regulatory elements typically interact, range from hundreds of kilobases to several megabases. A model that cannot span these distances cannot fully capture the regulatory grammar of the genome.
The biological reality of long-range regulation was established by systematic studies mapping GWAS variants to functional elements. Maurano et al. (2012) demonstrated that disease-associated variants are overwhelmingly enriched in regulatory regions rather than coding sequences, with enrichment patterns specific to cell types relevant to each disease. This foundational observation motivates the entire enterprise of regulatory modeling: if most disease variants act through regulatory mechanisms, understanding those mechanisms requires models that can capture the regulatory grammar of the genome.
The scale mismatch between model context and regulatory biology is not a minor inconvenience–it is a fundamental barrier. A 2kb context window can see only 1% of a 200kb regulatory neighborhood. Models that work well for detecting individual motifs systematically fail for tasks requiring integration of distal information.
Attention mechanisms offer a direct solution: by computing pairwise interactions between all positions, attention can model dependencies across arbitrary distances in a single layer. The cost is quadratic scaling with sequence length. A naive transformer operating on 200,000 base pairs at single-nucleotide resolution would require attention matrices with 40 billion entries, far exceeding practical memory limits. Hybrid architectures sidestep this constraint by using convolutions to compress the sequence before attention, reducing the effective sequence length to a few thousand tokens while preserving the information needed for long-range modeling.
Before viewing the table below, make a prediction: Which architectural approach do you expect will achieve the best balance between long-range modeling capability and computational efficiency? Consider the trade-offs between context window size, computational scaling, and the ability to capture distant regulatory interactions. What constraints might limit pure CNNs versus pure transformers, and how might hybrid approaches address these limitations?
| Architecture Type | Context Window | Effective Resolution | Computational Scaling | Long-Range Modeling |
|---|---|---|---|---|
| Pure CNN (DeepSEA, Basset) | 1-2 kb | Single nucleotide | Linear | None (outside receptive field) |
| Dilated CNN (Basenji2) | 40-130 kb | ~128 bp bins | Linear | Indirect (many layers) |
| Pure Transformer | Limited by memory | Single nucleotide | Quadratic | Direct (single layer) |
| Hybrid CNN-Transformer (Enformer) | 200 kb | ~128 bp bins | Quadratic on compressed | Direct (attention) |
| Efficient Attention (AlphaGenome) | ~1 Mb | Variable | Sub-quadratic | Direct (optimized) |
17.2 Enformer: Attention Meets Regulatory Genomics
Enformer (Ž. Avsec et al. 2021) demonstrated that combining convolutional compression with transformer attention could dramatically improve expression prediction from sequence. The model processes 200 kilobase windows of DNA and predicts thousands of chromatin and transcription tracks across cell types and species, establishing a template that subsequent models have extended and refined.
17.2.1 Architecture
The Enformer architecture consists of three stages that progressively transform raw sequence into multi-task predictions.
The following section describes Enformer’s architecture in detail. Understanding the precise mechanics helps when interpreting predictions and troubleshooting unexpected results, but the core concept is straightforward: compress with convolutions, then model long-range interactions with attention.
The convolutional stem takes one-hot encoded DNA (four channels for A, C, G, T) and applies a series of convolutional blocks with residual connections. Each block includes convolutions that detect local patterns, batch normalization and nonlinearities, and pooling operations that reduce sequence length while increasing channel depth. By the end of the stem, a 200 kilobase input has been compressed to roughly 1,500 tokens, each representing approximately 128 base pairs of underlying sequence. This compression strategy resembles how a city planner might study traffic patterns. Rather than tracking every individual car (each nucleotide), they aggregate traffic into neighborhood-level summaries (128 bp bins), preserving the essential flow information while making the analysis tractable. This compression is essential: it reduces the attention computation from quadratic in 200,000 to quadratic in 1,500, a reduction of roughly 17,000-fold in memory requirements.
Why is this particular compression strategy effective? The convolutional stem preserves local motif information while discarding positional precision that regulatory biology does not require. A transcription factor binding site is roughly 6-12 base pairs; whether it occurs at position 50,127 or 50,135 rarely matters for function. The 128 bp bins are large enough to contain complete motifs while small enough to preserve the spatial relationships between regulatory elements. Whether an enhancer is 50 kb or 51 kb from a promoter makes little functional difference, but whether it is 10 kb or 100 kb away matters greatly. The hierarchical compression also allows the model to learn features at multiple scales: early convolutional layers detect individual motifs, while later layers combine motifs into composite regulatory patterns before attention integrates across long distances.
The transformer trunk operates on the compressed sequence through a stack of self-attention layers. Each layer computes attention scores between all pairs of positions, allowing information to flow directly between any two locations in the 200 kilobase window. Relative positional encodings preserve information about the distances between elements, which matters for regulatory biology where the spacing between motifs often carries functional significance. The combination of multi-head attention and feed-forward layers enables the model to learn complex, position-dependent relationships across the full window.
Why does Enformer use relative positional encodings in the transformer rather than absolute positional encodings? Consider what information about regulatory elements matters for their function.
Relative positional encodings preserve information about the distance between regulatory elements (e.g., an enhancer is 50kb from a promoter) rather than their absolute positions in the genome. This matters because regulatory function depends on spacing relationships: whether two motifs are 100bp apart versus 10kb apart affects their interaction, but not on whether they are at genomic coordinate 10,000 or 10,000,000. This design choice reflects biological reality: evolution preserves regulatory spacing more than absolute positions.
Task-specific output heads branch from the shared transformer backbone. Separate heads predict different types of outputs: DNase accessibility and ATAC-seq signal (chromatin openness), histone modifications including H3K4me3, H3K27ac, and other marks, CAGE signal reflecting transcription initiation, and additional functional genomics readouts where training data is available. Each head consists of convolutional and linear layers that transform the shared representation into track-specific predictions.
The multi-task design serves multiple purposes. Different assays provide complementary supervision: chromatin accessibility reflects regulatory potential, histone marks indicate active enhancers and promoters, and CAGE captures transcriptional output. Training on all assays jointly encourages the backbone to learn representations that capture the full regulatory cascade from accessible chromatin through enhancer activation to transcription initiation.
17.2.2 Training Data and Cross-Species Learning
Where does a model learn what an enhancer looks like, or that H3K27ac marks active regulatory regions? The answer shapes everything the model can and cannot do. Enformer trains on functional genomics data from both human and mouse, spanning hundreds of assays and cell types. The chromatin accessibility, histone modification, and transcription initiation assays introduced in Section 2.4.1 and Section 2.4 provide the supervision signals: DNase-seq and ATAC-seq measure regulatory potential, ChIP-seq for histone marks identifies active enhancers and promoters, and CAGE captures where transcription begins. Human training data derives largely from ENCODE and Roadmap Epigenomics consortia, supplemented by CAGE data from FANTOM and additional chromatin profiling studies . Mouse data from analogous consortia provides complementary supervision.
Sequence-to-function models learn from experimental measurements of regulatory activity. Understanding what each assay captures clarifies what models can and cannot learn.
Chromatin accessibility assays identify regions where DNA is not tightly wrapped around nucleosomes, making it available for transcription factor binding:
- DNase-seq uses DNase I enzyme to cut accessible DNA; sequencing reads pile up at open chromatin regions
- ATAC-seq uses Tn5 transposase, which preferentially inserts into accessible regions; faster and requires fewer cells than DNase-seq
Transcription factor binding assays identify where specific proteins bind DNA:
- ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) uses antibodies to pull down DNA bound by a target protein; widely used for transcription factors and histone modifications
- CUT&RUN and CUT&Tag are newer methods that require fewer cells and produce less background noise
Histone modification assays reveal chromatin states associated with regulatory function:
- H3K4me3 marks active promoters
- H3K27ac marks active enhancers and promoters
- H3K27me3 marks repressed regions (Polycomb silencing)
- H3K36me3 marks transcribed gene bodies
Transcription assays measure where and how much transcription occurs:
- CAGE (Cap Analysis of Gene Expression) captures transcript 5’ ends, precisely mapping transcription start sites
- RNA-seq measures steady-state RNA levels across exons and introns
- PRO-seq and GRO-seq measure nascent transcription by capturing RNA polymerase-associated RNA
Regulatory activity assays directly measure enhancer function:
- STARR-seq (Self-Transcribing Active Regulatory Region sequencing) tests thousands of sequences for enhancer activity by measuring their ability to drive transcription of themselves
- MPRA (Massively Parallel Reporter Assays) tests sequences in reporter gene constructs
Chromatin conformation assays reveal 3D genome organization:
- Hi-C captures all chromatin contacts genome-wide
- Micro-C provides higher resolution contact maps
- HiChIP and PLAC-seq enrich for contacts involving specific proteins
Each assay type provides different supervision signal. Accessibility assays indicate regulatory potential; binding assays identify specific factors; activity assays measure functional output. Models trained on combinations learn richer regulatory representations than those trained on any single assay type.
Consider Enformer’s training setup. If the model is trained on data primarily from ENCODE and Roadmap Epigenomics consortia, which cell types do you expect it will predict well for, and which might it struggle with? Think about what cell types these consortia focused on.
Enformer will predict well for common cell lines like K562, HepG2, GM12878, and major tissue types extensively profiled by these consortia. It will likely struggle with rare cell types, disease-specific cell states, developmental stages not represented in the data, and tissue types that are difficult to culture or obtain. Predictions for underrepresented contexts should be interpreted with appropriate caution.
Cross-species training confers several advantages. Regulatory sequences that are functionally constrained evolve more slowly than neutral sequence, so mouse and human share many regulatory motifs and principles despite 80 million years of divergence. Training on both species helps the model distinguish conserved regulatory logic from species-specific noise, reduces overfitting to idiosyncrasies of human data, expands the effective training set without requiring additional human samples, and implicitly emphasizes evolutionarily conserved patterns that are more likely to be functionally important.
The training objective combines losses across all tracks, positions, and species. Count-based likelihoods (Poisson or negative binomial) handle sequencing-derived signals, while correlation-based objectives ensure the model captures the overall shape of coverage profiles. Per-track weighting prevents abundant assays from dominating gradients.
17.2.3 Variant Effect Prediction
The clinical and scientific value of Enformer lies substantially in its ability to predict how sequence variants alter regulatory activity. The procedure follows a straightforward logic: extract a 200 kilobase window containing the variant, compute predictions for the reference allele, compute predictions for the alternative allele, and compare the outputs across all tracks and positions.
Worked Example: Scoring a Regulatory Variant
Consider an intergenic variant 45 kilobases upstream of gene X, identified in a GWAS for liver enzyme levels. To interpret this variant with Enformer:
- Extract window: Center a 200kb window on the variant position, ensuring both the variant and the X promoter fall within the window
- Generate sequences: Create reference and alternative versions differing only at the variant position
- Forward pass: Run both sequences through Enformer, obtaining predictions for ~5,000 tracks at ~1,500 positions each
- Compute delta: Subtract reference predictions from alternative predictions
- Interpret: Focus on relevant tracks (liver cell types) and positions (enhancer region around variant, promoter region of gene X)
- Quantify: If CAGE signal at the X promoter decreases by 0.3 log-fold in HepG2 but not in other cell types, this suggests a liver-specific regulatory effect consistent with the GWAS phenotype
The resulting variant effect scores span thousands of dimensions, one for each assay and cell type. A variant might increase predicted DNase accessibility in one cell type while decreasing predicted CAGE signal in another, suggesting context-dependent regulatory effects. By aggregating predictions around gene promoters, researchers can estimate variant effects on gene expression in specific tissues.
Validation against GTEx expression quantitative trait loci (eQTLs) demonstrated that Enformer’s predictions correlate with observed genetic effects on expression (Ž. Avsec et al. 2021). Variants with large predicted effects on promoter-proximal CAGE signal were enriched among significant eQTLs. Notably, this correlation extended to distal variants: sequence changes 50 kilobases or more from a gene’s transcription start site still showed predictive power when they fell in regions of predicted regulatory activity. This long-range predictive capacity distinguishes Enformer from short-context models and validates the architectural investment in extended context windows. These predictions integrate with classical variant effect methods (Chapter 4) and foundation model approaches (Section 18.3.2) to provide comprehensive variant interpretation, with clinical workflow integration detailed in Section 29.1.4.
17.3 Borzoi: From Chromatin to Transcriptome
While Enformer predicts transcription initiation through CAGE, RNA-seq captures a richer picture of gene expression: not just where transcription begins, but how the transcript is spliced, which isoforms dominate, where transcription terminates, and how stable the resulting mRNA is. Borzoi (Linder et al. 2025) extends the hybrid architecture paradigm to predict full RNA-seq coverage profiles, enabling a unified view of how sequence variation affects the entire transcriptional program.
17.3.1 Beyond Transcription Initiation
CAGE measures where transcription begins, but what aspects of gene expression does it miss? Consider what happens to an RNA molecule after transcription initiates. What regulatory events occur post-transcriptionally that could affect protein output?
A single gene can produce multiple transcript isoforms through alternative promoter usage, alternative splicing, and alternative polyadenylation. These isoforms may have different stabilities, different translation efficiencies, and different functions. A variant that shifts isoform ratios without changing total expression could have substantial phenotypic consequences: a switch from a cytoplasmic to a nuclear isoform, for instance, or inclusion of a premature stop codon in the predominant transcript.
CAGE and chromatin assays cannot capture these complexities. They measure where transcription might begin and what the chromatin environment looks like, but they do not reveal how RNA polymerase traverses the gene body, where splicing occurs, or which 3’ end is selected. RNA-seq coverage profiles encode all of this information: exon boundaries appear as coverage drops at intron junctions, alternative splicing manifests as variable junction usage, and polyadenylation site choice appears in the coverage pattern near gene 3’ ends.
17.3.2 Predicting Coverage at Nucleotide Resolution
Teaching a model to predict where reads pile up sounds straightforward, but RNA-seq coverage varies by orders of magnitude (from thousands of reads on highly expressed exons to near-zero in introns), and the sharp transitions at splice junctions encode critical information. How do you train a model to capture both the forest and the trees? Borzoi addresses this challenge by building on an Enformer-style backbone with modifications tailored to RNA-seq prediction. The convolutional stem and transformer trunk follow similar principles, compressing long input windows and propagating information through attention. Output heads predict stranded RNA-seq coverage across the window, with additional heads for complementary signals like PRO-seq (nascent transcription), CAGE, and other assays when available.
Training on RNA-seq coverage imposes different demands than training on chromatin marks. Coverage varies over orders of magnitude between introns and exons; the model must capture both the overall expression level and the fine structure of the coverage profile. Junction reads that span splice sites provide particularly informative supervision, as they directly constrain the model to learn splicing patterns. The loss function balances accurate prediction of coverage levels with faithful reproduction of the coverage shape, including sharp transitions at exon boundaries.
Why do junction reads (reads spanning splice sites) provide particularly informative training signal for Borzoi? What would happen if the model were trained only on exonic coverage without junction information?
Junction reads directly constrain the model to learn splicing patterns by providing explicit supervision about which exons are connected in mature transcripts. Without junction information, the model could learn to predict high coverage on exons and low coverage on introns, but it would not learn which exons are spliced together, alternative splicing patterns, or splice site recognition sequences. Junction reads force the model to understand the combinatorial logic of exon inclusion and exclusion, not just expression levels.
17.3.3 Applications Beyond Expression Level
A variant that causes exon skipping does not change how much gene expression you see; it changes which transcript you get. If your model only predicts total expression levels, this pathogenic variant looks harmless. What can we do with full coverage predictions that simple expression quantification misses? By predicting full RNA-seq coverage, Borzoi enables analyses that go beyond simple expression quantification. Splicing variant effects can be assessed by comparing predicted coverage at exons and junctions under reference and alternative alleles. A variant that reduces predicted junction reads for a particular exon suggests exon skipping; increased junction reads to a cryptic splice site suggests aberrant splicing. These predictions complement specialized splicing models like SpliceAI (Section 6.5), providing additional context about how splicing changes fit within the broader transcriptional program. The integration of Borzoi splicing predictions with SpliceAI scores is examined in Section 18.3.1.
Alternative promoter usage becomes visible through coverage patterns near transcription start sites. A variant that increases coverage downstream of one TSS while decreasing it downstream of another suggests a shift in promoter preference. Such shifts can alter the 5’ UTR of the resulting transcript, affecting translation efficiency and regulatory motif content.
Polyadenylation site choice affects 3’ UTR length and content. Shorter 3’ UTRs may escape microRNA-mediated repression; longer ones may include additional regulatory elements. Borzoi’s coverage predictions around annotated polyadenylation sites can reveal variants that shift site usage, potentially explaining effects on mRNA stability and translation that would be invisible to chromatin-based models.
Borzoi shifts the paradigm from predicting regulatory potential (chromatin state) to predicting regulatory outcome (RNA processing). This matters clinically because many disease-causing variants act through splicing or other post-transcriptional mechanisms that chromatin-focused models cannot detect.
17.4 Sei: A Regulatory Vocabulary from Sequence
While Enformer and Borzoi predict continuous coverage tracks, Sei (Chen et al. 2022) takes a complementary approach: learning a discrete vocabulary of sequence classes that capture distinct regulatory activities. Rather than predicting thousands of individual assays, Sei maps sequences to a reduced set of regulatory states, each associated with characteristic chromatin and transcription patterns.
17.4.1 Discrete Regulatory States
A promoter that is active in liver but silent in brain is not just “partially open”—it is in fundamentally different regulatory states. But how many distinct states exist, and what sequence features define them? Tracking thousands of individual assay predictions makes these questions hard to answer. Sei builds on observations that chromatin states cluster into interpretable categories: active promoters, strong enhancers, poised enhancers, heterochromatin, and so forth. Previous methods like ChromHMM defined such states from observed chromatin marks in specific cell types . Sei learns to predict sequence class membership directly from DNA, asking what regulatory identity a sequence carries based on its intrinsic properties.
The model predicts 40 sequence classes derived from clustering patterns across chromatin accessibility, histone modifications, and transcription factor binding. Each class corresponds to a recognizable regulatory state: promoter-like sequences, enhancer-like sequences, CTCF binding sites, repressed regions, and various intermediate states. The output is not a single class assignment but a probability distribution over classes, reflecting the observation that many sequences have context-dependent regulatory potential.
17.4.2 Complementary to Track Prediction
When a clinician asks “what kind of regulatory element did this variant hit?”, answering “it reduced H3K27ac by 0.3 log-fold in HepG2 cells” is technically precise but practically opaque. What they want to know is: did the variant disrupt an enhancer, a promoter, or something else entirely? Sei and Enformer-style models serve complementary purposes. Enformer provides detailed, quantitative predictions across specific assays and cell types; Sei provides a compressed, interpretable summary of regulatory identity. For variant interpretation, both perspectives can be valuable. Enformer might reveal that a variant reduces predicted H3K27ac signal in liver but not heart; Sei might reveal that the same variant shifts sequence class membership from “strong enhancer” toward “weak enhancer,” a more immediately interpretable characterization.
The regulatory vocabulary approach also facilitates systematic analysis across many variants. Rather than tracking changes in thousands of individual tracks, researchers can ask how a set of variants affects the distribution of regulatory classes, identifying patterns that might be obscured in high-dimensional track space.
| Model | Output Type | Number of Outputs | Best For |
|---|---|---|---|
| Enformer | Continuous tracks | ~5,000 (cell type x assay) | Quantitative, cell-type-specific predictions |
| Borzoi | RNA-seq coverage | Per-position, stranded | Splicing, isoform, RNA processing effects |
| Sei | Discrete classes | 40 probability scores | Interpretable regulatory state changes |
| AlphaGenome | Multi-modal | Chromatin + RNA + contacts | Comprehensive, unified predictions |
For each scenario below, which regulatory model would you choose and why?
A clinical geneticist wants to explain to a patient why their intronic variant likely disrupts a liver enhancer. They need output the patient can understand.
A researcher suspects a variant affects alternative splicing of a transcript but has no experimental data. They need quantitative predictions of exon inclusion changes.
A diagnostic lab needs to score 50,000 variants from a biobank cohort. They require locally-runnable predictions with version control for reproducibility.
Sei - The discrete regulatory vocabulary (“strong enhancer” → “weak enhancer”) is more interpretable for non-specialists than continuous track predictions. The clinician can say “this variant shifts the sequence from acting like an enhancer to acting like background.”
Borzoi - Unlike Enformer which predicts transcription initiation, Borzoi predicts RNA-seq coverage including splicing effects. For splicing-specific questions, Borzoi provides quantitative exon inclusion predictions that complement SpliceAI.
Enformer - For high-throughput clinical applications requiring reproducibility, Enformer’s open weights enable local deployment without sending patient data externally. Version-controlled predictions are essential for diagnostic validation.
17.5 AlphaGenome: Unifying Modalities at Megabase Scale
AlphaGenome (Z. Avsec, Latysheva, and Cheng 2025) extends the hybrid modeling paradigm in two directions: longer context windows (approximately one megabase) and broader output modalities spanning chromatin, expression, splicing, and three-dimensional contacts. The goal is a single model that provides a comprehensive view of how sequence determines regulatory state.
17.5.1 From 200kb to One Megabase
Enformer processes 200kb windows. AlphaGenome extends to approximately 1Mb. What additional biological features could a model capture with this extended context? Think about the sizes of topologically associating domains (TADs) and super-enhancers.
The megabase context window pushes against computational limits even with hybrid architectures. AlphaGenome addresses this through efficient attention mechanisms that reduce the quadratic cost, hierarchical processing that handles different output modalities at appropriate resolutions, and architectural refinements accumulated from Enformer and Borzoi development.
The output repertoire spans chromatin accessibility and histone modifications (following Enformer), gene expression and RNA coverage (following Borzoi), splicing predictions including exon inclusion and junction usage, and contact predictions reflecting three-dimensional chromatin organization.
Unifying these modalities in a single model offers several advantages. The backbone representation must capture information relevant to all outputs, encouraging learning of features that connect chromatin state to transcription to RNA processing. Variant effect predictions become coherent across modalities: a single forward pass reveals how a variant affects chromatin, expression, splicing, and contacts, rather than requiring separate runs through independent models.
17.5.2 Closed Weights, Open Questions
The most capable model is useless if you cannot use it on your data. When patient genomic sequences cannot leave your institution’s servers, API-only access becomes a barrier, not a convenience. Who controls the model matters as much as what the model can do. AlphaGenome is primarily available through an API interface rather than as a downloadable model. This arrangement simplifies use for many applications: researchers can score variants without managing large model weights or specialized hardware. It also introduces constraints around data privacy, customization, and integration with local pipelines. Clinical applications that cannot send patient sequence data to external services may be unable to use API-only models directly, motivating interest in openly available alternatives.
Use Enformer when:
- You need open-source, locally runnable predictions
- Chromatin state and transcription initiation are your primary interests
- You want to fine-tune or adapt the model for custom tasks
- You require reproducible, version-controlled predictions
Use Borzoi when:
- Splicing effects are important for your variants
- You need RNA-level predictions beyond transcription initiation
- You are integrating with other splicing predictors like SpliceAI
- Post-transcriptional regulation is clinically relevant
Use Sei when:
- You need interpretable regulatory state classifications
- You are analyzing many variants and want compressed summaries
- Communicating results to non-specialists is important
- You want to identify variants that shift regulatory identity
Use AlphaGenome when:
- You need the most comprehensive multi-modal predictions
- Data privacy constraints allow API usage
- 3D contact predictions are relevant to your question
- You want unified predictions across chromatin, RNA, and structure
From the perspective of variant interpretation workflows, AlphaGenome provides a comprehensive set of predictions from a single query. A variant can be assessed for effects on local chromatin state, expression of nearby genes, splicing of overlapping transcripts, and potential disruption of chromatin contacts, all from the same underlying model. The challenge lies in synthesizing these multiple outputs into actionable conclusions, a topic addressed in Section 18.3.4, with practical workflow integration in Section 18.4.3.
17.6 What Hybrid Architectures Accomplish
The progression from DeepSEA through Enformer, Borzoi, and AlphaGenome reflects accumulating solutions to specific limitations. Each model addresses constraints that bounded its predecessor’s utility.
17.6.1 Spanning Enhancer-Promoter Distances
An enhancer 80 kilobases from a promoter was invisible to earlier models: not uncertain, not weakly detected, but completely absent from the prediction. What do we gain when we can finally see it? The most direct contribution is enabling long-range interaction modeling. A 200 kilobase context window encompasses the distances over which most cis-regulatory interactions occur. Attention mechanisms allow the model to learn direct relationships between enhancers and promoters without requiring information to propagate through many intermediate layers. Empirically, this translates to improved prediction of expression and better correlation with eQTLs, particularly for variants in distal regulatory elements.
17.6.2 Multi-Task Regularization
Why train one model on thousands of different assays instead of training specialized models for each? The naive expectation might be that a model focused solely on predicting gene expression would outperform one distracted by chromatin accessibility and histone modifications. Yet the opposite proves true. Training on hundreds of assays jointly constrains the model to learn representations that generalize across regulatory modalities. A feature useful only for predicting H3K4me3 in one cell type provides less gradient signal than a feature useful across chromatin, transcription, and accessibility. This multi-task pressure steers the model toward learning fundamental regulatory logic rather than assay-specific artifacts.
Why does multi-task training produce better representations than single-task training? The answer lies in what the shared backbone must learn to succeed across all tasks. Predicting chromatin accessibility requires recognizing transcription factor binding motifs. Predicting gene expression requires recognizing how those motifs combine into functional enhancers. Predicting histone modifications requires recognizing the sequence features that recruit chromatin modifiers. No single feature set suffices for all tasks, so the backbone must learn a rich vocabulary of regulatory features, precisely the vocabulary needed for robust variant effect prediction. In contrast, a model trained only on one assay might exploit assay-specific artifacts (batch effects, mapping biases, cell-line idiosyncrasies) that happen to correlate with the training signal but do not reflect genuine regulatory biology.
17.6.3 Cross-Species Constraints
If human and mouse regulatory sequences have diverged over 80 million years of evolution, why would training on both species together help rather than hurt? The answer reveals something fundamental about what we want these models to learn. Training on human and mouse together further regularizes the model. Species-specific binding site variants, repetitive elements, and technical artifacts in training data affect one species but not the other. Features that generalize across species are more likely to reflect conserved regulatory mechanisms. This provides a form of evolutionary validation built into the training process.
Why does cross-species training work when human and mouse regulatory sequences have diverged substantially? The key insight is that core regulatory logic is more conserved than specific sequences. The transcription factors that drive liver expression in humans are largely the same as those in mice, even if the exact positions and sequences of binding sites have shifted. A model trained on both species must learn what HNF4A binding looks like in general, not just where HNF4A binds in one particular genome. This abstraction makes the model robust to sequence variation while preserving sensitivity to functional features, exactly the properties needed for predicting effects of human genetic variants, which by definition differ from the reference sequence the model was trained on.
17.6.4 Unified Variant Effect Prediction
Running a variant through five different models and reconciling contradictory outputs is tedious at best and misleading at worst. What if one model could give you chromatin effects, expression changes, and splicing consequences in a single coherent prediction? Perhaps most practically valuable, hybrid models provide a unified framework for variant effect prediction on expression and related phenotypes. Rather than assembling scores from multiple specialized models, researchers can query a single model for comprehensive predictions. The outputs span cell types and assays, enabling tissue-specific interpretation of regulatory variants. This capability integrates naturally with the variant interpretation workflows described in Section 18.4 and the clinical applications examined in Chapter 29. The calibration of these multi-track predictions for clinical use is addressed in Section 18.5.
17.7 Limitations and Open Challenges
Despite their power, long-context regulatory models face fundamental limitations that bound their current utility and define directions for future development.
Before reading about limitations, reflect: based on what you know about how these models are trained, what categories of failure would you expect? Consider training data, model architecture, and the biology of gene regulation.
Expected failure categories include: training data bias (poor performance on underrepresented cell types and populations), finite context windows missing trans-acting factors and distant regulatory elements, inability to model 3D chromatin structure from linear sequence, and correlation-based learning that may not capture true causal mechanisms. The models learn associations between sequence and functional readouts but cannot distinguish which patterns are causally important versus merely correlated.
17.7.1 Training Data Constraints
Your patient has a variant in a regulatory element active in pancreatic beta cells. Will the model’s prediction be reliable? That depends entirely on whether beta cells were well-represented in training, and the model will not tell you if they were not. Functional genomics data is biased in coverage, overrepresenting well-studied cell types (embryonic stem cells, K562, HepG2, lymphoblastoid cell lines) while leaving many tissue types and disease-relevant cell states poorly covered. Models trained on available data will perform better in represented contexts and may fail silently in underrepresented ones. Ancestry bias compounds the problem: most functional genomics studies derive from individuals of European descent, limiting the diversity of haplotypes and regulatory variants represented in training data. These data gaps are examined more comprehensively in Section 2.4.1 and Section 2.4.
These biases propagate to variant effect predictions. A variant in a regulatory element active primarily in pancreatic beta cells may receive poor predictions if beta cell data is sparse in training. A variant on a haplotype common in African populations but rare in Europeans may fall outside the model’s effective training distribution. Users must recognize that prediction confidence varies with representation in training data, a consideration that current models do not explicitly communicate. Chapter 13 examines how such biases can compromise model validity.
17.7.2 Finite Context
A transcription factor binding 50 megabases away can determine whether your gene of interest is expressed. A structural variant duplicating a distant super-enhancer can drive oncogene activation. No fixed window, however large, can see everything that matters. Even megabase-scale windows capture only local regulation. Trans-acting factors, three-dimensional contacts spanning multiple megabases, and whole-chromosome organization fall outside model context. Structural variants that rearrange large genomic segments, duplicate enhancers, or create novel fusion genes cannot be modeled within fixed-window architectures. The reference genome assumption underlying these models further limits their applicability to complex haplotypes and populations with substantial structural variation relative to the reference.
17.7.3 Missing Three-Dimensional Context
Two enhancers equidistant from a promoter in linear sequence may have completely different effects: one brought into contact through a chromatin loop, the other sequestered in a different nuclear compartment. The genome is folded, but these models read it flat. Linear sequence models treat DNA as a one-dimensional string, but gene regulation occurs in three-dimensional nuclear space. Chromatin loops bring distal elements into proximity; nuclear compartmentalization segregates active and repressed regions; phase-separated condensates concentrate regulatory factors. While AlphaGenome predicts some contact features, current hybrid models do not fully integrate three-dimensional chromatin organization. The relationship between linear sequence, three-dimensional structure, and regulatory output remains incompletely captured. Chapter 21 examines models that explicitly address chromatin architecture.
17.7.4 Correlation Versus Causation
A model that perfectly predicts expression levels might still be completely wrong about why genes are expressed. It could be learning batch effects, GC content, or any signal correlated with the training labels without understanding regulatory mechanism. How do we know when a model has learned biology versus artifacts? Hybrid models learn correlations between sequence and functional readouts, not causal mechanisms. A variant might receive a high predicted effect score because it disrupts a motif correlated with expression in training data, not because the motif causally drives expression. Attribution methods can identify which sequence features contribute to predictions, but attribution is not validation. High-confidence predictions require experimental confirmation through approaches like massively parallel reporter assays, CRISPR perturbation, or allelic series analysis.
Prediction accuracy and mechanistic understanding are different things. A model can achieve high correlation with experimental measurements by learning any predictive signal, even confounded ones. The fact that a model accurately predicts expression does not mean it has learned the correct causal mechanism.
17.7.5 Interpretability Challenges
When Enformer predicts that a variant disrupts gene expression, can we ask why? What motif was disrupted? What regulatory logic was broken? With hundreds of millions of parameters, the answer is rarely straightforward. The scale of these models (hundreds of millions of parameters) makes mechanistic interpretation difficult. Attention patterns provide some insight into which positions the model considers related, but attention weights are not guaranteed to reflect the model’s actual computational strategy. Attribution methods (saliency maps, integrated gradients) can highlight important input positions, but the features the model constructs from those positions remain opaque. Chapter 25 examines these interpretability methods and their limitations in detail.
17.8 Relationship to Foundation Models
Long-context regulatory models occupy an interesting position in the genomic foundation model landscape. They share key characteristics with foundation models: large scale, broad training data, strong performance across tasks, and utility as feature extractors for downstream applications. Yet they differ from self-supervised DNA language models (Chapter 15) in their heavy reliance on supervised, task-specific training signals.
Enformer and its descendants can be viewed as highly specialized foundation models, pretrained on the specific task of regulatory prediction and adaptable to related applications. Their representations encode regulatory logic learned from functional genomics supervision, complementing the sequence patterns learned by self-supervised models from raw DNA. In practice, the two approaches may prove most powerful in combination: self-supervised models provide sequence representations from evolutionary context, while supervised regulatory models provide representations from functional genomics context. Integrating these representations for tasks like variant effect prediction is an active area of development, explored further in Chapter 18.
Consider the difference between Enformer (supervised on functional genomics data) and DNA language models like Nucleotide Transformer (self-supervised on sequence alone). What types of regulatory patterns might each approach learn well? What might each miss?
From a practical standpoint, hybrid regulatory models remain among the most directly useful genomic deep learning systems for variant interpretation. They provide quantitative, tissue-specific predictions for regulatory variants, outperform short-context alternatives on distal regulatory elements, and integrate naturally into variant prioritization workflows. Their limitations are real but understood; their strengths are substantial and empirically validated.
17.9 Prediction Without Explanation
Long-range regulatory prediction from sequence is tractable. Enformer established that hybrid convolutional neural network (CNN)-transformer architectures could span 200 kilobases and predict expression-related chromatin features. Borzoi extended coverage to the full transcriptome with improved quantitative accuracy. AlphaGenome unified multiple regulatory modalities at megabase scale, predicting chromatin accessibility, histone modifications, transcription factor binding, and gene expression from a single architecture. Each generation captures more of the regulatory landscape with greater fidelity to experimental measurements.
Yet these models predict regulatory outcomes without explaining regulatory mechanism. They learn that certain sequence patterns associate with certain expression levels, but they do not represent enhancer-promoter contacts, transcription factor cascades, or the causal chain from sequence to phenotype. The attention patterns that span long distances may correspond to genuine regulatory interactions or may reflect confounded sequence features that happen to predict expression. Interpretability methods (Chapter 25) can probe what patterns models have learned, but high prediction accuracy does not guarantee mechanistic insight.
This distinction shapes how regulatory model predictions should be used. For variant effect prediction (Chapter 18), regulatory models provide one input among several: they predict whether a variant alters chromatin accessibility or expression, while protein language models (Chapter 16) assess coding consequences and evolutionary models quantify constraint. The clinical integration of these signals (Chapter 29) requires understanding what each model contributes and where each is likely to fail. Regulatory models excel at predicting noncoding variant effects when the relevant cell type is represented in training data; they struggle with cell types absent from training and with variants acting through mechanisms not captured by the output tracks they predict.
You need to predict the regulatory impact of a variant located 50kb from the nearest gene. Based on what you have learned in this chapter, which model architecture would you choose and why? What input features would be most important?
Before reviewing the summary, test your recall:
Why do short-context models fundamentally fail at modeling mammalian gene regulation? What is the scale of the problem (provide specific distance ranges)?
Explain the hybrid CNN-transformer architecture strategy. How does Enformer achieve a 200kb context window without requiring quadratic attention over 200,000 positions?
Compare Enformer, Borzoi, and Sei in terms of what they predict and what applications they are best suited for. What are the key differences in their outputs?
What does it mean that regulatory models provide “prediction without explanation”? Why is high accuracy at predicting expression not the same as understanding regulatory mechanism?
Identify three critical limitations of current regulatory models that prevent them from capturing the full complexity of gene regulation. For each, explain what biological information is missing.
Scale mismatch: Short-context models have receptive fields of 1-2 kb or at most tens of kb through dilated convolutions, but mammalian enhancers are typically 20-50 kb from their target promoters, with substantial fractions exceeding 100 kb. Topologically associating domains (TADs) range from hundreds of kb to several megabases. A model with a 2kb window can see only 1% of a 200kb regulatory neighborhood, treating distant enhancers as if they do not exist.
Hybrid architecture efficiency: Enformer uses a convolutional stem to compress 200kb of sequence down to ~1,500 tokens (each representing ~128bp bins), reducing the sequence length by approximately 100×. This makes attention computationally tractable because attention scales quadratically with sequence length: operating on 1,500 positions instead of 200,000 reduces memory requirements by roughly 17,000-fold. The convolutions preserve local motif information while discarding positional precision that biology does not require, then the transformer attention enables direct information flow between any positions in the compressed representation.
Model output comparison: Enformer predicts ~5,000 continuous chromatin and transcription tracks (DNase, histone marks, CAGE) across cell types, best for quantitative tissue-specific predictions. Borzoi predicts full RNA-seq coverage profiles at nucleotide resolution, best for splicing effects and post-transcriptional regulation. Sei predicts probabilities across 40 discrete regulatory state classes (promoter-like, enhancer-like, etc.), best for interpretable regulatory identity changes and compressed variant summaries. AlphaGenome unifies chromatin, RNA, and 3D contact predictions at megabase scale via API.
Prediction vs. explanation: High accuracy means the model has learned sequence patterns that correlate with expression measurements, but correlation does not imply the model understands causal mechanisms. A variant might score high because it disrupts a motif correlated with expression in training data rather than because that motif causally drives expression. The model does not represent enhancer-promoter contacts, transcription factor cascades, or the mechanistic chain from sequence to phenotype; it learns statistical associations that predict outcomes without explaining how regulation actually works.
Three critical limitations:
Training data bias: models overrepresent well-studied cell types (K562, HepG2, lymphoblastoid lines) and European ancestry, causing poor performance on underrepresented tissues and populations where prediction confidence is unknown.
Finite context windows: even 1Mb contexts cannot capture trans-acting factors encoded elsewhere in the genome, structural variants spanning multiple megabases, or whole-chromosome organization effects.
Missing 3D chromatin structure: models treat DNA as linear sequence but regulation occurs in 3D nuclear space where chromatin loops bring distant elements into physical proximity; current models do not fully integrate how linear sequence determines 3D organization and how that structure mediates regulatory function.
Key Concepts:
- Long-range regulation problem: Mammalian gene regulation operates across 20-100+ kb, beyond the reach of short-context models
- Hybrid architectures: CNN compression + transformer attention balances computational efficiency with long-range modeling
- Multi-task learning: Training on diverse functional genomics assays encourages learning of generalizable regulatory features
- Cross-species training: Human and mouse data together emphasize evolutionarily conserved patterns
Model Comparison:
| Model | Context | Key Outputs | Access |
|---|---|---|---|
| Enformer | 200 kb | Chromatin, CAGE (~5,000 tracks) | Open weights |
| Borzoi | 200+ kb | RNA-seq coverage, splicing | Open weights |
| Sei | ~4 kb | 40 regulatory class probabilities | Open weights |
| AlphaGenome | ~1 Mb | Chromatin + RNA + contacts | API only |
Critical Limitations:
- Training data bias toward well-studied cell types and European ancestry
- Finite context misses trans-acting factors and structural variants
- Linear sequence representation ignores 3D chromatin organization
- Correlation-based learning does not guarantee causal understanding
- Model scale (hundreds of millions of parameters) impedes interpretation
Clinical Applications:
- Variant effect prediction for noncoding/regulatory variants
- Tissue-specific expression predictions
- Integration with classical VEP methods and protein language models
- See Chapter 18 for detailed variant interpretation workflows
Looking Forward:
- Chapter 18: How to combine regulatory model predictions with other evidence
- Chapter 21: Models that explicitly address chromatin architecture
- Chapter 25: Methods for probing what these models learn
- Chapter 29: Clinical application in diagnostic workflows