31 Sequence Design
Reading genomes is hard. Writing them is harder.
Estimated reading time: 35-45 minutes
Prerequisites: Before reading this chapter, you should be familiar with:
- Sequence representations and embedding spaces (Chapter 5)
- Masked language modeling and pretraining objectives (Chapter 8)
- Protein language models and structure prediction (Chapter 16)
- Uncertainty quantification and out-of-distribution detection (Chapter 24)
- Benchmark evaluation principles (Chapter 11)
Learning Objectives: After completing this chapter, you will be able to:
- Explain why sequence design inverts the prediction problem and the mathematical frameworks that formalize design objectives
- Compare sequence-based and structure-aware approaches to protein design
- Apply design principles to regulatory elements, mRNA therapeutics, and antibodies
- Evaluate generative models using novelty, validity, diversity, and functionality metrics
- Design closed-loop experimental workflows that integrate foundation model predictions with high-throughput validation
- Identify characteristic failure modes of model-guided design and mitigation strategies
Key Insight: The transition from prediction to design fundamentally changes the role of foundation models. Prediction asks “what does this sequence do?” while design asks “what sequence achieves this function?” This inversion exposes limitations invisible during prediction: models must remain accurate in regions of sequence space far from their training data.
Genomic foundation models predict the consequences of sequence variation with increasing accuracy. A protein language model estimates whether a missense variant disrupts function. A regulatory model forecasts how a promoter mutation alters expression across cell types. These predictive capabilities represent genuine advances. Yet prediction alone cannot create a therapeutic protein that nature never evolved, design a promoter that drives expression only in diseased tissue, or engineer an mRNA vaccine against a novel pathogen. The gap between reading genomes and writing them defines one of the central challenges in translational biology: we can characterize biological sequences with unprecedented resolution, but translating that understanding into designed molecules remains largely empirical, expensive, and slow.
This asymmetry reflects a fundamental mismatch between what evolution produced and what therapeutics require. Evolution optimizes for reproductive fitness over geological timescales, producing sequences that satisfied survival constraints under ancestral conditions. Therapeutic applications demand sequences optimized for entirely different objectives: binding a specific epitope with high affinity, expressing at therapeutic levels in a particular tissue, or evading immune recognition while retaining function. The sequences we need often lie far from natural evolutionary trajectories, in regions of sequence space that foundation models have never observed during training. Navigating this terra incognita requires not just accurate oracles that score candidate sequences, but principled strategies for proposing, testing, and refining designs where model reliability is uncertain.
Foundation models have begun to address this challenge by providing both generative priors over plausible sequences and differentiable oracles that guide optimization. Protein language models sample novel sequences respecting the statistical patterns of natural proteins. Structure-aware diffusion models generate backbones and sequences simultaneously, enabling design of proteins with specified geometries. Regulatory sequence models predict expression outcomes across thousands of candidate promoters, enabling gradient-based optimization toward desired activity profiles. When coupled with high-throughput experimental assays in closed-loop design cycles, these capabilities are transforming biological engineering.
31.1 Design Formalism
Before reading about the design formalism, consider: if you wanted to find a protein sequence with a specific binding property, why could not you simply enumerate all possible sequences and pick the best one? What makes this approach impractical, and what alternative strategies might you consider?
Sequence design inverts the standard prediction problem. Where prediction maps from sequence to function (given sequence \(x\), estimate property \(f(x)\)), design maps from desired function to sequence (given target property \(y^\star\), find sequence \(x^\star\) such that \(f(x^\star) \approx y^\star\)). This inversion is computationally challenging because biological sequence spaces are astronomically large. A 200-residue protein admits \(20^{200}\) possible sequences, vastly exceeding the number of atoms in the observable universe. Even a modest 500-base-pair regulatory element spans \(4^{500}\) possibilities. Exhaustive enumeration is impossible; intelligent search strategies are essential.
The design objective can take several mathematical forms depending on the application. Optimization problems seek sequences that maximize (or minimize) a scalar objective, such as finding \(x^\star = \arg\max_x f_\theta(x)\) where \(f_\theta\) might represent predicted binding affinity, expression level, or stability. Conditional generation problems sample sequences from a distribution conditioned on desired properties, drawing \(x \sim p_\theta(x \mid y)\) where \(y\) specifies structural constraints, functional requirements, or context. Constrained optimization problems combine objective maximization with explicit constraints, seeking \(x^\star = \arg\max_x f_\theta(x)\) subject to \(c(x) \leq 0\), where constraints \(c\) might enforce GC content limits, avoid restriction sites, or maintain similarity to natural sequences.
The table below summarizes how these different formulations apply to common design scenarios.
| Design Formulation | Mathematical Form | Typical Applications | Key Challenge |
|---|---|---|---|
| Optimization | \(x^\star = \arg\max_x f_\theta(x)\) | Maximize binding affinity, expression level | May find adversarial sequences |
| Conditional generation | \(x \sim p_\theta(x \mid y)\) | Generate sequences with specified structure | Requires well-calibrated conditional models |
| Constrained optimization | \(\max_x f_\theta(x)\) s.t. \(c(x) \leq 0\) | Optimize function while avoiding restriction sites | Constraint satisfaction adds complexity |
| Multi-objective | Pareto frontier of \((f_1, f_2, \ldots)\) | Balance affinity, stability, immunogenicity | No single optimal solution exists |
Foundation models contribute to design through multiple mechanisms. As generative priors, they assign higher probability to sequences resembling natural biology, regularizing optimization toward plausible regions of sequence space. As differentiable oracles, they enable gradient-based optimization where sequence modifications are guided by gradients of predicted properties. As embedding functions, they map discrete sequences into continuous spaces where interpolation and optimization become tractable (Section 5.6 for representation fundamentals; Section 8.1.2 for how pretraining shapes these spaces). The challenge lies in searching enormous combinatorial spaces while remaining within regimes where these model-based estimates remain reliable.
31.2 Protein Design with Language Models
Protein language models trained on evolutionary sequence databases (Chapter 16) have emerged as effective tools for protein design, providing both generative sampling capabilities and fitness estimation for candidate sequences. The masked language modeling objectives that enable fitness estimation are detailed in Section 8.1. The success of these approaches stems from a key insight: evolution has conducted billions of years of experiments on protein sequence space, and models trained on the surviving sequences implicitly encode constraints on what works.
31.2.1 Sequence Generation from Language Model Priors
Before reading about protein language model generation, recall from Chapter 8: What is the difference between autoregressive and masked language modeling objectives? How would each approach support sequence generation differently?
Autoregressive models (like GPT-style) predict each token given all previous tokens, making them naturally suited for sequential generation. Masked models (like BERT-style) predict masked tokens given surrounding context bidirectionally, supporting iterative refinement by masking and resampling positions. For protein design, autoregressive models generate sequences left-to-right, while masked models enable position-specific refinement of existing sequences.
Autoregressive protein language models such as ProGen and ProtGPT2 generate novel protein sequences by sampling tokens sequentially from learned distributions (Madani et al. 2023; Ferruz, Schmidt, and Höcker 2022). Given a partial sequence, the model predicts probability distributions over the next amino acid, enabling iterative extension until a complete protein emerges. This generation process can be unconditional (sampling from the full learned distribution) or conditional on control signals such as protein family annotations, organism of origin, or functional keywords.
Consider the temperature parameter in sequence generation. If you sample at T=0 (deterministic, always picking highest-probability amino acid), what kind of sequences would you expect to generate? What about at very high temperature? What tradeoff does this create for protein design?
At T=0, you generate the single most probable sequence, likely a “consensus” protein similar to highly abundant natural proteins. At very high T, you sample broadly including low-probability amino acids, producing diverse but potentially nonfunctional sequences. The tradeoff: low temperature exploits known-good sequence space (safe but unoriginal), high temperature explores novel space (creative but risky). Practical workflows sample across temperatures and use downstream filters.
The quality of generated sequences depends critically on how closely the sampling distribution matches functional proteins. Sequences sampled at low temperature (more deterministic) tend to resemble common protein families but may lack novelty. Sequences sampled at high temperature (more stochastic) exhibit greater diversity but risk straying into nonfunctional regions. The temperature parameter controls the entropy of the sampling distribution: at temperature T=0, the model deterministically selects the highest-probability token at each position; as T increases, lower-probability tokens become increasingly likely to be sampled. This creates a fundamental exploration-exploitation tradeoff: low temperatures exploit the model’s knowledge of natural sequences but may miss novel functional solutions, while high temperatures explore more broadly but venture into regions where the model’s predictions become unreliable. Practical design workflows often generate large libraries of candidates across temperature ranges, then filter using downstream oracles for structure, stability, or function.
Masked language models like ESM-2 support design through a different mechanism. Rather than generating sequences de novo, these models estimate the probability of each amino acid at each position given the surrounding context. Design proceeds by iterative refinement: starting from an initial sequence, positions are masked and resampled according to model predictions, gradually shifting the sequence toward higher-likelihood regions. This Gibbs-sampling-like procedure can be biased toward specific objectives by combining model likelihoods with scores from downstream predictors.
The key advantage of protein language model-based design lies in data efficiency. Because models are pretrained on millions of natural sequences, they generalize to design tasks with minimal task-specific data. A model fine-tuned on a few hundred functional variants can propose candidates across sequence space, extrapolating far beyond the training examples. This contrasts with traditional directed evolution approaches that require extensive experimental screening to navigate sequence space.
31.2.2 Structure-Aware Design with Diffusion Models
Structure-aware design involves understanding how diffusion models operate in three-dimensional coordinate space. If you are unfamiliar with diffusion models (progressive denoising from noise to signal), you may wish to review the diffusion model literature or focus on the conceptual workflow rather than mathematical details.
Structure-aware design addresses a fundamental limitation of sequence-only approaches: proteins function through three-dimensional structures, and sequence optimization without structural guidance may produce sequences that fail to fold correctly. The advent of accurate structure prediction (AlphaFold2, ESMFold; Section 16.4) enables new design paradigms that jointly consider sequence and structure.
RFdiffusion exemplifies this approach by generating protein backbones through a diffusion process in three-dimensional coordinate space (Watson et al. 2023). Starting from random noise, the model iteratively denoises toward plausible backbone geometries, conditioned on design specifications such as target binding interfaces, desired topology, or symmetric assembly requirements. The resulting backbones represent novel structures not observed in nature but predicted to be physically realizable.
Converting designed backbones to sequences requires inverse folding models that predict amino acid sequences likely to adopt a given structure. ProteinMPNN and ESM-IF operate on this principle, taking backbone coordinates as input and outputting probability distributions over sequences predicted to fold onto that backbone(Dauparas et al. 2022; Hsu et al. 2022). ESM-IF uses the representations learned by ESM-2 to condition sequence generation on structural constraints, connecting the inverse folding task directly to the protein language model paradigm. The model can generate thousands of candidate sequences for a single backbone, enabling selection based on additional criteria such as expression likelihood or immunogenicity.
The power of structure-aware design lies in using 3D structure as an intermediate representation between function and sequence. Rather than searching directly in the astronomically large space of sequences, design first identifies a structure that would achieve the desired function, then finds sequences that fold to that structure. This factorization dramatically constrains the search space.
This two-stage pipeline (structure diffusion followed by inverse folding) has proven effective for creating novel proteins. Designed binders targeting challenging therapeutic targets, de novo enzymes with specified active site geometries, and symmetric protein assemblies with precise nanoscale dimensions have all been realized experimentally. The key insight is that structure provides a useful intermediate representation: rather than searching directly in the vast space of sequences, design proceeds through the more constrained space of physically realizable structures.
The table below compares sequence-based and structure-aware approaches across key design considerations.
| Consideration | Sequence-Based (PLM) | Structure-Aware (Diffusion + Inverse Folding) |
|---|---|---|
| Prior knowledge required | Protein family or starting sequence | Target structure or binding interface |
| Novel structure capability | Limited to known folds | Can generate entirely new topologies |
| Computational cost | Lower (sequence operations only) | Higher (3D coordinate generation) |
| Output diversity | Depends on sampling temperature | High (many sequences per backbone) |
| Experimental success rate | 30-50% express | 30-70% express; 5-30% functional |
| Best applications | Variant optimization, library design | De novo binders, enzymes, assemblies |
31.2.3 Functional Conditioning and Multi-Objective Optimization
Consider a therapeutic antibody design project. List at least four properties you would need to optimize simultaneously, and explain why optimizing for just one property (e.g., binding affinity) would be insufficient for a successful therapeutic.
Key properties include: (1) binding affinity to target, (2) specificity to avoid off-target effects, (3) manufacturability and expression levels, (4) stability during storage, (5) solubility to prevent aggregation, and (6) low immunogenicity to minimize immune responses. Optimizing only affinity could yield an antibody that binds excellently but aggregates during manufacturing, triggers immune responses in patients, or binds unintended targets causing toxicity.
Real therapeutic or industrial applications rarely optimize a single objective. A designed enzyme must not only be catalytically active but also stable at process temperatures, expressible in the production host, and resistant to proteolytic degradation. A therapeutic antibody must bind its target with high affinity while avoiding off-target interactions, maintaining solubility, and minimizing immunogenicity. These competing demands create multi-objective optimization problems where no single sequence optimizes all criteria simultaneously.
Multi-objective design produces Pareto frontiers, the set of solutions where no objective can be improved without worsening another, representing different trade-offs among objectives. A sequence might achieve exceptional binding affinity at the cost of reduced stability, while another balances moderate affinity with excellent developability properties. Practitioners must select among Pareto-optimal solutions based on application-specific priorities, and foundation models increasingly support this selection by providing diverse oracles across multiple property dimensions.
Foundation models contribute to multi-objective design in three ways. Generative priors propose candidate sequences that satisfy basic plausibility constraints (foldability, expressibility) before optimization begins. Multiple differentiable oracles (for binding, stability, immunogenicity) enable gradient-based optimization toward Pareto frontiers. Embedding spaces support interpolation between sequences with different property profiles, enabling exploration of intermediate trade-offs. The combination of these capabilities makes foundation models central to modern protein design pipelines.
31.3 Regulatory Sequence Design
Genomic foundation models trained on chromatin accessibility, transcription factor binding, and gene expression data enable design of synthetic regulatory elements with specified activity profiles. Unlike protein design where the sequence-to-function mapping operates through three-dimensional structure, regulatory design must account for the genomic and cellular context in which elements function. The functional genomics resources described in Section 2.4 provide training data for these models, while the interpretability methods from Section 25.1 inform design strategies by revealing which sequence features drive predictions.
31.3.1 Promoter and Enhancer Engineering
Gradient-based design for regulatory elements uses the same saliency computations described in Section 25.1.2 for interpretation, but runs them “in reverse.” Before reading on, consider: if a saliency map tells you which nucleotides most affect the current prediction, how might you use this information to increase predicted expression in a target cell type?
Massively parallel reporter assays (MPRAs) have generated training data for models that predict expression levels from promoter and enhancer sequences (Section 2.4.4; Section 2.4.4) (Boer et al. 2019). These models learn sequence determinants of regulatory activity, including transcription factor binding sites, spacing constraints between elements, and context-dependent interactions. Once trained, the same models serve as oracles for design: by evaluating expression predictions across millions of candidate sequences, optimization algorithms can identify synthetic regulatory elements with desired properties.
Gradient-based design treats the sequence-to-expression model as a differentiable function. Starting from an initial sequence, gradients of predicted expression with respect to input positions indicate which mutations would increase (or decrease) activity. Because sequences are discrete while gradients are continuous, optimization requires relaxation strategies that operate on “soft” sequence representations before projecting back to discrete nucleotides. These approaches use the same saliency map computations used for model interpretation (Section 25.1.2), running the analysis in reverse to guide design rather than explain predictions.
Design objectives for regulatory elements extend beyond maximizing expression in a target context. Cell-type-specific enhancers should drive high expression in desired tissues while remaining inactive elsewhere. Inducible promoters should respond to specific signals while maintaining low basal activity. Compact regulatory elements are preferred for gene therapy applications where vector capacity is limited. These constraints transform simple optimization into multi-objective problems requiring careful balancing of competing requirements.
Generative models trained directly on regulatory sequences offer an alternative to optimization-based approaches. Autoregressive or diffusion models learn to sample novel enhancers and promoters that match the statistical properties of natural regulatory elements. Conditioning on cell type labels, chromatin state annotations, or other metadata enables generation of elements with targeted activity profiles. The advantage of generative approaches lies in their ability to produce diverse candidate libraries for experimental screening, rather than converging on a single optimized sequence that may exploit model artifacts rather than genuine biology.
31.3.2 Splicing and RNA Processing Elements
Models trained on splicing outcomes (SpliceAI and related architectures described in Chapter 6; see also Chapter 19 for RNA-specific foundation models) enable design of sequences that modulate RNA processing. Therapeutic applications include correcting pathogenic splice site mutations by strengthening weak splice sites or weakening aberrant ones, designing antisense oligonucleotides that redirect splicing to skip exons containing disease-causing mutations, and engineering alternative splicing outcomes to produce desired protein isoforms.
The design space for splicing elements encompasses splice site sequences themselves (the canonical GT-AG dinucleotides and surrounding intronic and exonic enhancers and silencers), branch point sequences, and auxiliary sequences that recruit splicing regulatory proteins. Foundation models that predict splicing patterns from local sequence context serve as oracles for evaluating candidate modifications, while gradient-based optimization identifies changes predicted to shift splicing toward therapeutic outcomes.
Design of splicing modulators requires particular attention to off-target effects. The splicing code is highly context-dependent, and sequence modifications intended to affect one splice site may inadvertently alter recognition of others. Genome-wide splicing models that predict effects across all splice sites provide essential off-target assessment, flagging candidate designs that would disrupt normal splicing at unintended locations.
31.4 mRNA Design and Optimization
Consider what you learned about protein design in Section 31.2. How does mRNA design differ fundamentally? What additional constraints does an mRNA therapeutic face compared to a designed protein?
Key differences:
- mRNA must encode the same protein sequence (constrained by genetic code), limiting design to synonymous codon choices and UTRs
- mRNA faces immune recognition as a foreign nucleic acid, requiring evasion strategies
- mRNA degrades rapidly, requiring stability optimization
- mRNA must be manufactured and delivered, adding constraints proteins do not face
Protein design optimizes amino acid sequence directly; mRNA design optimizes the nucleotide encoding while keeping the protein constant.
The clinical success of mRNA vaccines has intensified interest in systematic approaches to mRNA sequence design. Unlike protein or regulatory element design where the primary challenge is achieving desired function, mRNA design must simultaneously optimize translation efficiency, molecular stability, immune evasion, and manufacturing tractability. Foundation models increasingly contribute to each of these objectives.
31.4.1 Codon Optimization Principles
Before diving into codon optimization, recall the genetic code’s degeneracy. For a 100-amino-acid protein, roughly how many different mRNA sequences could encode the same protein? Why does this create both an opportunity and a challenge for mRNA design?
Most amino acids have 2-6 synonymous codons (Met and Trp have only 1). For 100 amino acids with ~3 average synonymous options per position, there are roughly 3^100 ≈ 10^47 possible encodings, an astronomical number. This creates opportunity because we can search for optimal encodings, but also creates a challenge because the search space is impossibly large, requiring smart optimization strategies rather than exhaustive search.
The genetic code is degenerate: sixty-one sense codons encode twenty amino acids, meaning that any protein sequence can be encoded by many different mRNA sequences. These synonymous sequences differ in translation efficiency, mRNA stability, and immunogenicity despite producing identical proteins. Codon optimization exploits this redundancy to improve therapeutic mRNA performance.
Although synonymous codons produce identical proteins, the mRNA sequences differ in ways that profoundly affect therapeutic outcomes. A single synonymous mutation can alter: (1) translation speed at that position, (2) mRNA secondary structure affecting stability, (3) recognition by innate immune sensors, and (4) ribosome pausing that affects co-translational folding. Codon optimization must navigate all these effects simultaneously.
Traditional codon optimization relied on codon adaptation indices derived from highly expressed genes in target organisms. Codons frequently used in abundant proteins were assumed to be efficiently translated, leading to optimization strategies that maximize use of preferred codons. This approach oversimplifies the complex relationship between codon choice and expression. Translation elongation rate varies with codon-anticodon interactions, tRNA abundance, mRNA secondary structure, and ribosome queuing effects. Local codon context matters: rare codons following abundant ones may be translated efficiently, while runs of preferred codons can cause ribosome collisions.
Machine learning models trained on ribosome profiling data and reporter assays have begun to capture these context-dependent effects. These models predict translation efficiency from sequence features including codon frequencies, local secondary structure, and amino acid properties. Using such models as oracles, optimization algorithms can search for mRNA sequences that maximize predicted translation while avoiding problematic sequence features. The resulting designs often differ substantially from simple codon-frequency optimization, incorporating rare codons at specific positions to optimize local translation dynamics.
31.4.2 Stability Engineering and UTR Design
Think back to regulatory element design (Section 31.3). How is UTR design similar to enhancer/promoter design? How is it different? What properties must a 5’ UTR balance that a promoter does not?
Similarities: Both are non-coding regulatory sequences where foundation models predict activity; both use gradient-based or generative design approaches. Key differences: (1) UTRs are transcribed into RNA (must consider RNA structure, not DNA), (2) 5’ UTRs must balance ribosome recruitment (high translation) against secondary structure that blocks scanning (low translation), (3) 3’ UTRs affect mRNA half-life via RNA-binding protein sites, while promoters do not face degradation. UTRs operate post-transcriptionally; promoters control transcription initiation.
mRNA stability in the cytoplasm determines the duration of protein production and thus the dose required for therapeutic effect. Stability is governed by multiple sequence features: the 5’ and 3’ untranslated regions (UTRs) that flank the coding sequence, the presence of destabilizing sequence motifs recognized by RNA-binding proteins, and secondary structures that protect against or expose the molecule to nucleases.
UTR engineering represents a particularly active area of foundation model application. Natural UTRs contain binding sites for regulatory proteins and microRNAs, sequences that affect ribosome recruitment, and structures that influence mRNA localization and stability. Foundation models trained on expression data across diverse UTR sequences learn which features promote stability and efficient translation. Design algorithms then search for synthetic UTRs that maximize these properties while avoiding sequences that trigger immune recognition or rapid degradation.
Chemical modifications of mRNA (pseudouridine, N1-methylpseudouridine, and other nucleoside analogs) dramatically improve stability and reduce immunogenicity. These modifications alter the sequence-function relationship in ways that current foundation models, trained primarily on natural RNA, may not fully capture. Emerging models that incorporate modification information promise to enable joint optimization of sequence and modification patterns.
31.4.3 Immunogenicity Considerations
Recall multi-objective optimization from protein design (Section 31.2.3). For an mRNA therapeutic, you need to balance: (1) high translation efficiency, (2) long mRNA half-life, (3) low immunogenicity, and (4) manufacturability. Why cannot you simply maximize each property independently? What is the likely tradeoff between translation efficiency and immunogenicity?
These objectives create competing constraints: (1) High GC content can improve stability but increases immunogenicity via TLR recognition; (2) Rare codons reduce translation but may reduce immune detection; (3) Strong secondary structures protect from nucleases but can block ribosome scanning; (4) Chemical modifications reduce immunogenicity but increase manufacturing cost. Translation efficiency often requires features (abundant codons, specific motifs) that immune sensors recognize. Optimization must find Pareto-optimal solutions balancing these tradeoffs, not single-objective maxima.
Exogenous mRNA triggers innate immune responses through pattern recognition receptors including Toll-like receptors (TLR3, TLR7, TLR8) and cytosolic sensors (RIG-I, MDA5). While some immune activation may be beneficial for vaccine applications, excessive inflammation limits dosing and causes adverse effects. For protein replacement therapies where repeated dosing is required, minimizing immunogenicity is essential.
The immunostimulatory potential of mRNA depends on sequence features including GC content, specific sequence motifs recognized by pattern receptors, and secondary structures that resemble viral replication intermediates. Foundation models that predict immunogenicity from sequence enable design of mRNAs that evade innate immune detection. These predictions must be balanced against other objectives: modifications that reduce immunogenicity may also reduce translation efficiency, creating multi-objective trade-offs that characterize mRNA design more broadly.
31.5 Antibody and Vaccine Design
Antibody engineering represents one of the most commercially significant applications of computational protein design. The modular architecture of antibodies (framework regions that maintain structural integrity surrounding hypervariable complementarity-determining regions (CDRs) that mediate antigen recognition) creates a well-defined design problem: optimize CDR sequences to achieve desired binding properties while maintaining framework stability and developability.
31.5.1 CDR Optimization and Humanization
Antibodies discovered through animal immunization or phage display often require optimization before therapeutic use. Non-human framework sequences may trigger immune responses in patients, necessitating humanization that replaces framework residues with human equivalents while preserving antigen binding. CDR sequences may require affinity maturation to achieve therapeutic potency or specificity optimization to reduce off-target binding.
Foundation models support antibody optimization through multiple mechanisms. Antibody-specific language models trained on paired heavy and light chain sequences learn the structural and functional constraints on CDR sequences. These models predict which mutations are compatible with the antibody fold and which are likely to disrupt structure. Given a parental antibody sequence, the models can propose libraries of variants enriched for functional candidates, reducing the experimental screening burden required to identify improved variants.
Structure-aware approaches enable more targeted design. Given a structure of the antibody-antigen complex (determined experimentally or predicted computationally via methods discussed in Section 16.4), optimization focuses on residues at the binding interface. Computational saturation mutagenesis predicts the effect of every possible amino acid substitution at each interface position, identifying combinations expected to improve affinity. These predictions guide the construction of focused libraries that explore the most promising region of sequence space.
31.5.2 Vaccine Antigen Design
Vaccine development increasingly employs computational design to create immunogens that elicit protective immune responses. The challenge differs from therapeutic protein design: rather than optimizing for direct biological activity, vaccine antigens must be recognized by the immune system and induce antibodies or T cells that protect against pathogen challenge.
Foundation models contribute to vaccine design in several ways. Epitope prediction models identify regions of pathogen proteins most likely to be recognized by antibodies or T cells, guiding selection of vaccine targets. Structural models predict how mutations affect epitope conformation, enabling design of stabilized antigens that maintain native epitope structure during manufacturing and storage. Glycan shielding analysis predicts which epitopes will be accessible on the pathogen surface versus hidden by glycosylation, focusing vaccine design on exposed regions.
The rapid development of mRNA vaccines against SARS-CoV-2 demonstrated the potential of computational approaches to accelerate vaccine design. Structure-guided stabilization of the prefusion spike conformation, optimization of mRNA sequences for expression and stability, and prediction of variant effects on vaccine efficacy all benefited from computational modeling. Future vaccine development will increasingly integrate foundation model predictions throughout the design process.
31.6 Closed-Loop Design-Build-Test-Learn Cycles
Foundation models achieve their full potential when integrated into iterative experimental workflows. The design-build-test-learn (DBTL) paradigm treats computational predictions as hypotheses to be tested experimentally, with results feeding back to improve both the designed molecules and the models that guide design. This closed-loop approach connects to the lab-in-the-loop concepts introduced in Section 30.5.3.
31.6.1 Active Learning for Efficient Exploration
Imagine you have a budget to experimentally test 100 protein variants, but your foundation model proposes 10,000 candidates. Some candidates have high predicted fitness but the model is uncertain; others have moderate predictions but high confidence. How would you decide which 100 to test? What are the tradeoffs between “exploiting” high predictions versus “exploring” uncertain regions?
Experimental validation remains the bottleneck in biological design. Even high-throughput assays can test at most thousands to millions of variants, a tiny fraction of possible sequences. Active learning strategies select which experiments to perform by balancing two competing objectives: exploiting current model predictions to test sequences likely to succeed, and exploring regions of uncertainty to gather data that will improve the model.
Bayesian optimization provides a principled framework for this trade-off. A surrogate model (typically a Gaussian process or ensemble neural network) approximates the sequence-to-fitness mapping. Acquisition functions such as expected improvement or upper confidence bound combine predicted function values with uncertainty estimates to select informative test sequences. The expected improvement acquisition function, for example, computes the probability-weighted average improvement over the current best sequence, naturally balancing regions of high predicted fitness (likely to improve) against regions of high uncertainty (potentially hiding superior solutions). Upper confidence bound adds a tunable exploration parameter that explicitly controls how much to favor uncertain regions. After each experimental round, the surrogate model is updated with new data, and the process repeats. This iterative refinement concentrates experimental resources on the most promising and informative regions of sequence space rather than uniformly sampling the combinatorially vast possibilities.
Foundation models enhance active learning by providing informative priors and features. Rather than learning sequence-to-function mappings from scratch, surrogate models can operate on protein language model embeddings that capture evolutionary relationships and structural constraints. These embeddings provide a meaningful notion of sequence similarity even before any task-specific data is available, accelerating the early rounds of optimization when labeled data is scarce.
Consider these factors when selecting an active learning approach:
- When labeled data is scarce: Use foundation model embeddings as features for your surrogate model; they provide useful priors even with few labels
- When experimental costs are high: Favor acquisition functions that emphasize exploration (e.g., upper confidence bound with high exploration parameter) to maximize information gain per experiment
- When you need quick wins: Favor exploitation-heavy strategies that test sequences with highest predicted fitness, accepting that you may miss better optima
- When model reliability is uncertain: Use ensemble disagreement as an additional uncertainty measure; avoid testing sequences where all models confidently agree (may be exploiting shared artifacts)
31.6.2 High-Throughput Experimentation Integration
Modern experimental platforms generate data at scales well-matched to foundation model training. Deep mutational scanning (DMS) systematically characterizes thousands of single-mutant variants of a protein, mapping the functional landscape around a parental sequence (see Section 2.4.4 for discussion of DMS data resources). Massively parallel reporter assays test tens of thousands of regulatory element variants in a single experiment. CRISPR screens introduce perturbations across the genome and measure phenotypic consequences.
These assays generate dense local maps of sequence-function relationships that complement the global patterns captured by foundation models. The integration is bidirectional: model predictions prioritize which variants to include in experimental libraries, and experimental results fine-tune models for improved accuracy in relevant sequence neighborhoods. After several DBTL cycles, the combined system (fine-tuned model plus accumulated experimental data) can often design sequences that substantially outperform the parental molecule.
The design of experiments themselves benefits from computational guidance. Rather than testing all possible single mutants, active learning identifies the most informative subset. Rather than random library construction, computational analysis identifies epistatic interactions that should be explored through combinatorial variants. The cost of DNA synthesis and high-throughput assays makes efficient experimental design increasingly important as design ambitions grow.
31.7 Validation Requirements and Failure Modes
Computational design generates hypotheses; experimental validation determines whether those hypotheses are correct. The gap between predicted and observed performance represents the ultimate test of design methods, and understanding where predictions fail is essential for improving both models and design strategies. The evaluation principles discussed in Chapter 11 and uncertainty quantification from Chapter 24 apply directly to design validation.
31.7.1 Validation Hierarchy
Designed sequences must pass through multiple validation stages before achieving real-world impact. Computational validation confirms that designs satisfy specified constraints and achieve predicted scores, filtering obvious failures before synthesis. In vitro validation tests whether designed proteins express, fold, and exhibit predicted activities in simplified experimental systems. In vivo validation assesses function in cellular or animal contexts where additional complexity may reveal unanticipated problems. Clinical validation, for therapeutic applications, determines whether designs are safe and effective in human patients.
Success rates decline at each stage of this hierarchy. Computationally promising designs often fail to express or fold correctly. Designs that succeed in vitro may lose activity in cellular contexts due to incorrect localization, unexpected degradation, or off-target interactions. Molecules that perform well in model organisms may fail in human clinical trials due to immunogenicity, toxicity, or pharmacokinetic limitations. The attrition from computational design to clinical success remains substantial, motivating continued improvement in predictive accuracy and earlier identification of failure modes.
31.7.2 Characteristic Failure Patterns
Practitioners should be aware of these systematic failure modes in model-guided design:
- Distribution shift: Optimization pushes sequences into regions where model predictions are unreliable
- Mode collapse: Generative models produce variants of training sequences rather than genuinely novel molecules
- Reward hacking: Optimization exploits model artifacts rather than genuine sequence-function relationships
- Missing properties: Models cannot predict properties absent from training data (e.g., aggregation under manufacturing conditions)
Mitigation strategies include ensemble methods, novelty filters, uncertainty quantification, and experimental validation in application-relevant conditions.
Foundation model-guided design exhibits systematic failure modes that practitioners must recognize and mitigate. Distribution shift occurs when optimization pushes sequences into regions where model predictions are unreliable (Section 11.7.1 for detailed discussion of distribution shift in genomic models; Section 24.7 for detection methods). A model trained on natural proteins may produce confident but incorrect predictions for designed sequences that lie far from training data. Regularization toward natural sequence statistics and uncertainty quantification help identify when designs have strayed beyond reliable prediction regimes.
Mode collapse in generative models produces designs that are variants of training sequences rather than genuinely novel molecules. When generated sequences can be matched to close homologs in training data, the design process has failed to create anything new. Novelty filters and diversity requirements during generation help ensure that computational design adds value beyond database retrieval.
Reward hacking occurs when optimization exploits model artifacts rather than genuine sequence-function relationships. A model might predict high expression for sequences containing spurious features that happen to correlate with expression in training data but have no causal effect. Ensemble methods, where designs must score highly across multiple independently trained models, provide some protection against hacking individual model weaknesses.
The most insidious failures involve properties that models cannot predict because they were absent from training data. A designed protein might aggregate under manufacturing conditions never encountered during model development. A regulatory element might be silenced by chromatin modifications specific to the therapeutic context. These failures can only be identified through experimental validation in relevant conditions, motivating the closed-loop DBTL approach that continuously tests designs in application-relevant settings.
31.8 Practical Design Constraints
Beyond achieving desired function, practical design must satisfy numerous constraints arising from manufacturing, safety, and deployment requirements.
31.8.1 Manufacturing and Developability
Designed proteins must be producible at scale in expression systems such as bacteria, yeast, or mammalian cells. Expression levels, solubility, and purification behavior determine manufacturing feasibility and cost. Foundation models trained on expression data can predict which sequences are likely to express well, enabling design pipelines that optimize not only for function but for manufacturability. For therapeutic proteins, developability encompasses additional properties including stability during storage, compatibility with formulation requirements, and behavior during analytical characterization. Aggregation propensity, chemical degradation sites (oxidation, deamidation), and glycosylation patterns all affect developability. Computational tools increasingly predict these properties from sequence, enabling their incorporation as design constraints.
31.8.2 Safety and Biosecurity Considerations
The same capabilities that enable beneficial design applications also raise biosecurity concerns. Generative models trained on pathogen sequences might in principle be used to design enhanced pathogens or reconstruct dangerous organisms. The dual-use potential of biological design technology requires ongoing attention to safety practices and governance frameworks.
Current foundation models do not provide straightforward paths to bioweapon development; designing a functional pathogen requires capabilities far beyond predicting sequence properties. As models improve and integrate with automated synthesis and testing platforms, the barrier to misuse may decrease. Responsible development practices, including careful consideration of training data, model access policies, and monitoring for concerning use patterns, are essential components of the foundation model ecosystem. These considerations connect to the broader discussion of safety and ethics in Chapter 27.
31.9 Algorithmic Search and Optimization
Design algorithms must navigate vast sequence spaces to identify candidates with desired properties. Several algorithmic paradigms have proven effective, each with characteristic strengths and limitations.
For each of the following design scenarios, which algorithmic approach would you choose and why?
- Optimizing a single position in an enzyme active site for catalytic activity
- Designing a library of 10,000 diverse antibody variants for experimental screening
- Finding a sequence that maximizes binding affinity while maintaining stability above a threshold
- Exploring the fitness landscape around a well-characterized parental protein
- Gradient-based optimization - single position allows exhaustive search of 20 amino acids, or gradient methods if using soft encodings. (2) Monte Carlo or evolutionary algorithms - generate diverse populations naturally; avoid mode collapse. (3) Constrained optimization or multi-objective evolutionary - explicit constraint handling for stability threshold. (4) Bayesian optimization - sample-efficient exploration when starting from known good sequence; balances exploitation and exploration naturally.
Gradient-based optimization treats foundation models as differentiable functions and computes gradients of objectives with respect to input sequence representations. Because sequences are discrete while gradients are continuous, optimization operates on relaxed representations (probability distributions over nucleotides or amino acids) that are projected back to discrete sequences for evaluation. The relaxation step is necessary because gradient descent requires continuous inputs: a “soft” one-hot encoding represents each position as a probability distribution over amino acids (e.g., 0.7 Ala, 0.2 Gly, 0.1 Ser) rather than a discrete choice, allowing gradients to flow and guide optimization. Projection to discrete sequences occurs either through argmax (selecting the highest-probability amino acid) or through stochastic sampling from the learned distribution. This approach efficiently navigates high-dimensional spaces but can produce adversarial sequences that exploit model weaknesses rather than achieving genuine biological function.
Evolutionary algorithms maintain populations of candidate sequences that undergo mutation, recombination, and selection based on fitness scores from foundation model oracles or experimental assays. This approach naturally handles discrete sequence spaces and can maintain diversity to avoid local optima. Multi-objective evolutionary algorithms explicitly construct Pareto frontiers of solutions trading off competing objectives.
Bayesian optimization models the sequence-to-fitness mapping with a probabilistic surrogate (typically a Gaussian process or ensemble neural network) and uses acquisition functions to balance exploration of uncertain regions with exploitation of predicted optima. This approach is particularly effective when experimental evaluations are expensive and each design round must be carefully chosen.
Monte Carlo methods sample sequences from distributions defined by foundation model likelihoods, optionally biased toward high-scoring regions through importance weighting or Markov chain Monte Carlo. These approaches naturally integrate foundation model priors with task-specific objectives and can generate diverse candidate sets for experimental screening.
The table below summarizes when to use each algorithmic approach.
| Algorithm | Best When | Strengths | Limitations |
|---|---|---|---|
| Gradient-based | Differentiable oracle available; continuous relaxation feasible | Fast; high-dimensional | Adversarial solutions; local optima |
| Evolutionary | Need diversity; multi-objective | Handles discrete spaces; Pareto fronts | Slower convergence |
| Bayesian optimization | Expensive experiments; need uncertainty | Sample-efficient; principled exploration | Scales poorly to high dimensions |
| Monte Carlo | Foundation model provides good prior; want library diversity | Natural uncertainty; diverse outputs | May be slow to find optima |
The choice among algorithmic approaches depends on the specific design problem, available computational resources, and experimental constraints. Many practical pipelines combine multiple approaches: generative sampling to produce initial candidate pools, gradient-based refinement to optimize specific objectives, and active learning to select informative experimental tests.
31.10 Evaluating Generative Design
Assessing whether a generative model produces useful designs requires metrics that capture multiple dimensions of quality. Unlike discriminative models evaluated by accuracy on held-out data, generative models must produce outputs that are simultaneously novel (not merely retrieving training examples), valid (satisfying basic biological constraints), diverse (exploring the design space rather than collapsing to narrow modes), and functional (achieving desired biological properties). No single metric captures all these requirements; comprehensive evaluation demands a suite of complementary assessments.
31.10.1 Computational Quality Metrics
Perplexity and likelihood measure how well generated sequences match the statistical patterns of natural biology. A generative model trained on natural proteins should assign higher likelihood to generated sequences that resemble natural ones. Perplexity (the exponentiated average negative log-likelihood) provides a scalar summary: lower perplexity indicates that generated sequences appear more natural according to the model’s learned distribution. However, low perplexity alone does not guarantee functional designs; a model might generate highly probable but biologically inert sequences that closely mimic common motifs without capturing rare functional features.
Novelty quantifies how different generated sequences are from training data. Sequence identity to nearest training neighbors provides a simple measure: sequences with less than 30% identity to any training protein clearly represent novel designs. More sophisticated approaches compute distances in embedding space, identifying generated sequences that occupy regions unrepresented in training data. The challenge lies in balancing novelty against validity: sequences too similar to training data offer limited design value, while sequences too different may fail to fold or function.
Diversity measures the variety within a set of generated sequences. Internal diversity metrics quantify pairwise distances within generated batches; low diversity indicates mode collapse where the model repeatedly generates similar sequences. Coverage metrics assess what fraction of the natural sequence space is represented by generated samples. Diversity matters for practical applications: a design campaign benefits from exploring multiple solutions rather than converging on a single candidate, since experimental validation will reveal unpredicted failures among computationally promising designs.
Validity assesses whether generated sequences satisfy basic biological constraints. For proteins, validity might require that sequences are predicted to fold (using AlphaFold2 or ESMFold structure prediction), contain no forbidden amino acid patterns, and have appropriate length distributions. For regulatory elements, validity might require balanced GC content, absence of restriction enzyme sites, and compatibility with delivery vectors. Validity filters identify obvious failures before expensive experimental testing, but passing validity checks does not guarantee function.
31.10.2 Functional Assessment
Computational metrics provide necessary but insufficient evidence of design success. The ultimate test is whether generated sequences achieve their intended biological function, which requires experimental validation that can only partially be predicted computationally.
Structure prediction offers an intermediate level of assessment between purely computational metrics and experimental validation. AlphaFold2 and ESMFold predict whether generated protein sequences fold into well-defined structures, with predicted local distance difference test (pLDDT) scores providing residue-level confidence estimates (Section 16.4). Designs with low predicted confidence likely fail to fold correctly. For generated regulatory elements, models like Enformer predict expression levels and chromatin state, providing functional estimates without wet-lab experiments. These predictions inherit the limitations of the underlying models: they may overestimate success for sequences that exploit model artifacts rather than achieving genuine biological function.
Oracle model evaluation uses trained predictors to estimate functional properties of generated sequences. Binding affinity predictors assess designed antibodies; stability predictors evaluate protein designs; expression models score regulatory elements. When oracle models are distinct from the generative model, this evaluation provides independent evidence of quality. However, oracle models themselves have limited accuracy, particularly for sequences far from their training distributions. A design might score highly on an oracle that has never seen similar sequences, yet fail experimentally.
Experimental success rates provide ground truth that computational metrics can only approximate. Published design studies report widely varying success rates depending on the design target and evaluation criteria. De novo protein design achieves expression rates of 30-70% for well-designed sequences, with functional activity observed in 5-30% of expressed candidates (Huang, Boyken, and Baker 2016). Designed antibodies targeting challenging epitopes may yield functional binders from 1-10% of tested sequences. Regulatory element design success rates vary enormously depending on the complexity of the specification and the stringency of activity requirements.
These success rates reflect the combined limitations of generative models, oracle predictors, and experimental systems. Improving any component, whether by training better generative priors, developing more accurate oracles, or refining experimental assays, can increase overall design success. The closed-loop DBTL approach (Section 31.6) systematically addresses these limitations by using experimental failures to improve models for subsequent design rounds.
31.10.3 Benchmarking Generative Models
Standardized benchmarks enable comparison across generative approaches, though constructing appropriate benchmarks for design presents unique challenges. Unlike prediction benchmarks where held-out data provides unambiguous ground truth, design benchmarks must assess open-ended generation where many valid solutions exist.
Retrospective benchmarks evaluate whether models can recover known functional sequences when given appropriate conditioning. Given a protein structure, can an inverse folding model generate a sequence that folds to that structure? Given a desired expression profile, can a regulatory model generate an element that achieves it? These evaluations test necessary capabilities but may not predict performance on genuinely novel design targets where the correct answer is unknown.
Prospective experimental validation provides the strongest benchmark but requires substantial resources. Community efforts like CASP (Critical Assessment of Structure Prediction) have driven progress in prediction; analogous competitions for design could similarly accelerate the field. Current efforts to establish design benchmarks include collections of deep mutational scanning data for evaluating predicted fitness landscapes and standardized assays for comparing designed proteins to natural sequences.
Meta-evaluation assesses whether computational metrics predict experimental outcomes. If high novelty correlates with experimental failure while low perplexity correlates with success, practitioners can use computational metrics to prioritize candidates. Establishing these correlations requires accumulating paired computational-experimental data across diverse design campaigns. As more groups publish both metrics and validation results, the field develops better understanding of which computational assessments matter for practical success.
31.11 From Understanding to Creating
Sequence design represents the frontier where foundation models transition from tools for understanding biology to engines for creating it. The field has advanced from designing individual stable proteins to engineering complex molecular machines, from optimizing isolated regulatory elements to programming cellular behavior, from incremental improvement of existing sequences to de novo creation of functions not found in nature. The constraints of natural evolution no longer bound the sequences we can consider; the statistical patterns of existing biology provide priors that guide exploration of novel territory.
The validation bottleneck persists as perhaps the most fundamental limitation. Computational design can propose candidates faster than experiments can test them, creating pressure to improve both predictive accuracy (reducing false positives that waste experimental resources) and experimental throughput (enabling more designs to be evaluated). Automated laboratories, standardized assay platforms, and improved experimental design methods all contribute to accelerating the design-build-test-learn cycle, but the gap between computational proposal and experimental validation remains substantial.
The transition from prediction to design amplifies both the potential benefits and the risks of these technologies. A model that predicts protein function enables analysis; a model that designs protein function enables creation. Ensuring that designed biology serves human flourishing while minimizing potential harms requires not just technical advances but thoughtful governance, inclusive deliberation about applications, and ongoing attention to safety. These broader considerations connect sequence design to regulatory, ethical, and societal dimensions (Chapter 27), where the technical capabilities developed throughout genomic AI meet the human systems that will determine how they are used.
Before reviewing the summary, test your recall:
- Why is the design problem fundamentally harder than the prediction problem? What changes when you invert from “sequence → function” to “function → sequence”?
Prediction evaluates one sequence (computable in milliseconds); design must search astronomically large spaces (20^200 for a 200-residue protein). Prediction stays near training data where models are accurate; design optimizes toward extremes where models become unreliable (distribution shift). Prediction has ground truth for validation; design creates novel sequences with unknown true function, requiring expensive experimental validation.
- Compare sequence-based protein design (using language models) with structure-aware design (using diffusion and inverse folding). What are the advantages of each approach?
Sequence-based (PLMs): Advantages include computational efficiency, strong priors from evolutionary data, good for variant optimization. Limitations: restricted to known fold families. Structure-aware (RFdiffusion + ProteinMPNN): Advantages include ability to generate entirely novel topologies, target specific binding geometries, design for function-first. Limitations: higher computational cost, requires structural specification. Choose PLMs for optimizing existing proteins, structure-aware for de novo creation.
- What are the four key dimensions for evaluating generative sequence models (novelty, validity, diversity, functionality)? Why is no single metric sufficient?
- Novelty: distance from training data (avoid memorization)
- Validity: satisfies basic constraints (foldable, correct chemistry)
- Diversity: variation among outputs (avoid mode collapse)
- Functionality: achieves intended purpose (ultimate test)
These can trade off: highly novel sequences may have lower validity. High validity does not guarantee function. High diversity is useless if all designs fail. Comprehensive evaluation requires assessing all dimensions.
- Describe a design-build-test-learn cycle for protein engineering. How do foundation models contribute at each stage, and where does the bottleneck typically occur?
Design: FM scores candidates, optimization identifies promising sequences (seconds-minutes). Build: DNA synthesis and assembly create physical constructs (days). Test: expression and functional assays measure performance (days-weeks). Learn: results update models for next cycle. FMs contribute in Design (generative priors, fitness prediction) and Learn (incorporating experimental data to improve predictions). Bottleneck: Build-Test phases (days-weeks) vs. fast Design (seconds), limiting iteration speed.
- What characteristic failure modes should you watch for in model-guided design (distribution shift, mode collapse, reward hacking)? Give an example of each.
Distribution shift: optimization pushes into regions where FM predictions are unreliable (e.g., designed protein with 15% identity to any training sequence; model never saw such distant sequences). Mode collapse: generative model produces minor variants of training sequences rather than novel designs (all generated antibodies are >95% identical to natural sequences in database). Reward hacking: optimization exploits model artifacts (e.g., promoter design finds sequences with spurious features correlating with high expression in training data but having no causal effect).
This chapter explored how foundation models enable the transition from predicting sequence function to designing sequences with desired properties.
Key Topics Covered:
- Design formalism: Mathematical frameworks (optimization, conditional generation, constrained optimization) that formalize the inverse prediction problem
- Protein design: Sequence-based approaches using language models and structure-aware methods using diffusion and inverse folding
- Regulatory design: Engineering promoters, enhancers, and splicing elements using gradient-based optimization and generative models
- mRNA optimization: Balancing translation efficiency, stability, and immunogenicity through codon and UTR design
- Closed-loop workflows: Design-Build-Test-Learn cycles with active learning for efficient experimental exploration
- Validation and failure modes: Understanding distribution shift, mode collapse, and reward hacking in model-guided design
- Generative evaluation: Metrics for novelty, validity, diversity, and functionality in design assessment
Key Takeaways:
- Design inverts prediction: instead of asking “what does this sequence do?”, we ask “what sequence achieves this function?” This inversion exposes model limitations invisible during prediction.
- Structure provides a powerful intermediate representation for protein design, constraining the search from astronomical sequence space to physically realizable geometries.
- Closed-loop DBTL cycles are essential because computational design generates hypotheses, not certainties. Models improve through iterative experimental feedback.
Looking Ahead:
- Chapter 32 explores emerging directions in genomic AI, including many-body effects, temporal dynamics, and increasingly integrated models
- The biosecurity considerations raised here connect to the broader governance discussion in Chapter 27
- The lab-in-the-loop concepts from Section 30.5.3 provide the experimental infrastructure for scaling design campaigns