23 Multi-Omics Integration
More data can mean worse predictions. Understanding why is prerequisite to making multi-omics work.
Estimated reading time: 35-45 minutes
Prerequisites: This chapter builds on foundation model concepts from Chapter 14, single-cell integration methods from Chapter 20, and assumes familiarity with attention mechanisms (Chapter 7) and transfer learning (Chapter 9). Understanding of GWAS (Chapter 3) and expression quantitative trait loci helps but is not essential.
Learning Objectives: After completing this chapter, you should be able to:
- Explain why naive concatenation of multi-omics features often degrades prediction performance
- Compare and contrast early, late, and intermediate fusion strategies, selecting the appropriate approach for different data characteristics
- Design multi-omics integration pipelines that handle missing modalities gracefully
- Trace information flow from genetic variants through molecular layers to clinical phenotypes
- Identify when multi-omics integration will likely help versus hurt prediction accuracy
Combining data types should improve prediction. If genomic variants provide one signal and transcriptomic measurements provide another, their combination ought to be more informative than either alone. This intuition, while reasonable, proves frequently wrong in practice. Naive concatenation of multi-omics features often degrades performance relative to single-modality models. Noise from uninformative features overwhelms signal from informative ones. Batch effects between modalities create spurious correlations that models exploit. The curse of dimensionality intensifies when features from multiple assays are stacked without principled integration. The paradox is real: more data can mean worse predictions, and understanding why is prerequisite to making multi-omics integration work.
Each molecular layer captures part of the biological story but not all of it. Genomic variants identify predisposition; transcriptomics reveals which genes respond; proteomics shows which proteins change; metabolomics measures downstream biochemical consequences. A patient with a BRCA1 variant may show altered DNA repair gene expression, deficient homologous recombination protein activity, and characteristic metabolic signatures. No single layer provides the complete picture. Effective integration traces this causal cascade from genetic variation through molecular intermediates to clinical phenotype, distinguishing primary effects from downstream consequences and noise from signal.
The integration strategy matters as much as the data itself. Early fusion concatenates features before modeling, intermediate fusion learns joint representations across modalities, and late fusion combines predictions from modality-specific models. Each approach carries distinct tradeoffs for different applications and data characteristics. Multi-omics foundation models attempt to learn unified representations across genomics, transcriptomics, proteomics, and other modalities simultaneously, while clinical integration extends further still, combining electronic health records, imaging data, and molecular measurements for patient-level prediction. The practical challenges are substantial: missing modalities when not every patient has every assay, batch effects from technical variation between measurement platforms, and a persistent gap between multi-omics potential and deployment reality.
23.1 Limits of Single-Modality Models
Each molecular layer tells an incomplete story. DNA sequence is static; it encodes potential but not state. A variant’s presence says nothing about whether the gene is expressed, whether the protein is active, or whether the pathway is perturbed. Transcriptomic data captures expression state but misses post-transcriptional regulation, protein modifications, and metabolic flux. Proteomic measurements reveal protein abundance but not necessarily activity or localization. Methylation profiles indicate epigenetic state but require expression data to understand functional consequences.
The incompleteness becomes concrete when modeling complex traits. Genome-wide association studies (see Chapter 3 for GWAS methodology) explain a fraction of total heritability for most common diseases through currently identified variants, with the gap between explained and estimated heritability forming the missing heritability problem (Manolio et al. 2009). Adding expression quantitative trait loci (eQTLs) improves fine-mapping by suggesting which variants affect gene expression (see Section 3.4), but many causal mechanisms operate through splicing, translation, or post-translational modification rather than expression level. Single-cell RNA sequencing reveals cellular heterogeneity invisible to bulk measurements, but the same cell cannot simultaneously undergo RNA-seq and assay for transposase-accessible chromatin sequencing (ATAC-seq), forcing computational integration across modalities measured in different cells (see Chapter 20 for approaches to this challenge).
Before reading further, consider a patient with type 2 diabetes. Which molecular layers would you expect to provide independent information about their disease state? Would genomic variants, gene expression, protein levels, and metabolite concentrations all be equally informative, or would some layers be redundant with others?
Consider the challenge of predicting drug response. Germline variants in drug-metabolizing enzymes explain some inter-individual variation (see Section 2.8.4), but tumor-specific somatic mutations, expression programs, and microenvironment all influence therapeutic efficacy. A genomics-only model sees the inherited component; a transcriptomics-only model sees the current expression state; neither captures the full picture. Multi-omics integration promises to bridge these gaps by learning representations that span molecular layers.
Foundation models address each molecular layer individually: sequence models predict regulatory effects from DNA (see Chapter 17), expression models capture transcriptional programs (see Chapter 19), and protein language models predict structure and function from amino acid sequence (see Chapter 16). Multi-omics integration asks how these modality-specific representations can be combined into unified patient or cell representations.
The promise comes with caveats. Adding modalities increases the number of parameters that must be estimated, potentially worsening overfitting when sample sizes are limited. Different modalities have different noise characteristics, batch structures, and missingness patterns. The same patient’s measurements across platforms may not align perfectly due to sample handling, timing, or technical variation. Naive concatenation of features often performs worse than single-modality models because the signal-to-noise ratio degrades when noisy features outnumber informative ones.
These challenges motivate careful consideration of integration strategy. The question is not whether to integrate, but how.
23.2 Integration Strategies and Their Tradeoffs
Three broad strategies have emerged for combining multi-omics data, each with distinct strengths and limitations.
23.2.1 Early Fusion
If you concatenate genomic, transcriptomic, and epigenomic data into one giant feature vector, you are betting that their interactions matter from the very first layer. When does this bet pay off, and when does it fail?
Early fusion concatenates features from multiple modalities before any modeling, creating a single high-dimensional input vector that contains genomic variants, expression values, methylation levels, and any other available measurements. A classifier or regressor then learns directly from this concatenated representation.
The appeal of early fusion lies in its simplicity and flexibility. Any downstream model architecture can operate on concatenated features, from linear regression to deep neural networks. The model can learn arbitrary interactions between features from different modalities, since all information is present in the input. Implementation requires only normalization and alignment of features across samples.
The limitations become apparent at scale. Dimensionality explodes when combining genome-wide variants (millions of features), gene expression (tens of thousands of genes), methylation (hundreds of thousands of CpG sites), and protein abundance (thousands of proteins). Most samples have far fewer observations than features, creating severe overfitting risk. Why does high dimensionality cause overfitting even with regularization? Consider the geometry: with p features and n samples (p >> n), the training points lie in a low-dimensional subspace of the feature space. The model can perfectly fit training data by assigning any prediction to training points and interpolating arbitrarily between them; countless solutions achieve zero training error, and most generalize poorly. Regularization constrains the solution space but cannot eliminate the fundamental problem that vastly more parameters exist than constraints. When each modality adds millions of features while sample counts remain in the thousands, the curse of dimensionality overwhelms any regularization scheme.
Missing data creates additional complications. If any modality is missing for a sample, early fusion requires either excluding that sample (reducing effective sample size) or imputing the missing modality (introducing noise and potential bias). Since multi-omics studies often have incomplete overlap between modalities, with some patients having genomics and transcriptomics but not proteomics, early fusion frequently operates on substantially reduced cohorts.
Scale differences between modalities pose another challenge. Expression values span orders of magnitude; methylation beta values range from zero to one; variant encodings are typically binary. Without careful normalization, modalities with larger variance can dominate the learned representation regardless of biological relevance. Batch effects within each modality add further complexity, since batch correction must precede concatenation but may interact with cross-modal relationships.
Despite these limitations, early fusion remains appropriate when sample sizes are large relative to feature counts, when all modalities are available for all samples, and when the downstream task is well-defined enough to guide feature selection. Biobank-scale studies with thousands of participants and focused feature sets can succeed with early fusion approaches.
23.2.2 Late Fusion
What if each modality truly provides independent information, like separate witnesses to the same event? Late fusion bets that combining final verdicts outperforms forcing witnesses to deliberate together from the start. This independence assumption buys you robustness to missing data but costs you the ability to detect when witnesses are describing the same thing from different angles.
Late fusion trains separate models for each modality and combines their predictions at the output level. A genomics model produces a risk score; a transcriptomics model produces another risk score; these modality-specific predictions are then combined into a final output.
Late fusion is related to but distinct from ensemble learning. Traditional ensembles combine multiple models trained on the same data to reduce variance and improve robustness: bagging averages diverse trees, boosting sequences weak learners, and stacking learns optimal combination weights. Late fusion, by contrast, combines models trained on different data modalities. Each model sees fundamentally different features (genomic variants vs. expression levels vs. protein abundance), not just different views or subsets of the same features. The modality-specific models may themselves be ensembles, and the combination layer may use ensemble techniques (weighted averaging, stacking), but the defining characteristic of late fusion is the separation of modalities during model training. This distinction matters because ensemble theory (bias-variance decomposition, diversity requirements) does not directly transfer: late fusion’s value comes from information complementarity across modalities rather than from variance reduction through model diversity.
This approach handles missing modalities gracefully. If a patient lacks proteomic data, the proteomics model simply does not contribute to the ensemble. Sample sizes for each modality-specific model can differ, since training requires only samples with that modality rather than complete multi-omics profiles. Each modality can use whatever architecture works best for its data type: deep networks for imaging, gradient boosting for tabular omics, convolutional architectures for sequence.
Late fusion cannot capture cross-modal interactions at the feature level. If a variant’s effect on disease depends on expression level of a regulatory gene, neither the genomics model nor the transcriptomics model alone can detect this interaction. The ensemble sees only the modality-specific predictions, not the underlying features. This limitation is fundamental: late fusion assumes that each modality provides independent signal that can be additively combined. Mathematically, if the true risk depends on an interaction term like \(\text{variant} \times \text{expression}\), late fusion can only approximate this with \(f_1(\text{variant}) + f_2(\text{expression})\), which cannot capture the non-additive structure regardless of how sophisticated the individual models become.
The assumption of independence often fails in biological systems. Gene expression depends on genetic variants through eQTLs. Protein levels depend on both transcription and post-transcriptional regulation. Methylation states influence and are influenced by transcription. The molecular layers are not independent information sources but coupled components of a dynamic system. Late fusion ignores this coupling.
Calibration presents a practical challenge. For ensemble predictions to be meaningful, the modality-specific models must produce well-calibrated probability estimates (see Section 24.3 for calibration methods). If the genomics model is overconfident and the transcriptomics model is underconfident, naive averaging produces biased predictions. Calibration techniques help but add complexity to the modeling pipeline.
Late fusion works well when modalities genuinely provide independent signals, when sample sizes for each modality differ substantially, or when interpretability requires understanding each modality’s contribution separately. Clinical deployment often favors late fusion because it gracefully handles the reality that not all patients will have all measurements.
23.2.3 Intermediate Fusion
Early fusion demands complete data and cannot scale. Late fusion ignores cross-modal interactions entirely. Is there a middle path that learns how modalities relate to each other while gracefully handling the messy reality of incomplete measurements?
You are designing a multi-omics model that must work even when patients are missing some data types. Early fusion requires all modalities; late fusion cannot capture cross-modal interactions. What properties would an ideal intermediate approach have? How might you encode different modalities into a common representation while preserving modality-specific structure?
Intermediate fusion learns modality-specific encoders that map each data type into a shared latent space, then operates on the aligned representations for downstream tasks. This approach combines the flexibility of early fusion with the robustness of late fusion.
Each modality has its own encoder architecture tailored to its characteristics. A variational autoencoder might encode single-cell expression data, handling sparsity and dropout noise. A convolutional network might process methylation profiles along chromosomal coordinates. A graph neural network might encode protein interaction data (see Section 22.2). These diverse architectures share nothing except their output dimensionality: all encoders produce embeddings in a common latent space.
Alignment between modalities is encouraged through multiple mechanisms. Reconstruction losses require each encoder’s latent representation to support decoding back to the original features, ensuring that the embeddings retain modality-specific information. Contrastive terms pull together representations of the same biological entity across modalities: the expression embedding for a cell should be similar to the ATAC-seq embedding for the same cell. Graph constraints enforce consistency with known biological relationships: genes connected in interaction networks should have similar embeddings.
The power of intermediate fusion lies in the shared latent space. By forcing different modalities to project into the same embedding space, the model learns that expression patterns and chromatin accessibility in the same cell should map to nearby points. This alignment enables cross-modal reasoning: a classifier operating on this shared space can effectively learn from both modalities simultaneously, even when predicting for samples with only one modality available.
The shared latent space enables cross-modal reasoning. A classifier operating on the shared space can learn interactions between genomic and transcriptomic features, since both are present in the same representation. Transfer becomes possible: a model trained on expression data can be applied to samples with only ATAC-seq by encoding through the ATAC-seq encoder into the shared space.
Missing modalities no longer require imputation or exclusion. If a sample lacks proteomics, only the available encoders fire, producing a partial representation in the shared space. The downstream model operates on whatever representation is available, degrading gracefully as modalities are missing rather than failing entirely.
GLUE, introduced in Section 20.5.2 for single-cell multi-omics integration, exemplifies this approach. Separate variational autoencoders encode RNA-seq and ATAC-seq data into a shared cell embedding space. A feature graph links ATAC-seq peaks to genes based on genomic proximity and transcription factor binding, providing biological constraints on the alignment. The result enables integration of measurements from different cells, not just different modalities in the same cell.
Intermediate fusion dominates modern multi-omics deep learning because it balances flexibility with robustness. The modality-specific encoders can be pretrained on large single-modality datasets, then fine-tuned for alignment (see Chapter 9 for transfer learning strategies). New modalities can be added by training new encoders without retraining existing components. The shared space provides a natural target for interpretation and visualization.
The approach is not without limitations. The quality of alignment depends heavily on the training objective and the availability of paired samples where multiple modalities are measured in the same biological entity. Without sufficient anchoring, the shared space may fail to capture true biological correspondence. Hyperparameter choices for balancing reconstruction against alignment losses require careful tuning.
Before viewing the comparison table below, make a prediction: Which fusion strategy (early, late, or intermediate) do you think would handle missing modalities best? Which would be most computationally expensive? Which would be best for learning interactions between genomic variants and gene expression levels? Write down your predictions, then check them against the table.
| Strategy | Cross-Modal Interactions | Missing Data Handling | Computational Cost | Best When |
|---|---|---|---|---|
| Early Fusion | Can learn arbitrary interactions | Poor: requires complete data | Low to moderate | Large samples, complete data, focused features |
| Late Fusion | None: predictions combined only | Excellent: uses available modalities | Low: independent models | Independent signals, variable coverage, interpretability needed |
| Intermediate Fusion | Learns in shared space | Good: graceful degradation | High: alignment training | Coupled modalities, transfer learning, paired training data |
23.3 Multi-Omics Foundation Models
The foundation model paradigm, introduced in Chapter 14, extends naturally to multi-omics settings. Rather than training task-specific models that integrate modalities for a single downstream application, multi-omics foundation models learn general-purpose representations that transfer across tasks.
23.3.1 Factor-Based Integration
Before diving into deep learning, consider a simpler question: can we explain the variation across modalities with a small number of shared factors? If inflammation drives coordinated changes in both gene expression and DNA methylation, a single factor might capture both effects. Factor-based methods test whether multi-omics complexity reduces to interpretable dimensions.
Multi-Omics Factor Analysis (MOFA and its successor MOFA+) provides a probabilistic framework for learning shared and modality-specific factors from multi-omics data (Argelaguet et al. 2018). The approach assumes that observed measurements across modalities can be explained by a small number of latent factors, some shared across modalities and others specific to individual data types.
MOFA+ extends this framework to handle multiple sample groups (such as different tissues or conditions), non-Gaussian likelihoods appropriate for count data, and scalable inference for large datasets . The factors learned by MOFA+ capture sources of variation that span modalities, enabling biological interpretation: a factor that loads heavily on inflammatory genes in expression data and on hypomethylation at immune loci in methylation data suggests coordinated epigenetic-transcriptional regulation of inflammation.
While MOFA+ is not a deep learning method in the strict sense, its factor-based decomposition provides a foundation for understanding what multi-omics integration should capture. The shared factors correspond to biological processes that manifest across molecular layers; the modality-specific factors capture technical variation or layer-specific biology.
23.3.2 Deep Generative Multi-Omics Models
RNA and protein measurements have fundamentally different noise characteristics: RNA suffers from dropout, proteins from background binding. Should a model treat them identically, or explicitly account for how each modality was generated? Deep generative approaches choose the latter, building the measurement process into the model itself.
totalVI (Total Variational Inference) integrates protein abundance from CITE-seq with gene expression in single-cell data through a hierarchical Bayesian model (Gayoso et al. 2021). The approach learns a joint latent space that captures cell state while properly modeling the distinct noise characteristics of RNA and protein measurements. Protein abundance follows a negative binomial distribution with technical factors including background binding; RNA counts follow a zero-inflated negative binomial accounting for dropout. The choice of these specific likelihood functions reflects the data-generating process: RNA-seq suffers from technical dropout where some transcripts fail to be captured despite being present (hence zero-inflation), while protein measurements from antibody-based methods have background binding noise but less zero-inflation. Using the wrong likelihood would force the model to distort its latent space to accommodate systematic errors, degrading biological interpretability.
The generative model structure enables imputation of missing modalities. Given RNA expression alone, totalVI can predict expected protein abundance by sampling from the learned joint distribution. This imputation is not mere correlation-based prediction but reflects the full posterior distribution over protein levels given expression.
MultiVI extends this framework to integrate gene expression with chromatin accessibility (Ashuach et al. 2023). The model learns to align measurements from different cells, enabling construction of unified cell atlases from studies that measured different modalities. The alignment relies on the biological assumption that gene expression and chromatin state reflect the same underlying cell state, even when measured in different cells.
These Bayesian deep generative models exemplify intermediate fusion with principled uncertainty quantification. The posterior distributions over latent variables capture not just point estimates but confidence in the learned representations (see Chapter 24 for uncertainty quantification methods). This property becomes important for clinical applications where prediction uncertainty must inform decision-making.
Consider totalVI integrating RNA and protein measurements from single-cell CITE-seq data. The model learns a joint latent space where cells map based on both modalities.
- Why does totalVI use different likelihood functions for RNA (zero-inflated negative binomial) versus protein (negative binomial)?
- If you had a new sample with only RNA measurements, how would the model generate a protein abundance prediction?
- What advantage does the Bayesian approach provide over simply training a regression from RNA to protein?
RNA-seq suffers from technical dropout where transcripts fail to be captured despite being present (requiring zero-inflation), while protein measurements from antibody-based methods have background binding noise but less zero-inflation. Using the correct likelihood prevents the model from distorting its latent space to accommodate systematic errors.
The model maps the RNA measurements into the shared latent space, then samples from the learned posterior distribution over protein levels conditioned on that latent representation.
The Bayesian approach provides full posterior distributions capturing uncertainty in predictions, not just point estimates, critical for clinical decisions where knowing confidence matters as much as the prediction itself.
23.3.3 Contrastive Multi-Modal Learning
What if you do not need to model the full data-generating process, but only need to learn that a cell’s expression profile and its methylation profile should map to nearby points? Contrastive learning takes this shortcut: instead of reconstructing measurements, it simply learns to recognize which observations came from the same biological entity across modalities.
Contrastive learning provides another path to multi-omics integration. The CLIP model for vision-language demonstrated that contrastive objectives can align embeddings from fundamentally different data types (images and text) into a shared space (Radford et al. 2021). Similar approaches apply to biological modalities.
The contrastive objective is straightforward: embeddings of the same biological entity across modalities should be similar, while embeddings of different entities should be dissimilar. A cell’s expression embedding should be close to its methylation embedding and far from other cells’ methylation embeddings. A patient’s genomic embedding should be close to their transcriptomic embedding across the cohort.
Why does this objective produce biologically meaningful alignment rather than arbitrary correspondence? The key is that biological state is the common cause underlying both modalities. A cell’s expression profile and its methylation profile are both consequences of the same underlying regulatory state (active enhancers, bound transcription factors, chromatin accessibility). By forcing the encoders to map the same cell to similar embeddings across modalities, the contrastive loss encourages representations that capture this shared biological state rather than modality-specific artifacts. Features that vary randomly between modalities (technical noise, batch effects) cannot satisfy the objective; only features reflecting genuine cellular identity survive the alignment pressure.
This objective requires paired samples for training: the same cells or patients measured across modalities. Anchor pairs define the positive examples; negative examples come from non-matching pairs within a batch. The encoders learn to produce embeddings where cross-modal correspondence emerges from training dynamics rather than explicit feature engineering.
Contrastive approaches scale well and can incorporate foundation model encoders pretrained on single modalities. An expression encoder pretrained on millions of cells via masked gene prediction can be fine-tuned with contrastive objectives to align with an ATAC-seq encoder. The pretraining provides rich initial representations; the contrastive fine-tuning establishes cross-modal correspondence (see Section 8.5 for contrastive pretraining strategies).
23.3.4 Language Models for Multi-Omics
Recent work extends the language model paradigm to multi-omics integration, treating different molecular modalities as distinct “languages” amenable to joint modeling. Zhu et al. (2024) apply GPT-style autoregressive modeling to learn unified representations across genomic, transcriptomic, and proteomic data. Similarly, Hwang et al. (2024) demonstrate that generative language models can capture cross-modal dependencies that escape modality-specific approaches.
This paradigm shift reframes multi-omics integration from a feature engineering problem to a representation learning problem, potentially discovering latent biological relationships that manual feature design would miss.
The multi-omics language model paradigm remains nascent. Key open questions include optimal tokenization across modalities, handling missing data, and whether joint pretraining outperforms modality-specific models with late fusion.
23.4 Clinical Integration: EHR, Imaging, and Molecular Data
The ultimate goal of multi-omics modeling for many applications is patient-level prediction: disease risk, treatment response, prognosis. Achieving this goal requires integrating molecular measurements with clinical data that directly captures patient state and outcomes.
23.4.1 Electronic Health Records as a Modality
Molecular measurements capture mechanism; clinical records capture manifestation. A patient’s genomic risk score tells you their inherited predisposition, but their ten-year history of diagnoses, medications, and lab values tells you what actually happened. When should you trust the molecular signal, and when does the clinical trajectory override it?
Electronic health records contain decades of longitudinal observations for millions of patients: diagnoses, procedures, medications, laboratory values, vital signs, clinical notes. This wealth of phenotypic information complements molecular data by capturing disease manifestation rather than molecular mechanism.
Integrating EHR with genomics poses distinct challenges. The data types differ fundamentally: structured codes, continuous lab values, free-text notes, and time-stamped events versus static or slowly-changing molecular measurements. Temporal structure matters: the sequence of diagnoses and treatments contains prognostic information that static snapshots miss. Missingness is informative: the absence of a laboratory test may indicate that a clinician deemed it unnecessary, which itself conveys information about patient state (Section 2.7.2). The phenotype quality challenges introduced there cascade through multi-omics integration, where EHR-derived labels may introduce systematic biases that Section 13.2.4 examines in detail.
Foundation models for EHR data learn representations from the longitudinal event sequences. These models, often based on transformer architectures that process sequences of medical codes (see Chapter 7), capture temporal dependencies and co-occurrence patterns in clinical trajectories. The resulting patient embeddings encode disease state and prognosis in a form amenable to integration with molecular data.
EHR data presents distinct time series challenges that affect multi-omics integration:
Irregular sampling. Clinical events occur at irregular intervals driven by patient visits, not a fixed measurement schedule. Lab values might be measured daily during hospitalization, monthly during stable periods, and not at all when patients feel healthy. Standard time series methods assuming regular intervals require adaptation.
Event sequences vs. continuous signals. EHR contains both discrete events (diagnoses, procedures) and continuous measurements (vital signs, lab values). Effective architectures must handle both: transformers that process event sequences, recurrent networks that interpolate continuous signals, or hybrid approaches.
Variable-length histories. Patients have clinical histories spanning days to decades. Encoding must handle this variability, whether through truncation to recent windows, hierarchical summarization, or attention mechanisms that learn which distant events remain relevant.
Censoring and outcome timing. For survival analysis and risk prediction, the timing of outcome events matters as much as their occurrence. Integration with molecular data requires careful alignment: when was the molecular sample collected relative to the clinical trajectory? Predictions should not use future clinical events that post-date the molecular measurement.
For specialized EHR modeling architectures including BEHRT, Med-BERT, and clinical transformer variants, see the clinical risk modeling discussion in Section 28.4.
Combining EHR embeddings with genomic features requires handling different temporal scales. Genetic variants are constant throughout life; EHR observations accumulate over years. The integration must determine which clinical observations are relevant to a given molecular measurement, accounting for the time between sample collection and clinical events. A patient’s genomic risk for a disease does not change, but their clinical trajectory unfolds over time, and the relevance of past clinical events depends on when the molecular sample was collected.
23.4.2 Imaging Integration
Molecular assays homogenize tissue into measurements that lose spatial context. Imaging preserves that context: where is the tumor, how heterogeneous is it, what does the tissue architecture reveal? The tradeoff is clear, but how do you align a three-dimensional scan with a bulk RNA-seq measurement that averaged over a small biopsy region?
Medical imaging provides spatial information that molecular assays lack. A CT scan reveals tumor location, size, and heterogeneity; histopathology slides show cellular morphology and tissue architecture; MRI captures organ structure and function. These spatial data complement molecular measurements that aggregate over dissected tissue regions.
Radiogenomics links imaging features to genetic and molecular characteristics. Glioblastoma tumors with specific imaging signatures have distinct methylation patterns and expression programs . Radiomic features extracted from CT scans correlate with mutational burden and immune infiltration in lung cancer . These associations enable prediction of molecular state from non-invasive imaging, potentially guiding treatment decisions when biopsy is impractical.
Foundation models for medical imaging, pretrained on millions of scans through self-supervised objectives, provide rich representations for downstream tasks . Integrating these imaging embeddings with molecular data follows the intermediate fusion paradigm: modality-specific encoders produce representations in a shared latent space where multi-modal classifiers operate.
The integration must account for correspondence between imaging regions and molecular samples. A tumor may be molecularly heterogeneous, with different subclones in different spatial locations. A biopsy samples one location; imaging captures the entire lesion. Alignment requires either spatial registration of biopsy location to imaging coordinates or acceptance that the correspondence is imperfect.
23.4.3 Multi-Modal Clinical Prediction Models
In clinical practice, patients arrive with whatever data they have: complete molecular workups for some, imaging only for others, extensive EHR histories for many. A practical clinical model cannot demand all modalities. How do you build a unified prediction system that improves with more data but still works with less?
Combining EHR, imaging, and molecular data for clinical prediction follows the intermediate fusion pattern. Each data type has a specialized encoder: a transformer for longitudinal EHR events, a vision encoder for imaging, domain-specific encoders for expression, methylation, and other molecular modalities. All encoders produce embeddings in a common patient representation space.
The training objective typically combines modality-specific reconstruction losses with alignment terms that encourage consistency across data types. A patient’s EHR embedding should be predictive of their molecular state; their imaging embedding should be consistent with their clinical trajectory. Downstream classifiers for outcomes like survival, treatment response, or disease progression operate on the combined representation.
Missing modalities are common in clinical settings. Not all patients have genomic data; imaging may be unavailable for some conditions; the depth of EHR history varies by healthcare system and patient engagement. Multi-modal clinical models must handle this missingness gracefully, producing useful predictions from whatever data are available while leveraging cross-modal information when present.
The clinical deployment path for such models requires validation on external cohorts, prospective evaluation, and regulatory clearance. These practical considerations, addressed in Chapter 28, shape model development from the outset. A model that performs well on a research cohort but requires modalities unavailable in clinical workflows provides little value. The practical deployment considerations, including feature availability and model calibration requirements, are examined in Section 28.3.
When designing clinical multi-modal models, start from deployment constraints:
- Identify available modalities: Which data types will be available for the target patient population? Not all patients in a research cohort have all modalities available in routine clinical care.
- Design for graceful degradation: Choose intermediate fusion architectures that can produce useful predictions even with incomplete data.
- Validate across missingness patterns: Test performance not just on complete cases but on realistic subsets reflecting clinical data availability.
- Consider timing: When is each modality available relative to the clinical decision point? A model requiring data not yet collected provides no clinical value.
23.5 Systems View: From Variant to Phenotype
Multi-omics integration gains conceptual clarity from a systems biology perspective that traces information flow from genetic variation through molecular intermediates to clinical phenotypes. This cascade view organizes the molecular layers into a causal hierarchy and identifies where integration should occur.
23.5.1 Information Cascade
Genetic variants are the starting point: heritable differences in DNA sequence that perturb downstream molecular processes. Some variants directly alter protein structure through missense or nonsense mutations. Others affect regulation: promoter variants change expression level, splice site variants alter transcript isoforms, enhancer variants modulate tissue-specific expression.
These primary effects propagate through molecular layers. Expression changes alter the cellular protein complement. Protein level changes affect enzyme activity, signaling cascades, and transcriptional feedback. Metabolic flux shifts in response to enzyme availability. Cell behavior changes as the integrated molecular state crosses thresholds for proliferation, differentiation, or death.
Tissue-level phenotypes emerge from cellular behavior aggregated across the organ. Tumor growth reflects altered cell proliferation; fibrosis reflects aberrant extracellular matrix deposition; inflammation reflects immune cell recruitment and activation. These tissue phenotypes manifest as clinical symptoms, laboratory abnormalities, and imaging findings.
The cascade view suggests where different modalities provide information. Genomics captures the inherited potential and somatic alterations. Transcriptomics and epigenomics capture the current regulatory state. Proteomics and metabolomics capture the functional molecular complement. Clinical data captures the phenotypic consequences.
23.5.2 Bottleneck Modalities
Consider two different variants: (1) a missense variant in TP53 that disrupts DNA binding, and (2) an enhancer variant that reduces TP53 expression by 30%. For which variant would expression data provide more information beyond genomic sequence alone? Why?
Not all modalities are equally informative for all questions. The concept of bottleneck modalities identifies which molecular layers most directly mediate the relationship between genetic variation and phenotype.
For many coding variants, protein structure is the bottleneck. A missense variant’s effect on disease depends primarily on how it alters protein function, which depends on how the amino acid substitution affects folding, stability, and activity. Expression level matters less than structural consequence. Protein language models that predict structural effects from sequence directly address this bottleneck (see Chapter 16).
For regulatory variants, expression is closer to the bottleneck. An enhancer variant affects disease through its effect on target gene expression, which affects downstream processes. Chromatin accessibility and transcription factor binding are intermediate steps; expression level is the more proximal readout. Models that predict expression effects from sequence address this bottleneck (see Chapter 17).
For some phenotypes, the bottleneck may lie downstream of molecular measurements entirely. Behavioral traits depend on neural circuit function that emerges from complex cellular and network dynamics. Metabolic traits depend on flux through interconnected pathways that may not be apparent from enzyme abundance alone. These cases suggest that molecular measurements provide incomplete information regardless of integration sophistication.
The bottleneck concept provides practical guidance for integration strategy. If you know that a phenotype is driven primarily by coding variants affecting protein structure, investing in proteomics may be less valuable than deploying state-of-the-art protein structure prediction from sequence. Conversely, if regulatory variants dominate, expression or chromatin accessibility measurements add substantial information beyond genomic sequence. Understanding the causal architecture helps prioritize which modalities to measure and integrate.
23.5.3 Causal vs. Correlational Integration
Multi-omics data are pervasively correlated. Genes in the same pathway have correlated expression. Methylation and expression are anti-correlated at many promoters. Clinical variables cluster by disease category. These correlations can improve prediction even without causal understanding.
When integrating high-dimensional multi-omics features, controlling false discoveries becomes critical. Model-X knockoffs provide a framework for variable selection with false discovery rate (FDR) control, applicable to any covariate matrix regardless of the response type (Candès et al. 2018). The approach constructs synthetic “knockoff” variables that mimic the correlation structure of original features without containing signal, then selects only features whose importance exceeds their knockoffs. This is particularly valuable for multi-omics integration where thousands of correlated features across modalities could generate spurious associations. For genomics specifically, knockoffs have been applied to fine-mapping, identifying causal variants with FDR guarantees. The statistical foundations and genomic applications are discussed in Section 26.2.3.
Causal integration seeks to identify the mechanistic relationships between molecular layers. If a variant causes reduced expression, which causes protein deficiency, which causes metabolic dysfunction, this causal chain suggests intervention targets: expression restoration or enzyme supplementation might address the downstream effects. Correlational integration might achieve the same predictive performance without identifying this chain, since all layers correlate with the phenotype.
Distinguishing causal from correlational relationships requires experimental perturbation or careful causal inference from observational data. Mendelian randomization uses genetic variants as instruments to infer causal effects of expression on outcomes (see Section 3.9 for integration of GWAS with mechanism). CRISPR screens directly perturb gene function and measure consequences. Multi-omics integration methods increasingly incorporate causal assumptions or validation against perturbation data.
The distinction matters for interpretation and intervention. A predictive model based on correlations may fail when the data distribution shifts (see Chapter 13) or when interventions alter the causal structure. A causally informed model captures mechanism that persists across contexts.
23.6 Handling Missing Modalities
Handling missing modalities requires understanding both the mathematical framework of intermediate fusion and the biological assumptions underlying cross-modal imputation. If the shared latent space concepts from Section 23.2.3 are not yet clear, review that section before proceeding.
Real-world multi-omics data are incomplete. Different studies measure different modalities. Within studies, technical failures, sample limitations, and cost constraints create missing data. Clinical deployment must handle patients with incomplete molecular profiles. Robust multi-omics methods must address missingness directly.
23.6.1 Training with Incomplete Data
Intermediate fusion architectures handle missing modalities naturally during inference: only the available encoders contribute to the shared representation. Training is more complex because alignment terms require paired measurements across modalities.
One approach trains on the subset of samples with complete data, then applies the trained encoders to samples with partial data during inference. This wastes information from the samples with incomplete profiles and may learn representations that fail to generalize to the missing-modality setting.
A better approach incorporates missingness into training. Modality dropout randomly masks modalities during training, forcing the model to learn representations robust to missing inputs. The mechanism works analogously to standard dropout: by training the model to succeed even when some information is unavailable, modality dropout encourages the shared latent space to encode biological state redundantly across modalities rather than relying on any single data type. The reconstruction and alignment losses are computed only for available modalities, so samples with partial data can still contribute to training.
Curriculum learning strategies may first train with complete data to establish alignment, then gradually increase modality dropout to improve robustness. The curriculum matters because alignment and robustness have opposing requirements: strong alignment needs paired data to learn which expression patterns correspond to which accessibility patterns, but robustness needs the model to practice predicting with missing inputs. Starting with complete data establishes the cross-modal correspondences, then increasing dropout teaches the model to function without them. The balance between alignment quality (which benefits from complete data) and robustness (which requires training on partial data) requires empirical tuning.
23.6.2 Cross-Modal Imputation
Intermediate fusion enables principled imputation of missing modalities. Given a sample’s available modalities encoded into the shared latent space, decoders for missing modalities can predict expected values. If a patient has expression data but not methylation, the expression encoder produces a latent embedding, and the methylation decoder generates predicted methylation values from that embedding.
The imputation quality depends on how well the shared space captures the biological factors underlying both modalities. If expression and methylation reflect the same cell state, the imputation may be accurate. If they capture distinct aspects of biology, imputation will smooth over true variation.
Uncertainty in imputation matters for downstream use. Point estimates of missing values provide no indication of confidence. Generative models that produce distributions over missing values enable propagation of uncertainty through downstream analyses (see Section 24.5). A risk prediction that depends heavily on imputed values should have wider confidence intervals than one based entirely on measured data. The selective prediction and uncertainty communication strategies that could implement this appropriate caution are developed in Section 24.8.
23.6.3 Zero-Shot Cross-Modal Transfer
The most ambitious application of multi-omics integration is zero-shot prediction across modalities: using a model trained on one set of modalities to make predictions for samples measured with entirely different modalities.
This transfer relies on the shared latent space capturing biological state independently of measurement modality. If the space truly represents cell state, then a classifier trained on expression-derived embeddings should work on ATAC-seq-derived embeddings, since both encoders map to the same biological meaning. The alignment training enables this transfer by ensuring that the same biological entity maps to the same latent location regardless of which modality was measured.
Zero-shot transfer is rarely perfect. The modalities may capture somewhat different aspects of biology, and the alignment may be imprecise. But partial transfer can still be valuable: a model achieving 80% of supervised performance without any labeled examples in the new modality saves substantial annotation effort (see Section 10.6.2 for zero-shot transfer in other contexts).
23.7 Practical Challenges
The gap between multi-omics potential and deployed reality reflects obstacles that compound across modalities. Technical variation that is manageable within a single assay type becomes intractable when batch structures differ across genomics, transcriptomics, and proteomics. Sample sizes that support single-modality analysis may be insufficient when the effective dimensionality grows with each added data type. Interpretability, already challenging for deep learning on individual modalities, becomes harder still when attributions must be compared across features with different scales and semantics. These practical challenges determine whether integration improves predictions or merely adds complexity.
23.7.1 Batch Effects Across Modalities
Batch effects, systematic technical variation between experimental batches, are endemic in high-throughput biology. Multi-omics integration faces compounded batch effects: each modality may have its own batch structure, batches may be correlated or anti-correlated across modalities, and batch correction methods designed for single modalities may not extend to multi-modal settings.
Consider a study where expression data were generated at three sequencing centers and proteomics data were generated at two mass spectrometry facilities. The batch effects in each modality are independent. Samples from expression batch 1 are spread across proteomics batches. Correcting expression batch effects does not address proteomics batch effects, and vice versa.
Integration must either correct batch effects within each modality before combining (risking removal of real biology that correlates with batch) or incorporate batch as a covariate in the integrated model (requiring that batch structure be known and modeled correctly). Domain adaptation techniques treat batches as domains and learn representations invariant to domain while retaining biological signal. The systematic strategies for detecting batch-driven confounding appear in Section 13.8, while mitigation approaches including adversarial domain adaptation are detailed in Section 13.9.
23.7.2 Sample Size and Power
Multi-omics studies typically have smaller sample sizes than single-modality studies due to cost constraints. Each additional modality increases per-sample cost, trading breadth for depth. This tradeoff has implications for statistical power and model complexity.
The effective sample size for multi-omics integration may be smaller than for any single modality. If 1000 patients have expression data and 800 have methylation data but only 600 have both, intermediate fusion sees 600 fully informative samples. Late fusion can use all 1000 expression samples and all 800 methylation samples, avoiding the intersection penalty.
Power analyses for multi-omics studies must account for the specific integration strategy and the expected missingness pattern. A study designed for early fusion needs larger sample sizes (relative to feature count) than one designed for late fusion. Grant applications and study planning should explicitly consider how integration choices affect required sample sizes.
23.7.3 Interpretability Across Modalities
Multi-omics models compound the interpretability challenges inherent in deep learning. When a model predicts disease risk from integrated genomic, transcriptomic, and proteomic features, clinicians need to understand which modalities and which features drive the prediction. A black-box risk score, however accurate, provides little guidance for understanding mechanism or identifying intervention targets.
Attribution methods that work for single-modality models do not automatically extend to multi-modal settings (see Chapter 25 for attribution methods). Gradient-based attribution can identify important features within each modality, but comparing importance across modalities requires careful normalization. A genomic variant and an expression value operate on different scales with different effect size distributions; raw attribution scores are not directly comparable.
The intermediate fusion architecture offers some interpretability advantages. The shared latent space can be visualized to understand how samples cluster and which modalities contribute to separation. Attention weights in cross-modal transformers indicate which features from each modality the model considers when making predictions. Modality ablation studies quantify each data type’s contribution to overall performance.
Biological interpretability requires connecting learned representations to known biology. Do the latent dimensions correspond to pathways, cell types, or disease processes? Are cross-modal attention patterns consistent with known regulatory relationships? These questions demand validation against external biological knowledge, not just introspection of model parameters.
23.7.4 Evaluation Complexity
Evaluating multi-omics models is more complex than evaluating single-modality models. Multiple dimensions of performance matter: prediction accuracy, calibration, cross-modality transfer, robustness to missing modalities, biological plausibility of learned representations, and clinical utility.
A model might achieve high prediction accuracy by memorizing batch effects or leveraging shortcuts in the data. Evaluation should include cross-batch and cross-cohort validation to assess generalization (Chapter 12), with particular attention to homology-aware splitting strategies (Section 12.2) that prevent information leakage across data partitions. Ablation studies that remove each modality quantify the contribution of each data type and identify whether the model genuinely integrates information or relies predominantly on one modality.
Biological validation through comparison to known biology provides another evaluation axis. Do the learned factors correspond to known pathways? Are attention patterns consistent with regulatory relationships? Do imputed values match held-out measurements? These checks assess whether the model captures biological signal rather than technical artifacts.
Clinical evaluation, addressed in Chapter 28, requires prospective validation in real deployment settings. A model that improves prediction in research cohorts may not improve clinical decisions if the predictions do not change management or if the required modalities are unavailable in clinical workflows.
A research team trains a multi-omics model integrating genomics, transcriptomics, and proteomics for cancer prognosis. The model achieves excellent cross-validation performance but fails when tested on data from a different hospital.
- What are three possible reasons for this failure?
- What evaluation strategies should have been included to detect this problem earlier?
- How might the fusion strategy choice (early, late, intermediate) affect robustness to this kind of distribution shift?
Three likely reasons: batch effects differing between hospitals (different sequencing platforms, sample processing), population differences (ancestry, age distribution, disease subtypes), or the model memorizing hospital-specific technical artifacts rather than biological signal.
Cross-batch validation, cross-cohort validation with held-out external datasets, and ablation studies to verify each modality contributes genuine biological information rather than batch-correlated shortcuts.
Late fusion may be more robust because each modality is processed independently before combination, limiting how batch effects in one modality can corrupt others. Early fusion concatenates raw features, allowing batch effects to propagate across modalities. Intermediate fusion balances these; shared latent spaces can either amplify or mitigate batch effects depending on training objectives.
23.8 Integration as Means, Not End
Multi-omics integration is not an end in itself but a means to improved prediction, understanding, and intervention. The integration strategies and foundation models surveyed here produce representations; downstream applications convert those representations to actionable outputs. Risk prediction combines multi-omic embeddings with clinical variables for individualized prognosis. Treatment response models predict which patients will benefit from specific therapies based on their integrated molecular profiles. Drug discovery uses multi-omics to inform target identification and patient stratification for clinical trials (see Chapter 30). In each case, integration provides the substrate; clinical or scientific goals provide the purpose.
The systems view that multi-omics enables shapes how predictions should be interpreted. A risk prediction based on integrated features inherits explanatory power from the causal relationships linking molecular layers to phenotype. Understanding which modalities drive predictions, and how those modalities relate to underlying biology, supports clinical reasoning about mechanism and intervention. This explanatory capacity distinguishes multi-omics from single-modality approaches that may predict equally well but provide less insight into why predictions succeed or fail.
The path from research models to clinical deployment requires addressing practical challenges that intensify with integration: batch effects across modalities and institutions, missing measurements that differ systematically across patients, sample size limitations that grow with feature dimensionality, and evaluation complexity when outcomes depend on multiple data types. The clinical applications examined in Chapter 28 and Chapter 29 confront these realities. As the field advances toward whole-patient foundation models that jointly encode genomics, transcriptomics, proteomics, imaging, and clinical data, the integration principles established here provide the foundation. The trade-offs between fusion strategies, the importance of shared latent spaces, the challenge of missing modalities, and the systems biology perspective on information flow will remain relevant as scale and scope expand. The interpretability challenges that compound across modalities (Chapter 25) and the calibration requirements for clinical deployment (Section 24.3) add further dimensions that shape how multi-omics models should be developed and evaluated.
Before reviewing the summary, test your recall:
- What is the “integration paradox” and why can more data sometimes lead to worse predictions in multi-omics?
- Compare the three fusion strategies (early, intermediate, late) and explain when each is most appropriate.
- Why does intermediate fusion dominate modern multi-omics approaches?
- What is meant by “bottleneck modalities” and why should you identify them when designing multi-omics integration?
- Explain the difference between causal and correlational integration, and why this distinction matters for clinical interventions.
Integration Paradox: The paradox is that naive concatenation of multi-omics features often degrades performance relative to single-modality models. This occurs because noise from uninformative features overwhelms signal from informative ones, batch effects between modalities create spurious correlations that models exploit, and the curse of dimensionality intensifies when features from multiple assays are stacked without principled integration, resulting in more parameters to estimate than training samples can constrain.
Fusion Strategy Comparison: Early fusion concatenates features before modeling, enabling arbitrary cross-modal interactions but requiring complete data and suffering from dimensionality issues; best for large samples with complete data. Late fusion trains separate models per modality and combines predictions, handling missing data excellently but unable to learn feature-level interactions; best when modalities provide independent signals. Intermediate fusion uses modality-specific encoders projecting to a shared latent space, balancing interaction learning with missing data robustness; best when modalities are coupled and paired training data exists.
Intermediate Fusion Dominance: Intermediate fusion dominates because it balances flexibility with robustness by learning modality-specific encoders that can be pretrained on large single-modality datasets, then aligned through shared latent spaces that enable cross-modal reasoning. The architecture handles missing modalities gracefully (encoders fire only for available data), allows new modalities to be added without retraining existing components, and provides a natural target for interpretation and visualization, addressing the key limitations of both early fusion (dimensionality, missing data) and late fusion (no feature-level interactions).
Bottleneck Modalities: Bottleneck modalities are the molecular layers that most directly mediate the relationship between genetic variation and phenotype for a specific question. For coding variants affecting protein structure, the bottleneck is at the protein level (structure matters more than expression). For regulatory variants, expression is closer to the bottleneck (enhancer effects operate through gene expression changes). Identifying bottlenecks guides which modalities to prioritize for measurement and integration; investing in proteomics provides little value if protein structure prediction from sequence already captures the critical information.
Causal vs. Correlational Integration: Causal integration identifies mechanistic relationships between molecular layers (e.g., variant causes reduced expression, which causes protein deficiency, which causes metabolic dysfunction), suggesting intervention targets like expression restoration or enzyme supplementation. Correlational integration exploits statistical associations to improve prediction without identifying mechanism, achieving similar predictive performance but lacking explanatory power. The distinction matters clinically because causal models capture mechanisms that persist across contexts and guide interventions, while correlational models may fail when data distributions shift or when interventions alter the causal structure.
Core Concepts:
- The integration paradox: more data can mean worse predictions when noise overwhelms signal, batch effects create spurious correlations, or dimensionality outpaces sample size
- Three fusion strategies (early, intermediate, late) offer different tradeoffs between cross-modal interaction learning and missing data robustness
- Intermediate fusion dominates modern approaches by learning modality-specific encoders projecting to a shared latent space
Key Methods:
- MOFA+ for probabilistic factor-based integration
- totalVI and MultiVI for deep generative single-cell integration
- Contrastive multi-modal learning for aligning embeddings across modalities
Design Principles:
- Identify bottleneck modalities: which molecular layers most directly mediate genotype-phenotype relationships for your question
- Design for graceful degradation: models should produce useful predictions even with incomplete modality coverage
- Distinguish causal from correlational integration: causal understanding enables intervention, correlation enables prediction
Practical Considerations:
- Batch effects compound across modalities and require per-modality correction or domain adaptation
- Effective sample size shrinks with each required modality due to incomplete overlap
- Interpretability requires cross-modal attribution normalization and validation against known biology
Looking Ahead: Chapter 24 addresses how to quantify and communicate uncertainty in these complex models, while Chapter 28 examines the deployment path from research to clinical practice.