28 Clinical Risk Prediction
A risk prediction has clinical value only if it changes what happens next.
Estimated reading time: 40-50 minutes
Prerequisites: This chapter builds on concepts from multiple earlier chapters. You should be familiar with polygenic risk scores (Section 3.5), foundation model architectures for DNA and protein sequences (Chapter 15, Chapter 16), transfer learning approaches (Chapter 9), and uncertainty quantification principles (Chapter 24). Understanding of evaluation metrics (Section 12.6) and interpretability methods (Chapter 25) will help you assess clinical deployment requirements.
Learning Objectives: After completing this chapter, you should be able to:
- Explain why traditional polygenic scores often fail clinical translation and how foundation model features address these limitations
- Compare early, intermediate, and late fusion architectures for integrating genomic and clinical data
- Distinguish discrimination, calibration, and clinical utility as distinct but complementary evaluation criteria
- Design validation studies appropriate for different levels of the evidence hierarchy
- Identify sources of bias in genomic and EHR data and describe mitigation strategies for equitable deployment
- Articulate the workflow, monitoring, and governance requirements for clinical integration of foundation model-based risk tools
The development of clinical prediction models follows an established methodology from Steyerberg’s Clinical Prediction Models (Steyerberg 2019) that genomic foundation models must satisfy for clinical deployment:
Development Phase:
- Clear specification of target population and outcome
- Appropriate predictor selection with clinical rationale
- Estimate model parameters while avoiding overfitting
- Internal validation (cross-validation, bootstrap)
Validation Phase:
- Discrimination assessment (c-statistic, auROC)
- Calibration assessment (reliability diagrams, calibration slope)
- Clinical utility analysis (decision curves, net benefit)
- External validation in independent cohorts
Deployment Phase:
- Presentation of predictions (risk categories, absolute risks)
- Decision support integration
- Monitoring and updating
Foundation models integrate phases 1-3 through pretraining and fine-tuning; however, clinical deployment still requires completing phases 4-11 (validation and monitoring steps). The TRIPOD statement (Collins et al. 2015) provides a 22-item checklist ensuring complete reporting. Clinical deployment of genomic foundation models requires TRIPOD compliance.
Maria is 52, with moderately elevated cholesterol and a family history that keeps her cardiologist alert: her father’s heart attack at 49, her brother’s stent at 54. Her physician orders a polygenic risk score and receives a number: 0.84, placing Maria in the top 5% of genetic risk for coronary artery disease. The cardiologist pauses. What should she do differently because of this score? Prescribe the statin she was already considering? Recommend earlier imaging she would have suggested anyway? Counsel Maria about lifestyle modifications she has heard a hundred times before?
A risk prediction has clinical value only if it changes what happens next. If Maria with her high polygenic risk score receives the same statin prescription, lifestyle counseling, and follow-up schedule as a patient without genetic testing, the score added nothing to her care regardless of its statistical validity. The fundamental challenge is not generating genomic predictions but translating them into actions that improve outcomes. This translation requires more than discrimination between who will and will not develop disease; it requires that the prediction reach clinicians in a usable form, at a decision point where alternatives exist, for a patient population where the prediction performs equitably.
Traditional polygenic scores, despite their scientific validity, often fail this translation test. They reduce entire genomes to single numbers that provide little mechanistic insight. They transfer poorly across ancestries because training data overrepresent European populations. They exist outside the electronic health records where clinical decisions actually happen, requiring manual lookup that busy clinicians rarely perform. Most fundamentally, the clinical actions available in response to a polygenic risk score (PRS), such as lifestyle modification, earlier screening, or preventive medication, are often the same actions recommended for patients with conventional risk factors, leaving unclear what the genetic information specifically enables.
Genomic foundation models offer capabilities that may address some of these limitations. Rather than collapsing genetic information into scalar risk scores, foundation models produce embeddings that capture sequence context, regulatory grammar, and functional consequences. These representations can integrate with clinical data through fusion architectures (Chapter 23), adapt to diverse prediction tasks through transfer learning (Chapter 9), and provide feature attributions that connect predictions to biological mechanisms (Chapter 25). Whether these capabilities translate into tools that change practice remains the open question. Foundation models improve representation quality, but they do not automatically solve the translation problem. Clinicians must still have time and incentive to interpret richer representations, interventions must exist that respond to the mechanistic insights models provide, and computational evidence must meet evidentiary standards for clinical decision-making. Having better predictions does not guarantee better outcomes if those predictions do not reach clinicians in actionable form, at moments when decisions can change, for populations where the predictions perform equitably. The path from these capabilities to tools that change practice runs through electronic health record integration, evidence standards for clinical deployment, fairness considerations that determine whether genomic AI reduces or amplifies health disparities, and the practical realities of care delivery.
28.1 From Polygenic Scores to Foundation Model Features
The limitations of classical polygenic risk scores define the opportunity for foundation model approaches. As discussed in Section 3.5, polygenic scores aggregate the effects of common variants into weighted sums, with weights derived from genome-wide association study effect sizes. This framework has demonstrated that common genetic variation contributes substantially to risk for conditions including coronary artery disease, type 2 diabetes, and breast cancer. A patient in the top percentile of polygenic risk for coronary disease faces roughly threefold higher lifetime risk than one in the bottom percentile, a gradient comparable to traditional risk factors like smoking or hyperlipidemia .
Before reading further, consider the three main limitations of traditional polygenic scores mentioned in the introduction: lack of mechanistic insight, poor cross-ancestry portability, and disconnection from clinical workflows. For each limitation, what kind of technical capability would be needed to address it? How might richer representations from foundation models help with each?
Several limitations constrain the clinical impact of this approach. The linear additive model cannot capture epistatic interactions where one variant’s effect depends on others, nor can it represent the nonlinear relationships between genetic variation and disease that emerge from regulatory networks and cellular pathways. Polygenic scores derived from European-ancestry genome-wide association studies substantially underperform in other populations, with effect sizes often attenuating by half or more in African or East Asian ancestries due to differences in linkage disequilibrium structure and allele frequencies (Section 13.2.1; Section 3.7). Beyond these technical constraints, a single scalar provides no mechanistic insight: a high polygenic score for diabetes does not indicate whether risk stems from impaired insulin secretion, insulin resistance, or altered satiety signaling, information that might guide intervention selection.
Foundation models address these limitations through richer representations. Instead of treating variants as independent weighted features, models like Delphi and G2PT learn genome-wide embeddings that encode sequence context, regulatory annotations, and cross-variant interactions (Georgantas, Kutalik, and Richiardi 2024; Lee et al. 2025). These approaches can capture nonlinear structure in genetic risk, leverage functional priors that transfer across ancestries, and provide attention-based attributions that highlight which genomic regions contribute most to predictions. Fine-mapping models like MIFM estimate posterior probabilities for variants most likely driving associations within loci, enabling prioritization based on statistical evidence (Rakowski and Lippert 2025). Whether these probabilistic rankings identify true causal variants (rather than variants in tight linkage with causal variants) remains a distinct question requiring experimental validation; fine-mapping provides refined association signals, not causal proof.
The core advancement of foundation model approaches is not simply “better prediction” but a fundamental shift in what the model produces. A polygenic score yields a single number; a foundation model yields a high-dimensional embedding that preserves information about which genomic regions contribute, how they interact, and why they matter biologically. This richer output enables downstream tasks (pathway attribution, cross-ancestry transfer, mechanistic interpretation) that scalar scores cannot support.
The practical architecture of a foundation model-enabled risk system typically involves three components: pretrained encoders that transform genomic data into embeddings, aggregation modules that summarize variant-level or region-level representations into patient-level features, and prediction heads that map these features (combined with clinical covariates) to risk estimates. This modular design separates the computationally expensive foundation model inference from the task-specific prediction layer, enabling updates to either component while maintaining clear interfaces for validation.
| Approach | Representation | Ancestry Transfer | Mechanistic Insight | Computational Cost |
|---|---|---|---|---|
| Traditional PRS | Scalar (single number) | Poor (LD-dependent) | None | Low |
| PRS + annotations | Scalar + categorical | Moderate | Limited | Low |
| FM embeddings | High-dimensional vector | Better (sequence-based) | Via attribution | High (precompute) |
| FM + fine-mapping | Weighted embeddings | Best (causal priors) | Strong | High |
28.1.1 Non-Linear and Deep Learning PRS Approaches
Recent work has explored whether deep learning can capture genetic interactions missed by linear PRS. Dibaeinia et al. (2025) introduce PRSformer, a transformer-based architecture designed to model variant interactions directly, potentially capturing epistatic effects that linear methods cannot represent.
| Method Class | Architecture | Captures Epistasis? | Interpretability | Computational Cost |
|---|---|---|---|---|
| Linear PRS (LDpred2, PRS-CS) | Weighted sum | No | High (weights) | Low |
| Non-linear ML (XGBoost) | Decision trees | Limited (pairwise) | Moderate (SHAP) | Moderate |
| Deep learning (PRSformer) | Transformer | Potentially | Low | High |
Rigorous comparison with well-tuned linear methods is essential before concluding that deep learning adds value (Section 12.7). Elgart et al. (2022) systematically compared linear and non-linear PRS approaches across multiple phenotypes, finding that non-linear models can provide modest but consistent improvements (typically 2-5% AUC gain) for phenotypes with complex genetic architectures, while establishing that improvements from deep learning must exceed well-tuned linear baselines to justify increased complexity. As discussed in baseline requirements, apparent advantages of complex models may reflect improper comparison rather than genuine epistatic signal.
28.2 Defining Clinical Risk Prediction
A risk prediction model is only as useful as the decision it informs. Effective clinical risk prediction requires precise specification of four elements: the outcome being predicted, the time horizon over which prediction applies, the target population for whom the model is intended, and the clinical action the prediction will trigger.
Consider a 55-year-old woman with moderately elevated cholesterol and a family history of early coronary disease. Her cardiologist must decide whether to initiate statin therapy, a decision traditionally guided by 10-year cardiovascular risk estimates from tools like the Pooled Cohort Equations. A genomic foundation model could augment this decision in several ways. It might refine her absolute risk estimate by incorporating polygenic information that the traditional calculator ignores. It might identify whether her genetic risk concentrates in pathways amenable to specific interventions (LDL metabolism favoring statins versus inflammatory pathways suggesting alternative approaches). It might flag pharmacogenomic variants affecting statin metabolism that influence dose selection or drug choice.
Each of these applications represents a different prediction task with distinct requirements. The 10-year risk estimate for major adverse cardiovascular events is an individual-level incident risk problem where discrimination and calibration matter most. The pathway-level attribution is an interpretability challenge requiring mechanistic grounding. The pharmacogenomic prediction is a treatment selection problem where the relevant outcome is adverse drug reaction risk conditional on therapy initiation.
For technical readers: These distinct concepts determine whether a test should enter clinical practice:
Analytical validity: Does the test accurately measure what it claims to measure?
- Example: Does the genotyping array correctly call the variant genotypes?
- Metrics: Call rate, concordance with sequencing, reproducibility
Clinical validity: Is the measurement associated with the clinical outcome of interest?
- Example: Does the polygenic score correlate with disease risk?
- Metrics: Discrimination (auROC), calibration, hazard ratios
Clinical utility: Does using the test improve patient outcomes?
- Example: Does knowing the polygenic score lead to interventions that reduce disease incidence?
- Metrics: Net reclassification, decision curve analysis, clinical trial outcomes
Why the distinction matters:
| Level | Can Be High While Others Are Low? | Example |
|---|---|---|
| Analytical validity | Yes | Perfectly accurate test for a biomarker that does not predict disease |
| Clinical validity | Yes | Highly predictive test that does not change management |
| Clinical utility | Requires others | Test must be valid and actually improve care |
A model can achieve excellent discrimination (clinical validity) yet provide no clinical utility if the resulting predictions do not change clinical decisions, if the population tested differs from training, or if interventions triggered by the prediction are ineffective.
Clinical risk prediction tasks cluster into several archetypes. Incident risk concerns whether a currently disease-free individual will develop disease within a specified window, such as 10-year diabetes risk for prediabetic patients. Progression risk asks which patients with existing disease will develop complications, for instance nephropathy in diabetes or heart failure after myocardial infarction. Survival and prognosis involve time-from-diagnosis to events like death, recurrence, or transplant, often requiring survival models that handle censoring and competing risks. Treatment response and toxicity concerns whether a patient will benefit from one therapy versus another and their probability of experiencing serious adverse effects.
| Task Type | Example Question | Time Structure | Key Metrics | FM Role |
|---|---|---|---|---|
| Incident risk | Will this patient develop T2D in 10 years? | Fixed window | Discrimination, calibration | Risk stratification features |
| Progression | Will this diabetic develop nephropathy? | Conditional on disease | Time-to-event | Trajectory modeling |
| Survival/prognosis | How long after cancer diagnosis? | Open-ended, censored | C-index, survival curves | Tumor embeddings |
| Treatment response | Will this patient respond to drug X? | Conditional on treatment | Relative benefit | Drug-gene interactions |
| Toxicity | Adverse event risk with drug X? | Conditional on treatment | auPRC (rare outcomes) | Pharmacogenomic features |
Foundation models enter these problems as feature generators. They transform raw sequence data into structured representations that downstream prediction models combine with clinical covariates. The architectural choices for this combination, and the evidence required to trust the resulting predictions, constitute the core methodological challenges of clinical translation.
28.3 Feature Integration Architectures
The features available for clinical risk models draw on multiple foundation model families, each capturing different aspects of genetic and molecular risk.
DNA-level foundation models provide variant effect predictions without requiring trait-specific training. Systems like Nucleotide Transformer, HyenaDNA, and GPN compute sequence-based deleteriousness scores that reflect how mutations disrupt regulatory grammar, splice sites, or protein-coding sequences (Dalla-Torre et al. 2023; Nguyen et al. 2023; Benegas, Batra, and Song 2023). These zero-shot predictions transfer across traits and ancestries because they derive from sequence properties rather than population-specific association statistics (Chapter 15). Fine-mapping models integrate these functional priors with association evidence to estimate which variants within a locus are likely causal, providing principled weights for aggregation (Section 3.4). Fine-mapping models like MIFM integrate such functional priors with association evidence to estimate which variants within a locus are likely causal, providing principled weights for aggregation (Rakowski and Lippert 2025).
Protein language models add coding variant interpretation. AlphaMissense and related systems predict pathogenicity for missense mutations based on evolutionary conservation patterns learned from millions of protein sequences, as discussed in Chapter 16. For conditions with strong coding variant contributions (Mendelian cardiomyopathies, cancer predisposition syndromes), these predictions provide important signal beyond what noncoding regulatory models capture.
Multi-omics foundation models extend beyond germline sequence. Cell-type-resolved representations from GLUE, scGLUE, and CpGPT capture regulatory state across chromatin accessibility, methylation, and expression (Chapter 20) (Cao and Gao 2022; Camillo et al. 2024). Rare variant burden scores from DeepRVAT aggregate predicted effects across genes into pathway-level impairment measures (Clarke et al. 2024). For oncology applications, tumor embedding models like SetQuence and graph neural network-based subtypers encode complex somatic mutation landscapes into patient-level representations (Jurenaite et al. 2024; Li et al. 2022).
Electronic health record features provide the clinical context without which genomic predictions lack meaning. Demographics, vital signs, laboratory values, medication lists, problem codes, and procedure histories characterize the patient’s current state and trajectory. Time-varying biomarker trajectories (estimated glomerular filtration rate trends, hemoglobin A1c patterns, tumor marker dynamics) capture disease evolution that static snapshots miss.
The architectural question is how to combine these heterogeneous inputs. Three fusion strategies offer different tradeoffs.
Early fusion concatenates all features into a single input vector and trains a unified model (neural network, gradient boosting, survival regression) on the combined representation. This approach allows the model to learn arbitrary interactions between genomic and clinical features but requires all inputs to be present for every patient, handles scale differences between modalities poorly, and can be dominated by whichever input provides the most features or strongest signal.
Intermediate fusion trains separate encoders for each modality, producing genomic embeddings, clinical embeddings, and multi-omic embeddings that a fusion module then combines. The fusion module might use attention mechanisms to weight modality contributions dynamically, cross-modal transformers that allow features from one modality to attend to features from another, or simpler concatenation with learned combination weights. This approach offers modularity (foundation model encoders can be swapped as new versions become available) while still enabling learned cross-modal interactions.
Late fusion trains independent models for each modality and combines their predictions through ensemble methods or meta-learning. A polygenic score model, an electronic health record model, and a multi-omic model each produce risk estimates that a final layer integrates. This approach handles missing modalities gracefully and allows modality-specific architectures but may underutilize cross-modal structure since interactions can only be captured at the final combination stage.
Why does late fusion handle missing data so effectively while intermediate fusion struggles? In late fusion, each modality-specific model is trained independently and produces valid predictions whether or not other modalities are available. The genomic model outputs a valid risk score using only genomic data; the EHR model outputs a valid risk score using only clinical data. The combination layer learns optimal weighting when all modalities are present but can fall back to available inputs without architectural modification: if EHR data is missing, the combination layer simply uses the genomic and multi-omic scores. In contrast, intermediate fusion architectures learn cross-modal interactions during training that assume all data streams are present. When a patient lacks methylation data, for example, the fusion layer’s learned weights encode expectations about relationships between methylation features and other modalities that cannot be satisfied, potentially producing undefined or degraded outputs.
| Fusion Strategy | Cross-Modal Interactions | Missing Data Handling | Modularity | Best For |
|---|---|---|---|---|
| Early | Full (learned jointly) | Poor (all required) | Low | Dense, complete data |
| Intermediate | Moderate (fusion layer) | Moderate (graceful degradation) | High | Evolving FM ecosystems |
| Late | Limited (output only) | Excellent (independent) | Moderate | Heterogeneous availability |
For clinical deployment, intermediate fusion often provides the best balance. It enables modular updates as foundation models improve, allows graceful degradation when modalities are missing, and captures cross-modal interactions that late fusion misses. The specific fusion mechanism (attention, concatenation, cross-modal transformer) matters less than ensuring the architecture supports the operational requirements of clinical deployment: batch computation, uncertainty quantification, and interpretable feature attribution.
28.4 EHR Integration and Phenotype Embeddings
Polygenic risk scores condense genetic information into scalar predictions, but clinical decision-making occurs in the context of rich electronic health records that capture diagnoses, procedures, medications, laboratory values, and clinical narratives. A PRS for coronary artery disease exists as an isolated number until integrated with a patient’s history of hypertension, diabetes, smoking, and lipid measurements. The question is not merely whether to combine genetic and clinical information, but how to do so in ways that improve prediction, maintain interpretability, and avoid introducing new sources of bias.
Traditional approaches treat EHR data as additional covariates in regression models that already include the PRS. Age, sex, smoking status, and existing diagnoses enter as predictors alongside the genetic score, with effect sizes learned from training data. This additive framework has clear interpretation but limited capacity: it assumes that genetic risk and clinical risk contribute independently, missing interactions where genetic predisposition matters more or less depending on clinical context. A patient with elevated LDL cholesterol and high coronary disease PRS may face multiplicative risk that additive models underestimate.
28.4.1 EEPRS Framework
The EHR-embedding-enhanced PRS (EEPRS) framework addresses these limitations by integrating phenotype embeddings derived from EHR data with GWAS summary statistics to construct improved polygenic scores (Ruan et al. 2022). Rather than using expert-defined phenotype covariates, EEPRS learns vector representations of clinical phenotypes from their patterns of co-occurrence in patient records. These embeddings capture relationships among diseases, symptoms, and risk factors that expert definitions may miss.
The framework proceeds in stages. Embedding models (Word2Vec trained on ICD-10 code sequences, or GPT-based embeddings of code descriptions) transform each patient’s diagnostic history into a low-dimensional vector representation. GWAS conducted on these embedding dimensions identify genetic variants associated with each dimension of clinical phenotype space. The resulting summary statistics enable construction of embedding-based polygenic scores that capture genetic predisposition to the phenotypic patterns encoded in each dimension. Integration with traditional disease-specific PGS through weighted combination yields final risk predictions.
Validation in UK Biobank demonstrated consistent improvement over single-trait polygenic scores across 41 clinical traits. Cardiovascular conditions showed the largest gains: ischemic stroke improved by 66%, heart failure by 32%, and peripheral artery disease by 25% . These improvements concentrate in traits where related phenotypes share genetic architecture, allowing the embedding-based scores to leverage cross-phenotype genetic correlation. For isolated traits without strong embedding-dimension associations, improvements were modest or absent.
28.4.2 Understanding When Embeddings Help
The EEPRS framework showed large improvements for cardiovascular conditions but minimal improvement for breast cancer. Before reading the explanation, can you hypothesize why? Consider what phenotype embeddings capture and how genetic architecture might differ between these conditions.
Cardiovascular conditions cluster together in clinical practice and share genetic architecture, allowing embeddings to capture cross-phenotype genetic correlations. Breast cancer has largely distinct genetic architecture from cardiovascular diseases, so embedding-based scores derived from cardiovascular-weighted dimensions provide no additional predictive signal for cancer risk.
The pattern of improvement across traits reveals when EHR embeddings add value to polygenic prediction. Conditions that cluster together in clinical space, co-occurring in patients and sharing risk factors, benefit most. The cardiovascular cluster (coronary artery disease, ischemic stroke, peripheral artery disease, heart failure, angina, type 2 diabetes) forms a coherent group in both clinical practice and genetic architecture. Embeddings trained on EHR data capture this clustering, and GWAS on embedding dimensions identify variants associated with the shared liability across the cluster. These variants provide additional prediction signal beyond what single-trait GWAS can detect.
Conversely, conditions with distinct genetic architectures that do not cluster with other phenotypes show minimal improvement. Breast cancer and coronary artery disease, despite both being common conditions well-represented in biobanks, did not benefit from embedding integration in external validation. Their genetic architectures are largely distinct; embedding-based scores derived from cardiovascular-weighted dimensions provide no additional signal for cancer prediction.
This selectivity has important implications for clinical deployment. EEPRS offers greatest value for conditions where conventional polygenic scores remain underpowered despite adequate GWAS sample sizes. Heart failure, peripheral artery disease, and asthma showed substantial improvements precisely because their polygenic scores have historically underperformed relative to heritability estimates. Embedding integration effectively borrows strength across genetically correlated phenotypes, amplifying signal that single-trait analyses struggle to detect.
Beyond diagnostic code embeddings, clinical measurements themselves can be embedded for improved prediction. Yun et al. (2023) introduce REGLE, which learns embeddings from raw clinical data such as spirograms and electrocardiograms rather than from ICD codes. These physiological embeddings capture aspects of phenotype not reflected in diagnostic labels, potentially enabling transfer learning across related conditions and improving prediction for phenotypes with limited direct training data.
28.4.3 PRS-PheWAS for Clinical Interpretation
Clinical deployment requires interpretability: why does this score predict disease risk, and what biological mechanisms does it capture? PRS-based phenome-wide association studies provide one answer by systematically testing association between the polygenic score and hundreds of clinical phenotypes (Section 3.8). For embedding-enhanced scores, PRS-PheWAS reveals which clinical manifestations the genetic risk predicts.
The EEPRS framework’s cardiovascular improvements became interpretable through PRS-PheWAS analysis. Embedding-based scores derived from ICA-transformed dimensions showed strong associations (adjusted \(p < 10^{-20}\)) with hypertension, atrial fibrillation, and cardiac dysrhythmias . These associations explain the improvement: the embeddings capture genetic variation that influences multiple cardiovascular endpoints, and aggregating across these endpoints provides stronger risk stratification than targeting any single outcome.
PRS-PheWAS also reveals unexpected associations that warrant clinical attention. Different embedding methods capture different aspects of phenotypic structure, with GPT-based embeddings uniquely identifying associations with infectious diseases and mental disorders that Word2Vec embeddings missed. These method-specific patterns may reflect differences in what the embedding approaches learn from clinical data, or they may indicate opportunities for method combination that leverages complementary signals.
28.4.4 Implementation Considerations
Translating EEPRS from research demonstration to clinical deployment requires addressing several practical challenges. The embedding models must be trained on EHR data representative of the deployment population; embeddings learned from UK Biobank may not transfer to health systems with different patient populations, coding practices, or documentation patterns. The integration weights that combine embedding-based and single-trait scores require calibration in the target population, not just the discovery cohort.
Computational requirements are modest once embeddings are pretrained. Scoring new patients requires computing their embedding from available ICD codes (a lookup operation), then calculating weighted sums across precomputed variant weights. The workflow integrates with existing PGS calculation pipelines, adding embedding score computation and integration as additional steps. Summary statistics for embedding-based GWAS can be distributed like conventional GWAS results, enabling score construction without sharing individual-level data.
The deeper challenge is population representativeness. EHR-based embeddings inherit the documentation patterns, coding practices, and healthcare access disparities of the health systems where they were trained. An embedding that positions diabetes near cardiovascular disease reflects the co-occurrence pattern in patients who access both cardiology and endocrinology care; patients who lack access to specialty care may show different patterns. Multi-ancestry validation revealed that EEPRS improvements varied across populations, with gains concentrated in conditions where the underlying genetic correlation structure held across ancestries.
28.4.5 Integration with Foundation Model Features
The EEPRS framework operates on GWAS summary statistics and phenotype embeddings, both derived from classical statistical approaches. Foundation models offer an alternative integration strategy where learned sequence representations replace or augment summary statistics. Rather than weighting variants by GWAS effect sizes, foundation model approaches can score variants by their predicted functional impact, regulatory consequence, or embedding similarity to known pathogenic variants (Chapter 18).
Attention-based integration, graph neural networks for pathway aggregation (Chapter 22), and transformer encoders for sequence context can all incorporate EHR embeddings as additional input features. A patient’s clinical embedding provides context that may modify interpretation of their genetic variants: a variant of uncertain significance (VUS) in a cardiovascular gene carries different implications for a patient whose clinical embedding places them in the cardiovascular risk cluster versus one with an unremarkable clinical profile. This contextualization moves beyond additive combination toward models that learn interactions between genetic and clinical risk.
The combination of phenotype embeddings and foundation model features remains largely unexplored. EEPRS demonstrated that phenotype embeddings capture heritable variation beyond single-trait GWAS; foundation models demonstrate that sequence context improves variant effect prediction beyond simple annotations (Section 18.7). Whether these approaches provide complementary signal, and whether their combination improves clinical prediction beyond either alone, represents an open research question with substantial clinical implications.
Before proceeding to temporal modeling, ensure you understand:
- Why embedding-based approaches improve some conditions (cardiovascular cluster) but not others (breast cancer)
- How PRS-PheWAS provides interpretability for enhanced polygenic scores
- What population representativeness challenges affect EHR-based embedding methods
If these concepts are unclear, review the preceding sections before continuing.
28.5 Temporal Modeling Architectures
Clinical risk prediction spans diverse temporal structures, and the choice of modeling framework must match the prediction task. A screening tool estimating whether a patient will develop diabetes within ten years faces different statistical challenges than a monitoring system tracking whether a patient’s kidney function trajectory signals imminent decline. Foundation model features can integrate into each framework, but the integration patterns differ.
The following section introduces survival analysis concepts (hazard functions, censoring, proportional hazards assumptions) and longitudinal modeling frameworks (joint models, time-varying coefficients). Readers unfamiliar with these statistical foundations may benefit from reviewing standard biostatistics references before proceeding. The key conceptual distinction is between models that predict whether an event occurs within a fixed window versus models that track how risk evolves over time.
Survival models address time-to-event outcomes where patients are followed until an event occurs or observation ends. The Cox proportional hazards model remains the workhorse of clinical risk prediction, estimating hazard ratios for features while making minimal assumptions about baseline hazard shape. Foundation model embeddings enter as covariates alongside clinical variables, with the proportional hazards assumption requiring that genomic risk effects remain constant over time. When this assumption fails (as when genetic effects on cancer recurrence differ between early and late periods), stratified or time-varying coefficient extensions accommodate the violation.
Deep survival models extend this framework through neural network architectures that learn nonlinear feature interactions. DeepSurv replaces the linear Cox predictor with a multilayer network while preserving the partial likelihood objective (Katzman et al. 2018). Deep Survival Machines model the survival distribution as a mixture of parametric components, enabling richer distributional assumptions than the semiparametric Cox approach (Nagpal, Li, and Dubrawski 2021). These architectures naturally accommodate the high-dimensional embeddings that foundation models produce, though the risk of overfitting increases and careful regularization becomes essential.
Longitudinal models address a different challenge: patients observed repeatedly over time, with measurements that evolve and interact. A patient’s hemoglobin A1c trajectory over five years contains information that a single baseline measurement cannot capture. Whether values are stable, rising, or fluctuating conveys prognostic signal beyond their current level. Joint longitudinal-survival models connect these repeated measurements to event outcomes, modeling how biomarker trajectories associate with hazard while accounting for informative dropout when sicker patients are measured more frequently or die before later observations.
Foundation model features integrate into longitudinal frameworks at multiple levels. Static genomic embeddings (computed once from germline sequence) serve as time-invariant covariates influencing both trajectory shape and event hazard. Time-varying molecular features (expression profiles, methylation states, circulating tumor DNA levels) can be encoded through foundation models at each measurement occasion, producing sequences of embeddings that recurrent or attention-based architectures process into trajectory representations. The computational cost of re-encoding molecular data at each timepoint is substantial, making efficient inference strategies essential for deployment.
Transformer architectures designed for irregularly sampled time series offer a natural framework for clinical trajectories. Models like STraTS and similar clinical transformers handle the variable timing and missing measurements characteristic of real-world healthcare data (Tipirneni and Reddy 2022). Position encodings based on actual timestamps rather than sequence position accommodate irregular sampling. Attention mechanisms identify which historical measurements most inform current predictions. Foundation model embeddings at each timepoint provide richer input representations than raw laboratory values alone.
The fundamental design choice in temporal genomic risk modeling is how to combine time-invariant genetic features (germline sequence does not change) with time-varying clinical context (laboratory values, disease progression, treatment response). The most effective architectures treat genetics as a “prior” that sets baseline risk trajectory, then update predictions as clinical observations accumulate. This mirrors clinical reasoning: genetic predisposition establishes susceptibility, but current clinical state determines immediate risk.
The choice between survival and longitudinal frameworks depends on the clinical question and available data. When the goal is baseline risk stratification (identifying high-risk patients at a single decision point), survival models with static genomic features often suffice. When the goal is dynamic monitoring (detecting deterioration as it develops), longitudinal models that update predictions as new measurements arrive become necessary. Hybrid approaches that initialize with genomic risk and update based on clinical trajectory combine the strengths of both paradigms.
28.6 Evaluation for Clinical Deployment
High performance on held-out test sets is necessary but far from sufficient for clinical deployment. Risk models must satisfy multiple evidence standards that typical machine learning papers do not address, and teams planning translation must understand these requirements from the outset rather than discovering them after development is complete.
Imagine you have developed a foundation model-based cardiovascular risk predictor that achieves auROC of 0.82 on your test set, substantially better than the 0.76 of the traditional Pooled Cohort Equations. A health system is interested in deploying it. What questions should they ask before integration? What evidence would you need beyond test set performance? Think about this before reading the evaluation framework below.
28.6.1 Discrimination
Discrimination measures how well a model ranks patients by risk, distinguishing those who will experience outcomes from those who will not. For binary endpoints like disease occurrence within a fixed time window, the area under the receiver operating characteristic curve (auROC) summarizes discrimination across all classification thresholds (Section 12.6). When outcomes are rare (severe adverse drug reactions, specific disease subtypes), the area under the precision-recall curve (auPRC) better reflects how well the model identifies true positives among many negatives. For survival tasks with censoring, the concordance index and time-dependent auROC generalize these metrics to the time-to-event setting.
Strong discrimination is necessary but not sufficient. A model that correctly ranks patients but systematically overestimates or underestimates absolute risk magnitudes will lead to inappropriate clinical decisions. If a model predicts 5% risk for patients who actually experience 15% event rates, physicians using those predictions will undertreat. Conversely, systematically inflated predictions lead to overtreatment with attendant harms and costs.
28.6.2 Calibration
Calibration asks whether predicted probabilities match observed frequencies. If a model assigns 20% risk to a group of patients, approximately 20% should experience the outcome. Well-calibrated predictions can be interpreted at face value and used directly for clinical decision-making; miscalibrated predictions mislead regardless of discrimination quality.
Assessment involves calibration plots comparing predicted risk deciles to observed event rates, statistical tests like the Hosmer-Lemeshow test, and proper scoring rules like the Brier score that combine calibration and discrimination (Section 12.12). The methodological foundations for these assessments, including temperature scaling and isotonic regression approaches, are detailed in Section 24.4. These assessments must be stratified by clinically relevant subgroups (ancestry, sex, age, comorbidity burden) because a model well-calibrated overall may be systematically miscalibrated for specific populations.
For polygenic score-informed models, calibration requires particular attention. Raw polygenic scores are typically centered and scaled rather than calibrated to absolute risk. Why are raw scores uncalibrated? A PGS is constructed by summing effect sizes across variants, producing a relative ranking rather than an absolute probability. The score distribution depends on the training population’s allele frequencies and LD structure; the same score percentile maps to different absolute risks depending on baseline disease incidence, age, sex, and environmental exposures. Without anchoring to an external incidence rate, the score carries no inherent probability interpretation. Mapping a score to an absolute event probability requires post-hoc models incorporating baseline incidence and clinical covariates. Foundation models can shift score distributions as architectures evolve, meaning recalibration may be necessary when updating encoders. The connection to Chapter 24 is direct: calibration is one form of uncertainty quantification, assessing whether model confidence aligns with actual outcome frequencies.
28.6.3 Clinical Utility
Beyond discrimination and calibration, clinical utility asks whether using the model will change decisions beneficially. Net reclassification improvement quantifies how many patients are appropriately moved across risk thresholds compared to a baseline model. Decision curve analysis estimates net benefit across threshold probabilities, accounting for the relative costs of false positives and false negatives in specific clinical contexts.
Net benefit analysis (Vickers and Elkin 2006) quantifies clinical value by weighing true positives against false positives at each decision threshold:
\[\text{Net Benefit} = \frac{\text{TP}}{N} - \frac{\text{FP}}{N} \times \frac{p_t}{1-p_t}\]
where \(p_t\) is the intervention threshold probability. Decision curves plot net benefit across thresholds, comparing the model to “treat all” and “treat none” strategies. For genomic risk prediction, net benefit analysis answers the question clinical utility ultimately requires: across what range of risk thresholds does using this model improve outcomes compared to simpler strategies? A model may achieve high auROC but provide no net benefit if its discrimination does not translate to actionable risk stratification at clinically relevant thresholds (Steyerberg 2019).
For foundation model-based tools, these analyses must demonstrate incremental value over existing alternatives. If a complex genomic foundation model provides only marginal improvement over a traditional polygenic score plus standard clinical calculator, the additional complexity, cost, and implementation burden may not be justified. The relevant comparison is not “better than nothing” but “better than what clinicians can already access.”
| Evaluation Dimension | Key Question | Primary Metrics | Subgroup Requirements |
|---|---|---|---|
| Discrimination | Does the model rank patients correctly? | auROC, auPRC, C-index | By ancestry, sex, age |
| Calibration | Do predicted probabilities match reality? | Calibration slope, ECE, Brier | By ancestry, comorbidity |
| Clinical utility | Does using the model improve decisions? | NRI, decision curves, net benefit | By decision threshold |
| Incremental value | Is it better than existing tools? | Delta-metrics vs. baseline | Across care settings |
Claims that foundation model-based risk tools improve upon “state-of-the-art” polygenic prediction require verification of baseline strength. The minimum baseline battery for rigorous evaluation should include:
- LDpred2-auto or LDpred2-grid: LD-aware Bayesian method that estimates polygenicity and heritability directly
- PRS-CS or PRS-CS-auto: Continuous shrinkage prior accommodating highly polygenic architectures
- SBayesR or SBayesRC: Mixture model approach; annotation-integrated version provides additional benchmark
- Published PGS from PGS Catalog for the specific trait, representing community-validated scores
- XGBoost or random forest: Non-deep-learning ML alternative establishing whether neural network complexity is necessary
Using only clumping-and-thresholding (C+T) as baseline artificially inflates apparent foundation model gains by 16-60%. When properly tuned linear methods are included, neural networks for polygenic prediction often perform only 93-95% as well, with apparent nonlinear advantages reflecting implicit LD modeling rather than genuine epistasis detection or representation learning (Ge et al. 2019).
The diagnostic question: does your foundation model approach outperform the best available linear method, or only weak linear methods? Incremental improvement over strong baselines justifies deployment complexity; improvement only over weak baselines does not.
28.6.4 Validation Hierarchy
Evidence strength depends critically on validation design. Internal validation through cross-validation or temporal splits within development data is useful but insufficient due to potential overfitting and subtle data leakage issues discussed in Section 12.4.1. External validation across institutions and ancestries tests the same locked model in independent health systems and diverse populations. This step is essential for assessing whether performance reflects genuine biological signal versus idiosyncratic features of the development dataset.
Prospective observational validation runs the model silently alongside clinical care without influencing decisions, measuring real-time performance and drift in deployment conditions. Prospective interventional trials use randomized or quasi-experimental designs to assess whether model-guided care actually improves outcomes, equity, and cost-effectiveness compared to usual care.
For most foundation model-based tools, regulators and health systems expect robust external validation at minimum. High-stakes applications (cancer prognosis affecting treatment intensity, pharmacogenomic predictions affecting drug choice) may require prospective interventional evidence. The investment required increases at each level of the hierarchy, but so does the confidence that deployment will produce benefit rather than harm.
28.7 Uncertainty Quantification
In clinical settings, models must know when they do not know. A risk prediction offered with false confidence is more dangerous than one accompanied by appropriate uncertainty bounds, because the former invites unwarranted action while the latter prompts appropriate caution or additional evaluation.
Two sources of uncertainty require distinction. Aleatoric uncertainty reflects irreducible noise in the outcome: even with perfect input features, some patients with identical measured characteristics will experience different outcomes due to unmeasured variables, stochastic biology, or measurement error. Epistemic uncertainty reflects model limitations: insufficient training data, architectural constraints, or distributional shift between training and deployment conditions. Aleatoric uncertainty cannot be reduced by collecting more data or improving models; epistemic uncertainty can (Section 24.1).
Practical uncertainty quantification methods include ensemble approaches, where multiple models trained with different random seeds provide prediction intervals based on their disagreement (Section 24.5.1). Monte Carlo dropout approximates Bayesian uncertainty by averaging predictions across stochastic forward passes (Section 24.5.2). Conformal prediction provides principled prediction intervals with guaranteed coverage under exchangeability assumptions, avoiding the distributional assumptions required by parametric methods (Section 24.6). Temperature scaling post-hoc adjusts model outputs to improve calibration without retraining (Section 24.4).
A foundation model-based risk tool provides a prediction of 35% 10-year cardiovascular risk for a patient of Nigerian ancestry, along with a wide confidence interval of 15-55%. The model was trained primarily on European-ancestry data. Is this wide interval more likely reflecting aleatoric or epistemic uncertainty? What would you recommend the clinician do with this prediction?
This wide interval primarily reflects epistemic uncertainty due to distributional shift: the model has limited training data from African-ancestry populations and thus low confidence in its predictions. The clinician should interpret this prediction with caution, consider it alongside other risk factors, and potentially order additional testing rather than making treatment decisions based solely on this uncertain estimate.
For foundation model-based systems, uncertainty decomposes into genomic and clinical components. Genomic uncertainty reflects confidence in variant effect predictions, fine-mapping probabilities, or embedding reliability; it increases for variants from underrepresented populations, rare variants with limited training examples, or sequences falling outside the distribution seen during pretraining. Clinical uncertainty reflects extrapolation to new care settings, practice patterns, or patient populations not represented in development data.
Selective prediction allows models to abstain when uncertainty exceeds thresholds, flagging cases for human review rather than providing potentially misleading predictions (Section 24.8). This is particularly important for patients from rare ancestries underrepresented in training data or with unusual clinical presentations. The tension between coverage (providing predictions for all patients) and reliability (ensuring predictions are trustworthy) must be navigated thoughtfully, ideally with input from the clinicians who will use the system.
28.8 Fairness and Health Equity
Many genomic and electronic health record datasets encode historical inequities in who gets genotyped, which populations are recruited into biobanks, and how healthcare is documented and delivered. Risk models trained on such data can amplify disparities if not carefully evaluated and designed.
The structural biases that genomic datasets inherit, from sequencing cohort recruitment to biobank composition to ClinVar submission patterns, create cascading effects on model performance. These biases manifest not as random noise but as systematic underperformance for populations historically excluded from genomic research (Section 13.2.1).
Underrepresentation of non-European ancestries compounds at each stage of the genomic AI pipeline. GWAS discovery is underpowered. Fine-mapping resolution is reduced due to different LD patterns. Variant effect predictors have fewer training examples. PRS portability suffers. Foundation model embeddings are less well-calibrated. Each layer of the stack inherits and potentially amplifies the biases of preceding layers, making end-to-end equity evaluation essential rather than optional.
The ancestry bias in genome-wide association studies persists in foundation model applications. As discussed in Section 3.7, polygenic scores derived from European-ancestry data substantially underperform in other populations. Foundation models have the opportunity but not the guarantee to improve portability by leveraging functional priors that transfer across ancestries (sequence-based deleteriousness does not depend on population-specific linkage disequilibrium) and by incorporating multi-ancestry training data. Whether they succeed depends on training data composition, evaluation practices, and explicit attention to cross-ancestry performance throughout development.
Electronic health record features introduce additional bias sources. Which patients receive genetic testing, which laboratory tests are ordered, how diagnoses are coded, and how thoroughly clinical notes are documented all differ systematically across patient populations, care settings, and health systems. A model trained on one institution’s data may encode those institutional patterns rather than underlying biology.
Health equity evaluation requires disparity metrics measuring performance differences in discrimination, calibration, and clinical utility across subgroups defined by ancestry, sex, socioeconomic proxies, and care site. Access metrics assess whether financial, geographic, or systemic barriers limit which patients can benefit from genomic risk tools. Outcome metrics evaluate whether clinical actions triggered by predictions differ across groups and whether benefits accrue equitably or concentrate among already-advantaged populations.
Technical mitigation strategies include reweighting training data to reduce representation disparities, group-wise calibration ensuring equitable performance across subgroups, and localized fine-tuning using deployment-site data. These approaches are discussed further in Section 13.9. Technical interventions alone cannot overcome structural inequities. Non-technical approaches including expanding sequencing access, subsidizing testing for underserved populations, and designing workflows that accommodate diverse care settings are equally essential.
The core principle is that equity cannot be an afterthought addressed during final evaluation. It must inform pretraining data selection, benchmark choice, validation study design, and deployment planning from the outset. A model that appears well-calibrated overall but is miscalibrated for specific populations will exacerbate rather than reduce health disparities.
The governance frameworks, regulatory considerations, and responsible development practices for ensuring equitable clinical AI are examined in Chapter 27.
28.9 Clinical Integration
Even a comprehensively validated model can fail in practice if it does not integrate into clinical workflows. Genomic risk predictions must reach clinicians at decision points, in formats that support rather than disrupt care delivery, with appropriate interpretability and uncertainty communication.
28.9.1 Workflow Integration Patterns
Clinical genomics has established pathways for returning results through CLIA-certified laboratories, structured reports, and genetic counseling. Foundation model-based risk tools can augment these pathways in two primary ways. Laboratory interpretation augmentation uses foundation model predictions to prioritize variants for manual review, provide richer functional annotations, and suggest likely disease mechanisms supporting differential diagnosis. Direct risk embedding in electronic health records precomputes risk scores for patients with genomic data, surfaces them in structured fields or clinical dashboards, and triggers alerts when thresholds are crossed.
Design choices include batch versus on-demand computation (batch overnight processing is often preferable given foundation model computational costs and the relative stability of genomic data), synchronous alerts at order entry versus asynchronous reports in clinical inboxes, and whether high-impact predictions require human-in-the-loop review before reaching front-line clinicians.
The specifics vary by clinical context. Pharmacogenomic alerts might appear synchronously at prescription order entry, providing immediate guidance on drug selection or dosing. Cardiometabolic risk scores might appear in primary care dashboards updated weekly, informing prevention discussions at annual visits. Oncology prognosis estimates might be generated at diagnosis and reviewed in tumor board settings where multidisciplinary teams make treatment decisions.
Pharmacogenomics: Synchronous alerts at prescription entry. Pre-compute patient drug-gene interaction profiles. Alert must include actionable alternatives, not just warnings.
Primary care prevention: Batch scoring with dashboard display. Weekly updates sufficient given slow-changing genomic risk. Integrate with existing cardiovascular risk calculators.
Oncology prognosis: Generate at diagnosis for tumor board review. Include uncertainty bounds and subgroup performance caveats. Human geneticist review before clinical action.
Rare disease diagnosis: On-demand for active diagnostic workups. Prioritize interpretability (which variants, what mechanisms) over pure risk scores.
28.9.2 System Architecture
From an engineering perspective, foundation model-based clinical tools typically require a secure model-serving endpoint handling inference requests, input adapters transforming laboratory and electronic health record data into model-ready formats, output adapters mapping predictions to structured clinical concepts or user-facing text, and logging infrastructure providing audit trails and enabling drift detection.
Regulated settings impose additional requirements: versioning of models, data pipelines, and reference genomes with complete reproducibility; access controls and network segmentation protecting genomic data; and validation environments separated from production for safe testing of updates. Practical guidance on hardware requirements, deployment patterns, and cost estimation appears in Appendix B.
28.9.3 Post-Deployment Monitoring
Clinical deployment begins rather than ends the model lifecycle. Practice patterns evolve as new treatments and guidelines emerge. Patient populations shift as screening programs expand or contract. Laboratory assays and sequencing pipelines change, introducing distributional shifts in input features.
Monitoring systems should track input distributions (genotype frequencies, electronic health record feature patterns) to detect when current patients differ from training populations. Output distributions (risk score histograms, threshold-crossing rates) reveal whether model behavior is changing. Performance metrics computed via rolling windows or periodic audits detect calibration or discrimination degradation before clinical consequences accumulate.
When drift is detected, responses range from recalibration (adjusting the score-to-probability mapping while preserving ranking behavior) through partial retraining (updating prediction heads while keeping foundation model weights fixed) to full model updates (retraining encoders, requiring renewed validation). The modular separation between foundation model backbones and clinical prediction heads facilitates this maintenance: encoders can be versioned and swapped with compatibility testing while prediction heads adapt to local deployment conditions.
Incident response processes allow clinicians to report surprising or harmful predictions, triggering root-cause analysis and potential remediation. Governance structures including AI oversight committees review models periodically and establish clear criteria for deprecation when performance degrades below acceptable thresholds.
28.10 Regulatory and Quality Frameworks
Foundation model-based clinical tools exist on a spectrum from research-only applications supporting hypothesis generation through clinical decision support tools informing diagnosis or management to regulated medical devices subject to formal oversight. The regulatory classification depends on intended use, risk level, and the claims made for the tool.
Jurisdictions differ in specifics, but common expectations include transparent descriptions of training data and known limitations, quantitative performance evidence across relevant subgroups, plans for post-market surveillance and incident reporting, and change management procedures for model updates. Beyond formal regulation, health systems typically require standard operating procedures for deployment and decommissioning, model cards describing training data and limitations, validation reports documenting evaluation evidence, and governance structures reviewing and approving new tools (Chapter 27).
Foundation models introduce additional documentation requirements. Descriptions of pretraining corpora must specify which genomes, assays, and populations were included. Fine-tuning datasets and label definitions require detailed documentation. Procedures for updating to new genome builds, reference panels, or assay types must be established and tested. The modular separation between pretrained encoders and clinical prediction heads can ease regulatory management by allowing independent updates to each component, but this requires careful version control and compatibility testing to ensure that updating one component does not degrade performance of the combined system.
28.11 Case Studies
Three stylized case studies illustrate how foundation model features integrate into clinical risk prediction across different disease contexts, time horizons, and decision types.
As you read each case study, consider: (1) What specific clinical decision does the prediction inform? (2) What evidence would be required before deployment? (3) What equity considerations apply? These questions connect the abstract principles discussed earlier to concrete clinical scenarios.
28.11.1 Cardiometabolic Risk Stratification
A 52-year-old man presents to his primary care physician for an annual wellness visit. His LDL cholesterol is 145 mg/dL, blood pressure is 138/88 mmHg, and hemoglobin A1c is 5.9%, placing him in the prediabetic range. His father had a myocardial infarction at age 58. The standard Pooled Cohort Equations estimate his 10-year atherosclerotic cardiovascular disease risk at 8.2%, just below the threshold where guidelines recommend statin therapy.
A foundation model-augmented risk system could refine this assessment. Variant effect scores from DNA foundation models annotate variants in cardiometabolic risk loci with predicted regulatory and coding impacts, combining sequence-based scores with fine-mapping probabilities to prioritize likely causal variants (Section 18.3; Section 18.4). A polygenic embedding model like Delphi or G2PT produces a genome-wide representation capturing nonlinear risk structure beyond simple effect size sums (Georgantas, Kutalik, and Richiardi 2024; Lee et al. 2025). This genomic embedding combines with electronic health record features through an intermediate fusion architecture, producing an updated 10-year risk estimate of 11.4%, above the treatment threshold.
The clinical value depends on what this refined estimate enables. If genomic foundation model features merely replicate traditional polygenic score information with higher computational cost, the benefit is marginal. But if the embedding captures pathway-level structure that identifies this patient’s risk as concentrating in LDL metabolism pathways rather than inflammatory or thrombotic mechanisms, that information might strengthen the indication for statin therapy specifically. Attention-based attributions highlighting which genomic regions contribute most to the elevated risk could inform counseling about heritability and family screening.
External validation across multiple health systems and ancestries would need to demonstrate that the foundation model approach provides calibrated predictions and meaningful reclassification improvement over traditional tools. Equity analysis would verify that performance holds across the diverse populations the health system serves rather than degrading for non-European ancestries underrepresented in training data.
28.11.2 Oncology Prognosis
A 64-year-old woman has undergone surgical resection for stage II colorectal cancer with microsatellite stable tumor characteristics. Her oncology team must decide whether adjuvant chemotherapy is warranted given the balance between recurrence risk reduction and treatment toxicity. Traditional staging provides prognostic information, but substantial heterogeneity exists within stage categories.
Foundation models can enrich prognostic assessment through multiple channels. Tumor mutation profiles encoded through models like SetQuence or SetOmic produce embeddings capturing the specific constellation of somatic alterations beyond simple mutation counts (Jurenaite et al. 2024). Transcriptomic profiling integrated through GLUE-style latent spaces adds expression context reflecting tumor microenvironment and pathway activity (Cao and Gao 2022). Graph neural network-based subtyping assigns the tumor to a molecular subtype with characteristic prognosis and treatment response patterns (Li et al. 2022).
These tumor-level representations combine with germline pharmacogenomic features (variants affecting fluoropyrimidine metabolism that influence toxicity risk) and clinical features (performance status, comorbidities, patient preferences) in a survival model predicting two-year recurrence hazard. A high-risk prediction might favor more intensive adjuvant therapy, while low-risk predictions might support observation with close surveillance.
The validation requirements are stringent. Retrospective analysis of institutional cohorts establishes proof of concept, but prospective validation in cohorts receiving contemporary treatment regimens is necessary given the rapid evolution of oncology care. Interpretability connecting predictions to specific mutations, pathways, or molecular subtypes supports clinical adoption by providing rationale beyond a black-box hazard estimate.
28.11.3 Pharmacogenomic Adverse Event Prediction
A 45-year-old man with newly diagnosed epilepsy requires anticonvulsant therapy. Carbamazepine is a common first-line choice, but it carries risk of severe cutaneous adverse reactions including Stevens-Johnson syndrome and toxic epidermal necrolysis. The HLA-B*15:02 allele is strongly associated with carbamazepine hypersensitivity in patients of Asian ancestry, and FDA-approved drug labeling recommends genetic testing before initiating therapy in at-risk populations.
This established pharmacogenomic association illustrates both the potential and limitations of current approaches. Single-variant associations with high effect sizes enable straightforward clinical implementation, but they cover a small fraction of drug-gene interactions. Many patients who do not carry HLA-B15:02* still experience adverse reactions, suggesting additional genetic (and non-genetic) risk factors that single-variant testing misses.
Foundation models could extend pharmacogenomic prediction beyond established single-gene associations. Resources like PharmGKB (Whirl-Carrillo et al. 2012) provide curated drug-gene-variant associations that serve as ground truth for model development and validation, though coverage remains incomplete for many drug classes. Variant effect scores across HLA genes, drug metabolism enzymes, and immune-related loci provide features reflecting the patient’s overall pharmacogenetic landscape (Section 2.8.4). These features aggregate into a polygenic adverse event risk score that captures contributions from many variants rather than relying on individual high-effect alleles. Combined with clinical features (renal function affecting drug clearance, concomitant medications with interaction potential, prior adverse reaction history), the model predicts adverse event probability specific to the proposed drug.
The validation challenge is severe. Serious adverse drug reactions are rare, making endpoint ascertainment difficult and underpowered. Case-control designs enriched for adverse events may overestimate model performance compared to prospective deployment. Multi-site validation across healthcare systems with different prescribing patterns and population ancestry compositions is essential.
Clinical implementation requires integration at the point of prescribing, providing actionable information when drug selection decisions are being made. This argues for pre-computed pharmacogenomic profiles that alert at order entry rather than reactive testing after a prescription is written. The interpretability requirement is particularly acute: clinicians must understand why a model flags a patient as high-risk for a specific drug to make informed risk-benefit decisions.
28.12 Translation as the Test
Success for genomic foundation models in clinical medicine will depend less on model scale and more on rigorous translation. Problem definition, evidence generation, equity evaluation, regulatory compliance, workflow integration, and post-deployment monitoring each introduce opportunities for failure. Models that clear all hurdles are rare; models that skip stages fail in deployment regardless of their technical sophistication.
The representational advances that foundation models provide become valuable only when they flow through validated, equitable, well-integrated clinical tools into decisions that improve patient outcomes. A pathogenicity score with state-of-the-art discrimination adds nothing to care if it reaches clinicians at the wrong moment, in the wrong format, without appropriate uncertainty communication. A risk prediction that performs excellently on average but fails systematically for underrepresented populations may widen health disparities rather than narrow them. Technical capability is necessary but not sufficient for clinical impact.
Rare disease diagnosis illustrates these translation principles in a particularly high-stakes context: where risk prediction addresses population-level stratification, variant interpretation addresses individual patients, with different evidence requirements, clinical workflows, and definitions of success (Chapter 29).
Before reviewing the summary, test your recall:
- What are the three main limitations of traditional polygenic risk scores that foundation model features aim to address?
- Compare early, intermediate, and late fusion architectures for integrating genomic and clinical features. When would you choose each approach?
- Why is calibration distinct from discrimination, and why does it matter for clinical deployment?
- How does bias compound through the genomic AI pipeline, from GWAS to foundation model predictions?
- What is the difference between analytical validity, clinical validity, and clinical utility? Give an example where a model has one but not the others.
PRS Limitations: Traditional polygenic scores have three main limitations: (1) they lack mechanistic insight by reducing genomes to single numbers without indicating which biological pathways drive risk, (2) they show poor cross-ancestry portability because they are derived from European-dominated GWAS data and depend on population-specific linkage disequilibrium patterns, and (3) they remain disconnected from clinical workflows by existing outside electronic health records where decisions actually happen. Foundation model embeddings address these by preserving information about which genomic regions contribute to risk, leveraging sequence-based functional priors that transfer across ancestries, and enabling integration architectures that combine genomic and clinical data within EHR systems.
Fusion Architectures: Early fusion concatenates all features into a single input and trains a unified model, allowing arbitrary interactions but requiring complete data for all patients and risking dominance by the strongest signal source. Intermediate fusion trains separate encoders for each modality then combines their embeddings through attention or cross-modal transformers, offering modularity and graceful degradation while still capturing cross-modal interactions. Late fusion trains independent models per modality and combines their predictions through ensemble methods, handling missing data excellently but potentially underutilizing cross-modal structure. Choose early fusion for dense complete datasets, intermediate fusion for evolving foundation model ecosystems requiring modularity, and late fusion when data availability varies substantially across patients.
Calibration vs Discrimination: Discrimination measures how well a model ranks patients by risk (whether those who develop disease score higher than those who do not), typically assessed via auROC. Calibration measures whether predicted probabilities match observed frequencies (whether patients assigned 20% risk actually experience events at 20% rate). A model can have excellent discrimination but poor calibration if it correctly ranks patients but systematically over- or underestimates absolute risk magnitudes. This matters for clinical deployment because treatment decisions depend on absolute risk thresholds: miscalibrated predictions lead to inappropriate treatment (undertreatment if risks are underestimated, overtreatment if inflated) even when patient ranking is correct.
Bias Compounding: Bias accumulates at each stage of the genomic AI pipeline due to European ancestry overrepresentation in genomic datasets. GWAS discovery power is reduced for non-European populations, leading to weaker effect size estimates. Fine-mapping resolution suffers because linkage disequilibrium patterns differ across ancestries. Variant effect predictors have fewer training examples from underrepresented populations. Polygenic scores built from these components show poor portability and reduced accuracy. Foundation model embeddings trained on biased data produce less calibrated representations for non-European ancestries. Each layer inherits and potentially amplifies the biases of preceding layers, making ancestry-stratified evaluation essential at every stage rather than only at final deployment.
Validity Levels: Analytical validity means the test accurately measures what it claims (e.g., genotyping array correctly calls variants). Clinical validity means the measurement associates with the clinical outcome (e.g., polygenic score correlates with disease risk). Clinical utility means using the test improves patient outcomes (e.g., knowing the score leads to interventions that reduce disease incidence). A model can have high analytical and clinical validity but no clinical utility if the resulting predictions do not change clinical decisions: for example, a perfectly accurate test for a biomarker that strongly predicts disease but for which no effective interventions exist would have excellent validity but provide no utility since knowing the result does not enable actions that improve outcomes.
This chapter examined the translation of genomic foundation models into clinical risk prediction tools that can improve patient outcomes.
Key Concepts:
From PRS to embeddings: Traditional polygenic scores are scalar summaries with limited mechanistic insight and poor cross-ancestry portability. Foundation models produce rich embeddings that preserve information about which variants matter, how they interact, and why they contribute to risk.
Fusion architectures: Early, intermediate, and late fusion strategies offer different tradeoffs for combining genomic and clinical features. Intermediate fusion typically provides the best balance of modularity, cross-modal learning, and graceful degradation for missing data.
Evaluation beyond discrimination: Clinical deployment requires not just good ranking (discrimination) but also accurate probability estimates (calibration) and demonstrated benefit over existing tools (clinical utility). Each must be evaluated across clinically relevant subgroups.
Validation hierarchy: Evidence strength increases from internal validation through external validation, prospective observational studies, and prospective interventional trials. Most foundation model tools require at minimum robust external validation.
Uncertainty quantification: Distinguishing aleatoric (irreducible) from epistemic (model-limited) uncertainty enables appropriate clinical responses. Selective prediction allows abstention when confidence is low.
Equity as design principle: Bias compounds through the genomic AI pipeline. Equity evaluation must span discrimination, calibration, and clinical utility across ancestry, sex, and socioeconomic subgroups. Technical mitigation alone cannot overcome structural inequities.
Workflow integration: Valid models fail in practice without appropriate integration into clinical workflows, monitoring for drift, and governance structures for oversight and incident response.
Looking Ahead: Chapter 29 applies these translation principles to rare disease diagnosis, where the stakes are individual patients rather than population-level risk stratification, and where foundation models offer particularly compelling advantages over traditional approaches.