18  Clinical Risk Prediction

Warning

TODO:

Modern genomic foundation models (GFMs) give us increasingly rich representations of DNA, RNA, proteins, and multi-omic context (Parts II–IV). The natural next question is: how do we turn these representations into actionable predictions for individual patients?

This chapter focuses on clinical risk prediction and decision support—models that estimate the probability, timing, or trajectory of outcomes such as incident disease, progression, recurrence, or adverse drug reactions. We emphasize how GFMs and related deep models:

We end with case studies in cardiometabolic risk, oncology risk and recurrence, and pharmacogenomics / adverse drug reactions, illustrating how GFMs move from bench to bedside.


18.1 Problem Framing: What Is Clinical Risk Prediction?

Clinical risk prediction is the task of mapping patient data—including genotypes, family history, clinical measurements, imaging, and environmental factors—to probabilistic statements about future outcomes. Concretely, a model might answer questions like:

  • What is this patient’s 10-year risk of coronary artery disease if treated with standard of care?
  • Given current tumor characteristics and therapy, what is the hazard of recurrence within 2 years?
  • If we start this medication, what is the probability of a severe adverse drug reaction (ADR) in the next 6 months?

In practice, these tasks fall into a few archetypes:

  • Individual-level incident risk
    • Will a currently disease-free individual develop disease within a specified time window (e.g., 10-year type 2 diabetes risk)?
  • Progression and complication risk
    • Among individuals with an existing condition, who will develop complications (e.g., nephropathy in diabetes, heart failure after myocardial infarction)?
  • Prognosis and survival
    • Time-from-baseline to events such as death, recurrence, or transplant, often with censoring and competing risks.
  • Treatment response and toxicity
    • Will a patient benefit from therapy A vs B, and what is their risk of severe toxicity or ADR?

GFMs enter these problems as feature generators: they transform raw genomic and multi-omic data into structured embeddings, variant effect scores, or region-level functional annotations. These representations then feed classic supervised learning tasks—classification, regression, and survival modeling—alongside clinical covariates.


18.2 Task Types and Loss Functions

Although GFMs provide sophisticated inputs, the prediction tasks themselves often re-use well-understood statistical frameworks.

18.2.1 Binary and Multi-label Classification

For many screening or triage problems, risk prediction is posed as:

Will outcome Y occur within horizon T?

Examples include incident atrial fibrillation within 5 years, or hospitalization for heart failure in the next 12 months. Models output a risk score $ = P(Y=1 x) $, trained with cross-entropy or focal losses.

Extensions:

  • Multi-label classification: Predict multiple outcomes (e.g., myocardial infarction, stroke, heart failure) simultaneously; share a common representation but separate output heads.
  • Ordinal endpoints: Disease stages or severity scores modeled with ordinal losses instead of strictly binary outcomes.

GFMs contribute by providing richer genetic features than traditional hand-crafted burden scores or PGS (e.g., variant-level embeddings from Nucleotide Transformer, GPN, or regulatory LMs) (Dalla-Torre et al. 2023; Benegas, Batra, and Song 2023).

18.2.2 Survival and Time-to-Event Modeling

Risk is often more naturally expressed as time-to-event:

  • Time from baseline to myocardial infarction or revascularization.
  • Time from surgery to cancer recurrence.
  • Time from first exposure to drug to severe toxicity.

These require models that handle censoring (patients lost to follow-up or event-free at study end). Approaches include:

  • Cox proportional hazards models with genomic and GFM-derived features as covariates.
  • Deep survival models that use neural networks to parameterize hazard functions, survival curves, or discrete-time hazards.
  • Competing risks models for mutually exclusive outcomes (e.g., cancer-specific vs non-cancer mortality).

GFMs naturally provide high-dimensional, possibly non-linear features; deep survival architectures can exploit these features while learning flexible hazard structures.

18.2.3 Multi-task Risk and Shared Representations

Large health systems increasingly estimate risk for dozens of outcomes simultaneously (e.g., hospital readmission, multiple cardiovascular endpoints, medication-specific ADRs). This motivates multi-task frameworks:

  • A shared encoder (combining EHR, genomic, and multi-omic encoders) produces a patient-level embedding.
  • Multiple output heads estimate risks for different endpoints or time horizons.

Such models can exploit cross-task correlations and share statistical strength (e.g., overlapping genetic architectures between lipids, CAD, and stroke). Deep polygenic architectures like Delphi and G2PT already adopt multi-trait ideas for genomic risk representations (Georgantas, Kutalik, and Richiardi 2024; Lee et al. 2025).


18.3 From PGS to GFM-Enabled Risk Scores

Polygenic scores (PGS) are a natural starting point for genomically informed clinical prediction.

18.3.1 Classical PGS: Strengths and Limitations

Traditional PGS typically:

  1. Use GWAS summary statistics to estimate per-variant weights.
  2. Construct a score $ S = _j w_j g_j $, where $ g_j $ is genotype at variant $ j $ and $ w_j $ is its estimated effect size.
  3. Plug PGS into regression or survival models alongside clinical covariates (age, sex, BMI, labs).

Despite many successes, classical PGS have well-known limitations:

  • Limited modeling of epistasis and non-linearities: Additive models struggle with higher-order interactions and context-dependent effects.
  • Challenge in integrating functional priors: Annotation-aware methods exist, but rarely leverage full GFMs.
  • Portability gaps: Performance often drops in under-represented ancestries due to LD structure and GWAS ascertainment.

These limitations motivate deep learning-based PGS that better exploit structure in both genotype and functional annotation space.

18.3.2 Deep Polygenic Risk: Delphi and G2PT

Recent methods push beyond additive scores by using deep sequence and genotype models:

  • Delphi: A deep-learning method for polygenic risk prediction that jointly models variant-level features and higher-order patterns across the genome (Georgantas, Kutalik, and Richiardi 2024).
    • Can incorporate variant annotations and linkage structure more flexibly than linear PGS.
    • Supports multi-phenotype prediction, effectively performing task-conditioned PGS.
  • G2PT (Genotype-to-Phenotype Transformer): A transformer-based architecture that treats an individual’s genotype as a sequence of variant “tokens” and learns polygenic risk representations with attention-based context (Lee et al. 2025).
    • Naturally captures epistatic interactions via attention, not just additive effects.
    • Emphasizes interpretability by tying attention patterns back to loci and pathways.

Both systems can optionally use GFM-derived variant features (e.g., scores from sequence-level LMs such as Nucleotide Transformer, HyenaDNA, or GPN) (Dalla-Torre et al. 2023; Nguyen et al. 2023; Benegas, Batra, and Song 2023). In this view:

  1. A GFM maps variant and local sequence context to variant effect features (e.g., predicted impact on chromatin, expression, motifs).
  2. A polygenic risk model (Delphi, G2PT, or related) aggregates these features across the genome to produce a patient-level risk embedding.
  3. A clinical head uses this embedding plus EHR covariates to output risk for specific outcomes or time horizons.

18.3.3 Fine-mapping and Causal Variants: MIFM

Polygenic risk ultimately hinges on causal variants, not just associated markers. MIFM (Multiple Instance Fine-Mapping) exemplifies how deep sequence models can refine the link between variant effects and risk:

  • MIFM uses a deep sequence model in a multiple-instance learning framework to identify causal regulatory variants within associated loci (Rakowski and Lippert 2025).
  • By modeling sets (bags) of variants per locus, it distinguishes causal variants from passengers in tight LD blocks.
  • The outputs—posterior probabilities or importance scores for candidate variants—can inform both mechanistic studies and more parsimonious, interpretable PGS.

Together, Delphi, G2PT, and MIFM illustrate a pattern that recurs throughout this chapter:

Use GFMs and deep sequence models to transform raw genotype into rich, structured features, then plug those features into prediction and decision-support architectures that live closer to the clinic.


18.4 Beyond Genotype: Fusing GFMs with EHR and Multi-omics

Clinical risk prediction rarely depends on genetics alone. Real-world deployment typically requires fusing genomic features with EHR, imaging, and other omics, mirroring the multi-omics integration strategies from Chapter 14.

18.4.1 Feature Sources

  1. Genomics and regulatory features
  2. Multi-omics and systems context
  3. Clinical covariates and EHR
    • Demographics, vitals, lab results, medication history.
    • Problem lists, procedures, imaging-derived features.
    • Time-varying trajectories of biomarkers (e.g., eGFR, HbA1c, tumor markers).

18.4.2 Fusion Patterns

Architecturally, risk models usually adopt one of the fusion strategies echoed from Chapter 14:

  • Early fusion
    • Concatenate GFM-derived genomic embeddings with static clinical covariates and feed into a single MLP or survival model.
    • Simple but sensitive to scaling, missingness, and modality imbalance.
  • Intermediate fusion
    • Separate encoders for genomics, EHR, and multi-omics produce modality-specific embeddings.
    • A fusion layer (attention, cross-modal transformer, or graph-based integration) combines them into a patient embedding, which downstream heads use for risk prediction.
  • Late fusion / ensembling
    • Independent models per modality (e.g., a PGS-only model, an EHR-only model).
    • Meta-model or decision rule combines predictions (e.g., “treat if either PGS or EHR risk is high”).

From a practical standpoint, intermediate fusion is often most attractive: it allows modularity (swap encoders as GFMs improve) while enabling cross-modal interactions.


18.5 Evaluation: Discrimination, Calibration, and Clinical Utility

High performance on test sets is not enough for clinical deployment. Risk models must be discriminative, well-calibrated, robust, and clinically useful.

18.5.1 Discrimination

Discrimination measures how well the model ranks individuals by risk:

  • AUROC (AUC) for binary endpoints.
  • AUPRC when outcomes are rare (e.g., severe ADRs).
  • C-index and time-dependent AUC for survival tasks.

Strong discrimination is necessary but not sufficient; poorly calibrated models can still achieve high AUROC.

For a cross-cutting discussion of how these metrics are used across molecular, variant, and trait-level tasks, see Chapter 15.

18.5.2 Calibration and Risk Stratification

Calibration asks whether predicted probabilities match observed frequencies:

  • If a group of patients is assigned ~20% risk of an event, do ~20% actually experience it?
  • Calibration is assessed with calibration plots, Hosmer–Lemeshow tests, and Brier scores, often stratified by subgroups (e.g., ancestry, sex, age).

For PGS-informed models, calibration is especially important because:

  • Raw PGS are often centered and scaled rather than calibrated; mapping PGS to absolute risk usually requires post-hoc models that incorporate baseline incidence and covariates.
  • GFMs can shift score distributions as architectures evolve; recalibration may be required when swapping or updating encoders.

18.5.3 Uncertainty Estimation and “When Not to Predict”

In high-stakes settings, models should know when they do not know. Common strategies include:

  • Ensemble variance or Monte Carlo dropout as uncertainty proxies.
  • Conformal prediction to output risk intervals or prediction sets with guaranteed coverage.
  • Selective prediction / abstention: allow models to abstain on cases where uncertainty is high or inputs are out-of-distribution (e.g., rare ancestries missing from training, novel tumor subtypes).

For GFM-based systems, uncertainty can be decomposed:

  • Genomic uncertainty: confidence in variant effect predictions or fine-mapping (e.g., MIFM probabilities).
  • Clinical uncertainty: extrapolation to new care settings, practice patterns, or patient populations.

Communicating uncertainty transparently is a core part of decision support.

18.5.4 Fairness, Bias, and Health Equity

Many genomic and EHR datasets reflect historical and structural inequities. Risk models can amplify these biases if not carefully evaluated.

Key considerations:

  • Ancestry and PGS portability: Classical PGS underperform in under-represented ancestries due to GWAS design; GFM-based methods such as Delphi and G2PT have the opportunity—but not the guarantee—to improve this by leveraging functional priors and cross-ancestry information (Georgantas, Kutalik, and Richiardi 2024; Lee et al. 2025).
  • Measurement and access bias: EHR-derived features may differ systematically across groups (e.g., who gets genotyped, which labs are ordered).
  • Group-wise calibration: Evaluate calibration and discrimination separately by ancestry, sex, socio-economic proxies, and care site.
  • Fairness metrics and constraints: When necessary, enforce group-level constraints (e.g., equalized odds) or design affirmative models targeting historically disadvantaged groups.

Equity is not an afterthought; for GFMs, it should inform what data to pretrain on, which benchmarks to report, and how to deploy models in practice.


18.6 Prospective Validation, Trials, and Regulation

Retrospective AUCs are not enough to justify clinical use. Clinical risk models typically require:

  • Prospective validation: Evaluate model performance in a temporally held-out cohort, ideally in multiple health systems with different population structures and practice patterns.
  • Impact studies: Measure whether using the model actually changes clinician behavior and improves outcomes (e.g., better statin targeting, fewer ADRs, reduced unnecessary imaging).
  • Randomized or pragmatic trials when models materially influence treatment decisions, to guard against hidden confounding in observational evaluations.

Regulatory landscapes (e.g., device approvals, software-as-a-medical-device frameworks) increasingly recognize learning systems and continuous updates. GFMs complicate this further:

  • A “fixed” risk model may rely on a GFM backbone that improves over time; updates may change risk rankings and calibration.
  • Regulatory strategies include locked models with explicit versions, change control plans, or adaptive approvals for constrained forms of continual learning.

Regardless of the framework, clear documentation of data provenance, GFM versions, training procedures, and validation results is essential.


18.7 Monitoring, Drift, and Continual Learning

Once deployed, GFMs and downstream risk models operate in non-stationary environments:

  • Clinical practice patterns change (new treatments, guidelines).
  • Patient populations drift (e.g., new screening programs).
  • Lab assays and sequencing pipelines evolve.

Monitoring should track:

  • Input distributions (e.g., genotype frequencies, EHR feature patterns).
  • Output distributions (risk score histograms, fraction of patients above decision thresholds).
  • Performance over time (calibration, discrimination), often via rolling windows or periodic audits.

When drift is detected:

  • Recalibration may suffice (e.g., refitting a calibration layer to current data).
  • Partial retraining of heads or fusion layers can adapt to new environments while keeping GFM weights fixed.
  • Full continual learning—including updating GFM backbones—requires careful safeguards to avoid catastrophic forgetting and maintain regulatory compliance.

Design patterns from Chapter 14’s systems models (e.g., modular encoders, robust interfaces between GFMs and clinical layers) are crucial for maintainable, updatable decision support.


18.8 Case Studies

To make these ideas concrete, we outline three stylized case studies that build on models and concepts from earlier chapters.

18.8.1 Cardiometabolic Risk Stratification

Goal: Identify individuals at high risk of major adverse cardiovascular events (MACE)—e.g., myocardial infarction, stroke, cardiovascular death—over a 10-year horizon.

Inputs:

Model design:

  1. Use a DNA GFM to compute variant-level annotations (e.g., predicted enhancer disruption in cardiomyocyte or hepatocyte contexts).
  2. Feed annotations and genotypes into Delphi or G2PT to obtain a patient-level genomics embedding tuned for cardiometabolic outcomes.
  3. Fuse the genomics embedding with EHR covariates via an intermediate fusion network (e.g., MLP or transformer over structured features).
  4. Train the model to predict 10-year MACE risk using survival or discrete-time hazard losses.

Clinical use:

  • Stratify patients into risk categories (e.g., low, intermediate, high) that inform statin initiation, PCSK9 inhibitor consideration, or intensive lifestyle intervention.
  • Provide individual-level explanations: highlight variants and pathways (via G2PT attention or Delphi variant contributions) that most contributed to risk—bridging Chapters 9 and 15.
  • Evaluate equity: ensure performance and calibration hold across ancestries and care sites.

18.8.2 Oncology: Risk and Recurrence Prediction

Goal: Predict recurrence risk and treatment benefit for patients with solid tumors after surgery or first-line therapy.

Inputs:

  • Somatic landscapes from whole-exome or whole-genome tumor sequencing.
  • Tumor representations from deep set or transformer architectures such as SetQuence/SetOmic (Jurenaite et al. 2024).
  • Multi-omics: tumor expression, methylation, and chromatin accessible from integrated frameworks (GLUE, CpGPT) (Cao and Gao 2022; Camillo et al. 2024).
  • GNN-based subtyping: embeddings or cluster assignments from cancer subtyping models like MoGCN and CGMega (X. Li et al. 2022; H. Li et al. 2024).
  • Clinical features: stage, grade, performance status, treatment regimen.

Model design:

  1. Encode somatic mutation sets with SetQuence/SetOmic to obtain tumor-variant embeddings (Jurenaite et al. 2024).
  2. Integrate transcriptomic and epigenomic profiles via GLUE-like latent spaces and CpGPT methylation embeddings (Cao and Gao 2022; Camillo et al. 2024).
  3. Combine these with GNN-based subtype embeddings (MoGCN/CGMega) to capture tumor–microenvironment and histopathological context (X. Li et al. 2022; H. Li et al. 2024).
  4. Fuse tumor-level representations with clinical features in a time-to-recurrence model (e.g., flexible deep survival network).

Clinical use:

  • Provide risk estimates that guide adjuvant therapy decisions (e.g., intensifying chemotherapy or adding targeted agents for high-risk patients).
  • Suggest candidate biomarkers or pathways for trial stratification, based on GFM-derived importance scores and attention maps.
  • Monitor drift as treatment standards evolve; update models to reflect new targeted therapies and immune checkpoint inhibitors.

18.8.3 Pharmacogenomics and Adverse Drug Reaction Risk

Goal: Predict which patients are at high risk of severe ADRs (e.g., myopathy on statins, severe cutaneous reactions to certain drugs, cardiotoxicity of oncology agents).

Inputs:

  • Germline variation in pharmacogenes (e.g., CYP family, HLA alleles) and broader genome.
  • Variant effect scores from both DNA and protein LMs for coding and regulatory variants in drug metabolism and immune genes (see Chapters 2–3, 9–10).
  • Clinical context: co-medications, comorbidities, organ function (liver, kidney), prior adverse reactions.

Model design:

  1. Use GFMs to derive mechanistically meaningful features for variants in pharmacogenes (e.g., predicted impact on protein stability, binding, or gene regulation).
  2. Aggregate these features across loci into a pharmacogenomic risk embedding, possibly using a G2PT-style transformer restricted to relevant genes (Lee et al. 2025).
  3. Combine this with EHR data in a multi-task classification model that predicts ADR risk for multiple drugs or drug classes.

Clinical use:

  • Flag patients at high risk before initiating therapy, prompting genotype-guided drug choice or dose adjustment.
  • Generate reports that tie risk back to specific variants and pharmacogenes, aligned with existing clinical pharmacogenomics guidelines.
  • Evaluate performance across ancestries to avoid exacerbating disparities in access to safe and effective therapy.

18.9 Practical Design Patterns and Outlook

Across these examples, several design patterns for GFM-enabled clinical prediction recur:

  • Treat GFMs as modular feature extractors:

    • Keep a clear separation between foundation encoders and clinical prediction heads, easing updates and regulatory management.
  • Embrace multi-modal fusion:

  • Prioritize calibration, uncertainty, and fairness as first-class citizens, not post-hoc add-ons.

  • Bridge interpretability and mechanism:

    • Use tools from Chapter 14 to connect individual risk predictions to variants, regions, and pathways, enabling mechanistic hypotheses and clinician trust.
  • Design for continual learning and monitoring:

    • Assume that clinical practice and data distributions will change; build pipelines that can adapt responsibly.

In the broader story of this book, clinical risk prediction and decision support represent a key translation layer: they connect the representational gains of genomic foundation models to the realities of patient care. The next chapters will extend these ideas to other application domains (e.g., rare disease diagnosis, discovery of pathogenic variants, and drug/biotech innovation), further exploring how GFMs reshape translational genomics.