22 Graph and Network Models
Sequence models encode what molecules can do. Network models encode what they do together.
Prerequisites: This chapter builds on foundation model concepts from Part 3 (especially protein language models in Chapter 16 and DNA models in Chapter 15) and attention mechanisms from Chapter 7. Familiarity with single-cell representations (Chapter 20) is helpful but not required.
Learning Objectives: After completing this chapter, you should be able to:
- Explain why graph neural networks complement rather than replace sequence-based foundation models
- Describe the message passing framework and how it propagates information across network neighborhoods
- Compare the design tradeoffs among GCN, GraphSAGE, GAT, and graph transformer architectures
- Design an integration strategy combining foundation model embeddings with GNN layers for a given biological task
- Evaluate the limitations of network-based predictions, including ascertainment bias and causality concerns
Estimated Reading Time: 45-55 minutes
Graph topology refers to the structural arrangement and connectivity patterns in a network. Key concepts include:
- Nodes (vertices): The entities in a graph (e.g., proteins, genes, cells)
- Edges (links): Connections between nodes representing relationships (e.g., physical binding, regulation)
- Degree: The number of edges connected to a node; high-degree nodes are called hubs
- Path: A sequence of edges connecting two nodes; path length counts the edges traversed
- Neighborhood (k-hop): All nodes reachable within k edges from a given node
- Connected component: A subgraph where any node can reach any other through some path
- Directed vs. undirected: Whether edges have direction (A → B differs from B → A) or not
These structural properties determine how information flows through networks and what patterns GNNs can learn. Unlike sequences (linear) or images (grid), graphs have irregular topology: variable node degrees, no inherent ordering, and arbitrary connectivity.
Graph neural networks are not alternatives to foundation models; they are consumers of them. Foundation models produce rich representations of individual biological entities: protein language models encode evolutionary constraint and structural propensity (Chapter 16), DNA models capture regulatory grammar (Chapter 17), RNA models represent transcript-level features (Chapter 19). These representations are powerful but operate on isolated sequences. A protein embedding captures what ESM learned about that protein’s sequence; it says nothing about which other proteins it binds, which pathways it participates in, or which disease phenotypes result from its disruption. Graph neural networks operate at a higher level of abstraction, taking foundation model representations as node features and learning to propagate information across relational structure. The combination yields capabilities that neither approach achieves alone.
This architectural relationship reflects a biological reality: organisms are not collections of independent molecules but systems of interacting components. A transcription factor affects its target genes through regulatory edges. Proteins assemble into functional complexes through physical binding. Signaling cascades propagate perturbations across cellular networks. These relationships exist at a level of abstraction above sequence, requiring a different mathematical framework to represent. Graphs provide precisely this framework. In a protein-protein interaction network, proteins become nodes and physical binding creates edges. In a gene regulatory network, directed edges connect transcription factors to their targets. In spatial transcriptomics data, cells become nodes with edges capturing physical proximity. Each graph encodes relational structure that sequence models cannot directly capture.
Think of the relationship between foundation models and GNNs as a division of labor: foundation models answer “What can this protein do?” based on its sequence, while GNNs answer “What does this protein do in context?” based on its network neighborhood. Neither question subsumes the other. A transcription factor’s DNA-binding properties (captured by sequence models) matter, but so does which genes it actually regulates in a given cell type (captured by network structure).
Graph machine learning encompasses several distinct prediction paradigms:
- Node-level prediction: Classify or regress properties of individual nodes (e.g., predicting whether a gene is disease-associated)
- Edge-level prediction (link prediction): Predict whether edges exist or their properties (e.g., drug-target binding affinity)
- Graph-level prediction: Classify or regress properties of entire graphs (e.g., predicting molecular toxicity from a chemical structure graph)
Underlying these tasks are two learning paradigms:
- Representation learning: Learn low-dimensional embeddings that capture node/graph structure for downstream tasks
- Message passing: Propagate and aggregate information across the graph to compute node representations
Most biological applications involve node prediction (gene function, disease association) or link prediction (protein interactions, regulatory edges). Graph-level prediction is more common in molecular property prediction where each molecule is a separate graph.
For readers seeking deeper mathematical foundations, these concepts underpin both classical network analysis and modern GNN architectures.
Matrix Representations. For a graph \(G\) with \(n\) nodes:
The adjacency matrix \(A_{ij} = 1\) if edge \((v_i, v_j)\) exists, 0 otherwise. Key property: \(A^k_{ij}\) counts walks of length \(k\) from node \(i\) to \(j\).
The degree matrix \(D\) is diagonal with \(D_{ii} = \sum_j A_{ij}\) (node degree).
The Laplacian \(L = D - A\) has special spectral properties: eigenvalues \(\geq 0\), smallest is 0 with eigenvector \(\mathbf{1}\), and the number of zero eigenvalues equals the number of connected components. The spectral gap (second-smallest eigenvalue) measures connectivity. These matrices appear directly in GNN message-passing formulas.
Centrality Measures. Different notions of node importance:
- Degree centrality: \(C_d(v) = \deg(v)/(n-1)\); high-degree “hub” nodes tend to be essential genes
- Betweenness centrality: Fraction of shortest paths passing through a node; identifies pathway “bottlenecks”
- Eigenvector centrality: A node is central if connected to other central nodes (PageRank variant)
Small-World Property. Biological networks typically exhibit small diameter (~4-6 for PPI) and high clustering (friends of friends are friends). This means information propagates in few hops, motivating shallow GNNs with 2-3 layers.
Community Structure. Densely connected node groups with sparse between-group connections. The modularity score \(Q\) measures partition quality. Communities often correspond to protein complexes, functional modules, or pathway components.
The practical implications are substantial. Disease gene prioritization leverages the observation that genes causing similar diseases cluster in network neighborhoods. A GNN can learn to propagate disease signals across protein interaction networks, but effectiveness depends critically on node feature quality. When those features come from protein language models encoding evolutionary constraint and structural propensity, the GNN gains access to sequence-level biological knowledge unavailable from simpler features like expression levels alone. Drug-target interaction prediction similarly benefits: ESM embeddings capture what makes a protein druggable, while network context reveals which targets sit in therapeutically relevant pathways.
22.1 Biological Networks and Data Resources
Graph neural networks can only learn from relationships encoded in their input graphs. The choice of network, its source, and its inherent biases determine what a model can discover and what it will miss. Understanding the landscape of available biological networks, their construction methods, and their systematic limitations is therefore prerequisite to effective graph-based modeling.
22.1.1 Landscape of Biological Graphs
Before examining graph neural network architectures, it is essential to understand what biological networks exist and where they come from. The choice of network fundamentally determines what topology a model can exploit, influencing what patterns it can learn. The biases inherent in network construction propagate through all downstream analyses.
Current protein-protein interaction databases are estimated to capture only 20-30% of true human interactions. Before reading further, predict: How might this incompleteness affect which types of proteins have more documented interactions? What factors might determine whether a protein’s interactions get discovered and catalogued?
Physical associations between proteins constitute perhaps the most widely used network type for GNN applications. Major databases include STRING, which integrates experimental data with computational predictions and text mining to assign confidence scores to interactions; BioGRID, which focuses on curated experimental interactions; and IntAct, which provides detailed interaction metadata from direct molecular experiments. These protein-protein interaction networks are incomplete (current estimates suggest only 20-30% of human PPIs are catalogued) and biased toward well-studied proteins in well-characterized pathways (Szklarczyk et al. 2023; Oughtred et al. 2020; Orchard et al. 2014; Venkatesan et al. 2008; Hart, Ramani, and Marcotte 2006). A gene involved in cancer or a common disease may have hundreds of documented interactions, while an uncharacterized protein in a specialized tissue may have none, not because it lacks interactions but because no one has looked.
Transcriptional control relationships require a different network structure. Unlike PPIs, gene regulatory networks are inherently directed: a transcription factor activates or represses its targets, not vice versa. Sources include chromatin immunoprecipitation sequencing (ChIP-seq) experiments that identify transcription factor binding sites, chromatin accessibility data (assay for transposase-accessible chromatin sequencing (ATAC-seq), DNase-seq) that reveals active regulatory regions, and chromosome conformation capture (Hi-C) that maps enhancer-promoter contacts (Chapter 21). Databases like ENCODE and the Roadmap Epigenomics Project provide regulatory annotations across cell types, though coverage varies dramatically by tissue (Chapter 2). Computational methods infer regulatory edges from expression correlations or sequence motifs, but such predictions contain substantial false positives and miss context-specific interactions.
Organized biochemical knowledge takes yet another form. KEGG (Kyoto Encyclopedia of Genes and Genomes) provides comprehensive pathway maps covering metabolism, genetic information processing, environmental information processing, cellular processes, and disease-specific pathways across thousands of organisms. Reactome offers deeply curated human pathway data with explicit reaction-level detail. WikiPathways provides community-curated pathways with particular strength in specialized and emerging biology. Together, these databases curate reactions, enzymatic steps, and signaling cascades into hierarchical pathway and metabolic networks where nodes can represent genes, proteins, metabolites, or abstract pathway concepts. These networks encode decades of molecular biology knowledge but reflect historical research priorities: metabolism and signal transduction are well-characterized, while more recently discovered processes like autophagy or RNA modification have sparser coverage.
Biological networks come in several structural varieties:
- Simple graphs: One edge type, one node type (e.g., basic PPI networks)
- Directed graphs: Edges have direction (e.g., gene regulatory networks where TF → target)
- Weighted graphs: Edges carry numerical values (e.g., interaction confidence scores)
- Heterogeneous graphs: Multiple node and/or edge types (e.g., knowledge graphs with genes, diseases, drugs)
- Multi-graphs: Multiple edges between the same node pair (e.g., proteins connected by both physical binding and co-expression)
- Bipartite graphs: Two distinct node sets with edges only between sets (e.g., drug-target networks)
- Hypergraphs: Edges connect arbitrary numbers of nodes (e.g., protein complexes, pathways)
The choice of graph type encodes assumptions about biological relationships. Multi-relational and heterogeneous graphs capture richer semantics but require specialized GNN architectures.
Beyond molecular interactions, relationships among genes, diseases, drugs, phenotypes, and other biomedical entities require heterogeneous representations. Unlike protein interaction networks, which contain a single node type and edge type, knowledge graphs are inherently heterogeneous: nodes represent diverse entity classes, and edges capture semantically distinct relationship types. This heterogeneity enables richer reasoning but demands architectures capable of handling multiple node and edge embeddings.
Several large-scale biomedical knowledge graphs have become standard resources. Hetionet integrates 47,031 nodes across 11 types (genes, diseases, compounds, anatomies, and others) with 2.25 million edges spanning 24 relationship types, providing a comprehensive substrate for computational drug repurposing (Himmelstein et al. 2017). The Unified Medical Language System (UMLS) aggregates over 200 biomedical vocabularies into a metathesaurus linking millions of concepts through hierarchical and associative relationships. PrimeKG consolidates 17 biological databases into a precision medicine knowledge graph with over 4 million relationships connecting diseases, drugs, genes, pathways, and phenotypes, explicitly designed for machine learning applications (Chandak, Huang, and Zitnik 2023).
Disease-gene association databases provide critical edges for clinical applications. DisGeNET curates over one million gene-disease associations from expert-reviewed sources, GWAS catalogs (Chapter 3), and text mining, assigning evidence scores that enable confidence-based filtering (Piñero et al. 2020). OMIM (Online Mendelian Inheritance in Man) provides authoritative curation of Mendelian disease genes, while OrphaNet focuses on rare diseases with detailed phenotypic annotations (Chapter 29). The Clinical Genome Resource (ClinGen) adds expert-reviewed gene-disease validity assessments using standardized evidence frameworks.
Drug-centric resources complete the translational picture. DrugBank provides comprehensive drug-target annotations with mechanism and pharmacology details. ChEMBL aggregates bioactivity data from medicinal chemistry literature, linking compounds to protein targets through binding affinity measurements. The Drug Gene Interaction Database (DGIdb) consolidates druggable gene categories and known interactions to support target prioritization (Chapter 30).
| Network Type | Example Databases | Node Types | Edge Semantics | Key Limitation |
|---|---|---|---|---|
| Protein-Protein Interaction | STRING, BioGRID, IntAct | Proteins | Physical binding, co-complex | 20-30% coverage; study bias |
| Gene Regulatory | ENCODE, Roadmap, JASPAR | TFs, genes, enhancers | Activation/repression (directed) | Cell-type specificity |
| Pathway/Metabolic | KEGG, Reactome, WikiPathways | Genes, metabolites, reactions | Enzymatic, signaling | Historical research bias |
| Knowledge Graph | Hetionet, PrimeKG, UMLS | Multiple entity types | Multiple relationship types | Integration quality varies |
| Spatial/Cell-Cell | Spatial transcriptomics data | Cells, spots | Proximity, communication | Emerging; sparse coverage |
The power of knowledge graphs lies in their support for multi-hop reasoning. A query asking whether a drug might treat a disease can traverse multiple edge types: drug inhibits protein A, protein A interacts with protein B, protein B is implicated in disease. Each hop contributes evidence, and the combination of paths provides signal that no single edge contains. Graph neural networks learn to aggregate across such paths, weighting different relationship types and path lengths according to their predictive value for specific tasks.
Spatially resolved transcriptomics and imaging data give rise to graphs capturing tissue organization invisible to bulk or even single-cell measurements (Chapter 20). In these spatial and cell-cell interaction graphs, nodes represent cells or spatial locations, while edges encode relationships derived from spatial coordinates. Common edge construction strategies include:
- k-nearest neighbors (kNN): Connect each cell to its k closest neighbors by Euclidean distance
- Delaunay triangulation: Create edges between cells that share a face in the Voronoi tessellation, producing a planar graph that respects local density
- Distance threshold: Connect cells within a fixed radius (e.g., 50 μm), reflecting biologically meaningful interaction ranges
- Ligand-receptor communication: Infer edges from co-expression of cognate ligand-receptor pairs between spatially adjacent cells
The resulting graphs capture tissue architecture: tumor-stroma boundaries, immune infiltration patterns, and spatial niches where specific cell types co-localize. Edge weights often encode distance or communication strength, enabling GNNs to learn that nearby cells influence each other more than distant ones. Such graphs enable questions about how spatial context influences cell behavior and how tissue microenvironment affects cellular phenotypes.
Consider a protein that has no documented interactions in STRING or BioGRID. Answer these questions:
- Does this mean it truly has no binding partners?
- What factors might explain why some proteins have hundreds of documented interactions while others have none?
- How might this asymmetry affect a GNN trained on such networks?
No - The absence of documented interactions likely reflects lack of study rather than biological truth. Most uncharacterized proteins do have binding partners.
Research priorities and historical focus - Proteins involved in cancer, common diseases, or conserved pathways attract more experimental attention. High-throughput screens are biased toward abundant, easily expressed proteins. Tissue-specific or condition-specific proteins may be understudied.
The GNN will preferentially propagate signals toward well-connected hub genes - This creates a rich-get-richer dynamic where well-studied genes dominate predictions, potentially missing novel biology in peripheral network regions. The model may learn to recapitulate existing knowledge rather than discover new patterns.
22.1.2 Biases and Limitations
All biological networks share systematic biases that affect downstream modeling. Well-studied genes appear as highly connected hubs not necessarily because they have more interactions but because researchers have investigated them more thoroughly. This ascertainment bias means that GNNs trained on network structure may primarily learn to propagate signals toward well-characterized genes, potentially missing novel biology in peripheral network regions.
Network incompleteness creates particular challenges for message passing algorithms. If a critical interaction is missing, information cannot flow across that gap. If a spurious interaction is present, noise propagates where it should not. These issues are especially acute for less-studied organisms, tissues, or disease contexts where network coverage is sparse.
The distinction between physical and functional associations matters for interpretation. A protein-protein interaction might represent stable complex membership, transient signaling, or indirect association through shared binding partners. Different edge types may warrant different treatment by graph models, but many databases conflate these categories or provide insufficient metadata to distinguish them.
22.2 Graph Neural Network Fundamentals
The mathematical machinery underlying graph neural networks differs fundamentally from the convolutional networks (Chapter 6) and transformer architectures (Chapter 7) examined in previous chapters. Where those models operate on regular structures (sequences, grids), GNNs must handle irregular topology with variable-degree nodes, no inherent ordering, and arbitrary connectivity. This section develops the message passing framework that addresses these challenges, then surveys the canonical architectures that have become standard tools for biological applications.
22.2.1 Message Passing Principles
The challenge of learning from graph-structured data lies in the irregular topology: unlike images (regular grids) or sequences (linear chains), graphs have variable-degree nodes, no inherent ordering, and complex connectivity patterns. Classical approaches computed hand-crafted features such as degree centrality, clustering coefficients, or shortest path statistics, then fed these to standard machine learning models. Such features capture useful properties but cannot adapt to task-specific patterns.
Message passing provides a learnable alternative. The core intuition is local information exchange: each node should update its representation based on what its neighbors know. By iterating this process across multiple layers, information propagates across the graph, allowing nodes to incorporate signals from increasingly distant parts of the network.
Think of message passing as a controlled diffusion process. Just as heat diffuses from hot regions to cold ones, information in a GNN flows from nodes to their neighbors. After one layer, each node knows about its immediate neighbors. After two layers, it knows about neighbors of neighbors. After L layers, information has spread across L-hop neighborhoods. The learned weights control how information mixes, not just that it spreads.
Formally, at layer \(\ell\), each node \(i\) maintains a hidden state \(\mathbf{h}_i^{(\ell)}\). A message passing layer computes, for each edge from neighbor \(j\) to node \(i\), a message:
\[ \mathbf{m}_{ij}^{(\ell)} = \phi_m\left(\mathbf{h}_i^{(\ell)}, \mathbf{h}_j^{(\ell)}, \mathbf{e}_{ij}\right) \]
where \(\phi_m\) is a learned function and \(\mathbf{e}_{ij}\) represents edge features. The node then aggregates messages from all neighbors and updates its state:
\[ \mathbf{h}_i^{(\ell+1)} = \phi_h\left(\mathbf{h}_i^{(\ell)}, \bigoplus_{j \in \mathcal{N}(i)} \mathbf{m}_{ij}^{(\ell)}\right) \]
where \(\mathcal{N}(i)\) denotes neighbors of node \(i\) and \(\bigoplus\) is a permutation-invariant aggregation (sum, mean, max, or attention-weighted combination). The aggregation must be permutation-invariant because neighbors have no inherent ordering.
After \(L\) layers, a node’s representation incorporates information from all nodes within \(L\) hops. For biological networks, this means a gene’s learned embedding can reflect not only its own features but signals from interaction partners, their partners, and so on, capturing pathway-level and module-level context.
Consider a small PPI network where protein A interacts with proteins B and C.
Initial embeddings (from a foundation model):
- \(\mathbf{h}_A = [0.8, 0.2]\) (kinase signature)
- \(\mathbf{h}_B = [0.3, 0.7]\) (receptor signature)
- \(\mathbf{h}_C = [0.5, 0.5]\) (adapter protein)
Step 1 - Message computation: Using a simple linear transformation \(W\):
- Message from B to A: \(\mathbf{m}_{BA} = W \cdot \mathbf{h}_B = [0.4, 0.6]\)
- Message from C to A: \(\mathbf{m}_{CA} = W \cdot \mathbf{h}_C = [0.5, 0.5]\)
Step 2 - Aggregation (mean): \[\text{aggregated} = \frac{1}{2}([0.4, 0.6] + [0.5, 0.5]) = [0.45, 0.55]\]
Step 3 - Update (residual connection + ReLU): \[\mathbf{h}_A' = \text{ReLU}(\mathbf{h}_A + \text{aggregated}) = \text{ReLU}([1.25, 0.75]) = [1.25, 0.75]\]
After this layer, protein A’s representation now incorporates information about its interaction partners. If B is a known receptor for a disease-relevant ligand, that signal has begun propagating to A.
Before reading further, test your understanding of message passing:
- If a GNN has 3 message passing layers, how many hops away can information travel from any given node?
- Why must the aggregation function be permutation-invariant?
- What happens to node representations if you stack many message passing layers without any mechanism to prevent it?
3 hops - Each layer extends the receptive field by one hop, so after L layers, a node has incorporated information from all nodes within L hops.
Because graph neighbors have no inherent ordering - Unlike sequence positions or grid locations, the set of neighbors in a graph has no canonical order. The aggregation function must produce the same result regardless of how neighbors are enumerated.
They converge (over-smoothing), losing discriminative signal - Repeated averaging causes node representations to become increasingly similar, eventually converging toward the same mean representation within connected components. This is why most practical GNNs use only 2-4 layers.
22.2.2 Canonical Architectures
Several GNN architectures have become standard tools for biological applications, each with distinct design choices that reflect different tradeoffs between computational efficiency, expressive power, and scalability.
Before reading about over-smoothing, consider this question: If you stack 10 graph convolutional layers that average neighbor representations at each step, what do you think happens to the node embeddings in a connected graph? Will they become more diverse and specialized, or more similar? Why?
The simplest approach performs normalized neighborhood averaging followed by linear transformation and nonlinearity. Graph convolutional networks (GCN) (Kipf and Welling 2017) introduced spectral graph convolutions that aggregate information from neighboring nodes, enabling models that respect graph structure. GCNs are computationally efficient and conceptually straightforward but suffer from over-smoothing when stacked deeply: repeated averaging causes node representations to converge, losing the discriminative signal that distinguishes different network positions (Li, Han, and Wu 2018; Oono and Suzuki 2020). The mathematical reason is intuitive: averaging is a low-pass filter that removes high-frequency variation. After many rounds of averaging, all nodes in a connected component converge toward the same mean representation, much like repeatedly blurring an image eventually produces uniform gray.
The over-smoothing problem is subtle but critical. Intuitively, if you repeatedly average a node’s representation with its neighbors, eventually all nodes in a connected component converge to similar representations. This means deeper GNNs are not always better. In practice, most GNNs use only 2-4 layers. Understanding when and why to limit depth is essential for effective GNN design.
Scalability to large graphs requires a different strategy. GraphSAGE learns aggregation functions that operate on sampled neighborhoods rather than the full neighbor set (Hamilton, Ying, and Leskovec 2017). The key insight is that full-batch GCN requires storing all node representations simultaneously, which becomes prohibitive for graphs with millions of nodes. By sampling a fixed number of neighbors (say, 10) at each layer rather than using all neighbors, GraphSAGE bounds memory requirements and enables mini-batch training. The sampling introduces variance but enables scaling to graphs that would be impossible otherwise. Crucially, GraphSAGE also provides inductive capability: the model can generate embeddings for nodes not seen during training by applying learned aggregators to their neighborhoods. For biological networks that grow as new genes are characterized, this generalization is valuable.
When some neighbors matter more than others, attention-weighted aggregation provides a learnable solution. Graph attention networks (GAT) compute attention scores between each node and its neighbors, allowing the model to focus on the most informative interactions (Veličković et al. 2018). This is analogous to attention in transformers but operates over graph neighborhoods rather than sequence positions (Chapter 7).
Finally, the boundary between sequence and graph models blurs when transformer architectures extend to graphs. Graph transformers replace local message passing with structured or global attention. Some variants attend over all node pairs with positional encodings derived from graph structure (shortest paths, Laplacian eigenvectors); others restrict attention to k-hop neighborhoods (Ying et al. 2021; Dwivedi and Bresson 2021). These architectures potentially capture long-range dependencies that multi-layer message passing struggles to propagate.
| Architecture | Aggregation Method | Scalability | Inductive? | Key Strength | Key Limitation |
|---|---|---|---|---|---|
| GCN | Normalized mean | Limited (full-batch) | No | Simple, efficient | Over-smoothing with depth |
| GraphSAGE | Sampled aggregators | High (mini-batch) | Yes | Scales to large graphs | Sampling variance |
| GAT | Attention-weighted | Moderate | Yes | Learns edge importance | Quadratic in neighbors |
| Graph Transformer | Global/structured attention | Variable | Yes | Long-range dependencies | Computational cost |
For each biological scenario, which GNN architecture would you choose?
Predicting function for newly characterized proteins not seen during training, using a PPI network with 2 million edges.
A small regulatory network (500 nodes) where you suspect certain transcription factor interactions are more important than others.
Propagating GWAS signals through pathways where causal genes may be 5+ hops from the nearest significant association.
Real-time drug-target interaction prediction in a clinical setting requiring fast inference on patient-specific networks.
GraphSAGE - Inductive capability is essential for new proteins, and sampling-based aggregation scales to large graphs. GCN’s transductive nature cannot handle unseen nodes.
GAT - When certain edges matter more than others, attention weights learn which interactions are most informative. For small networks, GAT’s computational overhead is manageable.
Graph Transformer - Long-range dependencies spanning 5+ hops are difficult for standard message passing (over-smoothing). Graph transformers with global attention or positional encodings can capture distant relationships directly.
GCN - Simplest and fastest inference. For real-time clinical applications, the efficiency of normalized mean aggregation outweighs the benefits of more sophisticated architectures.
The expressiveness of GNNs is bounded by their ability to distinguish different graph structures. Theoretical analysis connects standard message passing to the Weisfeiler-Lehman (WL) graph isomorphism test, a classical algorithm for determining whether two graphs have identical structure.
The WL test iteratively refines node labels by aggregating neighbor labels. Initially, all nodes receive the same label. At each iteration, a node’s new label is computed by hashing its current label together with the multiset of its neighbors’ labels. Two graphs are considered potentially isomorphic if they produce identical label histograms after sufficient iterations.
Standard message passing GNNs are provably no more powerful than the 1-WL test: any two graphs that 1-WL cannot distinguish will also produce identical GNN embeddings, regardless of depth or learned parameters. This means certain symmetric structures (e.g., regular graphs where all nodes have identical neighborhoods) cannot be differentiated. More expressive variants (higher-order WL tests, subgraph GNNs) exist but increase computational cost.
For most biological applications, this theoretical limitation is less constraining than practical issues of data quality, training efficiency, and interpretability (Chapter 25). Biological networks typically have sufficient heterogeneity in node features and local structure that the WL expressiveness bound rarely limits performance (Xu et al. 2019; Morris et al. 2019).
22.3 Foundation Model Embeddings as Node Features
The power of combining foundation models with graph neural networks lies in their complementary strengths. Foundation models extract rich biological information from sequence, but they operate on isolated entities without relational context. Graph neural networks reason over relationships, but they require informative node features to propagate meaningful signal. This section examines how to integrate these approaches effectively, from the architectural principle underlying the combination to practical patterns for implementation.
22.3.1 Integration Principle
The central architectural insight for genomic graph learning is that foundation models and graph neural networks operate at complementary levels of abstraction. Sequence-based foundation models excel at extracting biological information from linear sequences: ESM-2 learns evolutionary constraints and structural propensities from protein sequences (Chapter 16); DNABERT and its successors capture regulatory motifs and sequence grammar (Chapter 15); single-cell foundation models like scGPT learn cell state representations from expression profiles (Chapter 20). These representations encode rich biological knowledge but operate on individual entities without explicit relational information. The principles of feature extraction from pretrained models, which underpin this integration pattern, are developed in Section 9.3.
Graph neural networks excel at learning from relational structure but require informative node features to propagate. When node features are uninformative (simple one-hot encodings or scalar expression values), message passing can only learn from network topology. When node features carry substantial biological signal, message passing can refine and contextualize that information based on network position.
Combining these approaches follows a natural two-stage pattern. First, apply a foundation model to each entity in the graph to generate initial node embeddings. For a protein-protein interaction network, run ESM-2 on each protein sequence; for a gene regulatory network, use DNA embeddings for regulatory elements and protein embeddings for transcription factors; for a cell graph, apply scGPT to generate cell state representations. Second, train a GNN on these embeddings using the biological graph structure, allowing message passing to integrate entity-level representations with relational context.
This combination yields capabilities that neither component achieves alone. The foundation model provides rich, transferable features that would require massive labeled datasets to learn from scratch. The GNN provides relational reasoning that sequence models cannot perform. A protein’s druggability depends both on intrinsic properties (binding pocket geometry, expression pattern) that ESM captures and on network context (pathway position, interaction partners) that the GNN integrates.
Imagine you are building a model to predict which genes are essential for cancer cell survival. You have: - ESM-2 embeddings for each protein - A protein-protein interaction network from STRING - CRISPR knockout screens identifying essential genes in 10 cancer cell lines
Sketch out your approach. Would you freeze the ESM-2 embeddings or fine-tune them? How many GNN layers would you use? What might go wrong if you only used the network structure without the sequence embeddings?
22.3.2 Practical Integration Patterns
Several integration patterns have emerged in practice, each suited to different constraints and objectives. The simplest approach freezes foundation model weights and treats embeddings as fixed features, training only the GNN layers. This is computationally efficient and prevents catastrophic forgetting of pretrained knowledge but limits the model’s ability to adapt representations to the specific task (Chapter 9).
When sufficient task-specific data is available, allowing gradients to flow through both the GNN and (parts of) the foundation model enables end-to-end optimization. This joint fine-tuning typically requires careful learning rate scheduling, with smaller updates to foundation model parameters and larger updates to GNN layers. The approach can improve performance but risks overfitting and requires substantially more computation.
A middle ground inserts small trainable modules between foundation model layers or at the interface between foundation model outputs and GNN inputs. Adapter-based integration provides task adaptation with modest parameter overhead, avoiding full fine-tuning costs while retaining flexibility (Chapter 9).
The granularity of representations also offers flexibility. For proteins, one might extract both per-residue embeddings (capturing local structure) and sequence-level embeddings (capturing global properties), concatenating these as node features. For regulatory networks, one might combine nucleotide-level DNA embeddings with region-level chromatin accessibility predictions. This multi-scale integration uses foundation model representations at multiple granularities to capture different aspects of biological function.
Start simple: Begin with frozen embeddings and a 2-layer GNN. This establishes a strong baseline with minimal hyperparameter tuning.
Add complexity when justified: If frozen embeddings underperform and you have substantial labeled data (thousands of examples), try adapter-based integration before full fine-tuning.
Monitor for overfitting: Watch the gap between training and validation performance. Joint fine-tuning on small datasets often memorizes rather than generalizes.
Consider compute budget: Pre-computing embeddings for all nodes once is much cheaper than computing them on-the-fly during training. For large graphs, this can reduce training time by 10-100x.
The choice of integration pattern depends on data availability, computational resources, and the degree of distribution shift between foundation model pretraining and the target application (Chapter 13). For well-characterized systems with substantial labeled data, joint fine-tuning may be warranted. For novel organisms or rare diseases with limited labels, frozen embeddings with simple GNN layers often generalize better.
22.3.3 Evidence for the Integration Benefit
Empirical studies consistently demonstrate that foundation model embeddings improve GNN performance on biological tasks. In protein function prediction, ESM embeddings combined with PPI network GNNs substantially outperform either sequence-only or network-only baselines (Lin et al. 2025). The improvement is particularly pronounced for proteins with few characterized interaction partners, where network structure alone provides limited signal but sequence features carry evolutionary information.
For disease gene prioritization, combining DNA and protein foundation model embeddings with multi-relational GNNs over heterogeneous biological networks improves ranking of causal genes from GWAS loci (Saadat and Fellay 2024; Mastropietro, De Carlo, and Anagnostopoulos 2023). The foundation model features help distinguish genes with similar network positions based on sequence-level functional signals.
In single-cell analysis, scGPT embeddings combined with cell-cell communication graphs enable more accurate prediction of perturbation effects than either component alone (Cui et al. 2024; Su et al. 2023; Yan, Zhang, and Sun 2026). The cell embeddings capture transcriptional state, while the graph structure encodes spatial and molecular interaction context.
These results suggest that the integration principle generalizes across biological domains. The specific foundation models and graph types vary, but the architectural pattern (rich entity embeddings + relational structure + message passing) consistently outperforms simpler alternatives.
22.4 Applications
The integration of foundation model embeddings with graph neural networks enables applications across the translational pipeline. Disease gene prioritization leverages network context to identify causal genes from GWAS loci. Drug-target prediction exploits both sequence-derived druggability features and pathway positioning. Knowledge graph reasoning supports multi-hop inference for drug repurposing. Pathway analysis identifies dysregulated modules in patient-specific contexts. Each application follows the same architectural pattern (rich node features from foundation models, relational structure from biological networks, refinement through message passing) while addressing distinct biological questions.
22.4.1 Disease Gene Prioritization
Genome-wide association studies identify genomic loci associated with disease risk but rarely pinpoint causal genes (Chapter 3). A typical GWAS locus contains dozens of genes, most of which are passengers linked to the true causal variant through linkage disequilibrium. Identifying which gene(s) mediate the association requires integrating functional evidence with genetic signal.
Network-based prioritization leverages the observation that disease genes cluster in biological networks. If a GWAS locus contains genes A, B, and C, and gene B interacts with five known disease genes while A and C interact with none, gene B becomes a stronger causal candidate. Graph neural networks formalize and extend this intuition, learning to propagate disease labels through networks and score candidate genes based on their network context.
Classical network analysis uses “guilt by association”—genes near known disease genes are likely disease genes. GNNs go further by learning which associations matter. Not all neighbors are equally informative. A GNN trained with foundation model embeddings can learn that a neighbor sharing functional domains (detected through sequence similarity) is more informative than a neighbor connected only through high-throughput screening artifacts.
The integration with foundation models strengthens this approach. Rather than relying solely on network topology, which favors well-studied hub genes, the model can assess each candidate’s intrinsic functional properties through sequence embeddings. A gene with protein features characteristic of disease-relevant functions (membrane localization, DNA binding, signaling domains) receives higher scores even if its network position is peripheral. This helps mitigate the ascertainment bias toward well-characterized genes that plagues purely topological methods.
Clinical applications include rare disease diagnosis, where patient exome sequencing identifies hundreds of candidate variants and network-based scoring helps prioritize which genes to investigate further (Chapter 29). The approach also supports drug target identification by highlighting genes whose network position and functional properties make them amenable to therapeutic modulation (Chapter 30). For rare disease diagnosis, network-based prioritization integrates with the variant filtering pipelines in Section 29.1, where foundation model embeddings and network context jointly inform gene ranking.
22.4.2 Drug-Target Interaction Prediction
Identifying which proteins a drug binds is fundamental to understanding mechanism and predicting side effects. Experimental screening of drug-target pairs is expensive and incomplete; computational prediction can prioritize candidates for validation.
Drug-target interaction prediction naturally fits a graph framework. Construct a heterogeneous graph with drug nodes, protein nodes, and edges representing known interactions. Node features for proteins come from sequence foundation models; node features for drugs come from molecular encodings (fingerprints, learned representations from molecular graphs). Train a GNN to predict missing edges, learning which drug and protein features, combined with network context, indicate likely binding.
The foundation model integration is critical here. Protein embeddings from ESM capture binding pocket characteristics, domain structure, and evolutionary constraint that influence druggability. The graph structure provides context: if a drug binds protein A, and protein A participates in complex with protein B, then the drug may also affect protein B’s function. Multi-relational GNNs can learn different propagation patterns for different edge types (physical binding versus pathway membership versus sequence similarity), improving prediction accuracy.
This application connects to broader drug discovery workflows (Chapter 30), where target identification is one component of a multi-stage pipeline. GNN-based predictions provide hypotheses for experimental validation, accelerating the search for novel therapeutic targets.
A pharmaceutical company wants to predict off-target effects of a new kinase inhibitor. They have: - The drug’s binding affinity to 50 kinases (experimentally measured) - A kinase family tree based on sequence similarity - ESM-2 embeddings for all human kinases
How would you structure this as a graph learning problem? What would be your nodes, edges, and prediction target? What might the model learn that simple sequence similarity would miss?
22.4.3 Knowledge Graph Reasoning and Drug Repurposing
Drug repurposing seeks new therapeutic applications for existing compounds, exploiting the observation that drugs often affect multiple targets and pathways beyond their original indication. Knowledge graphs provide a natural framework for repurposing by encoding the relationships through which a drug’s effects might propagate to new disease contexts.
The repurposing problem can be framed as link prediction in a heterogeneous graph: given a knowledge graph with drugs, diseases, genes, and pathways as nodes, predict missing drug-treats-disease edges. Unlike direct drug-target prediction, this task requires reasoning across multiple relationship types. A candidate repurposing hypothesis might involve a chain such as: drug D binds protein P1, P1 regulates pathway W, pathway W is dysregulated in disease X, therefore D may treat X. Graph neural networks designed for heterogeneous graphs learn to aggregate evidence across such chains, weighting different metapaths (sequences of edge types) according to their predictive reliability. The drug repurposing applications that exploit this reasoning are detailed in Section 30.2.2.
Foundation model embeddings strengthen knowledge graph reasoning in several ways. For gene and protein nodes, ESM embeddings encode functional properties that influence druggability and pathway membership. For disease nodes, embeddings derived from clinical text or phenotype ontologies capture symptom patterns and comorbidity relationships. For drug nodes, molecular representations from chemical language models or graph neural networks over molecular structure encode binding properties and pharmacokinetics. These rich node features allow the GNN to assess not just whether a path exists but whether the entities along that path have compatible functional characteristics.
Empirical results demonstrate the value of this integration. Models combining knowledge graph structure with foundation model embeddings outperform both topology-only approaches (which ignore node semantics) and embedding-only approaches (which ignore relational structure) on standard drug repurposing benchmarks (Dang et al. 2025; Zhao et al. 2025). The improvement is particularly pronounced for drugs and diseases with sparse direct evidence, where multi-hop reasoning through well-characterized intermediate entities provides the primary signal.
Clinical translation of knowledge graph predictions requires careful interpretation. A high-scoring drug-disease prediction indicates that multiple lines of computational evidence converge, not that efficacy has been established. The paths contributing to predictions provide mechanistic hypotheses that can guide experimental validation: if the model relies heavily on a drug-protein-pathway-disease chain, that pathway becomes a candidate biomarker for patient selection or treatment response monitoring. Several repurposing candidates identified through knowledge graph methods have entered clinical trials, though the approach remains most valuable for hypothesis generation rather than definitive target validation (Stebbing et al. 2020; Richardson et al. 2020).
22.4.4 Pathway and Module Analysis
Individual genes rarely act alone; biological function emerges from coordinated activity of gene sets organized into pathways and functional modules. Patient-specific pathway analysis identifies which modules show coordinated dysregulation, providing mechanistic insight beyond single-gene associations.
GNNs enable pathway analysis that respects network structure rather than treating gene sets as independent members. By propagating patient-specific expression or mutation signals through pathway graphs, models can identify which subnetworks show coherent perturbation. This differs from classical gene set enrichment, which tests for overrepresentation without considering internal pathway topology.
Foundation model features strengthen pathway analysis by providing functional context for each gene. A gene with features indicating chromatin regulation may contribute to pathway dysfunction through different mechanisms than one with features indicating membrane signaling. The GNN learns to weight these contributions based on network position and functional annotation, identifying pathway perturbations that purely topological or purely gene-set methods miss.
22.4.5 Cell Type and State Annotation
Single-cell foundation models generate rich representations of individual cells (Chapter 20), but many biological questions involve relationships between cells: which cells communicate, how spatial neighborhoods influence behavior, which cell types co-occur in disease states.
Graph neural networks over cell-cell interaction graphs enable several applications. Cell type annotation propagates labels from well-characterized cells to ambiguous ones based on expression similarity and spatial proximity. Perturbation response prediction models how signals from perturbed cells propagate to neighbors. Tissue region classification identifies coherent spatial domains (tumor, stroma, immune infiltrate) based on local cell compositions.
The foundation model integration follows the standard pattern: scGPT or similar models generate cell embeddings, spatial proximity or inferred ligand-receptor interactions define edges, and GNN message passing refines cell representations based on neighborhood context. The resulting embeddings capture both intrinsic cell state and extrinsic spatial/communicative context, enabling predictions that purely expression-based or purely spatial models cannot make.
22.5 Practical Considerations
The network modeling principles you learn here power clinical applications in later chapters. In Chapter 29, network propagation prioritizes disease genes from GWAS loci; the same message passing mechanics you have learned enable identifying which genes in a locus are most likely causal. In Chapter 30, drug-target interaction prediction combines PLM embeddings (Chapter 16) with network proximity to disease genes. The ascertainment bias discussed above directly affects these applications: understudied genes appear less connected than they truly are, potentially causing prioritization methods to miss important candidates.
Deploying graph neural networks on biological data requires navigating choices that profoundly affect model behavior. Graph construction determines what relationships the model can exploit. Scalability strategies determine whether training is feasible on large networks. Robustness techniques determine whether predictions generalize beyond well-characterized network regions. Interpretation methods determine whether outputs provide actionable biological insight. The following subsections address each consideration in turn.
22.5.1 Graph Construction Quality
Graph construction involves a fundamental tradeoff: curated databases like BioGRID provide high-confidence interactions but limited coverage, while computational predictions from STRING are comprehensive but noisier. Before reading the discussion, predict: For disease gene prioritization, would you prefer a high-precision network (fewer edges, high confidence) or a high-recall network (more edges, lower confidence)? What would change for a safety-critical application like predicting drug side effects?
The impact of graph construction choices cannot be overstated. A GNN can only learn from relationships encoded in its input graph; missing edges prevent information flow, spurious edges introduce noise, and biased edge sets propagate ascertainment artifacts.
Source selection involves tradeoffs between precision and completeness. Curated databases like BioGRID provide high-confidence interactions but miss most true relationships. Computational predictions from STRING or co-expression analysis are more comprehensive but noisier. The appropriate choice depends on the downstream task: high-precision networks may be preferable when false positives are costly, while high-recall networks enable discovery of novel biology at the risk of chasing artifacts.
Thresholding decisions determine network density. Confidence scores or distance metrics allow continuous edge weights, but many GNN implementations require discrete edges or work better with relatively sparse graphs. Cross-validation over threshold values or principled selection criteria (target edge density, ensure graph connectivity) help navigate this choice.
For heterogeneous graphs, schema design (which node types exist, which edge types connect them) encodes strong assumptions about relevant biology. A knowledge graph that separates genes, transcripts, and proteins as distinct node types enables fine-grained reasoning but requires more training data than a simpler gene-only representation.
Before training a GNN, verify your graph construction:
Mistakes in graph construction often matter more than model architecture choices.
22.5.2 Scalability and Mini-Batching
Biological graphs range from thousands of nodes (a single-patient cell graph) to millions (a comprehensive knowledge graph or large spatial transcriptomics dataset). Full-batch training, where the entire graph is processed simultaneously, becomes infeasible at scale due to memory constraints.
Mini-batching strategies partition computation into manageable pieces. Neighborhood sampling (GraphSAGE-style) restricts message passing to a fixed sample of neighbors per node, enabling node-level mini-batches. Subgraph sampling trains on induced subgraphs corresponding to meaningful units (individual pathways, tissue regions, patient subsets). Cluster-based training partitions the graph into communities, processes each independently, and handles cross-cluster edges in a second pass.
For foundation model integration, computational cost compounds: generating embeddings for millions of proteins or cells may itself be expensive. Pre-computing and caching embeddings is often practical, decoupling the foundation model forward pass from GNN training. When embeddings must be computed on-the-fly (for dynamic features or joint fine-tuning), careful batching and gradient checkpointing become essential.
22.5.3 Robustness to Noise and Missingness
All biological networks contain errors. Experimental methods for detecting interactions have false positive and false negative rates; computational predictions rely on imperfect proxies; even curated databases contain mistakes. GNNs must tolerate this noise to be practically useful.
Randomly masking edges during training forces the model to avoid relying on any single interaction. This edge dropout improves robustness to missing or incorrect edges and serves as a form of regularization. The mechanism works because dropout during training creates an implicit ensemble: the model must learn to make correct predictions across many different subgraphs, which encourages it to rely on redundant signals rather than any single edge. Similarly, masking node features or entire nodes through node dropout prevents overfitting to well-connected hubs by forcing the model to make predictions even when the most informative hub nodes are unavailable.
Ensemble methods train multiple GNNs on different network subsamples or with different random initializations, aggregating predictions to reduce variance from network noise. Bayesian GNNs provide uncertainty estimates that flag low-confidence predictions for manual review (Chapter 24).
Evaluation should explicitly assess robustness by testing on held-out edges, nodes from poorly characterized network regions, or networks constructed from different data sources than training. A model that performs well only on hub genes or well-characterized interactions may fail in precisely the scenarios where computational prediction is most needed (Chapter 12).
You have trained a GNN for disease gene prioritization and achieved 0.85 AUC on your test set. Before celebrating, what additional evaluations should you perform to assess whether this performance is meaningful?
Consider: 1. How would you check if the model is simply learning node degree? 2. How would you assess performance on understudied genes? 3. How would you test generalization to a new disease not in training?
Compare performance to degree-only baseline - Train a simple model using only node degree as a feature. If your GNN performs only marginally better, it may be learning degree rather than biological mechanism. Additionally, stratify test performance by degree quartiles to check if accuracy is uniform across hub and peripheral genes.
Stratify evaluation by publication count or study bias metrics - Split test genes into well-characterized (many publications) versus understudied (few publications) categories. Compute AUC separately for each group. A model that performs well only on well-studied genes is recapitulating existing knowledge, not discovering new biology.
Temporal holdout or leave-one-disease-out cross-validation - Train on diseases characterized before year X, test on diseases characterized after. Or use cross-validation where each fold holds out a full disease and all its genes. This tests whether the model learns generalizable disease biology rather than memorizing specific disease-gene pairs.
22.5.4 Interpretation and Validation
A key advantage of graph models is interpretability: the graph structure itself provides a scaffold for understanding predictions (Chapter 25). Several techniques extract biological insight from trained GNNs.
Attention weights in GAT and graph transformer models indicate which neighbors most influenced each node’s prediction. Aggregating attention across predictions can highlight critical edges or subgraphs, suggesting which interactions drive model behavior. For cases where some neighbors matter more than others, this attention weight analysis reveals learned priorities.
Computing how predictions change with respect to node or edge features identifies which parts of the input most affect outputs. Gradient-based attribution methods such as integrated gradients provide smoother, more reliable attributions than raw gradients.
Systematically removing edges, masking nodes, or perturbing features and observing prediction changes reveals which graph elements are necessary for specific predictions. This counterfactual analysis can identify model vulnerabilities and generate testable hypotheses about which interactions are essential.
Projecting learned node representations into two dimensions using Uniform Manifold Approximation and Projection (UMAP) or t-distributed stochastic neighbor embedding (t-SNE) reveals clusters that may correspond to functional categories, cell types, or disease subtypes. Comparing embedding visualizations across conditions identifies network regions that show context-specific changes.
Interpretation is not an afterthought but a central goal. The most impactful applications are those where GNN predictions generate testable hypotheses about biological mechanism, ultimately validated by experiment. Attention weights highlighting a regulatory edge or gradient attribution implicating a signaling pathway should prompt follow-up experiments, not immediate clinical action.
22.6 Limitations and Open Challenges
Graph neural networks inherit the biases and limitations of their input networks. Network incompleteness means critical relationships may be absent. Ascertainment bias means well-studied genes dominate predictions. Correlational structure may not reflect causal mechanisms. These limitations do not invalidate the approach but constrain its appropriate use and interpretation.
22.6.1 Study Bias Problem
Network-based methods inherit the biases of their input networks. Well-studied genes appear as hubs; poorly characterized genes are peripheral or disconnected. GNNs trained on such networks tend to propagate signals toward well-characterized genes, effectively recapitulating rather than extending existing knowledge.
This creates particular problems for disease gene discovery, where the goal is often to identify previously unrecognized genes. A model that consistently ranks known disease genes highly may simply be exploiting their network prominence rather than learning generalizable disease biology. Careful evaluation on temporal holdouts (genes characterized after training data was assembled) or stratified by network degree can reveal whether models truly generalize (Chapter 12). The systematic approaches for detecting and quantifying such confounding patterns are detailed in Section 13.8.
Mitigation strategies include degree-corrected training objectives, explicit modeling of ascertainment bias, or alternative network constructions that reduce dependence on historical research focus. None fully solves the problem, which reflects fundamental data limitations rather than algorithmic shortcomings.
22.6.2 Causality Versus Association
Network edges typically represent associations (two proteins bind, two genes correlate) rather than causal relationships (perturbing gene A changes gene B). GNNs learn to exploit correlational patterns, which may not correspond to causal mechanisms.
For applications like drug target identification, the causality limitation matters enormously. A gene that correlates with disease through confounding may be a poor target despite high network-based prioritization scores. A GNN might learn that genes in the “inflammation” module are associated with autoimmune disease, but this does not mean that targeting any gene in that module will be therapeutic. Experimental validation remains essential before clinical translation.
Integrating causal inference methods with graph learning is an active research area, but current GNN applications should be interpreted as identifying associations worthy of experimental follow-up rather than establishing causal relationships.
22.6.3 Negative Data and Class Imbalance
Most biological network datasets encode only positive relationships: known interactions, confirmed regulatory edges, documented associations. The absence of an edge may indicate true non-interaction or simply lack of evidence. This creates severe class imbalance for edge prediction tasks and makes negative sampling strategies critical (Chapter 13).
Random negative sampling (assuming absent edges represent non-interactions) is common but biologically unrealistic. More sophisticated approaches sample negatives with matched properties (same degree distribution, similar node features) to create harder and more meaningful contrasts. Evaluation should report performance separately on different negative sampling schemes to assess whether models generalize beyond easily discriminated negatives (Chapter 11).
22.6.4 Distribution Shift
A GNN trained on one biological network (human PPI from STRING) may not transfer to another (mouse regulatory network, patient-specific spatial graph). Foundation model embeddings help by providing transferable features, but network structure differences can still break performance.
Applying models across species is particularly challenging: network topology, edge type distributions, and gene function may all differ between organisms. Cross-tissue or cross-disease transfer poses similar challenges. Explicit domain adaptation methods, multi-task training across related networks, or foundation model fine-tuning on target domains can help but add complexity (Chapter 9).
22.7 Sequence Encodes, Structure Connects
Graph neural networks operate at a complementary level of abstraction to sequence-based foundation models. Foundation models learn rich representations of biological entities from sequence data; graph neural networks learn to reason about relationships between those entities. Combining them follows a natural pattern: generate embeddings with foundation models, then refine them through message passing over graph structure. This integration yields capabilities that neither component achieves alone, propagating information across protein interaction networks, regulatory pathways, and spatial neighborhoods in ways that sequence models cannot represent.
The central insight is that biological knowledge exists at multiple scales. Sequence encodes what individual genes and proteins can do; networks encode how they interact to produce cellular function. GNNs translate the relational structure of biological networks into learnable inductive biases, enabling disease gene prioritization through network propagation, drug target prediction through pathway context, and spatial analysis through tissue graphs. The improvements over sequence-only approaches are consistent across applications, demonstrating that relational context adds genuine information beyond what sequence representations capture.
Yet network structure carries its own biases. Protein interaction databases are enriched for well-studied genes and disease-relevant pathways; less-characterized genes have fewer annotated interactions regardless of their biological importance. Correlation between genes does not imply regulatory relationship. Class imbalance between known disease genes and the genome-wide background reflects research history as much as biology. These biases propagate through GNN predictions, creating systematic patterns in what the models emphasize and what they miss. The multi-omics integration examined in Chapter 23 extends graph-based reasoning to additional modalities. Clinical applications in Chapter 28 leverage network-derived features for risk stratification, while Chapter 29 applies network propagation to gene prioritization workflows. Both depend on understanding where network-derived predictions are trustworthy and where they inherit the limitations of their inputs, challenges that the uncertainty quantification methods in Section 24.5 help address.
Before reviewing the summary, test your recall:
- Explain the relationship between foundation models and graph neural networks. Do GNNs replace foundation models or complement them?
- What is the over-smoothing problem in GNNs, and how does it limit network depth?
- Why do protein interaction networks inherit study bias, and how does this affect disease gene predictions?
- Describe how message passing works in GNNs and what happens with each additional layer.
- What is the fundamental difference between edges representing association versus causation in biological networks?
GNNs complement foundation models, not replace them - Foundation models extract rich representations from sequence (what a protein can do), while GNNs add relational reasoning (what it does in network context). The division of labor: FMs provide node features, GNNs perform message passing to integrate neighborhood information.
Over-smoothing occurs when repeated averaging converges node representations - Each GNN layer averages neighbor representations. After many layers, all nodes in a connected component converge toward similar embeddings, losing discriminative power. This is why most practical GNNs use only 2-4 layers rather than the deep stacking common in sequence models.
Well-studied genes have more documented interactions due to research focus - Cancer genes, disease genes, and conserved pathway members attract experimental attention. This ascertainment bias means GNNs trained on such networks learn to propagate signals toward already-known genes, potentially missing novel biology in peripheral regions.
Message passing iteratively updates node representations based on neighbors - At each layer: (1) compute messages from neighbors, (2) aggregate messages using permutation-invariant operations (sum/mean/max/attention), (3) update node representation by combining aggregated messages with current state. After L layers, each node has incorporated information from its L-hop neighborhood.
Association means correlation; causation means perturbation effects - An edge in a PPI network indicates proteins bind (association) but not that perturbing one changes the other (causation). Co-expression correlations may reflect shared regulation rather than direct regulatory relationship. This matters for drug targets: correlation does not guarantee that modulating a gene will have the desired therapeutic effect.
Core Concepts:
- Graph neural networks consume foundation model embeddings, not replace them. FMs provide rich node features; GNNs add relational reasoning.
- Message passing iteratively updates node representations based on neighbor information, with each layer extending the receptive field by one hop.
- Over-smoothing limits GNN depth; most practical models use 2-4 layers.
Architecture Choices:
| Need | Choose |
|---|---|
| Simple baseline | GCN with frozen FM embeddings |
| Large graphs | GraphSAGE with neighborhood sampling |
| Variable edge importance | GAT for attention-weighted aggregation |
| Long-range dependencies | Graph transformers (but higher compute) |
Key Applications:
- Disease gene prioritization: Propagate disease labels through PPI networks
- Drug-target prediction: Heterogeneous graphs with drug and protein nodes
- Drug repurposing: Multi-hop reasoning through knowledge graphs
- Spatial analysis: Cell graphs with FM embeddings as node features
Critical Limitations to Remember:
- Networks inherit study bias (well-studied genes are over-connected)
- Edges represent association, not causation
- Missing edges block information flow
- Performance on hub genes may not generalize to understudied genes
Connection to Later Chapters:
- Chapter 23: Extending GNNs to multi-modal data
- Chapter 24: Uncertainty quantification for network predictions
- Chapter 25: Interpreting what GNNs learn
- Chapter 28: Clinical applications of network features
- Chapter 29: Network-based gene prioritization for diagnosis