Appendix B — Deployment and Compute

This appendix covers practical considerations for deploying genomic foundation models, from hardware requirements and cloud platforms to inference optimization and production deployment. The goal is to help practitioners translate model capabilities into working systems that can process real genomic data at scale.

B.1 Hardware Landscape

Genomic foundation models span a wide range of computational requirements. Understanding hardware options helps practitioners match resources to their specific needs.

B.1.1 GPU Computing

Graphics Processing Units (GPUs) are the workhorses of deep learning, providing thousands of parallel cores optimized for matrix operations. Key specifications:

Metric	Description	Relevance
VRAM	GPU memory	Determines maximum model/batch size
Compute (TFLOPS)	Floating-point operations per second	Determines training/inference speed
Memory bandwidth	Data transfer rate	Critical for transformer attention
Tensor cores	Specialized matrix units	Accelerate mixed-precision operations

B.1.2 Consumer vs. Data Center GPUs

GPU Class	Examples	VRAM	Use Case
Consumer	RTX 4090	24 GB	Small model inference, development
Workstation	RTX A6000	48 GB	Medium model training/inference
Data center	A100	40/80 GB	Large model training
Latest generation	H100	80 GB	Foundation model training

Memory is typically the bottleneck. A 3-billion parameter model in FP16 requires approximately 6 GB just for weights, plus additional memory for activations, gradients (if training), and KV cache (for transformers). The A100 80GB enables training models that would require multi-GPU setups on smaller cards.

B.1.3 TPUs

Tensor Processing Units (TPUs) are Google’s custom accelerators, available through Google Cloud. They offer:

High memory bandwidth optimized for matrix operations
Efficient multi-device scaling through dedicated interconnects
Cost-effective for large-scale training

Many DeepMind models (AlphaFold, Enformer) were trained on TPUs. The JAX framework provides the best TPU support.

B.1.4 Multi-GPU and Distributed Training

Large models require multiple GPUs:

Data parallelism replicates the model across GPUs, each processing different batches. Gradients are synchronized after each step. Scales batch size but not model size.

Model parallelism splits the model across GPUs: - Tensor parallelism: Splits individual layers across GPUs - Pipeline parallelism: Assigns different layers to different GPUs

Fully Sharded Data Parallel (FSDP) and DeepSpeed ZeRO combine approaches, sharding model states across GPUs to train models larger than any single GPU’s memory.

B.1.5 CPU Inference

For smaller models or low-throughput applications, CPU inference may suffice:

Avoids GPU procurement and maintenance
Enables deployment on standard servers
Suitable for models with <1B parameters
Can be accelerated with ONNX Runtime, Intel MKL

B.2 Cloud Platforms

Cloud computing provides on-demand access to GPU resources without capital expenditure.

B.2.1 Major Providers

Provider	GPU Options	Strengths
AWS	A100, H100, Trainium	Broadest ecosystem, SageMaker
Google Cloud	A100, TPU v4/v5	TPU access, Vertex AI
Azure	A100, H100	Enterprise integration, Azure ML
Lambda Labs	A100, H100	ML-focused, simpler pricing
CoreWeave	A100, H100	GPU-specialized, Kubernetes-native

B.2.2 Cost Considerations

GPU costs vary significantly:

Resource	Approximate Cost (2024)
A100 40GB (on-demand)	$3–4/hour
A100 80GB (on-demand)	$4–5/hour
H100 (on-demand)	$5–8/hour
A100 (spot/preemptible)	$1–2/hour

Spot instances offer 60–80% discounts but can be interrupted. Suitable for: - Checkpointed training runs - Batch inference jobs - Non-time-critical workloads

Reserved instances provide discounts for committed usage but require upfront planning.

B.2.3 Managed ML Platforms

Platform	Features
AWS SageMaker	Training, hosting, MLOps pipelines
Google Vertex AI	Training, prediction, feature store
Azure ML	Training, deployment, monitoring
HuggingFace Inference Endpoints	One-click model deployment

These platforms handle infrastructure but add cost overhead. They are valuable for production deployments requiring reliability, monitoring, and scaling.

B.3 Model Deployment

Deploying a model for real-world use requires careful consideration of latency, throughput, reliability, and cost.

B.3.1 Inference Servers

Specialized inference servers optimize model serving:

Server	Features
NVIDIA Triton	Multi-framework, dynamic batching, model ensemble
vLLM	Optimized for LLM inference, PagedAttention
TGI (Text Generation Inference)	HuggingFace’s optimized inference server
TorchServe	PyTorch-native, simple deployment

These servers provide: - Dynamic batching: Combine requests for efficient GPU utilization - Model warmup: Pre-load models to reduce cold start - Health checks: Monitor model availability - Metrics: Track latency, throughput, errors

B.3.2 API Design

A typical genomic model API accepts sequences and returns predictions:

# Request
{
    "sequences": ["ATCGATCG...", "GCTAGCTA..."],
    "return_embeddings": false
}

# Response
{
    "predictions": [
        {"pathogenicity": 0.87, "confidence": 0.92},
        {"pathogenicity": 0.12, "confidence": 0.95}
    ],
    "model_version": "v2.1.0",
    "processing_time_ms": 145
}

Design considerations: - Batch endpoints for throughput-critical applications - Streaming for large outputs (embeddings, long sequences) - Versioning to manage model updates - Input validation to catch malformed sequences early

B.3.3 Containerization

Docker containers package models with dependencies:

FROM pytorch/pytorch:2.0.0-cuda11.7

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY model/ /app/model/
COPY serve.py /app/

EXPOSE 8080
CMD ["python", "/app/serve.py"]

Containers provide: - Reproducible environments - Easy deployment across platforms - Isolation from host system - Simplified scaling with Kubernetes

B.3.4 Kubernetes Deployment

Kubernetes orchestrates containerized model deployments:

Horizontal scaling: Add/remove replicas based on load
GPU scheduling: Allocate GPUs to pods
Rolling updates: Deploy new versions without downtime
Resource limits: Prevent runaway memory/compute usage

Example deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: variant-predictor
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: model
        image: variant-predictor:v2.1
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"

B.4 Inference Optimization

Optimizing inference reduces latency and cost.

B.4.1 Quantization

Quantization reduces numerical precision to decrease memory and computation:

Precision	Bits	Memory	Speed	Quality
FP32	32	1×	1×	Baseline
FP16/BF16	16	0.5×	~2×	Minimal loss
INT8	8	0.25×	~4×	Small loss
INT4	4	0.125×	~8×	Moderate loss

Post-training quantization converts trained models without retraining. Works well for INT8; INT4 may require calibration data or quality monitoring.

Quantization-aware training incorporates quantization during training for better INT4/INT8 quality.

For genomic models: - FP16/BF16 is standard and nearly lossless - INT8 often acceptable for classification tasks - INT4 requires careful validation, especially for regression outputs

B.4.2 Model Pruning

Pruning removes unimportant weights:

Magnitude pruning: Remove weights below threshold
Structured pruning: Remove entire neurons/attention heads
Movement pruning: Remove weights based on training dynamics

Pruning can achieve 50–90% sparsity with minimal accuracy loss on some tasks, but requires model-specific tuning.

B.4.3 Knowledge Distillation

Distillation trains a smaller “student” model to mimic a larger “teacher”:

Run teacher model on large unlabeled corpus
Train student to match teacher outputs (soft labels)
Student learns compressed version of teacher’s knowledge

Effective for: - Deploying on resource-constrained devices - Reducing inference cost for high-volume applications - Creating task-specific lightweight models

B.4.4 ONNX and TensorRT

ONNX (Open Neural Network Exchange) provides a portable model format:

import torch.onnx

torch.onnx.export(model, sample_input, "model.onnx")

TensorRT optimizes ONNX models for NVIDIA GPUs: - Layer fusion - Kernel auto-tuning - Precision calibration

TensorRT can provide 2–5× speedup over naive PyTorch inference.

B.4.5 Caching and Batching

KV-cache stores attention key/value pairs for autoregressive generation, avoiding recomputation.

Speculative decoding uses a small draft model to propose tokens, verified in parallel by the main model.

Dynamic batching groups incoming requests: - Increases GPU utilization - Trades latency for throughput - Configurable wait time and batch size limits

B.5 Benchmarking and Monitoring

B.5.1 Performance Metrics

Metric	Description	Target
Latency (p50)	Median response time	Application-dependent
Latency (p99)	99th percentile response time	Critical for SLAs
Throughput	Requests/second	Scale with load
GPU utilization	GPU compute usage	>80% for efficiency
Memory utilization	GPU memory usage	Monitor for OOM

B.5.2 Monitoring Stack

Metrics → Prometheus → Grafana (visualization)
                    → AlertManager (alerts)

Logs → Elasticsearch → Kibana (search/analysis)

Traces → Jaeger/Zipkin (request tracing)

Key monitoring points: - Request latency distribution - Error rates by error type - Queue depth (if using async processing) - Model prediction distribution (detect drift) - Input sequence characteristics (length, composition)

B.5.3 Load Testing

Before production deployment:

# Example with locust
locust -f load_test.py --host=http://model-api:8080 \
       --users=100 --spawn-rate=10 --run-time=10m

Test scenarios: - Sustained load at expected traffic - Burst traffic (10× normal) - Long sequences (stress memory) - Concurrent batch requests

B.6 Cost Optimization

B.6.1 Right-Sizing

Match hardware to workload: - Do not use A100 for models that fit on RTX 4090 - Use CPU inference for low-throughput applications - Consider spot instances for batch processing

B.6.2 Autoscaling

Scale resources with demand:

# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Scale-to-zero during idle periods for significant savings.

B.6.3 Batch Processing

For non-real-time workloads: - Accumulate requests and process in batches - Use spot instances with checkpointing - Schedule during off-peak hours for lower costs

B.6.4 Model Selection

Choose appropriately sized models: - 110M parameter DNABERT vs. 2.5B Nucleotide Transformer - Evaluate if larger model accuracy justifies cost - Consider distilled/pruned versions for production

B.7 Security Considerations

B.7.1 Data Privacy

Genomic data is sensitive: - Process in compliant environments (HIPAA, GDPR) - Encrypt data at rest and in transit - Implement access controls and audit logging - Consider on-premises deployment for sensitive data

B.7.2 Model Security

Input validation: Reject malformed sequences
Rate limiting: Prevent abuse
Authentication: Require API keys/tokens
Model versioning: Track deployed versions for reproducibility

B.7.3 Federated Learning

For multi-institution collaboration: - Train on distributed data without centralization - Share only model updates, not raw data - Enables learning from diverse populations - See Section 10.6.3 for details

B.8 Reference Architecture

A production genomic model deployment might include:

                                    ┌─────────────┐
                                    │  Model      │
                                    │  Registry   │
                                    └──────┬──────┘
                                           │
┌──────────┐    ┌──────────┐    ┌─────────▼─────────┐    ┌──────────┐
│  Client  │───►│   API    │───►│  Inference Server │───►│  Cache   │
│   App    │    │ Gateway  │    │  (Triton/vLLM)    │    │ (Redis)  │
└──────────┘    └────┬─────┘    └─────────┬─────────┘    └──────────┘
                     │                    │
                     ▼                    ▼
               ┌──────────┐        ┌──────────┐
               │ Metrics  │        │   GPU    │
               │(Prometheus)│       │ Cluster  │
               └──────────┘        └──────────┘

Components: - API Gateway: Authentication, rate limiting, routing - Inference Server: Model hosting, batching, optimization - GPU Cluster: Kubernetes-managed GPU nodes - Cache: Store frequent predictions - Model Registry: Version and track deployed models - Metrics: Monitor performance and health