Appendix B — Deployment and Compute
This appendix covers practical considerations for deploying genomic foundation models, from hardware requirements and cloud platforms to inference optimization and production deployment. The goal is to help practitioners translate model capabilities into working systems that can process real genomic data at scale.
B.1 Hardware Landscape
Genomic foundation models span a wide range of computational requirements. Understanding hardware options helps practitioners match resources to their specific needs.
B.1.1 GPU Computing
Graphics Processing Units (GPUs) are the workhorses of deep learning, providing thousands of parallel cores optimized for matrix operations. Key specifications:
| Metric | Description | Relevance |
|---|---|---|
| VRAM | GPU memory | Determines maximum model/batch size |
| Compute (TFLOPS) | Floating-point operations per second | Determines training/inference speed |
| Memory bandwidth | Data transfer rate | Critical for transformer attention |
| Tensor cores | Specialized matrix units | Accelerate mixed-precision operations |
B.1.2 Consumer vs. Data Center GPUs
| GPU Class | Examples | VRAM | Use Case |
|---|---|---|---|
| Consumer | RTX 4090 | 24 GB | Small model inference, development |
| Workstation | RTX A6000 | 48 GB | Medium model training/inference |
| Data center | A100 | 40/80 GB | Large model training |
| Latest generation | H100 | 80 GB | Foundation model training |
Memory is typically the bottleneck. A 3-billion parameter model in FP16 requires approximately 6 GB just for weights, plus additional memory for activations, gradients (if training), and KV cache (for transformers). The A100 80GB enables training models that would require multi-GPU setups on smaller cards.
B.1.3 TPUs
Tensor Processing Units (TPUs) are Google’s custom accelerators, available through Google Cloud. They offer:
- High memory bandwidth optimized for matrix operations
- Efficient multi-device scaling through dedicated interconnects
- Cost-effective for large-scale training
Many DeepMind models (AlphaFold, Enformer) were trained on TPUs. The JAX framework provides the best TPU support.
B.1.4 Multi-GPU and Distributed Training
Large models require multiple GPUs:
Data parallelism replicates the model across GPUs, each processing different batches. Gradients are synchronized after each step. Scales batch size but not model size.
Model parallelism splits the model across GPUs: - Tensor parallelism: Splits individual layers across GPUs - Pipeline parallelism: Assigns different layers to different GPUs
Fully Sharded Data Parallel (FSDP) and DeepSpeed ZeRO combine approaches, sharding model states across GPUs to train models larger than any single GPU’s memory.
B.1.5 CPU Inference
For smaller models or low-throughput applications, CPU inference may suffice:
- Avoids GPU procurement and maintenance
- Enables deployment on standard servers
- Suitable for models with <1B parameters
- Can be accelerated with ONNX Runtime, Intel MKL
B.2 Cloud Platforms
Cloud computing provides on-demand access to GPU resources without capital expenditure.
B.2.1 Major Providers
| Provider | GPU Options | Strengths |
|---|---|---|
| AWS | A100, H100, Trainium | Broadest ecosystem, SageMaker |
| Google Cloud | A100, TPU v4/v5 | TPU access, Vertex AI |
| Azure | A100, H100 | Enterprise integration, Azure ML |
| Lambda Labs | A100, H100 | ML-focused, simpler pricing |
| CoreWeave | A100, H100 | GPU-specialized, Kubernetes-native |
B.2.2 Cost Considerations
GPU costs vary significantly:
| Resource | Approximate Cost (2024) |
|---|---|
| A100 40GB (on-demand) | $3–4/hour |
| A100 80GB (on-demand) | $4–5/hour |
| H100 (on-demand) | $5–8/hour |
| A100 (spot/preemptible) | $1–2/hour |
Spot instances offer 60–80% discounts but can be interrupted. Suitable for: - Checkpointed training runs - Batch inference jobs - Non-time-critical workloads
Reserved instances provide discounts for committed usage but require upfront planning.
B.2.3 Managed ML Platforms
| Platform | Features |
|---|---|
| AWS SageMaker | Training, hosting, MLOps pipelines |
| Google Vertex AI | Training, prediction, feature store |
| Azure ML | Training, deployment, monitoring |
| HuggingFace Inference Endpoints | One-click model deployment |
These platforms handle infrastructure but add cost overhead. They are valuable for production deployments requiring reliability, monitoring, and scaling.
B.3 Model Deployment
Deploying a model for real-world use requires careful consideration of latency, throughput, reliability, and cost.
B.3.1 Inference Servers
Specialized inference servers optimize model serving:
| Server | Features |
|---|---|
| NVIDIA Triton | Multi-framework, dynamic batching, model ensemble |
| vLLM | Optimized for LLM inference, PagedAttention |
| TGI (Text Generation Inference) | HuggingFace’s optimized inference server |
| TorchServe | PyTorch-native, simple deployment |
These servers provide: - Dynamic batching: Combine requests for efficient GPU utilization - Model warmup: Pre-load models to reduce cold start - Health checks: Monitor model availability - Metrics: Track latency, throughput, errors
B.3.2 API Design
A typical genomic model API accepts sequences and returns predictions:
# Request
{
"sequences": ["ATCGATCG...", "GCTAGCTA..."],
"return_embeddings": false
}
# Response
{
"predictions": [
{"pathogenicity": 0.87, "confidence": 0.92},
{"pathogenicity": 0.12, "confidence": 0.95}
],
"model_version": "v2.1.0",
"processing_time_ms": 145
}Design considerations: - Batch endpoints for throughput-critical applications - Streaming for large outputs (embeddings, long sequences) - Versioning to manage model updates - Input validation to catch malformed sequences early
B.3.3 Containerization
Docker containers package models with dependencies:
FROM pytorch/pytorch:2.0.0-cuda11.7
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY model/ /app/model/
COPY serve.py /app/
EXPOSE 8080
CMD ["python", "/app/serve.py"]Containers provide: - Reproducible environments - Easy deployment across platforms - Isolation from host system - Simplified scaling with Kubernetes
B.3.4 Kubernetes Deployment
Kubernetes orchestrates containerized model deployments:
- Horizontal scaling: Add/remove replicas based on load
- GPU scheduling: Allocate GPUs to pods
- Rolling updates: Deploy new versions without downtime
- Resource limits: Prevent runaway memory/compute usage
Example deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: variant-predictor
spec:
replicas: 3
template:
spec:
containers:
- name: model
image: variant-predictor:v2.1
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"B.4 Inference Optimization
Optimizing inference reduces latency and cost.
B.4.1 Quantization
Quantization reduces numerical precision to decrease memory and computation:
| Precision | Bits | Memory | Speed | Quality |
|---|---|---|---|---|
| FP32 | 32 | 1× | 1× | Baseline |
| FP16/BF16 | 16 | 0.5× | ~2× | Minimal loss |
| INT8 | 8 | 0.25× | ~4× | Small loss |
| INT4 | 4 | 0.125× | ~8× | Moderate loss |
Post-training quantization converts trained models without retraining. Works well for INT8; INT4 may require calibration data or quality monitoring.
Quantization-aware training incorporates quantization during training for better INT4/INT8 quality.
For genomic models: - FP16/BF16 is standard and nearly lossless - INT8 often acceptable for classification tasks - INT4 requires careful validation, especially for regression outputs
B.4.2 Model Pruning
Pruning removes unimportant weights:
- Magnitude pruning: Remove weights below threshold
- Structured pruning: Remove entire neurons/attention heads
- Movement pruning: Remove weights based on training dynamics
Pruning can achieve 50–90% sparsity with minimal accuracy loss on some tasks, but requires model-specific tuning.
B.4.3 Knowledge Distillation
Distillation trains a smaller “student” model to mimic a larger “teacher”:
- Run teacher model on large unlabeled corpus
- Train student to match teacher outputs (soft labels)
- Student learns compressed version of teacher’s knowledge
Effective for: - Deploying on resource-constrained devices - Reducing inference cost for high-volume applications - Creating task-specific lightweight models
B.4.4 ONNX and TensorRT
ONNX (Open Neural Network Exchange) provides a portable model format:
import torch.onnx
torch.onnx.export(model, sample_input, "model.onnx")TensorRT optimizes ONNX models for NVIDIA GPUs: - Layer fusion - Kernel auto-tuning - Precision calibration
TensorRT can provide 2–5× speedup over naive PyTorch inference.
B.4.5 Caching and Batching
KV-cache stores attention key/value pairs for autoregressive generation, avoiding recomputation.
Speculative decoding uses a small draft model to propose tokens, verified in parallel by the main model.
Dynamic batching groups incoming requests: - Increases GPU utilization - Trades latency for throughput - Configurable wait time and batch size limits
B.5 Benchmarking and Monitoring
B.5.1 Performance Metrics
| Metric | Description | Target |
|---|---|---|
| Latency (p50) | Median response time | Application-dependent |
| Latency (p99) | 99th percentile response time | Critical for SLAs |
| Throughput | Requests/second | Scale with load |
| GPU utilization | GPU compute usage | >80% for efficiency |
| Memory utilization | GPU memory usage | Monitor for OOM |
B.5.2 Monitoring Stack
Metrics → Prometheus → Grafana (visualization)
→ AlertManager (alerts)
Logs → Elasticsearch → Kibana (search/analysis)
Traces → Jaeger/Zipkin (request tracing)
Key monitoring points: - Request latency distribution - Error rates by error type - Queue depth (if using async processing) - Model prediction distribution (detect drift) - Input sequence characteristics (length, composition)
B.5.3 Load Testing
Before production deployment:
# Example with locust
locust -f load_test.py --host=http://model-api:8080 \
--users=100 --spawn-rate=10 --run-time=10mTest scenarios: - Sustained load at expected traffic - Burst traffic (10× normal) - Long sequences (stress memory) - Concurrent batch requests
B.6 Cost Optimization
B.6.1 Right-Sizing
Match hardware to workload: - Do not use A100 for models that fit on RTX 4090 - Use CPU inference for low-throughput applications - Consider spot instances for batch processing
B.6.2 Autoscaling
Scale resources with demand:
# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Scale-to-zero during idle periods for significant savings.
B.6.3 Batch Processing
For non-real-time workloads: - Accumulate requests and process in batches - Use spot instances with checkpointing - Schedule during off-peak hours for lower costs
B.6.4 Model Selection
Choose appropriately sized models: - 110M parameter DNABERT vs. 2.5B Nucleotide Transformer - Evaluate if larger model accuracy justifies cost - Consider distilled/pruned versions for production
B.7 Security Considerations
B.7.1 Data Privacy
Genomic data is sensitive: - Process in compliant environments (HIPAA, GDPR) - Encrypt data at rest and in transit - Implement access controls and audit logging - Consider on-premises deployment for sensitive data
B.7.2 Model Security
- Input validation: Reject malformed sequences
- Rate limiting: Prevent abuse
- Authentication: Require API keys/tokens
- Model versioning: Track deployed versions for reproducibility
B.7.3 Federated Learning
For multi-institution collaboration: - Train on distributed data without centralization - Share only model updates, not raw data - Enables learning from diverse populations - See Section 10.6.3 for details
B.8 Reference Architecture
A production genomic model deployment might include:
┌─────────────┐
│ Model │
│ Registry │
└──────┬──────┘
│
┌──────────┐ ┌──────────┐ ┌─────────▼─────────┐ ┌──────────┐
│ Client │───►│ API │───►│ Inference Server │───►│ Cache │
│ App │ │ Gateway │ │ (Triton/vLLM) │ │ (Redis) │
└──────────┘ └────┬─────┘ └─────────┬─────────┘ └──────────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ Metrics │ │ GPU │
│(Prometheus)│ │ Cluster │
└──────────┘ └──────────┘
Components: - API Gateway: Authentication, rate limiting, routing - Inference Server: Model hosting, batching, optimization - GPU Cluster: Kubernetes-managed GPU nodes - Cache: Store frequent predictions - Model Registry: Version and track deployed models - Metrics: Monitor performance and health
B.9 Checklist for Production Deployment
Before deploying a genomic model to production:
Model Validation
Infrastructure
Monitoring
Security
Documentation