References
Adzhubei, Ivan A., Steffen Schmidt, Leonid Peshkin, Vasily E. Ramensky,
Anna Gerasimova, Peer Bork, Alexey S. Kondrashov, and Shamil R. Sunyaev.
2010. “A Method and Server for Predicting Damaging Missense
Mutations.” Nature Methods 7 (4): 248–49. https://doi.org/10.1038/nmeth0410-248.
All of Us, All of Us Research Program Investigators. 2019. “The
‘All of Us’ Research
Program.” New England Journal of Medicine
381 (7): 668–76. https://doi.org/10.1056/NEJMsr1809937.
Auton, Adam, Gonçalo R. Abecasis, David M. Altshuler, Richard M. Durbin,
Gonçalo R. Abecasis, David R. Bentley, Aravinda Chakravarti, et al.
2015. “A Global Reference for Human Genetic Variation.”
Nature 526 (7571): 68–74. https://doi.org/10.1038/nature15393.
Avsec, Žiga, Vikram Agarwal, D. Visentin, J. Ledsam, A.
Grabska-Barwinska, Kyle R. Taylor, Yannis Assael, J. Jumper, Pushmeet
Kohli, and David R. Kelley. 2021. “[Enformer]
Effective Gene Expression Prediction from Sequence by
Integrating Long-Range Interactions.” Nature Methods 18
(October): 1196–1203. https://doi.org/10.1038/s41592-021-01252-x.
Avsec, Ziga, Natasha Latysheva, and Jun Cheng. 2025.
“AlphaGenome: AI for Better
Understanding the Genome.” Google DeepMind. https://deepmind.google/discover/blog/alphagenome-ai-for-better-understanding-the-genome/.
Benegas, Gonzalo, Carlos Albors, Alan J. Aw, Chengzhong Ye, and Yun S.
Song. 2024. “GPN-MSA: An Alignment-Based
DNA Language Model for Genome-Wide Variant Effect
Prediction.” bioRxiv, April, 2023.10.10.561776. https://doi.org/10.1101/2023.10.10.561776.
Benegas, Gonzalo, Sanjit Singh Batra, and Yun S. Song. 2023.
“[GPN] DNA Language Models Are Powerful
Predictors of Genome-Wide Variant Effects.” Proceedings of
the National Academy of Sciences 120 (44): e2311219120. https://doi.org/10.1073/pnas.2311219120.
Benegas, Gonzalo, Gökcen Eraslan, and Yun S. Song. 2025.
“[TraitGym] Benchmarking
DNA Sequence Models for
Causal Regulatory Variant
Prediction in Human
Genetics.” bioRxiv. https://doi.org/10.1101/2025.02.11.637758.
Benegas, Gonzalo, Chengzhong Ye, Carlos Albors, Jianan Canal Li, and Yun
S. Song. 2024. “Genomic Language Models:
Opportunities and Challenges.” arXiv.
https://doi.org/10.48550/arXiv.2407.11435.
Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran
Arora, Sydney von Arx, Michael S. Bernstein, et al. 2022. “On the
Opportunities and Risks of
Foundation Models.” arXiv. https://doi.org/10.48550/arXiv.2108.07258.
Brandes, Nadav, Grant Goldman, Charlotte H. Wang, Chun Jimmie Ye, and
Vasilis Ntranos. 2023. “Genome-Wide Prediction of Disease Variant
Effects with a Deep Protein Language Model.” Nature
Genetics 55 (9): 1512–22. https://doi.org/10.1038/s41588-023-01465-0.
Brixi, Garyk, Matthew G. Durrant, Jerome Ku, Michael Poli, Greg
Brockman, Daniel Chang, Gabriel A. Gonzalez, et al. 2025.
“[Evo 2] Genome Modeling and Design
Across All Domains of Life with Evo 2.” bioRxiv. https://doi.org/10.1101/2025.02.18.638918.
Browning, Brian L., Xiaowen Tian, Ying Zhou, and Sharon R. Browning.
2021. “Fast Two-Stage Phasing of Large-Scale Sequence
Data.” American Journal of Human Genetics 108 (10):
1880–90. https://doi.org/10.1016/j.ajhg.2021.08.005.
Bycroft, Clare, Colin Freeman, Desislava Petkova, Gavin Band, Lloyd T.
Elliott, Kevin Sharp, Allan Motyer, et al. 2018. “The
UK Biobank Resource with Deep Phenotyping and
Genomic Data.” Nature 562 (7726): 203–9. https://doi.org/10.1038/s41586-018-0579-z.
Camillo, Lucas Paulo de Lima, Raghav Sehgal, Jenel Armstrong, Albert T.
Higgins-Chen, Steve Horvath, and Bo Wang. 2024.
“CpGPT: A Foundation Model
for DNA Methylation.” bioRxiv. https://doi.org/10.1101/2024.10.24.619766.
Cao, Zhi-Jie, and Ge Gao. 2022. “[GLUE]
Multi-Omics Single-Cell Data Integration and Regulatory
Inference with Graph-Linked Embedding.” Nature
Biotechnology 40 (10): 1458–66. https://doi.org/10.1038/s41587-022-01284-4.
Chandak, Payal, Kexin Huang, and Marinka Zitnik. 2023.
“[PrimeKG] Building a Knowledge Graph to
Enable Precision Medicine.” Scientific Data 10 (1): 67.
https://doi.org/10.1038/s41597-023-01960-3.
Chen, Kathleen M., Aaron K. Wong, Olga G. Troyanskaya, and Jian Zhou.
2022. “[DeepSEA Sei] A
Sequence-Based Global Map of Regulatory Activity for Deciphering Human
Genetics.” Nature Genetics 54 (7): 940–49. https://doi.org/10.1038/s41588-022-01102-2.
Cheng, Jun, Guido Novati, Joshua Pan, Clare Bycroft, Akvilė Žemgulytė,
Taylor Applebaum, Alexander Pritzel, et al. 2023.
“[AlphaMissense] Accurate Proteome-Wide
Missense Variant Effect Prediction with
AlphaMissense.” Science 381 (6664):
eadg7492. https://doi.org/10.1126/science.adg7492.
Choi, Shing Wan, Timothy Shin-Heng Mak, and Paul F. O’Reilly. 2020.
“[PRS] Tutorial: A Guide to Performing
Polygenic Risk Score Analyses.” Nature Protocols 15 (9):
2759–72. https://doi.org/10.1038/s41596-020-0353-1.
Chung, Wen-Hung, Shuen-Iu Hung, Hong-Shang Hong, Mo-Song Hsih, Li-Cheng
Yang, Hsin-Chun Ho, Jer-Yuarn Wu, and Yuan-Tsong Chen. 2004. “A
Marker for Stevens–Johnson Syndrome.”
Nature 428 (6982): 486–86. https://doi.org/10.1038/428486a.
Clarke, Brian, Eva Holtkamp, Hakime Öztürk, Marcel Mück, Magnus
Wahlberg, Kayla Meyer, Felix Munzlinger, et al. 2024.
“[DeepRVAT] Integration of Variant
Annotations Using Deep Set Networks Boosts Rare Variant Association
Testing.” Nature Genetics 56 (10): 2271–80. https://doi.org/10.1038/s41588-024-01919-z.
Dabernig-Heinz, Johanna, Mara Lohde, Martin Hölzer, Adriana Cabal, Rick
Conzemius, Christian Brandt, Matthias Kohl, et al. 2024. “A
Multicenter Study on Accuracy and Reproducibility of Nanopore
Sequencing-Based Genotyping of Bacterial Pathogens.” Journal
of Clinical Microbiology 62 (9): e00628–24. https://doi.org/10.1128/jcm.00628-24.
Dalla-Torre, Hugo, Liam Gonzalez, Javier Mendoza-Revilla, Nicolas Lopez
Carranza, Adam Henryk Grzywaczewski, Francesco Oteri, Christian Dallago,
et al. 2023. “Nucleotide Transformer: Building and
Evaluating Robust Foundation Models for Human Genomics.”
Nature Methods 22 (2): 287–97. https://doi.org/10.1038/s41592-024-02523-z.
Davydov, Eugene V., David L. Goode, Marina Sirota, Gregory M. Cooper,
Arend Sidow, and Serafim Batzoglou. 2010. “Identifying a
High Fraction of the Human
Genome to Be Under Selective
Constraint Using GERP++.”
PLOS Computational Biology 6 (12): e1001025. https://doi.org/10.1371/journal.pcbi.1001025.
DePristo, Mark A., Eric Banks, Ryan Poplin, Kiran V. Garimella, Jared R.
Maguire, Christopher Hartl, Anthony A. Philippakis, et al. 2011.
“A Framework for Variation Discovery and Genotyping Using
Next-Generation DNA Sequencing Data.” Nature
Genetics 43 (5): 491–98. https://doi.org/10.1038/ng.806.
Fishman, Veniamin, Yuri Kuratov, Aleksei Shmelev, Maxim Petrov, Dmitry
Penzar, Denis Shepelin, Nikolay Chekanov, Olga Kardymon, and Mikhail
Burtsev. 2025. “GENA-LM: A Family of
Open-Source Foundational DNA Language Models for Long
Sequences.” Nucleic Acids Research 53 (2): gkae1310. https://doi.org/10.1093/nar/gkae1310.
Frankish, Adam, Mark Diekhans, Anne-Maud Ferreira, Rory Johnson, Irwin
Jungreis, Jane Loveland, Jonathan M Mudge, et al. 2019.
“GENCODE Reference Annotation for the Human and Mouse
Genomes.” Nucleic Acids Research 47 (D1): D766–73. https://doi.org/10.1093/nar/gky955.
Gao, Ziqi, Chenran Jiang, Jiawen Zhang, Xiaosen Jiang, Lanqing Li,
Peilin Zhao, Huanming Yang, Yong Huang, and Jia Li. 2023.
“[HIGH-PPI] Hierarchical
Graph Learning for Protein–Protein Interaction.” Nature
Communications 14 (1): 1093. https://doi.org/10.1038/s41467-023-36736-1.
Garrison, Erik, Jouni Sirén, Adam M. Novak, Glenn Hickey, Jordan M.
Eizenga, Eric T. Dawson, William Jones, et al. 2018. “Variation
Graph Toolkit Improves Read Mapping by Representing Genetic Variation in
the Reference.” Nature Biotechnology 36 (9): 875–79. https://doi.org/10.1038/nbt.4227.
Georgantas, Costa, Zoltán Kutalik, and Jonas Richiardi. 2024.
“Delphi: A Deep-Learning
Method for Polygenic Risk
Prediction.” medRxiv. https://doi.org/10.1101/2024.04.19.24306079.
Goodwin, Sara, John D. McPherson, and W. Richard McCombie. 2016.
“Coming of Age: Ten Years of Next-Generation Sequencing
Technologies.” Nature Reviews Genetics 17 (6): 333–51.
https://doi.org/10.1038/nrg.2016.49.
Guo, Fei, Renchu Guan, Yaohang Li, Qi Liu, Xiaowo Wang, Can Yang, and
Jianxin Wang. 2025. “Foundation Models in Bioinformatics.”
National Science Review 12 (4): nwaf028. https://doi.org/10.1093/nsr/nwaf028.
He, Shujun, Baizhen Gao, Rushant Sabnis, and Qing Sun. 2023.
“Nucleic Transformer: Classifying
DNA Sequences with
Self-Attention and
Convolutions.” ACS Synthetic Biology 12
(11): 3205–14. https://doi.org/10.1021/acssynbio.3c00154.
Hudaiberdiev, Sanjarbek, D. Leland Taylor, Wei Song, Narisu Narisu,
Redwan M. Bhuiyan, Henry J. Taylor, Xuming Tang, et al. 2023.
“[TREDNet] Modeling Islet Enhancers
Using Deep Learning Identifies Candidate Causal Variants at Loci
Associated with T2D and Glycemic Traits.”
Proceedings of the National Academy of Sciences 120 (35):
e2206612120. https://doi.org/10.1073/pnas.2206612120.
Jaganathan, Kishore, Sofia Kyriazopoulou Panagiotopoulou, Jeremy F.
McRae, Siavash Fazel Darbandi, David Knowles, Yang I. Li, Jack A.
Kosmicki, et al. 2019. “[SpliceAI]
Predicting Splicing from Primary
Sequence with Deep
Learning.” Cell 176 (3): 535–548.e24. https://doi.org/10.1016/j.cell.2018.12.015.
Ji, Yanrong, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021.
“DNABERT: Pre-Trained Bidirectional
Encoder Representations from
Transformers Model for DNA-Language in
Genome.” Bioinformatics 37 (15): 2112–20. https://doi.org/10.1093/bioinformatics/btab083.
Jiang, Tao, Yongzhuang Liu, Yue Jiang, Junyi Li, Yan Gao, Zhe Cui,
Yadong Liu, Bo Liu, and Yadong Wang. 2020. “Long-Read-Based Human
Genomic Structural Variation Detection with cuteSV.” Genome Biology 21 (1):
189. https://doi.org/10.1186/s13059-020-02107-y.
Jurenaite, Neringa, Daniel León-Periñán, Veronika Donath, Sunna Torge,
and René Jäkel. 2024. “SetQuence &
SetOmic: Deep Set Transformers for Whole
Genome and Exome Tumour Analysis.” BioSystems 235
(January): 105095. https://doi.org/10.1016/j.biosystems.2023.105095.
Kagda, Meenakshi S., Bonita Lam, Casey Litton, Corinn Small, Cricket A.
Sloan, Emma Spragins, Forrest Tanaka, et al. 2025. “Data
Navigation on the ENCODE Portal.” Nature
Communications 16 (1): 9592. https://doi.org/10.1038/s41467-025-64343-9.
Karczewski, Konrad J., Laurent C. Francioli, Grace Tiao, Beryl B.
Cummings, Jessica Alföldi, Qingbo Wang, Ryan L. Collins, et al. 2020.
“The Mutational Constraint Spectrum Quantified from Variation in
141,456 Humans.” Nature 581 (7809): 434–43. https://doi.org/10.1038/s41586-020-2308-7.
Krusche, Peter, Len Trigg, Paul C. Boutros, Christopher E. Mason,
Francisco M. De La Vega, Benjamin L. Moore, Mar Gonzalez-Porta, et al.
2019. “Best Practices for Benchmarking
Germline Small Variant
Calls in Human Genomes.”
Nature Biotechnology 37 (5): 555–60. https://doi.org/10.1038/s41587-019-0054-x.
Kundaje, Anshul, Wouter Meuleman, Jason Ernst, Misha Bilenky, Angela
Yen, Alireza Heravi-Moussavi, Pouya Kheradpour, et al. 2015.
“Integrative Analysis of 111 Reference Human Epigenomes.”
Nature 518 (7539): 317–30. https://doi.org/10.1038/nature14248.
Kurki, Mitja I., Juha Karjalainen, Priit Palta, Timo P. Sipilä, Kati
Kristiansson, Kati M. Donner, Mary P. Reeve, et al. 2023.
“FinnGen Provides Genetic Insights from a
Well-Phenotyped Isolated Population.” Nature 613 (7944):
508–18. https://doi.org/10.1038/s41586-022-05473-8.
Lambert, Samuel A., Laurent Gil, Simon Jupp, Scott C. Ritchie, Yu Xu,
Annalisa Buniello, Aoife McMahon, et al. 2021. “The
Polygenic Score Catalog as an
Open Database for Reproducibility and Systematic Evaluation.”
Nature Genetics 53 (4): 420–25. https://doi.org/10.1038/s41588-021-00783-5.
Lee, Ingoo, Zachary S. Wallace, Yuqi Wang, Sungjoon Park, Hojung Nam,
Amit R. Majithia, and Trey Ideker. 2025. “[G2PT]
A Genotype-Phenotype Transformer to Assess and Explain
Polygenic Risk.” bioRxiv. https://doi.org/10.1101/2024.10.23.619940.
Li, Hao, Zebei Han, Yu Sun, Fu Wang, Pengzhen Hu, Yuang Gao, Xuemei Bai,
et al. 2024. “CGMega: Explainable Graph Neural
Network Framework with Attention Mechanisms for Cancer Gene Module
Dissection.” Nature Communications 15 (1): 5997. https://doi.org/10.1038/s41467-024-50426-6.
Li, Heng. 2013. “Aligning Sequence Reads, Clone Sequences and
Assembly Contigs with BWA-MEM.” arXiv.
https://doi.org/10.48550/arXiv.1303.3997.
———. 2014. “Towards Better Understanding
of Artifacts in Variant Calling
from High-Coverage
Samples.” Bioinformatics 30 (20): 2843–51.
https://doi.org/10.1093/bioinformatics/btu356.
———. 2018. “Minimap2: Pairwise Alignment for Nucleotide
Sequences.” Bioinformatics 34 (18): 3094–3100. https://doi.org/10.1093/bioinformatics/bty191.
Li, Xiao, Jie Ma, Ling Leng, Mingfei Han, Mansheng Li, Fuchu He, and
Yunping Zhu. 2022. “MoGCN: A
Multi-Omics Integration
Method Based on Graph
Convolutional Network for Cancer
Subtype Analysis.” Frontiers in
Genetics 13 (February). https://doi.org/10.3389/fgene.2022.806842.
Li, Zehui, Akashaditya Das, William A. V. Beardall, Yiren Zhao, and
Guy-Bart Stan. 2023. “Genomic Interpreter:
A Hierarchical Genomic
Deep Neural Network with
1D Shifted Window
Transformer.” arXiv. https://doi.org/10.48550/arXiv.2306.05143.
Liao, Wen-Wei, Mobin Asri, Jana Ebler, Daniel Doerr, Marina Haukness,
Glenn Hickey, Shuangjia Lu, et al. 2023. “A Draft Human Pangenome
Reference.” Nature 617 (7960): 312–24. https://doi.org/10.1038/s41586-023-05896-x.
Lin, Zeming, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting
Lu, Allan dos Santos Costa, et al. 2022. “[ESM-2]
Language Models of Protein Sequences at the Scale of
Evolution Enable Accurate Structure Prediction.” bioRxiv. https://doi.org/10.1101/2022.07.20.500902.
Linder, Johannes, Divyanshi Srivastava, Han Yuan, Vikram Agarwal, and
David R. Kelley. 2025. “[Borzoi]
Predicting RNA-Seq Coverage from
DNA Sequence as a Unifying Model of Gene
Regulation.” Nature Genetics 57 (4): 949–61. https://doi.org/10.1038/s41588-024-02053-6.
Liu, Zicheng, Siyuan Li, Zhiyuan Chen, Fang Wu, Chang Yu, Qirong Yang,
Yucheng Guo, Yujie Yang, Xiaoming Zhang, and Stan Z. Li. 2025.
“Life-Code: Central Dogma
Modeling with Multi-Omics
Sequence Unification.” arXiv. https://doi.org/10.48550/arXiv.2502.07299.
Loh, Po-Ru, Petr Danecek, Pier Francesco Palamara, Christian
Fuchsberger, Yakir A Reshef, Hilary K Finucane, Sebastian Schoenherr, et
al. 2016. “Reference-Based Phasing Using the
Haplotype Reference Consortium
Panel.” Nature Genetics 48 (11): 1443–48. https://doi.org/10.1038/ng.3679.
Ma, Jiani, Jiangning Song, Neil D. Young, Bill C. H. Chang, Pasi K.
Korhonen, Tulio L. Campos, Hui Liu, and Robin B. Gasser. 2023.
“’Bingo’-a Large Language Model- and Graph Neural
Network-Based Workflow for the Prediction of Essential Genes from
Protein Data.” Briefings in Bioinformatics 25 (1):
bbad472. https://doi.org/10.1093/bib/bbad472.
Mallal, Simon, Elizabeth Phillips, Giampiero Carosi, Jean-Michel Molina,
Cassy Workman, Janez Tomažič, Eva Jägel-Guedes, et al. 2008.
“HLA-B*5701 Screening for
Hypersensitivity to Abacavir.” New
England Journal of Medicine 358 (6): 568–79. https://doi.org/10.1056/NEJMoa0706135.
Manzo, Gaetano, Kathryn Borkowski, and Ivan Ovcharenko. 2025.
“Comparative Analysis of Deep
Learning Models for Predicting
Causative Regulatory
Variants.” bioRxiv: The Preprint Server for
Biology, June, 2025.05.19.654920. https://doi.org/10.1101/2025.05.19.654920.
Marees, Andries T., Hilde de Kluiver, Sven Stringer, Florence Vorspan,
Emmanuel Curis, Cynthia Marie-Claire, and Eske M. Derks. 2018.
“[GWAS] A Tutorial on Conducting
Genome-Wide Association Studies: Quality Control and
Statistical Analysis.” International Journal of Methods in
Psychiatric Research 27 (2): e1608. https://doi.org/10.1002/mpr.1608.
Marquet, Céline, Julius Schlensok, Marina Abakarova, Burkhard Rost, and
Elodie Laine. 2024. “[VespaG]
Expert-Guided Protein Language Models Enable Accurate and
Blazingly Fast Fitness Prediction.” Bioinformatics 40
(11): btae621. https://doi.org/10.1093/bioinformatics/btae621.
McKenna, Aaron, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian
Cibulskis, Andrew Kernytsky, Kiran Garimella, et al. 2010. “The
Genome Analysis Toolkit:
A MapReduce Framework for Analyzing
Next-Generation DNA Sequencing Data.” Genome
Research 20 (9): 1297–1303. https://doi.org/10.1101/gr.107524.110.
Medvedev, Aleksandr, Karthik Viswanathan, Praveenkumar Kanithi, Kirill
Vishniakov, Prateek Munjal, Clément Christophe, Marco AF Pimentel,
Ronnie Rajan, and Shadab Khan. 2025. “BioToken and
BioFM – Biologically-Informed
Tokenization Enables Accurate and
Efficient Genomic Foundation
Models.” bioRxiv. https://doi.org/10.1101/2025.03.27.645711.
Meier, Joshua, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, and
Alexander Rives. 2021. “[ESM-1v]
Language Models Enable Zero-Shot Prediction of the Effects
of Mutations on Protein Function.” bioRxiv. https://doi.org/10.1101/2021.07.09.450648.
Morales, Joannella, Shashikant Pujar, Jane E. Loveland, Alex Astashyn,
Ruth Bennett, Andrew Berry, Eric Cox, et al. 2022. “A Joint
NCBI and EMBL-EBI Transcript Set
for Clinical Genomics and Research.” Nature 604 (7905):
310–15. https://doi.org/10.1038/s41586-022-04558-8.
Mountjoy, Edward, Ellen M. Schmidt, Miguel Carmona, Jeremy
Schwartzentruber, Gareth Peat, Alfredo Miranda, Luca Fumis, et al. 2021.
“An Open Approach to Systematically Prioritize Causal Variants and
Genes at All Published Human GWAS Trait-Associated
Loci.” Nature Genetics 53 (11): 1527–33. https://doi.org/10.1038/s41588-021-00945-5.
Naghipourfar, Mohsen, Siyu Chen, Mathew K. Howard, Christian B.
Macdonald, Ali Saberi, Timo Hagen, Mohammad R. K. Mofrad, Willow
Coyote-Maestas, and Hani Goodarzi. 2024. “[cdsFM - EnCodon/DeCodon]
A Suite of Foundation
Models Captures the Contextual
Interplay Between Codons.”
bioRxiv. https://doi.org/10.1101/2024.10.10.617568.
Ng, Pauline C., and Steven Henikoff. 2003. “SIFT:
Predicting Amino Acid Changes That Affect Protein
Function.” Nucleic Acids Research 31 (13): 3812–14. https://doi.org/10.1093/nar/gkg509.
Nguyen, Eric, Michael Poli, Marjan Faizi, Armin Thomas, Callum
Birch-Sykes, Michael Wornow, Aman Patel, et al. 2023.
“HyenaDNA: Long-Range
Genomic Sequence Modeling at
Single Nucleotide
Resolution.” arXiv. https://doi.org/10.48550/arXiv.2306.15794.
Nielsen, Rasmus, Joshua S. Paul, Anders Albrechtsen, and Yun S. Song.
2011. “Genotype and SNP Calling from Next-Generation
Sequencing Data.” Nature Reviews. Genetics 12 (6):
443–51. https://doi.org/10.1038/nrg2986.
Notin, Pascal, Aaron Kollasch, Daniel Ritter, Lood van Niekerk,
Steffanie Paul, Han Spinner, Nathan Rollins, et al. 2023.
“ProteinGym: Large-Scale
Benchmarks for Protein Fitness
Prediction and Design.” Advances in
Neural Information Processing Systems 36 (December): 64331–79. https://papers.nips.cc/paper_files/paper/2023/hash/cac723e5ff29f65e3fcbb0739ae91bee-Abstract-Datasets_and_Benchmarks.html.
Nurk, Sergey, Sergey Koren, Arang Rhie, Mikko Rautiainen, Andrey V.
Bzikadze, Alla Mikheenko, Mitchell R. Vollger, et al. 2022. “The
Complete Sequence of a Human Genome.” Science 376
(6588): 44–53. https://doi.org/10.1126/science.abj6987.
O’Connell, Jared, Deepti Gurdasani, Olivier Delaneau, Nicola Pirastu,
Sheila Ulivi, Massimiliano Cocca, Michela Traglia, et al. 2014. “A
General Approach for Haplotype
Phasing Across the Full Spectrum
of Relatedness.” PLOS Genetics 10 (4):
e1004234. https://doi.org/10.1371/journal.pgen.1004234.
O’Leary, Nuala A., Mathew W. Wright, J. Rodney Brister, Stacy Ciufo,
Diana Haddad, Rich McVeigh, Bhanu Rajput, et al. 2016. “Reference
Sequence (RefSeq) Database at NCBI: Current
Status, Taxonomic Expansion, and Functional Annotation.”
Nucleic Acids Research 44 (D1): D733–45. https://doi.org/10.1093/nar/gkv1189.
Padyukov, Leonid. 2022. “Genetics of Rheumatoid Arthritis.”
Seminars in Immunopathology 44 (1): 47–62. https://doi.org/10.1007/s00281-022-00912-0.
Pasaniuc, Bogdan, and Alkes L. Price. 2016. “Dissecting the
Genetics of Complex Traits Using Summary Association Statistics.”
Nature Reviews Genetics 18 (2): 117–27. https://doi.org/10.1038/nrg.2016.142.
Poplin, Ryan, Pi-Chuan Chang, David Alexander, Scott Schwartz, Thomas
Colthurst, Alexander Ku, Dan Newburger, et al. 2018.
“[DeepVariant] A Universal
SNP and Small-Indel Variant Caller Using Deep Neural
Networks.” Nature Biotechnology 36 (10): 983–87. https://doi.org/10.1038/nbt.4235.
Rakowski, Alexander, and Christoph Lippert. 2025.
“[MIFM] Multiple Instance Fine-Mapping:
Predicting Causal Regulatory Variants with a Deep Sequence
Model.” medRxiv. https://doi.org/10.1101/2025.06.13.25329551.
Rentzsch, Philipp, Daniela Witten, Gregory M Cooper, Jay Shendure, and
Martin Kircher. 2019. “CADD: Predicting the
Deleteriousness of Variants Throughout the Human Genome.”
Nucleic Acids Research 47 (D1): D886–94. https://doi.org/10.1093/nar/gky1016.
Rives, Alexander, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin,
Jason Liu, Demi Guo, et al. 2021. “[ESM-1b]
Biological Structure and Function Emerge from Scaling
Unsupervised Learning to 250 Million Protein Sequences.”
Proceedings of the National Academy of Sciences of the United States
of America 118 (15): e2016239118. https://doi.org/10.1073/pnas.2016239118.
Robinson, James, Dominic J Barker, Xenia Georgiou, Michael A Cooper,
Paul Flicek, and Steven G E Marsh. 2020.
“IPD-IMGT/HLA
Database.” Nucleic Acids Research 48 (D1):
D948–55. https://doi.org/10.1093/nar/gkz950.
Sakaue, Saori, Saisriram Gurajala, Michelle Curtis, Yang Luo, Wanson
Choi, Kazuyoshi Ishigaki, Joyce B. Kang, et al. 2023. “Tutorial: A
Statistical Genetics Guide to Identifying HLA Alleles
Driving Complex Disease.” Nature Protocols 18 (9):
2625–41. https://doi.org/10.1038/s41596-023-00853-4.
Sanabria, Melissa, Jonas Hirsch, Pierre M. Joubert, and Anna R. Poetsch.
2024. “[GROVER] DNA Language Model
GROVER Learns Sequence Context in the Human Genome.”
Nature Machine Intelligence 6 (8): 911–23. https://doi.org/10.1038/s42256-024-00872-0.
Schubach, Max, Thorben Maass, Lusiné Nazaretyan, Sebastian Röner, and
Martin Kircher. 2024. “CADD V1.7: Using Protein
Language Models, Regulatory CNNs and Other Nucleotide-Level
Scores to Improve Genome-Wide Variant Predictions.” Nucleic
Acids Research 52 (D1): D1143–54. https://doi.org/10.1093/nar/gkad989.
Shafin, Kishwar, Trevor Pesout, Pi-Chuan Chang, Maria Nattestad, Alexey
Kolesnikov, Sidharth Goel, Gunjan Baid, et al. 2021.
“Haplotype-Aware Variant Calling with
PEPPER-Margin-DeepVariant Enables
High Accuracy in Nanopore Long-Reads.” Nature Methods 18
(11): 1322–32. https://doi.org/10.1038/s41592-021-01299-w.
Sherry, S. T., M.-H. Ward, M. Kholodov, J. Baker, L. Phan, E. M.
Smigielski, and K. Sirotkin. 2001. “dbSNP: The NCBI Database of Genetic
Variation.” Nucleic Acids Research 29 (1): 308–11. https://doi.org/10.1093/nar/29.1.308.
Siepel, Adam, Gill Bejerano, Jakob S. Pedersen, Angie S. Hinrichs,
Minmei Hou, Kate Rosenbloom, Hiram Clawson, et al. 2005.
“[PhastCons] Evolutionarily Conserved
Elements in Vertebrate, Insect, Worm, and Yeast Genomes.”
Genome Research 15 (8): 1034–50. https://doi.org/10.1101/gr.3715005.
Sirugo, Giorgio, Scott M. Williams, and Sarah A. Tishkoff. 2019.
“The Missing Diversity in
Human Genetic Studies.”
Cell 177 (1): 26–31. https://doi.org/10.1016/j.cell.2019.02.048.
Smolka, Moritz, Luis F. Paulin, Christopher M. Grochowski, Dominic W.
Horner, Medhat Mahmoud, Sairam Behera, Ester Kalef-Ezra, et al. 2024.
“Detection of Mosaic and Population-Level Structural Variants with
Sniffles2.” Nature Biotechnology 42 (10):
1571–80. https://doi.org/10.1038/s41587-023-02024-y.
Sollis, Elliot, Abayomi Mosaku, Ala Abid, Annalisa Buniello, Maria
Cerezo, Laurent Gil, Tudor Groza, et al. 2023. “The
NHGRI-EBI GWAS
Catalog: Knowledgebase and Deposition Resource.”
Nucleic Acids Research 51 (D1): D977–85. https://doi.org/10.1093/nar/gkac1010.
Song, Li, Gali Bai, X. Shirley Liu, Bo Li, and Heng Li. 2022.
“T1K: Efficient and Accurate KIR and
HLA Genotyping with Next-Generation Sequencing
Data.” bioRxiv. https://doi.org/10.1101/2022.10.26.513955.
“The Genome Aggregation
Database (gnomAD).” n.d.
Accessed July 3, 2025. https://www.nature.com/immersive/d42859-020-00002-x/index.html.
Trop, Evan, Yair Schiff, Edgar Mariano Marroquin, Chia Hsiang Kao, Aaron
Gokaslan, McKinley Polen, Mingyi Shao, et al. 2024. “The
Genomics Long-Range
Benchmark: Advancing DNA
Language Models,” October. https://openreview.net/forum?id=8O9HLDrmtq.
Van der Auwera, Geraldine A., Mauricio O. Carneiro, Christopher Hartl,
Ryan Poplin, Guillermo del Angel, Ami Levy-Moonshine, Tadeusz Jordan, et
al. 2018. “From FastQ Data to
High-Confidence Variant
Calls: The Genome
Analysis Toolkit Best
Practices Pipeline.” Current
Protocols in Bioinformatics 43 (1): 11.10.1–33. https://doi.org/10.1002/0471250953.bi1110s43.
Verma, Anurag, Jennifer E. Huffman, Alex Rodriguez, Mitchell Conery,
Molei Liu, Yuk-Lam Ho, Youngdae Kim, et al. 2024. “Diversity and
Scale: Genetic Architecture of 2068 Traits in the
VA Million Veteran
Program.” Science 385 (6706): eadj1182. https://doi.org/10.1126/science.adj1182.
Vilhjálmsson, Bjarni J., Jian Yang, Hilary K. Finucane, Alexander Gusev,
Sara Lindström, Stephan Ripke, Giulio Genovese, et al. 2015.
“Modeling Linkage Disequilibrium
Increases Accuracy of Polygenic
Risk Scores.” American Journal of
Human Genetics 97 (4): 576–92. https://doi.org/10.1016/j.ajhg.2015.09.001.
Wenger, Aaron M., Paul Peluso, William J. Rowell, Pi-Chuan Chang,
Richard J. Hall, Gregory T. Concepcion, Jana Ebler, et al. 2019.
“Accurate Circular Consensus Long-Read Sequencing Improves Variant
Detection and Assembly of a Human Genome.” Nature
Biotechnology 37 (10): 1155–62. https://doi.org/10.1038/s41587-019-0217-9.
Wu, Yang, Zhili Zheng, Loic Thibaut2, Michael E. Goddard, Naomi R. Wray,
Peter M. Visscher, and Jian Zeng. 2024. “Genome-Wide Fine-Mapping
Improves Identification of Causal Variants.” Research
Square, August, rs.3.rs–4759390. https://doi.org/10.21203/rs.3.rs-4759390/v1.
Yan, Binghao, Yunbi Nam, Lingyao Li, Rebecca A. Deek, Hongzhe Li, and
Siyuan Ma. 2025. “Recent Advances in Deep Learning and Language
Models for Studying the Microbiome.” Frontiers in
Genetics 15 (January). https://doi.org/10.3389/fgene.2024.1494474.
Yuan, Qiuyue, and Zhana Duren. 2025. “[LINGER]
Inferring Gene Regulatory Networks from Single-Cell
Multiome Data Using Atlas-Scale External Data.” Nature
Biotechnology 43 (2): 247–57. https://doi.org/10.1038/s41587-024-02182-7.
Yun, Taedong, Helen Li, Pi-Chuan Chang, Michael F Lin, Andrew Carroll,
and Cory Y McLean. 2021. “Accurate, Scalable Cohort Variant Calls
Using DeepVariant and GLnexus.”
Bioinformatics 36 (24): 5582–89. https://doi.org/10.1093/bioinformatics/btaa1081.
Zheng, Rongbin, Changxin Wan, Shenglin Mei, Qian Qin, Qiu Wu, Hanfei
Sun, Chen-Hao Chen, et al. 2019. “Cistrome Data
Browser: Expanded Datasets and New Tools for Gene
Regulatory Analysis.” Nucleic Acids Research 47 (D1):
D729–35. https://doi.org/10.1093/nar/gky1094.
Zheng, Zhenxian, Shumin Li, Junhao Su, Amy Wing-Sze Leung, Tak-Wah Lam,
and Ruibang Luo. 2022. “Symphonizing Pileup and Full-Alignment for
Deep Learning-Based Long-Read Variant Calling.” Nature
Computational Science 2 (12): 797–803. https://doi.org/10.1038/s43588-022-00387-x.
Zhou, Jian, Chandra L. Theesfeld, Kevin Yao, Kathleen M. Chen, Aaron K.
Wong, and Olga G. Troyanskaya. 2018. “[Expecto]
Deep Learning Sequence-Based Ab Initio Prediction of
Variant Effects on Expression and Disease Risk.” Nature
Genetics 50 (8): 1171–79. https://doi.org/10.1038/s41588-018-0160-6.
Zhou, Jian, and Olga G. Troyanskaya. 2015. “[DeepSEA]
Predicting Effects of Noncoding Variants with Deep
Learning–Based Sequence Model.” Nature Methods 12 (10):
931–34. https://doi.org/10.1038/nmeth.3547.
Zhou, Zhihan, Yanrong Ji, Weijian Li, Pratik Dutta, Ramana Davuluri, and
Han Liu. 2024. “DNABERT-2: Efficient
Foundation Model and Benchmark
For Multi-Species
Genome.” arXiv. https://doi.org/10.48550/arXiv.2306.15006.
Zook, Justin M., Jennifer McDaniel, Nathan D. Olson, Justin Wagner,
Hemang Parikh, Haynes Heaton, Sean A. Irvine, et al. 2019. “An
Open Resource for Accurately Benchmarking Small Variant and Reference
Calls.” Nature Biotechnology 37 (5): 561–66. https://doi.org/10.1038/s41587-019-0074-6.