Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2025 Jun 18:arXiv:2408.16245v5. [Version 5]

Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions

Sully F Chen †,*, Robert J Steele ‡,*, Glen M Hocky §, Beakal Lemeneh , Shivanand P Lad , Eric K Oermann **
PMCID: PMC11998858  PMID: 40236839

Abstract

The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on single-omic data—either proteins or nucleic acids and have seen incredible success in downstream tasks in each domain, with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pre-training limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabeled sequence data, OmniBioTE learns joint representations mapping genes to their corresponding protein sequences. We further demonstrate that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy (ΔG) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, compared to single-omic controls trained with identical compute, OmniBioTE demonstrates superior performance-per-FLOP across both multi-omic and single-omic benchmarks, highlighting the power of a unified modeling approach for biological sequences.

1. Introduction

It has long been a fundamental goal of bioinformatics to derive functional and structural insights directly from primary biomolecular sequences. High-throughput sequencing technologies now enable routine acquisition of vast quantities of nucleic acid and protein data, yet translating these linear sequences into mechanistic understanding remains challenging. Recent breakthroughs in natural language processing (NLP), particularly the transformer architecture [1], have demonstrated exceptional capacity to model complex sequential dependencies in text. The majority of research applying transformers to biosequences has focused on applying the architecture to single-omics, typically nucleic acid distributions (genomics, transcriptomics, epigenetics, etc.) or proteomics. These efforts have yielded astonishing successes in several tasks, with the most notable being the prediction of the 3D structure of proteins from their primary sequences [29]. Other work has focused on developing models that produce useful representations of single-omics biosequences for various downstream tasks. There exist numerous protein foundation models [1020], and we find the most variety of model architectures in this class. Notably, there are many generative models [2123], encoder-decoder models [17, 18], and even a diffusion model [21]. Several genomics foundation models have been trained as well, primarily on human genomics data [2427]. Other genomic foundation models have been trained on human and murine data [28], multi-species genomes [29], prokaryotic genomes [30], and even metagenomic scaffolds [31]. Notably, very few models integrate broad, multi-species training data, with the exception of DNABERT-2 [29], though this dataset notably lacks genomes from the domain Archaea and consists of only 32 billion base pairs. To date, the largest DNA foundation model to be trained consists of 40 billion parameters [32], and was trained multi-species genomes and found to be successful at multiple downstream tasks. Genomic models augmented with epigenetic data have also demonstrated great success in downstream tasks such as predicting epigenetic markers [3336], detecting splice sites and promoter regions [27], modeling the histone code [37], and modeling the phosphorylation of protein kinases [38]. Other foundation models focus on transcriptomics, primarily focusing on single-cell RNA (scRNA) [3943]. Other foundation models for mRNA [44] and general RNA [45] have also been trained. Transcriptomic foundation models have successfully predicted transcriptome-to-proteome translations [46], gene rankings [47], cell type annotation [48], and drug response [43,48].

Despite these advances, cellular biology is inherently multi-omic, with proteins and nucleic acids engaging in dynamic and reciprocal interactions underpinning gene regulation, replication, and repair. Single-omic transformers, by design, lack the capacity to capture cross-modal dependencies in their fundamental representations to model tasks such as transcription factor binding, RNA-mediated translational control, and chromatin remodeling. Only three existing models incorporate both nucleic acid and protein information: AlphaFold3 [4], a closed-source proprietary model, RosettaFoldNA [6], and LucaOne [49]. Furthermore, the former two of these models are focused on structure predictions rather than generally learning from multi-omic sequences, while the latter model’s nucleic acid sources included only DNA and RNA. We hypothesized that integrating protein and nucleic acid sequences of all types from multiple types of sequencing into a unified modeling framework may uncover joint representations that more faithfully reflect the complexity of multi-omic interactions and enable direct prediction of multi-omic phenomena from sequence alone.

Here, we introduce OmniBioTE, the first large-scale, open-source multi-omic transformer pre-trained on 250 billion tokens drawn from GenBank nucleic-acid entries and UniRef100 protein sequences. We explore four model sizes (88M–2.3B parameters) and compare performance against matched single-omic controls (NucBioTE, ProtBioTE) trained with identical compute, but only nucleic acid data (NucBioTE) or on proteomic data (ProtBioTE). We evaluate on tasks spanning: (1) predicting binding free energies (ΔG) for protein–nucleic acid complexes on ProNAB [50], (2) emergent contact prediction via attention-based probing, (3) nucleic acid specificity assessment on JASPAR [51], and (4) state of the art performance on standard single-omic benchmarks (GUE [29], TAPE [52]). Our results demonstrate that multi-omic pre-training yields embeddings that inherently align gene and protein modalities, outperform single-omic models in both multi-omic and single-omic tasks, and exhibit emergent structural knowledge without explicit supervision. OmniBioTE sets a new paradigm for foundation modeling in biology by unifying sequence modalities within a single transformer framework.

2. Results

2.1. Emergent Joint Representations

We first tested whether OmniBioTE embeddings encode modality-invariant features linking genes and proteins. A low-rank linear projector trained on frozen embeddings produced by OmniBioTE via a contrastive loss objective with only 5% of ground-truth data generalizes to the remaining 95% of held out data (Fig.2a,b). In comparison, two separate low-rank linear projections trained with identical objectives and data splits on the single-omic models fail to generalize. Despite OmniBioTE never being explicitly (or even implicitly) taught a correspondence between genes and their corresponding translated protein sequences, the model naturally learns these associations from the underlying distributions.

Figure 2:

Figure 2:

a. The distribution of cosine similarity between feature vectors produced by OmniBioTE via a low-rank feature extractor on the 95% held-out data. b. The analogous plot produced by NucBioTE and ProBioTE with two separate feature extractors with identical methodology. c. The increase in F1-score on the contact-prediction task using frozen attention maps from OmniBioTE models fine-tuned to predict binding affinity compared to frozen attention maps from the base models. d. An example of predicted contact probability for Zinc finger and BTB domain-containing protein 7A (ZBTB7A) bound to a DNA duplex computed from the attention maps produced by the fine-tuned OmniBioTE models. Darker red colors indicate a stronger predicted probability of contact. All box-and-whisker plots are constructed via the median value as the central line, the interquartile range (IQR) as the box, and the whiskers denoting the minimum and maximum value of the distribution. Outliers are defined as points that lie outside of ± 1.5 × IQR and were excluded from (a) for clarity.

2.2. Multi-omic Task Performance

We demonstrated OmniBioTE’s potential as a foundation model for natively multi-omic tasks by fine-tuning each OmniBioTE model to predict the ΔG of protein-nucleic acid binding interactions. OmniBioTE-XL achieved a Pearson correlation coefficient of 0.41 and MAE = 1.56 kcal/mol, exceeding single-omic controls (ΔPCC=+0.33) (Fig.3a,b). Additionally, mutation scans of JASPAR consensus sequences confirm that ΔΔG predictions increase upon subtle consensus sequence disruption on average, scaling with model size (Fig. 3c).

Figure 3:

Figure 3:

a. Performance on 10-fold cross-validation over the ProNAB dataset as measured by the Pearson correlation coefficient (PCC) as a function of pre-training compute. b. Mean absolute error in ΔG prediction over the 10-fold cross-validation set. c. The predicted ΔΔG of mutated consensus sequences as a function of pre-training compute. Error bars represent the standard error of the mean of all 10 folds. LucaOne and DeePNAP baselines are omitted for clarity, as both achieve performance indistinguishable from random chance (ΔΔG = 0). d. Performance on the supervised contact evaluation task trained at various contact thresholds. The positive-to-negative ratio of the dataset is 0.29, 0.16, 0.09, and the maximum F1-score achievable with random guessing is 0.37, 0.247, and 0.157, for 8Å, 6Å, and 4Å, respectively. (*) represents the top-performing model in each evaluation.

In these tasks, we found superior performance of OmniBioTE compared to recent, purpose-built, deep learning-based methods [53], likely owing to the rich sequence information gleaned from the large-scale multi-omic pretraining (Fig. 3a). We compared our approach to using AlphaFold3-derived structures combined with molecular dynamics simulations and found that AlphaFold3 based simulations were notably more computationally intensive with worse results (Sec. S1, Extended Figure S1). Notably, empirical work has found that the maximum possible Pearson’s correlation coefficient is around 0.81, and the minimum possible mean absolute error is around 0.6 kcal/mol [54].

We next confirmed that the multi-omic approach is considerably more performant and compute-efficient than using two identically trained single-omic models (Fig. 3a,b). We find a clear trend of increasing performance with model scale, as opposed to over-fitting with greater parameter count, indicating the robustness of the approach and potential for further performance gains with greater scale in both compute and data. We find that on the protein-nucleic acid contact prediction task (measured in F1-score), our per-residue/per-nucleotide OmniBioTE-XL model outperforms a genomic/proteomic baseline, LucaOne, which had considerably more pre-training compute invested (Fig. 3d). We hypothesize that this advantage stems from training OmniBioTE on a wide variety of nucleic acid data, in addition to genomics. We find that the byte-pair encoded OmniBioTE model underperforms compared to the LucaOne baseline and the per-residue/per-nucleic acid OmniBioTE models, which we attribute to lower-resolution predictions (each token predicts the contact for multiple residues at a time). Additionally, we find similar improvements with scale on the contact prediction task (Table S14).

2.3. Attention-based Structural Interpretability

We assessed whether learned attention maps encoded structural information regarding nucleotide–residue contacts. A simple convolutional probe was trained on frozen attention maps from ΔG-fine-tuned OmniBioTE and compared to an identical convolutional probe trained on frozen attention maps produced by base models. Critically, all model parameters were frozen while training the probes, ensuring that no structural information leaked into either model’s attention maps. The model probe trained on attention maps from the OmniBioTE models trained to predict ΔG yielded consistently higher F1 scores on the contact prediction task at larger model scales (Fig. 2c), indicating that more latent structural information is present in the attention maps produced by models trained to predict binding affinity. This is particularly striking, as this structural information is not explicitly present in the binding affinity task and must instead be inferred. An example of contact predictions projected onto a Zinc finger protein is shown in Fig. 2d.

2.4. Single-omic Benchmarks

We hypothesized that our multi-omic model may be more performant on single-omic benchmarks. For each benchmark across all tasks, multi-omic pre-training demonstrates superior or comparable performance to single-omic pre-training in terms of performance-per-FLOP even with vastly different compute budgets for the GUE, TAPE, and ProteinGLUE benchmarks (Fig. 4a,c,e). This improvement in performance-per-FLOP is even more striking when considering that significantly less data per-modality was seen by the model in the multi-omic training runs, since the token budget was fixed in all training runs regardless of modality. In the GUE benchmarks (Fig. 4b), OmniBioTE models set a new state-of-the-art in all categories, with the exception of human transcription factor classificaiton, and lie well above the previous compute Pareto frontier. In the TAPE evaluations (Fig. 4d), OmniBioTE does not achieve any state-of-the-art results in terms of absolute performance, but the per-residue OmniBioTE models begin to trend above the previous compute Pareto frontier set by ESM. Results are mixed between all models on ProteinGLUE (Fig. 4f), with the Pareto frontier difficult to ascertain; more scaling experiments are likely needed to elucidate the true frontier. The new compute Pareto frontier highlights the benefits of multi-omic data for efficient model scaling.

Figure 4: Model performance and scaling across single-omic benchmarks.

Figure 4:

Aggregate benchmark performance for each model plotted as a function of pre-training FLOPs for the a GUE, b TAPE, and c ProteinGLUE benchmarks demonstrating superior performance per pre-training FLOPs of multi-omic pre-training compared to single-omic pre-training. GUE epigenetic mark prediction benchmarks were averaged to form a single category. (*) represents the top-performing model in each evaluation.

Notably, results on protein evaluation tasks differed depending on whether the tokenization was per-residue/nucleotide or whether a byte-pair encoding tokenizer was used. This difference in performance is likely due to an increase in performance on per-residue tasks.

3. Discussion

OmniBioTE is a series of first-of-its-kind multi-omic models (MOMs) pre-trained jointly on a diverse set of nucleic acid sequences and proteomic data. We analyzed the properties of these models across a wide range of scales and tasks. We found that these models not only achieve state-of-the-art performance on single-omic tasks measured in performance-per-FLOP, but also unlock novel multi-omic tasks such as modeling protein-nucleic acid interactions by predicting the change in Gibbs free energy between a protein and nucleic acid. We also showed that as a result of this fine-tuning process, OmniBioTE learns meaningful structural information without any explicit strucutral training, allowing one to estimate how strongly a given protein residue or nucleotide participates in binding interactions.

We found that OmniBioTE emergently learned a joint representation between nucleic acid and protein sequences despite never explicitly being trained on a joint objective, demonstrating that training biosequence transformer models on multi-omic data can learn non-trivial representations across sequences even with a simple masked language model objective. We attribute this emergence from self-supervised pre-training as being a consequence of the efficient coding hypothesis [55]. We hypothesize that considerably richer representations could be learned if auxiliary training objectives were introduced, such as structure/property prediction, cross-attention between different modalities, or the addition of multiple sequence alignment data. Beyond additional learning objectives, we note that there has been a considerable amount of research into multi-modal vision-language modeling using novel model architectural components including cross-attention and cross-modal projectors [5659], and that many of these approaches may be of interesting in multi-modal biosequence modeling as well.

We additionally found that multi-omic pre-trained models are superior or comparable at scale to identical models trained on single-omics data with identical compute budgets. Furthermore, we find that our multi-omic models set a new compute Pareto frontier across GUE and TAPE benchmarks, even before factoring in the lower amount of per-modality data each model sees during training. Despite the difference in datasets, we found no downsides to mixing in other modalities during pre-training for our biosequence foundation models in this project. In fact, our MOMs set new state-of-the-art performance numbers for several of the downstream nucleic acid tasks. Our MOMs also considerably outperformed a combination of single-omic models on the multi-omic task of binding affinity prediction, and outperformed molecular dynamics methods in conjunction with structural predictions from AlphaFold3, despite being a considerably more computationally intensive baseline. Lastly, we showed that these results robustly transfer to completely unseen and unrelated datasets by testing our models on the JASPAR dataset.

There are several notable limitations to this work that deserve special mention. Most notably, we only scratched the surface on multi-omic biosequence modeling. As noted earlier, there are many popular ways of training multi-omic sequence models, and we elected for a simple approach using a masked language modeling task. We additionally only investigate our scaling over a rough two orders of magnitude of compute, and leave the training of larger models on larger datasets as future research directions that seem reasonably likely to yield performance benefits consistent with the scaling results we found in this work. Lastly, we only investigated a masked language modeling task for pre-training rather than the more popular autoregressive training framework, again leaving this approach open as a viable future research direction.

Many of biology’s most significant interactions occur between proteins and nucleic acids, and we demonstrate the first large-scale attempt at building and scaling foundation models to specifically learn these critical molecular interactions. Beyond their biological significance, modeling the interactions between nucleic acids and proteins is of great pharmaceutical and clinical importance; models that can assist with the development of nucleic acids that modify the function of naturally occurring proteins would greatly accelerate pharmaceutical development. Many notable pharmaceutical drugs and candidate drugs that function via nucleic acid-protein interaction have already shown great promise, such as pegaptanib [60], an RNA aptamer targeting vascular endothelial growth factor, as well as RNA sequences that target nucleolin [61], coagulation factors [6265], CCL2 [66], CXCL12 [67], and hepticidin [68]. While our methodology does not explore aptamer design or property prediction, we believe that this methodology could be extended to aptamers with the right dataset. Foundational biosequence models have the promise of dramatically improving our ability to both understand and predict biology, and we hope that our work with OmniBioTE presents the first of many efforts to build multi-omic models that can capture the full richness of biomolecular interactions.

4. Methods

Broadly, we train dense, non-causal encoder transformer models of varying sizes using the masked-language-modeling (MLM) objective [69] on 250 billion tokens of nucleic acid and protein sequences of varying types. We additionally train control models consisting of only nucleic acid or protein sequences with equal compute budgets to evaluate the effect of training on additional sequence types. We demonstrate that our MOMs emergently learn joint representations between nucleic acid and protein sequences by showing that there exist meaningful features roughly invariant to sequence modality, and that such features do not exist in single-omic models.

We evaluate our suite of models by fine-tuning on several single-omics datasets that assess performance on various downstream tasks relevant to molecular biology, structural biology, and biochemistry. Additionally, we design two novel multi-omic tasks that require inference on both protein and nucleotide sequences simultaneously. Lastly, we show via simple convolutional probes that the models’ attention maps encode structural information that is learned without any a priori structural training.

4.1. Training Data

We source our nucleic acid data from GenBank [70], a collection compiled by the National Center for Biotechnology Information. We preprocessed the entire GenBank archive by first removing all metadata from each sequence, with the exception of sequence type (DNA, mRNA, tRNA, etc.). This produced 242,855,368 sequences with a total of 312,190,748,151 base pairs, primarily composed of general DNA, general RNA, mRNA, cRNA, and single-stranded RNA. A full breakdown of nucleic acid sequence data can be seen in Table S1. We source our protein data from Uniref100 [71], a dataset maintained by UniProt. Similarly to the nucleic acid data, we remove all metadata from each sequence, yielding 369,597,671 sequences with a total of 1,739,747,047 residues.

We take a subset of 1011 base pairs and protein residues total to train a byte-pair encoding tokenizer [72] using the Sentencepiece library [73], with a vocabulary size of 211 for protein sequences and nucleic acid sequences (212 unique tokens total). Our choice of tokenizer and vocabulary size was chosen based on previous work [29]. Additionally, we train a multi-omic per-residue/nucleotide model at each size, where each token is simply a single base pair or residue.

4.2. Architecture and Training

OmniBioTE is based on the GPT-2 architecture [74] and the LLaMA-2 architecture [75]. We substitute learned positional embeddings [76] for rotary positional embeddings (RoPE) [77] and replace the causal self-attention mechanism [74, 76] with a full, non-causal attention operation [69]. We additionally scale the pre-softmax causal-attention at 1/width rather than 1/width2 in accordance with maximal update parameterization (μP) [78]. We use an aspect ratio (the ratio of model width to depth) of 128. We modify Kaparthy’s NanoGPT [79] for a lightweight and simple model implementation. We train four OmniBioTE variants, OmniBioTE-small (88 million non-embedding parameters), OmniBioTE-medium (675 million), OmniBioTE-large (1.3 billion) and OmniBioTE-XL (2.3 billion). Additionally, we train controls for each model on only nucleic acid data or only protein data (henceforth referred to as “NucBioTE-[size]” and “ProtBioTE-[size]”). For experiments requiring fine-grained, single-nucleotide/residue inference, we also train an OmniBioTE model of each size that uses a single-character tokenizer rather than a byte-pair encoding (BPE). In total, we train 16 models.

We train each model for 250 billion tokens with a context length of 1024 tokens for the BPE-tokenized models and a context length of 2048 characters for the single-character models (to accommodate the decreased amount of data per token). We train at a batch size of 786432, 1032192, or 1048576 tokens (chosen based on available compute and memory and to maximize throughput) with the masked language modeling objective [69]. We use AdamW [80] (β1 = 0.9, β2 = 0.95, ϵ = 10−8, weight decay = 10−2), employing μP for stable hyperparameter transfer. For the parameters with fixed learning rate under μP (the embedding and unembedding parameters), we set the learning rate to 0.05, and scale learning rates of the rest of the parameters via 32/width. These hyperparameters were determined empirically with sweeps at the 106-parameter-scale. Finally, all learning rates are decayed with PyTorch’s OneCycleLR [81], with a warmup period of 1 billion tokens, a starting and ending learning rate scale of 10−5.

4.3. Evaluations

We design our own multi-omic benchmark to assess our model’s ability to accurately characterize protein-nucleic acid interactions. We further design several novel benchmarks to assess the performance and interpretability of our models on protein-nucleic acid tasks. In addition to our main multi-omic tasks, we evaluate our approach on several popular benchmarks to evaluate single-omic performance on a variety of nucleic acid and protein-based tasks in an effort to assess the baseline single-omic capabilities of our model before multi-omic task-specific fine-tuning. All fine-tuning optimization is performed via AdamW [80] with identical hyperparameters as described in the pre-training step unless otherwise specified.

4.3.1. Protein-Nucleic Acid Binding Evaluation

To showcase the native multimodality of our generalist model, we designed a novel evaluation task using the ProNAB dataset [50]. ProNAB consists of 20,090 samples comprised of 14606 protein-DNA complexes, 5323 protein-RNA complexes, and 161 protein-DNARNA complexes. These samples are composed of 798 unique DNA-binding proteins and 340 unique RNA-binding proteins. We refer to the original work for a detailed description of the dataset composition [50]. The objective of our task is as follows: given the primary sequence of a nucleic acid-binding protein and a nucleic acid sequence predict the ΔG of the binding interaction. This task is of particular interest in the prediction of unknown DNA/RNA-binding protein interactions with the human genome.

We assemble our dataset by first filtering the ProNAB dataset, rejecting any nucleic acid or protein sequences with non-standard residues (we use only the standard 20 amino acids and the 5 standard nucleotide bases), leaving 850 unique proteins, and 15994 protein-nucleic acid complexes. We then split the data into 10 cross-validation sets. Ultimately, we end up with 752 unique proteins and 12282 total protein-nucleic acid interactions.

The ProNAB dataset often has multiple nucleic acid sequences per protein, thus the number of unique proteins is vastly outweighed by the number of unique nucleic acids. To avoid data leakage in the train and test sets, we group samples by protein sequence, then create folds by randomly grouping by protein sequence such that the folds do not have any proteins in common. Furthermore, we conduct sequence similarity analysis on the protein sequences in the train and test set via sequence alignment with the BLOSUM62 substitution matrix [82] to ensure minimal train/test leakage. We found that the average alignment score between identical protein sequences in our dataset was 5.20 ± 0.15 (identical sequences may have different scores due to the BLOSUM62 scores), while over 99.4% of pair-wise comparisons in our train/test set had an alignment score below 0.0, and 99.9% had a score below 1.0 suggesting that our results are not purely a result of sequence homology. As an extra precaution, we keep any proteins that have a sequence similarity score over 1.5 with any other protein sequence in the dataset strictly in the train set of all cross-validation sets to guarantee there is no significant sequence homology in any cross-validation fold. As a result, 13 unique proteins and 232 protein-nucleic acid interactions were always kept in the train set to avoid any significant sequence homology in the validation sets.

To compute a ΔG value, we first concatenate a primary protein sequence and nucleic acid sequence pair and run a forward pass through OmniBioTE. We then take the embedding produced by the first token and apply a linear projection which produces a single ΔG value. If a complex is composed of a protein and a double-stranded DNA or RNA molecule, we append the second nucleic acid sequence as well. We fine-tune our model to predict ΔG from the protein-nucleic acid pairs in the train set, with mean-squared error (MSE) as our loss target. As a single-omic control, we compute the embeddings of the protein and nucleic acid sequences separately with the corresponding ProBioTE and NucBioTE model. We then concatenate these embeddings and use a linear projection head to produce the ΔG value.

Our primary evaluation metrics are the Pearson correlation coefficient of ΔG prediction with the ground-truth measured value, as well as the mean absolute error of the predicted ΔG values. We begin with a pre-trained OmniBioTE model, then further train our models for 64 epochs with a batch size of 256 on the ΔG prediction task. The projection head learning rate initialized to 10−2, the embedding vector learning rate initialized to 10−3, and the non-embedding parameters learning rate to 10−4 · 1024/width. All learning rates are decayed with PyTorch’s OneCycleLR, an implementation of the learning rate schedule first described in [81].

As a baseline, we train a recent deep-learning-based architecture, DeePNAP [83] on the identical cross-validation dataset as our model. We train the DeePNAP architecture for 64 epochs with a batch size of 256. For the training, we use AdamW (β1 = 0.9, β2 = 0.999, ϵ = 10−8, weight decay = 10−2, weight decay = 10−2), starting at a learning rate of 10−3 and decaying linearly to 0.0. Additionally, we fine-tune a recently released Genome-Protein model, LucaOne [49] in a similar manner. Specifically, we set the embedding learning rate to 10−4, the non-embedding parameter learning rates to 2.5·10−5, and the projection head learning rate to 10−2. We train the LucaOne with identical AdamW hyperparameters, batch size, and epochs.

Lastly, we compare against a baseline that is more representative of current computational methods. First, we predict the structure of the protein-nucleic acid complex with AlphaFold3 [4] and use molecular dynamics simulations to predict the ΔG of the binding interaction.

4.3.2. Nucleic Acid Binding Specificity

To further validate the robustness of the OmniBioTE models fine-tuned to predict binding affinity, we evaluate whether the models can correctly predict the specificity of various DNA-binding proteins (DBPs) to their consensus sequences. First, we gather a set of 2,145 DBPs and their position-frequency matrices (PFMs) from JASPAR [51]. Using the same sequence similarity rejection technique described in the ProNAB experiment, we filter all DBPs from the JASPAR dataset that have any significant overlap with the ProNAB dataset used in the cross-validation evaluation. We then use our fine-tuned OmniBioTE model to compute the ΔG for each DBP-nucleic-acid pair, where the consensus sequence is defined by the most frequent nucleotide in each position of the PFM. Next, we mutate each consensus sequence by randomly substituting each nucleotide with probability 5%. This produces a mutated nucleic acid sequence that would have a reduced binding affinity to the DBP as empirically known by the PFM, but would still be “in distribution” of the plausible binding nucleic acids. We generate 8 unique mutated nucleic acid sequences per DBP. We predict the ΔG for these mutated interactions and compute the difference between the predicted ΔG of the consensus sequence. If the finetuned model has learned to model the specificity of the binding interaction correctly, we should expect the ΔG to increase after the consensus sequence is mutated.

4.3.3. Protein-Nucleotide Contact Prediction

We gather all structures from the Research Collaboratory for Structural Bioinformatics Protein Data Bank [84] that contain strictly one protein chain and either one or two nucleic acid chains. For each residue in the protein-nucleic acid complex, we compute the distance to the nearest nucleotide and label a residue as “contacting a nucleotide” if it is within 8Å of a nucleotide. Next, we group data by primary protein sequence and create 10 cross-validation splits by protein grouping to avoid data leakage. To fine-tune OmniBioTE, we concatenate the protein and nucleic acid sequences together and compute a forward pass through the model as usual. Instead of unembedding the hidden states of the final layers, we instead compute a linear projection to a single scalar, upon which a sigmoid function is applied to yield a contact prediction. Although the nucleic acid sequence is included in the forward pass, contact prediction is only computed for the protein residues. We train the model against a binary cross-entropy loss function for 32 epochs on each fold with a batch size of 256, with an identical training setup to the runs in the protein-nucleic acid binding evaluation. We additionally test other contact thresholds (4Å and 6Å) to evaluate the robustness of our approach. We additionally run the same training procedure on LucaOne with the embedding learning rate set to 10−4, the non-embedding parameter learning rates set to 2.5·10−5, and the projection head learning rate set to 10−2, with identical AdamW hyperparameters.

4.3.4. Genome Understanding Evaluation

To evaluate OmniBioTE’s generalizability to a variety of domain-specific nucleic acid tasks, we employ the Genome Understanding Evaluation (GUE) suite [29]. GUE consists of several genetic and epigenetic classification tasks over human, mouse, yeast, and coronaviridae genomes. Core promoter detection, transcription factor prediction, promoter detection, splice site detection, epigenetic mark prediction, and COVID variant classification were the target classes among these genomes. The promoter detection task is a binary classification task, where the goal is to determine whether a sequence of DNA is or is not a promoter. The promoter task is divided into several subcategories: proximal promoter detection, core promoter detection, and TATA/non-TATA motif promoter detection. The proximal promoter task contain the entire promoter sequence (including the core promoter) in the classification task, while the core promoter task only includes the sequence in close proximity to the transcription start site. The TATA class is composed of promoters that contain a TATA-motif, while the non-TATA does not have a TATA motif. Transcription factor detection is another binary classification task, where the goal is to determine whether a DNA sequence is the binding site of a transcription factor. This task is divided into human and murine datasets. Splice site detection is a classification task where the goal is to determine if a DNA sequence contains a splice donor or acceptor site. The epigenetic tasks’ goals are to determine whether a nucleic acid sequence taken from a yeast genome is likely to contain a given epigenetic modification Lastly, the COVID variant task is a multi-class classification task where the goal is to predict which variant-type (Alpha, Beta, Delta, Eta, Gamma, Iota, Kappa, Lambda and Zeta) a 1000 base pair snippet was sequenced from. We refer to the original work for a full characterization of the evaluation set. All tasks use Matthews correlation coefficient as the primary metric, with the exception of the COVID variant classification task, which uses F1-score.

For each classification task, we fine-tune a base OmniBioTE or NucBioTE model. A class prediction is generated by taking the first token’s final embedding and applying a linear projection down to the number of classes in place of the original final projection head, followed by a SoftMax operation. We set the embedding parameter learning rate to 10−3, the transformer weight matrices to 1024·(model width)−1·10−4, and lastly, set the learning rate of the projection head to 10−2 for all model sizes. Hyperparameters were determined with sweeps over the validation sets. All learning rates are decayed with PyTorch’s OneCycleLR. The small and medium models are trained for 15000 steps with a batch size of 32 over the training data, while the large and XL models were trained for 30000 steps with a batch size of 32. We find that final validation performance is relatively robust to the number of epochs over each dataset, thus these training parameters were chosen to yield a reasonable training time. The model that performs best on the validation set is evaluated on the test set. We additionally fine-tune LucaOne as an additional multi-omic baseline. We train with the exact same optimizer hyperparameters described for LucaOne in the protein-nucleic acid binding evaluation above. We train with batch size 32 for 30,000 iterations on each task.

4.3.5. Tasks Assessing Protein Embeddings

We employ the Tasks Assessing Protein Embeddings (TAPE) suite [52] to evaluate OmniBioTE’s ability to generalize to unseen protein-based tasks. TAPE consists of five challenges: secondary structure prediction, residue contact prediction, remote homology detection, fluorescence prediction, and stability prediction. Secondary structure prediction is a per-residue classification challenge, where the goal is to determine what type of secondary structure each residue composes. The secondary structures are split into one of either 3 or 8 classes, depending on the task. Residue contact prediction involves generating an N × N mask, where N is the length of the protein, with each element of the mask predicting the probability that a residue pair are within 8 Å of each other. Remote homology detection involves mapping a primary protein sequence to one of 1195 homologies, with the aim to learn to classify primary sequences into meaningful structural families. Fluorescence prediction is a regression task, where the goal is to predict the log fluorescence intensity of a protein from a given primary structure. Finally, stability prediction is a regression task that aims to predict the maximum concentration at which a protein is still structurally stable. All classification tasks are measured in accuracy, while all regression tasks are measured via Spearman’s correlation coefficient. We train each task (excluding the contact evaluation which is discussed below) for 64 epochs over the dataset with a batch size of 32, with identical initial learning rate parameters and schedule as the GUE tasks [29], though we initialize the non-embedding model parameter learning rate to 1024·(model width)−1·10−4, the embedding learning rate to 10−4, and the projection head learning rate to 10−2 for all model sizes.

The residue contact evaluation task involves predicting an L × L matrix of values between 0 and 1, with each element (i, j) representing the probability that residue i in the primary sequence is within 8 Å of residue j. To generate this prediction matrix, embeddings are generated from a transformer model [76], and a learned linear projection head transforms each embedding into 128-dimensional vectors. As inspired by previous work [85], a tensor of shape 256 × L × L is constructed, where item [:, i, j] corresponds to the ith 128-dimensional vector concatenated with the jth 128-dimensional vector. This tensor is transformed via an 8-layer ResNet [86] to yield a final (1 × L × L) matrix, which after transformation by the sigmoid function, produces the desired probability matrix. Binary cross-entropy is used as the loss target, with masks applied computing the loss only on residue pairs that are separated by at least 12 total residues (excluding “short” contacts). Fine-tuning is performed for 128 epochs with a batch size of 128. The learning rate of non-embedding transformer parameters was set to 1024·(model width)−1·10−4, with the projection head and ResNet [86] using a learning rate of 10−3. Learning rates were warmed up and decayed via the PyTorch OneCycleLR [81] learning rate scheduler as mentioned previously.

We fine-tune a series of ESM2 models [9] to compare both absolute performance and scaling performance against a state-of-the-art single-omic protein model. Specifically, we finetune the 8 million, 35 million, 150 million, 650 million, and 3 billion parameter ESM2 models in an identical fashion as the OmniBioTE models above. For brevity, we hereafter refer to the ESM models as ESM2-XS (8 million), ESM2-S (35 million), ESM2-M (150 million), ESM2-L (650 million), and ESM2-XL (3 billion). We use the same embedding and head learning rate as the OmniBioTE finetuning runs, and set the non-embedding parameter learning rate to 640·(model width)−1·10−4. Additionally, we evaluate LucaOne via the same hyperparameters described in the protein-nucleic acid binding evaluation, with the same number of iterations and batch size for each task. We use AdamW (β1 = 0.9, β2 = 0.999, ϵ = 10−8, weight decay = 0.01) as the optimizer for all models.

4.3.6. Protein General Language of Life Evaluation

To explore per-residue tasks (i.e., tasks that require a prediction for every residue in the protein), we employ the Protein General Language of Life Evaluation (ProteinGLUE) [87]. We refer to the original work for a full description of ProteinGLUE, but briefly, ProteinGLUE consists of several tasks:

Secondary structure prediction: the task is identical to the TAPE secondary structure task discussed above [52]. Accuracy is the primary metric.

Solvent accessibility: the task is to either classify whether a residue has less than 7% solvent accessibility, as well as a regression task to predict the actual solvent accessibility value. For the binary classification task, accuracy is the primary metric, and Pearson correlation coefficient is used as the primary metric for the regression task.

Protein-protein interaction: the task is to predict which residues interact in either homodimer or heterodimers. Area under the receiver operating characteristic curve (AUCROC) is used as the primary metric.

Epitope region detection: the task is to predict which regions of a protein are antigenic epitopes. The performance of this task is measured in AUCROC.

Hydrophobic patch prediction: the goal of this task is to predict the largest rank of a hydrophobic patch that a residue belongs to. This task is measured via Pearson correlation coefficient.

Each task was trained with a batch size of 32 for 16 epochs on all tasks except for the protein-protein interaction, for which 64 epochs were used owing to a smaller dataset size. Identical initial learning rates and schedules used in the TAPE evaluation mentioned above were used. We compare against ESM models in a similar manner as the TAPE evaluations, namely with an embedding learning rate of 10−4, a projection head learning rate of 10−2, and a non-embedding parameter learning rate of 640 · (model width)−1 · 10−4. We use the same optimizers and hyperparameters as described in the TAPE evaluations. We evaluate LucaOne on this task with identical hyperparameters as the TAPE evaluation.

4.4. Per-Residue Evaluations

Because the protein and nucleic acid datasets were tokenized with byte-pair encoding [72], most tokens contain several base pairs or residues. Evaluations that require a per-residue prediction, such as secondary structure, are not directly compatible with this tokenization scheme. To solve this issue, we apply two simple strategies at train and test time. At train time, we compute the target of a single token as the mode of all the residues it contains in the case of a classification task or the mean of the values of the residues it contains in the case of a regression task. This allows the input sequence length and the target sequence length to be the same size. At test time, we simply duplicate the value at the predicted token by the number of residues that token contains, allowing us to construct a prediction with the same length as the target ground truth. This method places an upper bound on the maximum achievable performance our model can achieve on any per-residue task, but in practice, this upper bound is higher than state-of-the-art results previously reported. This is likely due to the fact that nearby residues often share similar values in per-residue prediction tasks (e.g., if a residue is in a beta chain, its adjacent residues are likely to be in a beta chain as well). We note that our evaluation results are still directly comparable to previous per-residue methods, as we duplicate our predictions to match the ground truth dimensionality rather than decreasing the ground truth dimensionality to match the sequence length (as is done at train time).

For the contact evaluations, the non-uniform number of residues encoded by each token presented a significant challenge. We remedy this issue by transforming prediction targets from residue to token space for training and transforming predictions from token to residue space for evaluation. Transformation of prediction maps from residue space to token space was accomplished by assigning the (i, j)-token pair as a true contact if any of the residues contained within token i contact any of the residues within token j. Similarly, the (i, j)-token pair of the contact mask, used to ignore short-range contacts in the loss function, was assigned a positive value if any of the residues contained within token i are at least 12 residues apart from any of the residues contained in token j. Transforming from token space to residue space for evaluation is done in a simpler manner: residue (n, m) is assigned the value of the token pair (i, j), where i is the token containing residue n and j is the token containing residue j. For the per-residue/nucleotide models, the models were evaluated normally.

4.5. Interpretability

4.5.1. Protein-Nucleic Acid Interactions

To show that OmniBioTE learns semantically meaningful features, we demonstrate that when trained to predict the binding affinity between a nucleic acid and a protein sequence, OmniBioTE implicitly learns structural information despite exclusively being trained on primary sequence data. We fine-tune one OmniBioTE model of each size, in an identical fashion as described for the protein-nucleic acid binding evaluation, though we use all available data rather than cross-validation splits, as the goal is to fine-tune OmniBioTE models to be highly capable of predicting binding interactions, then investigate their mechanics.

Next, we gather all structures from the Research Collaboratory for Structural Bioinformatics Protein Data Bank [84] that contain strictly one protein chain and either one or two nucleic acid chains. For each residue in the protein-nucleic acid complex, we classify the residue as making contact with a nucleotide if it is within 8Å of any nucleotide (in the same manner as described in the Protein-nucleic acid Contact Prediction task). We then compute a forward pass through either the OmniBioTE model fine-tuned to predict ΔG or through the base OmniBioTE model (control) and collect the attention maps produced by each head in each layer (this results in N2 attention maps, where N is the number of layers). Next, we concatenate these attention maps along the channel dimension to produce an N2 × L × L tensor, where L is the length of the input sequence. We then train a small convolutional network consisting of four layers. The first layer takes the N2 channels and applies a 3×3 convolution to produce 64 channels, the next two layers apply a 3 × 3 convolution producing 64 channels, and the final layer again applies a 3 × 3 convolution but produces only one channel. The output of the convolutional net is an L × L tensor, and we average across the last dimension to produce L logits that, after a sigmoid operation, yield the predicted probability that a given residue makes contact with a nucleotide (this task is identical to the Protein-Nucleic acid Contact Prediction task described above). We train this convolutional network via AdamW with a learning rate of 10−3, β1 = 0.9, β2 = 0.999, weight decay of 10−2, and ϵ = 10−8 for 1000 steps with a batch size of 256, linearly decaying the learning rate to zero over the course of training. Critically, the weights of the underlying OmniBioTE model remain frozen throughout training, meaning that the convolutional network must extract this structural information strictly from the attention maps produced by the underlying model. We compare the F1-score on each of the 10 folds for the attention maps produced by the base OmniBioTE model and those produced by the OmniBioTE model fine-tuned to predict binding affinity. If the fine-tuned model has learned meaningful structural information from the fine-tuning process, we would expect the F1-score for convolutional networks trained on these attention maps to be higher than those of the base model.

4.5.2. Shared Representations Between Modalities

We aim to test whether OmniBioTE effectively learns a joint representation space between nucleic acid and protein sequences rather than simply learning to represent both modalities separately. In this case, we want to test whether OmniBioTE has learned representations of gene sequences (DNA, both coding and non-coding regions) and their corresponding protein sequences that reflect shared functional or structural properties, independent of the sequence modality.

We first formalize the notion of invariance under transcription and translation. Let xX be a gene (DNA) sequence, and let yY be the corresponding protein sequence produced by a mapping G : XY, such as the standard transcription and translation process. Suppose that our pre-trained multimodal model outputs embeddings zx for x and zy for y, where zx,zyd. We define a feature extractor ϕ:d that maps an embedding to a scalar feature value. A feature is called invariant under the mapping G if

ϕzx=ϕzy

for all xX and y = G(x). In practical terms, such an invariant feature may correspond to the biological function or identity of a gene–protein pair, that is, a characteristic that remains constant regardless of the modality.

To test whether the model has indeed learned such invariant features, we conduct a contrastive learning experiment employing a strict linear transformation. In this experiment, we first obtain pairs of gene sequences (including both intronic and exonic regions) and their corresponding translated protein sequences. Using our pre-trained multimodal model, we compute the embeddings zx and zy for each gene and protein sequence, respectively. We then introduce a learnable linear function Wk×d with low rank kd to project the embeddings into a shared subspace, yielding Wzx and Wzy. The function W is optimized via a contrastive objective that simultaneously maximizes the cosine similarity between corresponding pairs Wzx and Wzy while minimizing the similarity between non-corresponding pairs.

Specifically, we employ a contrastive loss function similar to the CLIP framework [88] to learn our feature extractor: let XN×d and YN×d denote two batches of embeddings (with N samples and embedding dimension d), where each row xi of X is a gene’s feature vector, a and each row yi of Y is the corresponding protein sequence. Any given pair xi and yj are unrelated if ij. To compute the contrastive loss, each embedding in X and Y is normalized to unit length. The normalized embeddings are then used to compute a similarity matrix SN×N whose entries are given by

Sij=x^i,y^jτ,

where τ is a temperature parameter that controls the scaling of the cosine similarities.

In this setup, the diagonal elements Sii represent the cosine similarity between corresponding pairs, while the off-diagonal elements Sij for ij represent the similarities between non-corresponding pairs. Our final loss is composed of two terms: the first term considers each row of S as logits for a classification task in which the correct label for xi is i. The second term is computed by treating each column as logits for the corresponding yi. The two terms are simply averaged to compute the final scalar loss. This approach is identical to the original CLIP loss proposed by Radford et al. [88]. For our experiments, we use τ = 0.07, and d = 16.

We minimize this loss via the AdamW optimizer, with learning rate 0.01, linearly decayed to 0.0 over 10000 steps, β = (0.9, 0.95), and ϵ = 10−8. We optimize strictly over the projection matrix and leave the model parameters frozen, as the goal is to test whether joint features are already learned, not whether they can be learned.

After learning ϕ, we apply this transformation to a held-out set of gene-protein pairs and compute the dot product between their feature representations. If ϕ is a generalizable feature extractor, we should see high dot product scores between corresponding held-out pairs and low dot product scores between non-corresponding held-out pairs.

Critically, we assess the generalization capability of the invariant features under very strict conditions; we train on only 5% of the available paired data and test on the remaining 95%. Strong performance in this setting indicates that the model’s embeddings encode a shared subspace that captures the desired invariances.

For further validation, we perform a control experiment using two separately trained single-omic models—one trained solely on genes and the other solely on proteins. In this case, the embedding spaces of these models are learned independently, and there is no inherent guarantee of alignment between them. We attempt to learn two distinct feature extractors, ϕx and ϕy, for the gene and protein modalities, respectively, with the goal of minimizing the same contrastive loss.

Figure 1: The OmniBioTE Pipeline.

Figure 1:

a. First, we gather large-scale datasets consisting of proteomic data, nucleic acid modalities as DNA, many types of RNA, synthetic constructs, and more. b. Next, we employ large-scale pretraining over these sequences via an encoder transformer and the masked language-modeling objective. c. Finally, we fine-tune this foundation model with a task-specific head to tackle a wide variety of tasks.

Acknowledgements

The authors would like to thank Michael Retchin for his insightful comments and broad literature knowledge on protein-nucleic acid interaction. The authors would like to thank Douglas Kondziolka for his feedback on the manuscript. The authors would also like to thank Vincent D’Anniballe for his helpful discussion surrounding biosequence datasets. Lastly we would like to thank Michael Costantino and the NYU Langone High Performance Computing team for their assistance with maintaining state-of-the-art computing infrastructure necessary for this research.

GMH was supported by the National Institutes of Health through the award R35GM138312, and MD simulations were performed on NYU High Performance Computing resources, using GPUs purchased by the Simons Center for Computational Physical Chemistry (SCCPC) at NYU (SF Grant No. 839534).

Footnotes

Inclusion and Ethics

SFC, RJS, GMH designed the experiments, wrote the code, and wrote the manuscript. BL assisted in data collection and analysis. SPL and EKO assisted in the design of the experiments, writing of the manuscript, and direction of the research.

Data and Code Availability

Datasets can be found from their respective open sources, specifically the National Center for Biotechnology Information (Genbank), and UniProt (for Uniref100). Additionally, we maintain code for the downloading and preprocessing of this data on our Github. We release all foundation models on HuggingFace https://huggingface.co/WeiHua/OmniBioTE to accelerate the development of novel downstream use cases built on top of our foundation model. Additionally, the code for training and evaluating our models is available on our Github repository (https://github.com/nyuolab/OmniBioTE). Code and data for predictions combining AlphaFold3 with MD simulations are available from Zenodo https://zenodo.org/records/15098577.

References

  • [1].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. [Google Scholar]
  • [2].Senior Andrew W, Evans Richard, Jumper John, Kirkpatrick James, Sifre Laurent, Green Tim, Qin Chongli, Žídek Augustin, Nelson Alexander WR, Bridgland Alex, et al. Improved protein structure prediction using potentials from deep learning. Nature, 577(7792):706–710, 2020. [DOI] [PubMed] [Google Scholar]
  • [3].Jumper John, Evans Richard, Pritzel Alexander, Green Tim, Figurnov Michael, Ronneberger Olaf, Tunyasuvunakool Kathryn, Bates Russ, Žídek Augustin, Potapenko Anna, Bridgland Alex, Meyer Clemens, Kohl Simon A. A., Ballard Andrew J., Cowie Andrew, Romera-Paredes Bernardino, Nikolov Stanislav, Jain Rishub, Adler Jonas, Back Trevor, Petersen Stig, Reiman David, Clancy Ellen, Zielinski Michal, Steinegger Martin, Pacholska Michalina, Berghammer Tamas, Bodenstein Sebastian, Silver David, Vinyals Oriol, Senior Andrew W., Kavukcuoglu Koray, Kohli Pushmeet, and Hassabis Demis. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, aug 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Abramson Josh, Adler Jonas, Dunger Jack, Evans Richard, Green Tim, Pritzel Alexander, Ronneberger Olaf, Willmore Lindsay, Ballard Andrew J., Bambrick Joshua, Bodenstein Sebastian W., Evans David A., Hung Chia-Chun, O’Neill Michael, Reiman David, Tunyasuvunakool Kathryn, Wu Zachary, Akvilė Žemgulytė Eirini Arvaniti, Beattie Charles, Bertolli Ottavia, Bridgland Alex, Cherepanov Alexey, Congreve Miles, Cowen-Rivers Alexander I., Cowie Andrew, Figurnov Michael, Fuchs Fabian B., Gladman Hannah, Jain Rishub, Khan Yousuf A., Low Caroline M. R., Perlin Kuba, Potapenko Anna, Savy Pascal, Singh Sukhdeep, Stecula Adrian, Thillaisundaram Ashok, Tong Catherine, Yakneen Sergei, Zhong Ellen D., Zielinski Michal, Žídek Augustin, Bapst Victor, Kohli Pushmeet, Jaderberg Max, Hassabis Demis, and Jumper John M.. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, 630(8016):493–500, jun 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Baek Minkyung, DiMaio Frank, Anishchenko Ivan, Dauparas Justas, Ovchinnikov Sergey, Lee Gyu Rie, Wang Jue, Cong Qian, Kinch Lisa N., Schaeffer R. Dustin, Millán Claudia, Park Hahnbeom, Adams Carson, Glassman Caleb R., DeGiovanni Andy, Pereira Jose H., Rodrigues Andria V., van Dijk Alberdina A., Ebrecht Ana C., Opperman Diederik J., Sagmeister Theo, Buhlheller Christoph, Pavkov-Keller Tea, Rathinaswamy Manoj K., Dalwadi Udit, Yip Calvin K., Burke John E., Garcia K. Christopher, Grishin Nick V., Adams Paul D., Read Randy J., and Baker David. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Baek Minkyung, McHugh Ryan, Anishchenko Ivan, Jiang Hanlun, Baker David, and DiMaio Frank. Accurate prediction of protein–nucleic acid complexes using rosettafoldna. Nature Methods, 21(1):117–121, jan 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Ahdritz Gustaf, Bouatta Nazim, Floristean Christina, Kadyan Sachin, Xia Qinghui, Gerecke William, O’Donnell Timothy J., Berenberg Daniel, Fisk Ian, Zanichelli Niccolò, Zhang Bo, Nowaczynski Arkadiusz, Wang Bei, Stepniewska-Dziubinska Marta M., Zhang Shang, Ojewole Adegoke, Guney Murat Efe, Biderman Stella, Watkins Andrew M., Ra Stephen, Lorenzo Pablo Ribalta, Nivon Lucas, Weitzner Brian, Ban Yih-En Andrew, Chen Shiyang, Zhang Minjia, Li Conglong, Song Shuaiwen Leon, He Yuxiong, Sorger Peter K., Mostaque Emad, Zhang Zhao, Bonneau Richard, and AlQuraishi Mohammed. Openfold: retraining alphafold2 yields new insights into its learning mechanisms and capacity for generalization. Nature Methods, may 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Wu Ruidong, Ding Fan, Wang Rui, Shen Rui, Zhang Xiwen, Luo Shitong, Su Chenpeng, Wu Zuofan, Xie Qi, Berger Bonnie, Ma Jianzhu, and Peng Jian. High-resolution de novo structure prediction from primary sequence. bioRxiv, 2022. [Google Scholar]
  • [9].Lin Zeming, Akin Halil, Rao Roshan, Hie Brian, Zhu Zhongkai, Lu Wenting, Smetanin Nikita, Verkuil Robert, Kabeli Ori, Shmueli Yaniv, dos Santos Costa Allan, Fazel-Zarandi Maryam, Sercu Tom, Candido Salvatore, and Rives Alexander. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023. [DOI] [PubMed] [Google Scholar]
  • [10].Elnaggar Ahmed, Essam Hazem, Salah-Eldin Wafaa, Moustafa Walid, Elkerdawy Mohamed, Rochereau Charlotte, and Rost Burkhard. Ankh: Optimized protein language model unlocks general-purpose modelling. arXiv preprint arXiv:2301.06568, 2023. [Google Scholar]
  • [11].Geffen Yaron, Ofran Yanay, and Unger Ron. Distilprotbert: a distilled protein language model used to distinguish between real proteins and their randomly shuffled counterparts. Bioinformatics, 38(Supplement_2):ii95–ii98, 2022. [DOI] [PubMed] [Google Scholar]
  • [12].Nambiar Ananthan, Heflin Maeve, Liu Simon, Maslov Sergei, Hopkins Mark, and Ritz Anna. Transforming the language of life: transformer neural networks for protein prediction tasks. In Proceedings of the 11th ACM international conference on bioinformatics, computational biology and health informatics, pages 1–8, 2020. [Google Scholar]
  • [13].Elnaggar Ahmed, Heinzinger Michael, Dallago Christian, Rehawi Ghalia, Wang Yu, Jones Llion, Gibbs Tom, Feher Tamas, Angerer Christoph, Steinegger Martin, et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 44(10):7112–7127, 2021. [DOI] [PubMed] [Google Scholar]
  • [14].Rives Alexander, Meier Joshua, Sercu Tom, Goyal Siddharth, Lin Zeming, Liu Jason, Guo Demi, Ott Myle, Zitnick C Lawrence, Ma Jerry, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Meier Joshua, Rao Roshan, Verkuil Robert, Liu Jason, Sercu Tom, and Rives Alex. Language models enable zero-shot prediction of the effects of mutations on protein function. Advances in neural information processing systems, 34:29287–29303, 2021. [Google Scholar]
  • [16].Rao Roshan M, Liu Jason, Verkuil Robert, Meier Joshua, Canny John, Abbeel Pieter, Sercu Tom, and Rives Alexander. Msa transformer. In International Conference on Machine Learning, pages 8844–8856. PMLR, 2021. [Google Scholar]
  • [17].Heinzinger Michael, Weissenow Konstantin, Sanchez Joaquin Gomez, Henkel Adrian, Steinegger Martin, and Rost Burkhard. Prostt5: Bilingual language model for protein sequence and structure. bioRxiv, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Chen Bo, Cheng Xingyi, Geng Yangli-ao, Li Shen, Zeng Xin, Wang Boyan, Gong Jing, Liu Chiming, Zeng Aohan, Dong Yuxiao, Tang Jie, and Song Le. xtrimopglm: Unified 100b-scale pre-trained transformer for deciphering the language of protein. bioRxiv, 2023. [DOI] [PubMed] [Google Scholar]
  • [19].Su Jin, Han Chenchen, Zhou Yuyang, Shan Junjie, Zhou Xibin, and Yuan Fajie. Saprot: Protein language modeling with structure-aware vocabulary. bioRxiv, pages 2023–10, 2023. [Google Scholar]
  • [20].Notin Pascal, Dias Mafalda, Frazer Jonathan, Marchena-Hurtado Javier, Gomez Aidan N, Marks Debora, and Gal Yarin. Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pages 16990–17017. PMLR, 2022. [Google Scholar]
  • [21].Alamdari Sarah, Thakkar Nitya, van den Berg Rianne, Lu Alex Xijie, Fusi Nicolo, Amini Ava Pardis, and Yang Kevin K. Protein generation with evolutionary diffusion: sequence is all you need. bioRxiv, pages 2023–09, 2023. [Google Scholar]
  • [22].Madani Ali, Krause Ben, Greene Eric R, Subramanian Subu, Mohr Benjamin P, Holton James M, Olmos Jose Luis, Xiong Caiming, Sun Zachary Z, Socher Richard, et al. Large language models generate functional protein sequences across diverse families. Nature Biotechnology, 41(8):1099–1106, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Ferruz Noelia, Schmidt Steffen, and Höcker Birte. Protgpt2 is a deep unsupervised language model for protein design. Nature communications, 13(1):4348, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Fishman Veniamin, Kuratov Yuri, Petrov Maxim, Shmelev Aleksei, Shepelin Denis, Chekanov Nikolay, Kardymon Olga, and Burtsev Mikhail. Gena-lm: A family of open-source foundational models for long dna sequences. bioRxiv, pages 2023–06, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Nguyen Eric, Poli Michael, Faizi Marjan, Thomas Armin, Wornow Michael, Birch-Sykes Callum, Massaroli Stefano, Patel Aman, Rabideau Clayton, Bengio Yoshua, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 36, 2024. [Google Scholar]
  • [26].Dalla-Torre Hugo, Gonzalez Liam, Mendoza-Revilla Javier, Carranza Nicolas Lopez, Grzywaczewski Adam Henryk, Oteri Francesco, Dallago Christian, Trop Evan, de Almeida Bernardo P, Sirelkhatim Hassan, et al. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. BioRxiv, pages 2023–01, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Dudnyk Kseniia, Cai Donghong, Shi Chenlai, Xu Jian, and Zhou Jian. Sequence basis of transcription initiation in the human genome. Science, 384(6694):eadj0116, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [28].Avsec Žiga, Agarwal Vikram, Visentin Daniel, Ledsam Joseph R., Grabska-Barwinska Agnieszka, Taylor Kyle R., Assael Yannis, Jumper John, Kohli Pushmeet, and Kelley David R.. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, oct 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [29].Zhou Zhihan, Ji Yanrong, Li Weijian, Dutta Pratik, Davuluri Ramana, and Liu Han. Dnabert-2: Efficient foundation model and benchmark for multi-species genome, 2024.
  • [30].Zvyagin Maxim, Brace Alexander, Hippe Kyle, Deng Yuntian, Zhang Bin, Bohorquez Cindy Orozco, Clyde Austin, Kale Bharat, Perez-Rivera Danilo, Ma Heng, et al. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics. The International Journal of High Performance Computing Applications, 37(6):683–705, 2023. [Google Scholar]
  • [31].Hwang Yunha, Cornman Andre L, Kellogg Elizabeth H, Ovchinnikov Sergey, and Girguis Peter R. Genomic language model predicts protein co-regulation and function. Nature communications, 15(1):2880, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Brixi Garyk, Durrant Matthew G, Ku Jerome, Poli Michael, Brockman Greg, Chang Daniel, Gonzalez Gabriel A, King Samuel H, Li David B, Merchant Aditi T, Naghipourfar Mohsen, Nguyen Eric, Ricci-Tam Chiara, Romero David W, Sun Gwanggyu, Taghibakshi Ali, Vorontsov Anton, Yang Brandon, Deng Myra, Gorton Liv, Nguyen Nam, Wang Nicholas K, Adams Etowah, Baccus Stephen A, Dillmann Steven, Ermon Stefano, Guo Daniel, Ilango Rajesh, Janik Ken, Lu Amy X, Mehta Reshma, Mofrad Mohammad R.K., Ng Madelena Y, Pannu Jaspreet, Re Christopher, Schmok Jonathan C, St. John John, Sullivan Jeremy, Zhu Kevin, Zynda Greg, Balsam Daniel, Collison Patrick, Costa Anthony B., Hernandez-Boussard Tina, Ho Eric, Liu Ming-Yu, McGrath Tom, Powell Kimberly, Burke Dave P., Goodarzi Hani, Hsu Patrick D, and Hie Brian. Genome modeling and design across all domains of life with evo 2. bioRxiv, 2025. [Google Scholar]
  • [33].Tsukiyama Sho, Hasan Md Mehedi, Deng Hong-Wen, and Kurata Hiroyuki. BERT6mA: prediction of DNA N6-methyladenine site using deep learning-based approaches. Briefings in Bioinformatics, 23(2):bbac053, February 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].De Waele Gaetan, Clauwaert Jim, Menschaert Gerben, and Waegeman Willem. CpG Transformer for imputation of single-cell methylomes. Bioinformatics, 38(3):597–603, October 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Jin Junru, Yu Yingying, Wang Ruheng, Zeng Xin, Pang Chao, Jiang Yi, Li Zhongshen, Dai Yutong, Su Ran, Zou Quan, et al. idnaabf: multi-scale deep biological language learning model for the interpretable prediction of dna methylations. Genome biology, 23(1):219, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Zhou Jiyun, Chen Qiang, Braun Patricia R, Perzel Mandell Kira A, Jaffe Andrew E, Tan Hao Yang, Hyde Thomas M, Kleinman Joel E, Potash James B, Shinozaki Gen, et al. Deep learning predicts dna methylation regulatory variants in the human brain and elucidates the genetics of psychiatric disorders. Proceedings of the National Academy of Sciences, 119(34):e2206069119, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Lee Dohoon, Yang Jeewon, and Kim Sun. Learning the histone codes with large genomic windows and three-dimensional chromatin interactions using transformer. Nature Communications, 13(1):6678, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Zhou Zhongliang, Yeung Wayland, Gravel Nathan, Salcedo Mariah, Soleymani Saber, Li Sheng, and Kannan Natarajan. Phosformer: an explainable transformer model for protein kinase-specific phosphorylation predictions. Bioinformatics, 39(2):btad046, January 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Fu Xi, Mo Shentong, Buendia Alejandro, Laurent Anouchka, Shao Anqi, del Mar Alvarez-Torres Maria, Yu Tianji, Tan Jimin, Su Jiayu, Sagatelian Romella, et al. Get: a foundation model of transcription across human cell types. bioRxiv, pages 2023–09, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Yang Fan, Wang Wenchuan, Wang Fang, Fang Yuan, Tang Duyu, Huang Junzhou, Lu Hui, and Yao Jianhua. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022. [Google Scholar]
  • [41].Hao Minsheng, Gong Jing, Zeng Xin, Liu Chiming, Guo Yucheng, Cheng Xingyi, Wang Taifeng, Ma Jianzhu, Zhang Xuegong, and Song Le. Large-scale foundation model on single-cell transcriptomics. Nature Methods, pages 1–11, 2024. [DOI] [PubMed] [Google Scholar]
  • [42].Cui Haotian, Wang Chloe, Maan Hassaan, Pang Kuan, Luo Fengning, Duan Nan, and Wang Bo. scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11, 2024. [DOI] [PubMed] [Google Scholar]
  • [43].Theodoris Christina V, Xiao Ling, Chopra Anant, Chaffin Mark D, Al Sayed Zeina R, Hill Matthew C, Mantineo Helene, Brydon Elizabeth M, Zeng Zexian, Liu X Shirley, et al. Transfer learning enables predictions in network biology. Nature, 618(7965):616–624, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Li Sizhen, Moayedpour Saeed, Li Ruijiang, Bailey Michael, Riahi Saleh, Kogler-Anele Lorenzo, Miladi Milad, Miner Jacob, Zheng Dinghai, Wang Jun, et al. Codonbert: Large language models for mrna design and optimization. bioRxiv, pages 2023–09, 2023. [Google Scholar]
  • [45].Celaj Albi, Gao Alice Jiexin, Lau Tammy TY, Holgersen Erle M, Lo Alston, Lodaya Varun, Cole Christopher B, Denroche Robert E, Spickett Carl, Wagih Omar, et al. An rna foundation model enables discovery of disease mechanisms and candidate therapeutics. bioRxiv, pages 2023–09, 2023. [Google Scholar]
  • [46].Liu Linjing, Li Wei, Wong Ka-Chun, Yang Fan, and Yao Jianhua. A pre-trained large generative model for translating single-cell transcriptome to proteome. bioRxiv, pages 2023–07, 2023. [DOI] [PubMed] [Google Scholar]
  • [47].Shen Hongru, Liu Jilei, Hu Jiani, Shen Xilin, Zhang Chao, Wu Dan, Feng Mengyao, Yang Meng, Li Yang, Yang Yichen, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. Iscience, 26(5), 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Gong Jing, Hao Minsheng, Cheng Xingyi, Zeng Xin, Liu Chiming, Ma Jianzhu, Zhang Xuegong, Wang Taifeng, and Song Le. xtrimogene: an efficient and scalable representation learner for single-cell rna-seq data. Advances in Neural Information Processing Systems, 36, 2024. [Google Scholar]
  • [49].He Yong, Fang Pan, Shan Yongtao, Pan Yuanfei, Wei Yanhong, Chen Yichang, Chen Yihao, Liu Yi, Zeng Zhenyu, Zhou Zhan, et al. Lucaone: generalized biological foundation model with unified nucleic acid and protein language. bioRxiv, pages 2024–05, 2024. [Google Scholar]
  • [50].Harini Kannan, Srivastava Ambuj, Kulandaisamy Arulsamy, and Gromiha M Michael. ProNAB: database for binding affinities of protein–nucleic acid complexes and their mutants. Nucleic Acids Research, 50(D1):D1528–D1534, October 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Rauluseviciute Ieva, Riudavets-Puig Rafael, Blanc-Mathieu Romain, Castro-Mondragon Jaime A, Ferenc Katalin, Kumar Vipin, Lemma Roza Berhanu, Lucas Jérémy, Chèneby Jeanne, Baranasic Damir, et al. Jaspar 2024: 20th anniversary of the open-access database of transcription factor binding profiles. Nucleic acids research, 52(D1):D174–D182, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Rao Roshan, Bhattacharya Nicholas, Thomas Neil, Duan Yan, Chen Peter, Canny John, Abbeel Pieter, and Song Yun. Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019. [PMC free article] [PubMed] [Google Scholar]
  • [53].Pandey Uddeshya, Behara Sasi M., Sharma Siddhant, Patil Rachit S., Nambiar Souparnika, Koner Debasish, and Bhukya Hussain. Deepnap: A deep learning method to predict protein–nucleic acid binding affinity from their sequences. Journal of Chemical Information and Modeling, 64(6):1806–1815, 2024. [DOI] [PubMed] [Google Scholar]
  • [54].Kramer Christian, Kalliokoski Tuomo, Gedeck Peter, and Vulpetti Anna. The experimental uncertainty of heterogeneous public k i data. Journal of medicinal chemistry, 55(11):5165–5173, 2012. [DOI] [PubMed] [Google Scholar]
  • [55].Loh Lay Kuan and Bartulovic Mihovil. Efficient coding hypothesis and an introduction to information theory. Retrieved from users. ece.cmu.edu/˜pgrover/teaching/files/InfoTheoryEfficientCodingHypothesis.pdf. Homayoun Shahri, 2014. [Google Scholar]
  • [56].Jaegle Andrew, Gimeno Felix, Brock Andy, Vinyals Oriol, Zisserman Andrew, and Carreira Joao. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021. [Google Scholar]
  • [57].Alayrac Jean-Baptiste, Donahue Jeff, Luc Pauline, Miech Antoine, Barr Iain, Hasson Yana, Lenc Karel, Mensch Arthur, Millican Katherine, Reynolds Malcolm, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022. [Google Scholar]
  • [58].Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021. [Google Scholar]
  • [59].Liu Haotian, Li Chunyuan, Wu Qingyang, and Lee Yong Jae. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. [PMC free article] [PubMed] [Google Scholar]
  • [60].Gragoudas Evangelos S., Adamis Anthony P., Cunningham Emmett T., Feinsod Matthew, and Guyer David R.. Pegaptanib for neovascular age-related macular degeneration. New England Journal of Medicine, 351(27):2805–2816, 2004. [DOI] [PubMed] [Google Scholar]
  • [61].Carvalho Josué, Paiva Artur, Campello Maria Paula Cabral, Paulo António, Mergny Jean-Louis, Salgado Gilmar F., Queiroz João A., and Cruz Carla. Aptamer-based targeted delivery of a g-quadruplex ligand in cervical cancer cells. Scientific Reports, 9(1):7945, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [62].Tanaka Kenichi A., Szlam Fania, Rusconi Christopher P., and Levy Jerrold H.. In-vitro evaluation of anti-factor ixa aptamer on thrombin generation, clotting time, and viscoelastometry. Thrombosis and Haemostasis, 101(5):827–833, May 2009. [PubMed] [Google Scholar]
  • [63].Chan M. Y., Rusconi C. P., Alexander J. H., Tonkens R. M., Harrington R. A., and Becker R. C.. A randomized, repeat-dose, pharmacodynamic and safety study of an antidote-controlled factor ixa inhibitor. Journal of Thrombosis and Haemostasis, 6(5):789–796, May 2008. [DOI] [PubMed] [Google Scholar]
  • [64].Riccardi Claudia, Meyer Albert, Vasseur Jean-Jacques, Cavasso Domenico, Krauss Irene Russo, Paduano Luigi, Morvan François, and Montesarchio Daniela. Design, synthesis and characterization of cyclic nu172 analogues: A biophysical and biological insight. International Journal of Molecular Sciences, 21(11):3860, May 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [65].Jilma-Stohlawetz Petra, Knöbl Paul, Gilbert James C., and Jilma Bernd. The anti-von willebrand factor aptamer arc1779 increases von willebrand factor levels and platelet counts in patients with type 2b von willebrand disease. Thrombosis and Haemostasis, 108(2):284–290, August 2012. [DOI] [PubMed] [Google Scholar]
  • [66].Menne Jan, Eulberg Dirk, Beyer Diana, Baumann Matthias, Saudek Frantisek, Valkusz Zsuzsanna, Więcek Andrzej, and Haller Hermann. C-c motif-ligand 2 inhibition with emapticap pegol (nox-e36) in type 2 diabetic patients with albuminuria. Nephrology, Dialysis, Transplantation, 32(2):307–315, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [67].Giordano Frank A., Layer Julian P., Leonardelli Sonia, Friker Lea L., Turiello Roberta, Corvino Dillon, Zeyen Thomas, Schaub Christina, Müller Wolf, Sperk Elena, Schmeel Leonard Christopher, Sahm Katharina, Oster Christoph, Kebir Sied, Hambsch Peter, Pietsch Torsten, Bisdas Sotirios, Platten Michael, Glas Martin, Seidel Clemens, Herrlinger Ulrich, and Hölzel Michael. L-rna aptamer-based cxcl12 inhibition combined with radiotherapy in newly-diagnosed glioblastoma: dose escalation of the phase i/ii gloria trial. Nature Communications, 15(1):4210, 2024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [68].Schwoebel Frank, van Eijk Lucas T., Zboralski Dirk, Sell Simone, Buchner Klaus, Maasch Christian, Purschke Werner G., Humphrey Martin, Zöllner Stefan, Eulberg Dirk, Morich Frank, Pickkers Peter, and Klussmann Sven. The effects of the anti-hepcidin spiegelmer nox-h94 on inflammation-induced anemia in cynomolgus monkeys. Blood, 121(12):2311–2315, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [69].Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  • [70].Sayers Eric W, Cavanaugh Mark, Clark Karen, Ostell James, Pruitt Kim D, and Karsch-Mizrachi Ilene. GenBank. Nucleic Acids Research, 47(D1):D94–D99, October 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [71].Suzek Baris E., Huang Hongzhan, McGarvey Peter, Mazumder Raja, and Wu Cathy H.. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics, 23(10):1282–1288, March 2007. [DOI] [PubMed] [Google Scholar]
  • [72].Sennrich Rico, Haddow Barry, and Birch Alexandra. Neural machine translation of rare words with subword units, 2016.
  • [73].Kudo Taku and Richardson John. Sentence-piece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
  • [74].Radford Alec, Wu Jeff, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. Language models are unsupervised multitask learners. 2019.
  • [75].Touvron Hugo, Martin Louis, Stone Kevin, Albert Peter, Almahairi Amjad, Babaei Yasmine, Bashlykov Nikolay, Batra Soumya, Bhargava Prajjwal, Bhosale Shruti, Bikel Dan, Blecher Lukas, Ferrer Cristian Canton, Chen Moya, Cucurull Guillem, Esiobu David, Fernandes Jude, Fu Jeremy, Fu Wenyin, Fuller Brian, Gao Cynthia, Goswami Vedanuj, Goyal Naman, Hartshorn Anthony, Hosseini Saghar, Hou Rui, Inan Hakan, Kardas Marcin, Kerkez Viktor, Khabsa Madian, Kloumann Isabel, Korenev Artem, Koura Punit Singh, Lachaux Marie-Anne, Lavril Thibaut, Lee Jenya, Liskovich Diana, Lu Yinghai, Mao Yuning, Martinet Xavier, Mihaylov Todor, Mishra Pushkar, Molybog Igor, Nie Yixin, Poulton Andrew, Reizenstein Jeremy, Rungta Rashi, Saladi Kalyan, Schelten Alan, Silva Ruan, Smith Eric Michael, Subramanian Ranjan, Tan Xiaoqing Ellen, Tang Binh, Taylor Ross, Williams Adina, Kuan Jian Xiang, Xu Puxin, Yan Zheng, Zarov Iliyan, Zhang Yuchen, Fan Angela, Kambadur Melanie, Narang Sharan, Rodriguez Aurelien, Stojnic Robert, Edunov Sergey, and Scialom Thomas. Llama 2: Open foundation and fine-tuned chat models, 2023.
  • [76].Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Lukasz, and Polosukhin Illia. Attention is all you need, 2023.
  • [77].Su Jianlin, Lu Yu, Pan Shengfeng, Murtadha Ahmed, Wen Bo, and Liu Yunfeng. Roformer: Enhanced transformer with rotary position embedding, 2023.
  • [78].Yang Greg, Hu Edward J., Babuschkin Igor, Sidor Szymon, Liu Xiaodong, Farhi David, Ryder Nick, Pachocki Jakub, Chen Weizhu, and Gao Jianfeng. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer, 2022.
  • [79].Karpathy Andrej. NanoGPT. https://github.com/karpathy/nanoGPT, 2022.
  • [80].Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization, 2019.
  • [81].Smith Leslie N. and Topin Nicholay. Superconvergence: Very fast training of neural networks using large learning rates, 2018.
  • [82].Henikoff Steven and Henikoff Jorja G. Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences, 89(22):10915–10919, 1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [83].Pandey Uddeshya, Behara Sasi M, Sharma Siddhant, Patil Rachit S, Nambiar Souparnika, Koner Debasish, and Bhukya Hussain. Deepnap: A deep learning method to predict protein–nucleic acid binding affinity from their sequences. Journal of Chemical Information and Modeling, 64(6):1806–1815, 2024. [DOI] [PubMed] [Google Scholar]
  • [84].Berman Helen M., Westbrook John, Feng Zukang, Gilliland Gary, Bhat T. N., Weissig Helge, Shindyalov Ilya N., and Bourne Philip E.. The Protein Data Bank. Nucleic Acids Research, 28(1):235–242, January 2000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [85].Xu Mingxue, Xu Yao Lei, and Mandic Danilo P.. Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition, 2023.
  • [86].He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep residual learning for image recognition, 2015.
  • [87].Capel Henriette, Weiler Robin, Dijkstra Maurits, Vleugels Reinier, Bloem Peter, and Feenstra K. Anton. Proteinglue: A multi-task benchmark suite for self-supervised protein modeling. bioRxiv, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [88].Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR, 2021. [Google Scholar]
  • [89].Eastman Peter, Galvelis Raimondas, Peláez Raúl P, Abreu Charlles RA, Farr Stephen E, Gallicchio Emilio, Gorenko Anton, Henry Michael M, Hu Frank, Huang Jing, et al. Openmm 8: molecular dynamics simulation with machine learning potentials. J. Phys. Chem. B, 128(1):109–116, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [90].Wang Ercheng, Sun Huiyong, Wang Junmei, Wang Zhe, Liu Hui, Zhang John ZH, and Hou Tingjun. End-point binding free energy calculation with mm/pbsa and mm/gbsa: strategies and applications in drug design. Chem. Rev., 119(16):9478–9508, 2019. [DOI] [PubMed] [Google Scholar]
  • [91].Roux Benoît and Chipot Christophe. Editorial guidelines for computational studies of ligand binding using mm/pbsa and mm/gbsa approximations wisely. J. Phys. Chem. B, 128(49):12027–12029, 2024. [DOI] [PubMed] [Google Scholar]
  • [92].James A Maier Carmenza Martinez, Kasavajhala Koushik, Wickstrom Lauren, Hauser Kevin E, and Simmerling Carlos. ff14sb: improving the accuracy of protein side chain and backbone parameters from ff99sb. J. Chem. Theor. Comput., 11(8):3696–3713, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [93].Nguyen Hai, Roe Daniel R, and Simmerling Carlos. Improved generalized born solvent model parameters for protein simulations. J. Chem. Theor. Comput., 9(4):2020–2034, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [94].Leimkuhler Benedict and Matthews Charles. Robust and efficient configurational molecular sampling via langevin dynamics. J Chem Phys, 138(17), 2013. [DOI] [PubMed] [Google Scholar]
  • [95].Zhang Zhijun, Liu Xinzijian, Yan Kangyu, Tuckerman Mark E, and Liu Jian. Unified efficient thermostat scheme for the canonical ensemble with holonomic or isokinetic constraints via molecular dynamics. J Phys Chem A, 123(28):6056–6079, 2019. [DOI] [PubMed] [Google Scholar]
  • [96].Dalla-Torre Hugo, Gonzalez Liam, Revilla Javier Mendoza, Carranza Nicolas Lopez, Grywaczewski Adam Henryk, Oteri Francesco, Dallago Christian, Trop Evan, Sirelkhatim Hassan, Richard Guillaume, et al. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [97].Liang Chaoqi, Qiao Lifeng, Ye Peng, Dong Nanqing, Sun Jianle, Bai Weiqiang, Ren Yuchen, Ma Xinzhu, Yan Hongliang, Song Chunfeng, et al. Toward understanding bert-like pre-training for dna foundation models. arXiv preprint arXiv:2310.07644, 2023. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Datasets can be found from their respective open sources, specifically the National Center for Biotechnology Information (Genbank), and UniProt (for Uniref100). Additionally, we maintain code for the downloading and preprocessing of this data on our Github. We release all foundation models on HuggingFace https://huggingface.co/WeiHua/OmniBioTE to accelerate the development of novel downstream use cases built on top of our foundation model. Additionally, the code for training and evaluating our models is available on our Github repository (https://github.com/nyuolab/OmniBioTE). Code and data for predictions combining AlphaFold3 with MD simulations are available from Zenodo https://zenodo.org/records/15098577.


Articles from ArXiv are provided here courtesy of arXiv

RESOURCES