GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

eyes s robson; Nilah M Ioannidis

doi:10.1101/2023.10.12.562113

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

[Preprint]. 2024 Mar 7:2023.10.12.562113. Originally published 2023 Oct 17. [Version 3] doi: 10.1101/2023.10.12.562113

GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

eyes s robson ¹, Nilah M Ioannidis ²

PMCID: PMC10614795 PMID: 37904945

Abstract

Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields — including benchmarking, auditing, and algorithmic fairness — are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.

1. Introduction

Genomes are fundamental characteristics of organisms that encode the molecular machinery and regulatory functions that define cellular organization and response to stimuli. Modern statistical machinery to analyze genomes has grown in complexity, from sophisticated tree models to reconstruct demographic and evolutionary histories [57, 5] to neural network-based sequence-to-function maps (i.e. $f : X \to y$ ) for [epi]genomic annotation, imputation, and understanding [86, 7, 87, 38, 6]. Due to this increased reliance on high-complexity, difficult-to-interpret models, there is a need for centralized benchmarks and benchmark development tools to maximize research efficacy. Benchmarks offer new perspectives on model evaluation, assess the progress of a field over time, and provide standardized comparability between new and existing models.

Here, we curate large-scale $(N > 10^{6})$ preprocessed genome interpretation tasks for establishing the Genome Understanding and ANnotation in silico Evaluation, or GUANinE¹, benchmark. While ideally, predictions from genome interpretation models would be confirmed by comprehensive wetlab experiments, such experiments present a significant and costly research bottleneck for model development. Because gold-standard human evaluation [61] for genomic tasks is infeasible, the design of large- $N$ benchmarking tasks is necessary to develop competitive baselines. Our goal is to provide a set of benchmarking tasks for genomic AI that

allow for direct, supervised training of high-complexity models from scratch (tabula rasa) for comparison to pretraining regimes for transfer learning, and
ensure statistical significance while having limited or quantifiable confounders (e.g. batch effects or socioeconomic factors [77]), a requirement of any evaluatory dataset [24].

GUANinE v1.0 prioritizes functional genomic annotation and understanding on short-to-moderate length sequences (between 80 and 512 nucleotides), rather than exploring long sequences inputs or distal effects [38, 20, 36]. Although GUANinE does not cover every domain of genomic AI, e.g. taxonomic classification or comparative biology [47, 71], it has an emphasis on tasks that require a complex understanding of genome regulation, and we hope it will encourage further task design and benchmarking in the field, similar to the diversity of tasks in NLP from cloze completion [67] to Natural Language Understanding (NLU) [12, 80].

Additionally, we make use of the standardized performance metrics of GUANinE to evaluate a variety of non-neural baselines and existing genomic AI models across diverse tasks, and we demonstrate the power of self-supervised pretraining in the genome while exploring key hyperparameter implications with numerous and varied T5 models. These experiments confirm the perplexity benefits of longer sequences [14] while demonstrating the tradeoff of reduced representation sizes for fixed-length tasks at higher levels of tokenization.

1.1. Background — Benchmarking

Benchmarking has a long history across the AI fields of natural language processing (NLP, “language”) and computer vision (“vision”), from ImageNet [18], which inspired AlexNet [41] and ResNets [27] in vision, to the bilingual parallel corpora that led to the transformer architecture [78, 19] and modern benchmarks of question-answering [60] and comprehensive evaluation [80, 81, 66] in language. The potential for model development, designing new tasks, and evaluating models enabled by benchmarking is difficult to overstate. On the other hand, reliance on benchmarks is not without risks — benchmarking is an intrinsically normative process that can entrench suboptimal priorities [11], perpetuate cultural biases [11, 9], limit model expressivity [51], or yield inaccurate metrics due to duplication errors [8]. Given present and historical biases in genomics [55, 1, 77] and medicine [48], benchmarking biomedical tasks based on clinical or volunteer-based data is challenging. GUANinE relies on experimentally determined and evolutionary data for its tasks to reduce socioeconomic confounders, although in vitro and evolutionary data come with their own biases as we discuss.

1.2. Related Work

Previous evaluation efforts for genomic AI models and noncoding variation have utilized the Eukaryotic Promoter Database [56] for promoter annotation [50, 33, 83], gene expression data from the Roadmap [73], GTEx [72], and Geuvadis [43] consortia for promoter understanding [63, 31, 3, 87, 38], and GWAS or eQTL variant association datasets [87, 31, 20] or curated clinical variant annotations from HGMD or ClinVar [30, 79, 40, 31, 38] for variant interpretation. Recent small-scale experimental validations make use of wetlab techniques such as CRISPRi [6]. Most of these self-designed evaluations by authors are heavily tailored to the model or task of interest, rather than being explicitly intended or designed as benchmarking tasks for followup comparisons with other models. The Critical Assessment of Genome Interpretation (CAGI) challenges [29], in contrast, involve benchmarks created for specific multi-submission challenges, but are typically limited in scope, with submissions tailored specifically to each individual challenge. The recent GenomicBenchmarks paper [26] is notably distinct from other work and is the most comparable to GUANinE, although GUANinE involves a wider scope of tasks with over 60M (~70x) training examples, a rigorous approach to task construction (e.g. repeat-downsampling and GC-content balancing), and comprehensive baselining.

2. `GUANinE` Tasks

The GUANinE benchmark is designed to be supervised, human (eukaryote)-centric, and well-controlled, with a focus on large training sets. Compared to other AI disciplines, the relative infeasibility of manual human-labelling for genome annotation is a clear limitation. For GUANinE, great consideration was placed into cleaning, limiting obvious confounders, and (where applicable) selecting negatives. Our suite of tasks is meant to broadly characterize human genomic complexity and span several domains of functional genomics.

2.1. Functional Elements

Endogenous functional element annotation is commonly used for the training and evaluation of genomic AI methods [50, 26]. In GUANinE , we label finite spans of nucleotides (centered at an experimentally called peak) with a scalar output corresponding to a functional ‘propensity’ across a catalogue of cell types. This propensity is based on a weighted sum of the number of cell types displaying signal for the consensus peak in the experiment of interest; see Appendix B for details. This scalar target subsumes the canonical vector output across multiple cell types [86, 87, 38, 6] into a concise, interpretable metric of cell type specificity. Compared to existing functional element datasets, these tasks have a relatively low (< 20%) number of negatives (zeros) and stricter downsampling in repeat-masked regions.

dnase-propensity

This task asks a model to estimate the ubiquity of DNase Hypersensitive Site (DHS) across cell types for sequences with some non-zero DHS signal alongside negative sequences from the rest of the genome. It is constructed from the DNase hypersensitivity subset of the SCREEN v2 database [69], a collection of several hundred cell type tracks from ENCODE [70]. We label a 511 bp hg38 reference sequence with an aggregate propensity score, where a ‘positive-class’ score of 1 through 4 represents the relative number of cell type tracks with DNase hypersensitivity signal at the peak locus (4 being nearly ubiquitous), while a ‘negative-class’ score of 0 represents a partially GC-balanced negative randomly sampled from the genome. In effect, the y-value is a low-dimensional summary of the binarized accessibility across the 727 cell types. Compared to the ccre-propensity task below, this is a simpler, annotative task of DHS ubiquity. The y-values are integers ranging from 0 to 4, and we use Spearman rho in evaluations.

ccre-propensity

This task asks a model to estimate DHS functionality across cell types among the subset of sequences annotated as candidate Cis-Regulatory Elements (cCREs) in ENCODE’s Candidate Registry of cis-Regulatory Elements, as used in the SCREEN v2 cCRE database [69]. We start with the ‘positive’ DHSes from the dnase-propensity task, and we label them with the corresponding signal from each of four peak-called epigenetic markers: H3K4me3, H3K27ac, CTCF, and DNase hypersensitivity. As before, this propensity corresponds to a weighted sum of binarized signal over different cell types for each experiment type; see Appendix B for details. Each example in the ccre-propensity task has 509 bp of hg38 context centered at the middle of a DHS². Compared to the dnase-propensity task, this is a more complex, understanding-based task of DHS function. The y-values are integers ranging from 0 (non-marker DHS sites from the ‘positives’ of the dnase-propensity task above) to 4, indicating the highest number of cell types in which a signal was detected (e.g. 100 of the 198 cell types with a CTCF experiment).

Although our dnase- and ccre- propensity tasks both reflect patterns of ubiquity versus cell type specificity, neither explicitly asks for the specific cell types in which a DHS or cCRE is active. This choice to bin (or quantize) our scores allows for clearly defined negatives without worrying about zero-inflation in the loss function, provides universal post-hoc groupings (e.g. 0 vs all, 4 vs all, etc) that reflect different standardized constructions of ubiquity and activity, and reduces the risk of inflated performance due to correlation structures or from prioritizing certain tracks to the detriment of others [64].

2.2. Conservation

Sequence conservation across evolutionarily related organisms suggests the presence of negative selection against deleterious variation and thus biological function [40, 79]. Many per-base or per-element conservation scores can be directly computed from multiple sequence alignments across related organisms [5, 54] (though these alignments may induce bias due to evolutionary divergence). Human Accelerated Regions (HARs [53]) are distinct for their markedly unconserved nature yet high impact on human physiology, as evidenced by their recent positive selection since our last common ancestor with chimpanzees and bonobos.

cons30 and cons100

These tasks are constructed by labeling 512 bp contiguous segments of the hg38 reference genome with the mean phyloP30 or phyloP100 multiple sequence alignment conservation score, respectively, as reported in Pollard et al. [54]. The 30-way alignment roughly corresponds to primates/mammals, while the 100-way corresponds to vertebrates. We minimize the inclusion of HARs by removing high-variance, lowly conserved segments, as these have undergone sharp positive selection in contrast to (ultra)conserved elements. All sequences used for task construction have 95+% coverage across the species in their respective MSAs. The y-values correspond to binned quantiles (integer) of the mean conservation score between 0 (least conserved) and 24 (most conserved), and we use Spearman rho in evaluations.

2.3. Gene Expression

Gene expression is central to cellular identity and function, and genomic AI represents a promising advance for understanding regulatory genomics and the sequence determinants of mRNA abundance and decay [87, 3, 63, 2]. However, the space of naturally occurring promoter sequences in any one organismal context is limited [76, 17], both in sample size relative to complexity (~30,000 in humans [56]) and in diversity of sequences (e.g. phylogenetically related or constrained promoters, and similar GC-gradient patterning across genes [71]). Experimental techniques such as MPRAs or oligonucleotide assembly offer the means to perturb regulatory element motif grammars and add sequence diversity [76, 16] to ensure that genomic AI models learn causal determinants of gene expression rather than simple sequence features or correlates. The tasks below benefit from substantial size and sequence diversity, though we note that because the experiments are conducted in yeast with exogenous sequences, the performance of models designed for the human genome will be affected by the distribution shift between human and yeast.

gpra-c and gpra-d

The Gigantic Parallel Reporter Assay (GPRA) tasks are large corpora of short, 80 random (+ 30 scaffold) bp promoter sequences in yeast labeled by a gene expression value measured via dual reporter single-cell fluorescence [17]. We re-process and sanitize datasets collected in both the ‘complex’ (gpra-c) and ‘defined’ (gpra-d) growth media as originally reported by Vaishnav et al. [76]. The y-values correspond to the (experimentally) binned fluorescence value of the promoter sequence expression between 0 (lowest) and 17 (highest), and we use Spearman rho in evaluations.

3. Baselines

Rigorous baselining for genomic AI in the context of benchmarks is critical [23, 49], due to the presence of confounding sequence (e.g. dinucleotide frequency [7]) and measurement (e.g. batch effect) factors. We evaluate a selection of neural and non-neural baselines in GUANinE to provide insight into performance, and we encourage subsequent work to include similar baselines. Our neural baselines include few-shot performance of a handful of commonly-used existing convolutional models in genomic AI, as well as a specialized transformer architecture that we evaluate both pretrained and tabula rasa to explore the benefits of pretraining.

3.1. Non-neural Baselines

Our non-neural baselines range from simple GC-content, a strong predictor due to its correlation with sequence function in vertebrates [71], to k-mer frequency models, which have an established record in genomics [25].

GC-content

From its original prominence in isochore theory [10] to the identification of conserved GC-rich patterns in large-scale evolutionary genome databases [5, 71], GC-content (relative to sequence length, also known as G+C%) is highly indicative of function, in part because of its prevalence in and around coding regions. We compute it as the summed percentage of guanine and cytosine bases in the input sequences.

k-mer frequency SVR

We adopt a straightforward linear-kernel SVR on 5-mer frequencies computed from the input sequences [25]. As support vector machines are a special case of perceptrons, this model can be interpreted as a one-layer CNN with a kernel size of five. We use the liblinear sci-kit learn implementation [52, 22] with the maximum training sample size supported on a moderate machine.

k-mer frequency kNN

We also evaluate a more complex albeit less statistically efficient class of models, nearest neighbor graphs, on the same 5-mer frequency representations of sequences as above. We use the GPU-accelerated FAISS library [34] to permit the kNN model to scale to millions of training points per task.

3.2. Existing Neural Baselines

As GUANinE is a supervised benchmark, we include a handful of pretrained multi-task genomic AI models in our baselines. We perform feature extraction on the output of each model, where each element of that output is a predicted experimental measurement (e.g. DNase hypersensitivity, TF binding) for that sequence in one cell type (or line). We pass these features into an L2-regularized layer, as in linear evaluation [84, 13]. Since we do not fine-tune the models, this performance metric is meant to evaluate GUANinE’s tasks and the models’ few-shot performance, rather than to comprehensively score their architectures or fine-tuning potential. ³

DeepSEA

DeepSEA takes in a 1000-bp sequence and maps it to a vector of 919 output tracks. The model is fast and shallow, with the learned representation from the convolutional layers unraveled into a 50880-dimensional vector and fed into a large, dense layer with over 89% of DeepSEA’s 52.8 M parameters.

Beluga

Beluga is an enhanced version of DeepSEA. It has more layers, an increased input size of 2000 bp, and uses 2002 experimental tracks for training. Similar to DeepSEA, the learned representation from the convolutional layers is fed into a dense layer that contains over 90% of Beluga’s 149.5 M parameters.

Basenji2

Basenji2 is a deeper model prioritizing longer sequence contexts than DeepSEA or Beluga, and it relies heavily on wide, dilated convolutions to reduce its parameter and FLOP counts. Basenji2 was trained on 5313 human and 1643 mouse tracks, including many gene expression (CAGE) tasks unlike the purely epigenomic tracks of DeepSEA and Beluga. We calculate Basenji2’s effective input size to be 54,784 bp⁴. The dense output layer contains over 26% of Basenji2’s 30.1 M parameters.

3.3. Novel Neural Baselines

We also train large (770 M parameter) transformers on each task, providing a low-context comparison to the longer length baselines above. We use the text-to-text transfer transformer (T5) architecture, an encoder-decoder transformer designed for efficient self-supervised pretraining and transfer learning [59]. We evaluate the impact of tokenization and language modeling in T5 variants as outlined below.

T5

Compared to the more common encoder-only BERT and RoBERTa [19, 45], the T5 utilizes a more efficient span-corruption task for language modeling enabled by decoupling the input and output length via decoding. We use single nucleotide token inputs for all results marked ‘T5.’

T5-515, -2051, -8195

These models correspond to a T5 architecture with tokenized inputs via a unigram language model [42]. The vocabulary sizes evaluated are 512+3, 2048+3, and 8192+3 tokens, where the +3 represents the <PAD>, <EOS>, and <UNK> tokens.

hgT5-515, -2051, -8195

These models are identical to the T5-515, -2051, and -8195 above except that they undergo self-supervised pretraining on a repeat-downsampled subset of hg38 (1.64 Gbp) before fine-tuning. Because context size is vital factor in language model perplexity [19, 14], we stipulate tokenization [42] for pretraining. See Appendix E for details.

4. Results and Key Findings

We briefly provide key findings from and discussion of our supervised baselines in Tables 2 and 3. A comprehensive table of all model performance metrics, including context ablation studies, is included in Table 5 of the Appendix. We have loosely ordered results in the tables by model complexity/input length, sectioned by pretraining regime, if any. DeepSEA, Beluga, and Basenji2 are progressively more heavily supervised (in terms of using more tracks during training) and have successively larger input context sizes.

Table 2:

Performance of non-neural and supervised baselines on GUANinE’s test set. Best score per task along with close scores (if any) are bolded. These results are also presented visually in Figure 1.

Model	dnase-	ccre-	conservation		gpra-c	gpra-d
	propensity		30	100
	ρ	ρ	ρ	ρ	ρ	ρ
GC-content	20.5533	12.9891	38.1307	39.5828	52.9688	56.9824
5-mer LinSVR	36.3022	26.3708	49.3504	45.0548	69.4348	73.6469
5-mer kNN	31.9876	23.3579	52.5776	44.5351	58.5608	62.1082
* T5	33.0548	27.4643	52.5576	44.0542	84.6738	88.5912
DeepSEA	34.6927	23.2158	49.6163	43.4004	68.6607	72.9729
Beluga	64.9577	49.6688	58.6143	48.8816	66.5479	69.9363
Basenji2	60.5149	56.4568	71.0417	59.1229	69.5484	72.9872

Open in a new tab

Table 3:

Comparison of dnase-propensity with the DHS-subtask of ccre-propensity. Best and best low-context (512 bp) scores are bolded.

	dnase -propensity	DHS -subtask
Model	ρ	ρ
GC-content	20.5533	13.0236
DeepSEA	34.6927	22.7526
- low-context	34.1037	19.5812
Beluga	64.9577	44.5226
- low-context	62.4819	39.9648
Basenji2	60.5149	56.4814
- low-context	58.3123	36.1738
T5-2051	33.8244	28.5074
hgT5-2051	56.2698	39.2990

Open in a new tab

4.1. Functional element tasks

Our dnase-propensity and ccre-propensity tasks demonstrate gradated performance with increasing model complexity and context size. In Table 2 we see that Beluga (most parameters) and Basenji2 (largest context size) have the strongest performance for both the dnase- and ccre-propensity tasks. The middling performance of DeepSEA, the Linear k-mer frequency SVR, and the T5 suggest a high degree of intrinsic difficulty for these tasks, with adequate complexity for benchmarking pretrained models. Compared to common metrics such as average auPRC or auROC [86, 87, 38, 6], our propensity score metrics allow for tissue-agnostic predictions from sequence and face few issues of variable class-balance across tracks⁵ [39].

Annotation vs understanding

Digging deeper into our functional element tasks (Table 3), we note an interesting distinction between the dnase-propensity task, for which Beluga has the top performance despite its relatively limited 2 kbp input length, and the DHS-subtask of our ccre-propensity task, for which Basenji2 outperforms due to its additional context. Interestingly, the dnase-propensity task performance is less dependent on context size than the DHS-subtask. Recall that our dnase-propensity task asks a model to predict relative accessibility for an arbitrary genomic sequence, while our ccre-propensity task asks relative accessibility for candidate-cis regulatory elements (a more difficult task focused on understanding the cell type specificity of sequences that all have some propensity for accessibility by virtue of being cCREs). Beluga’s model complexity is better able to accurately recognize and annotate local accessibility, but compared to Basenji2, it lacks distal regulatory context that informs potential cell type specificity.

We also see the benefits of additional supervision (i.e. multi-task training) when comparing DeepSEA, Beluga, and Basenji2 to the T5-2051 model on these tasks. Despite its much larger parameter count (770 M), the T5 is on par with the smaller DeepSEA (50 M) for dnase-propensity, although the additional parameters aid in the DHS-subtask. The significant jump in performance for both tasks with self-supervised pretraining (hgT5-2051) suggests the issue is not with the T5 architecture, but rather a data bottleneck on the original task without additional (self-)supervision. In fact, with self-supervised pretraining, hgT5-2051 nearly competes on these functional element tasks with Beluga’s highly informative 2002-track multi-task training.

4.2. Conservation tasks

We see limited improvement with additional supervision or model complexity on the cons100 task, where Basenji2 (with the largest sequence context) has the best (modest) performance, suggesting a performance bottleneck, or possibly even a hard ceiling, on this task. It may be that limited conservation information at the vertebrate scale can be inferred from human genome sequence alone (and thus, if desired for downstream tasks, it should be passed as ancillary input [20]). The cons30 task appears more tractable, with both Beluga and Basenji2 outperforming less complex (non-neural and DeepSEA) or less supervised (T5) models. Basenji2, in particular, appears to have a strong implicit representation of primate sequence conservation, perhaps from its joint training on the mouse genome, which may boost its overall performance. We view conservation estimation as a promising area for benchmarking and model development in genomic AI; however, additional formulations of conservation or an emphasis on conserved element comprehension are needed.

4.3. Gene expression GPRA tasks

DeepSEA, Beluga, and Basenji2 have underwhelming performance relative to baselines, which may be a consequence of organismal or technical transfer and distribution shift (short, exogenous yeast sequences rather than long, endogenous human ones). These models were also trained for inter-sequence annotation rather than intra-sequence (variation) understanding, perhaps limiting their maximum performance. The T5’s success here confirms that these tasks are tractable despite the esoteric data distribution. ⁶

5. Reflections on Models

5.1. Non-Neural Baselines

The 5-mer frequency Linear SVR performs remarkably well on several tasks, outperforming even the T5 on cons100 and dnase-propensity. We believe its success on dnase-propensity is due to its statistical efficiency⁷. Optimizations to k-mer frequency SVRs would likely boost performance slightly higher [25], and we encourage inclusion of similar non-neural baselines in future benchmarks. The 5-mer frequency kNN performs comparably on the cons30 and cons100 tasks, with moderate but generally lower performance on the other tasks.

5.2. DeepSEA, Beluga, and Basenji2

Unsurprisingly, the ~50 kbp of additional context in Basenji2 gives it a strong advantage over DeepSEA, Beluga, and the T5/hgT5 models on GUANinE, especially for ccre-propensity and the conservation tasks. However, Basenji2 is dependent on this context, and underperforms Beluga on several tasks when input size is ablated. Basenji2 retains moderate performance on the conservation tasks without context, possibly due to its bi-organismal training. In contrast, Beluga, while less performant, is less sensitive to input size ablation (Table 5), and it maintains a faster runtime on its moderate context size (2 kbp). DeepSEA is the shallowest of the pretrained models and underperforms relative to the Linear SVR on dnase-propensity and to both the Linear SVR and the T5 on ccre-propensity, highlighting the need for rigorous non-neural baselining in the field. It does, however, perform decently on the GPRA tasks, which do not benefit from large context sizes.

5.3. T5

The T5 models are the largest tested models by parameter count, which may explain their strong performance on the GPRA tasks. However, we note that performance could be further optimized through a more extensive hyperparameter search, or adjustments to our output tokenization (designed for transfer learning). The issue of hyperparameter search for large models is a known problem in AI [35, 59], and until genomic AI progresses it may be difficult to have prior information about the efficacy of different hyperparameters for large models.

6. hgT5 and the impact of self-supervised pretraining

Pretraining with self-supervised language modeling (span corruption) has a strong, positive impact on T5 performance in our benchmark, as seen in the hgT5 scores of Table 4. On GPRA tasks, the loss incurred by tokenization (i.e. T5 vs T5-515, etc) is largely offset by pretraining, while on conservation tasks, particularly the 30-way (primate/mammal) alignment, self-supervision is able to overcome the performance bottleneck seen in the non-pretrained models of Table 2. Language modeling may have an implicit connection to sequence conservation, if conserved elements or motifs tend to be most imputable. On the dnase- and ccre-propensity tasks, pretraining strongly improves performance as well — notably, this improvement is obtained without additional sequence context, in contrast to other pretrained models. This suggests that combining self-supervision with supervised pretraining or fine-tuning may yield greater performance gains.

Table 4:

T5 and self-supervised hgT5 variant performance on GUANinE’s test set. Best reported score per task along with close scores (if any) are bolded. The hgT5-8195 is also presented visually in Figure 1.

Model	dnase-	ccre-	conservation		gpra-c	gpra-d
	propensity		30	100
	ρ	ρ	ρ	ρ	ρ	ρ
*T5	33.0548	27.4643	52.5576	44.0542	84.6738	88.5912
T5-515	34.3998	26.5567	49.7183	41.6013	83.1336	87.4443
T5-2051	33.8244	30.2823	49.8113	39.5893	84.4885	86.9079
T5-8195	32.3782	30.0196	50.2905	40.1706	81.8273	85.7500
hgT5-515	57.0310	36.1465	58.8254	45.6393	84.4092	88.5267
hgT5-2051	56.2698	39.3459	58.6516	46.7293	85.1381	88.1346
hgT5-8195	55.5411	36.7083	58.4973	46.9948	83.5787	87.5641

Open in a new tab

Finally, our pretraining corpus made use of only the human genome; however, the relative identicalness of the human and primate reference genomes and the relative uniqueness of primates [65] versus other genera [71] may yield diminishing returns from pretraining on distally related organisms. For human- (and mouse-)specific tasks, the significant amount of supervised annotations may also limit the benefits of self-supervision, but significant future work on language models in genomic AI is still warranted. Compared to other DNA language models [47, 15, 33], our hg38 pretraining corpus is repeat-downsampled [44] and contains significant human variation [68], which may improve its utility. See Appendix D for implementation details.

7. Future Work

Functional elements

We envision a plethora of follow-ups and revisions to our dnase- and ccre-propensity tasks, ranging from more informative summaries of experiments (e.g. SVD) to ‘metatissue’ propensity scores. We do not anticipate functional element understanding to be ‘solved’ in the near future, and we plan to include the more difficult task of variant interpretation in functional elements in the creation of future benchmarks.

Conservation

The use of alternative metrics such as background selection [46] or explicitly controlling for non-local drivers of conservation (e.g. chromosome size [82], etc) may make this type of task more tractable. The combination of evolutionary data with regions of mutational constraint [79, 65], or observed selection [40], may yield richer notions of conservation, particularly if our dial of evolutionary scale was directly tunable.

Gene expression

The GPRA tasks, even after data cleaning, were by far the largest in GUANinE , and strategic subsetting of ‘difficult’ or ‘informative’ sequences [62] or multi-task training on gpra-c and -d may make for more efficient tasks. The development of high throughput exogenous promoter expression methods for mammalian cells will also provide benchmarks for transfer learning from human-centric models.

Potential future tasks

As genomic AI and our in silico evaluation capabilities advance, so too will our ability to rely on smaller scale $(N < 10^{4})$ benchmarks or finer-grained tasks, or on tasks subject to significant gene-environment interaction or environmental confounding, e.g. GWAS lead SNPs or eQTLs [1]. Splicing [32], taxonomic classification and comparative (meta)genomics [47, 71], mRNA degradation [2], oncogenic or loss-of-function potential, phylogenetic or evolutionary distance estimation [28], distal effects comprehension [36], CpG methylation [4], 3D conformation modeling [85], and promoter expression plasticity [21], among numerous others, are all readily or near-readily available tasks that may prove valuable for future benchmarks.

8. Conclusion

Machine learning in genomics has greatly advanced since the days of ORF detection via Fourier transform [74], and with the increasing use of genomic AI models, we face the need for rigorous model selection and standardized evaluation, as in other AI fields. Centralized, well-documented benchmarks are essential for such practice, and here we present such a prototypical benchmark for genomic AI, GUANinE, with future expansions and refinement in mind. GUANinE offers concise evaluation across multiple tasks for learned DNA sequence representations, and our curation of tasks with large-scale training sets offers ample opportunity for rigorous baselining, fine-tuning, and model selection. We expect benchmarking, alongside efforts in model training, interpretation, and auditing, with the vital critique of both bioethicists and AI ethicists, will continue to shape the field and enable the development of models for biomedical applications and genome interpretation.

Benchmark and Model Availability

GUANinE, as well as baseline models, are available at https://github.com/ni-lab/guanine. Training sets are in both full and few-shot (1%) versions. To reduce the risk of overfitting and ensure identical evaluation, the test set is provided without labels and final scores can be calculated via prediction server, see the repository for instructions. Our hg38 corpus and the T5 and hgT5 models can be found there for fine-tuning or auditing.

Figure 1: — Performance of different methods. Note the T5 and hgT5-8195 use ≤ 512 bp of input sequence.

Table 1:

Benchmark summary statistics.

Task	Functional Elements		Conservation		Gene Expression
	dnase-prop	ccre-prop	cons30	cons100	gpra-c	gpra-d
Train set	2.8 M	7.1 M	5.4 M	5.0 M	23 M	16 M
Dev set	35 k	91 k	55 k	51 k	230 k	170 k
Test set	44 k	101 k	55 k	51 k	230 k	170 k
Split	Dev21/Test22	Dev21/Test22	Random	Random	Random	Random
Organism	Human	Human	Human	Human	Yeast	Yeast
Target	Scalar	Multi-task	Scalar	Scalar	Scalar	Scalar
Output range	0 - 4	0 - 4	0 - 24	0 - 24	0 - 17	0 - 17

Open in a new tab

Acknowledgments

Thanks to Pooja Kathail, Ruchir Rastogi, Ryan Keivanfar, Aniketh Reddy, and other members of the Ioannidis lab for invaluable recommendations and feedback. Special thanks to Ryan Chung for guidance on genomic pretraining. Compute was generously supported by Google’s TPU Research Cloud (TRC, formerly TFRC). This work was partially supported by an NIH Training Grant (T32HG000047), an Okawa Foundation Research Grant, and a grant from the UC Noyce Initiative for Computational Transformation.

Appendices

A. Extended Performance Metrics

Table 5:

Omnibus test set scores, including subtask breakout metrics for the ccre-propensity multi-task, and including low-context versions of DeepSEA, Beluga, and Basenji2. Low-context sequences were centered (to a bin, if necessary), truncated to 512 bp of input sequence, and padded with zeroes, which were typically seen during training as Ns. Best reported scores per column, along with the best low-context scores if different, are bolded.

Model	dnase-	ccre-		ccre-subtask			conservation		gpra-c	gpra-d
	propensity		dnase	ctcf	h3k27ac	h3k4me3	30	100
	ρ	ρ	ρ	ρ	ρ	ρ	ρ	ρ	ρ	ρ
GC-content	20.5533	12.9891	13.0236	12.6525	13.2115	13.0690	38.1307	39.5828	52.9688	56.9824
5-mer LinSVR	36.3022	26.3708	23.3577	27.6775	27.7758	26.6722	49.3504	45.0548	69.4348	73.6469
5-mer kNN	31.9876	23.3579	22.0112	24.5605	23.8530	23.0069	52.5776	44.5351	58.5608	62.1082
* T5	33.0548	27.4643	26.8736	28.2986	27.1943	27.4908	52.5576	44.0542	84.6738	88.5912
DeepSEA	34.6927	23.2158	22.7526	24.0649	23.3712	22.6745	49.6163	43.4004	68.6607	72.9729
- low-context	34.1037	20.4241	19.5812	21.2250	20.9838	19.9063	48.1403	42.7819	67.8070	71.9632
Beluga	64.9577	49.6688	44.5226	51.6006	52.5047	50.0473	58.6143	48.8816	66.5479	69.9363
- low-context	62.4819	39.8753	39.9648	40.5294	39.6200	39.3870	54.9345	45.6777	65.6781	69.0738
Basenji2	60.5149	56.4568	56.4814	56.3126	57.3374	56.3651	71.0417	59.1229	69.5484	72.9872
- low-context	58.3123	35.8789	36.1738	36.6209	35.7707	34.9501	55.4415	46.8741	68.6921	72.3155
T5-515	34.3998	26.5567	23.8884	28.957	26.7359	26.6455	49.7183	41.6013	83.1336	87.4443
T5-2051	33.8244	30.2823	28.5074	31.263	30.9719	30.3869	49.8113	39.5893	84.4885	86.9079
T5-8195	32.3782	30.0196	28.9806	32.3107	27.7989	30.9883	50.2905	40.1706	81.8273	85.7500
hgT5-515	57.0310	36.1465	33.9715	36.9684	37.0397	36.6064	58.8254	45.6393	84.4092	88.5267
hgT5-2051	56.2698	39.3459	39.2990	40.6559	39.3374	38.0914	58.6516	46.7293	85.1381	88.1346
hgT5-8195	55.5411	36.7083	35.5757	37.7378	35.9764	37.5434	58.4973	46.9948	83.5787	87.5641

Open in a new tab

Figure 2: — GC-content distribution of sequences before and after downsampling negatives (orange) to reduce the difference in GC-content versus positives (blue).

B. Additional preprocessing information

B.1. dnase-propensity

We downloaded SCREEN v2 DHS locations from Encode file ENCFF503GCK. We removed 6 assays corresponding to legacy HeLa and A549 cell lines due to age at time of experiment, leaving 727 DNase hypersensitivity assays.⁸ We extracted 3.1 M non-zero signal DHSes, then downsampled sequences with more than 25% repeat elements in proportion to their percentage of nucleotides masked by RepeatMasker (higher repeat percentages were more heavily downsampled), which left us with 2.3 M sequences. We augmented these ‘positive’ sequences with ≈ 400 k ‘negative’ sequences from the rest of the genome, which were downsampled to reduce the difference in GC-content between positive and negative sequences. The pre- and post-downsampling GC-content distributions of positives and negatives can be seen in Figure 2.

We next constructed our propensity score for each DHS sequence. Specifically, we counted the number of cell types where each consensus DHS was peak-called, then summed this count to a propensity score. We downweighted cancer and immortalized cell lines by ½ to prioritize primary tissues and other biosamples. Negative class sequences were given a propensity of 0 as these regions did not report DNase hypersensitivity. Positive class propensity scores were binned into levels 1 (highly cell-type-specific) through 4 (near-ubiquitous). Our end distribution of classes is approximately 18.8% 0 (negatives), 31.2% class 1, 25.0% class 2, 20.0% class 3, and 5.0% for class 4.

B.2. ccre-propensity

To construct our ccre-propensity tasks, we began with our 2.3 M repeat-downsampled DHS ‘positive’ sequences from the dnase-propensity task and constructed propensity scores for additional epigenetic signals as well. In particular, we annotated each sequence with a vector of its raw experimental measurements from the SCREEN v2 cCRE registry (available from ENCODE) and constructed propensity scores per epigenetic signal by summing these weighted tracks (as in the dnase-propensity task) for 527 DHS, 198 CTCF, 314 H3K4Me3, and 224 H3K27ac experiments. Compared to the dnase-propensity task, a smaller number of cell types (those for which a second epigenetic experiment was available), are present in the cCRE dataset, so many of the DHSes have zero (0) representative cell lines for each epigenetic signal. We targeted a similar class balance as the dnase-propensity task for classes 2, 3, and 4 for each of the epigenetic signals, with up to 20% of the data for each subtask being negative, 0-labeled DHSes without any signal. This up to 20% negative set was obtained by downsampling those with low GC-content as in the dnase-propensity negatives.

B.3. gpra-c and gpra-d

We downloaded the data from Vaishnav et al. [76] and preprocessed it as follows. Most pertinently, we found that 20-25% of sequences had variable length (non-80bp), and that length was strongly associated with differences in observed gene expression. As the T5 and other architectures can facilely detect sequence length, we chose to prune the non-80 (+30 scaffold) length oligonucleotides to reduce inflated performance due to length in our benchmark. These sequences may in fact contain biological meaning, but our goal for this benchmark was to reduce spurious factors [24]. Additionally, while Vaishnav et al. [76] reported floating point y-values (the average per DNA barcode across multiple observations), we found that integerizing the y-values into gene expression bins preserved ~99.8% of the observed variance, possibly indicative of technical biases during data collection [75]. Future work on statistical correction may be invaluable for finer-grained experiments.

C. Baseline methods

Baseline model performances (not including hgT5 language models) are presented in Table 2. Non-neural models were trained using the maximum dataset size on a moderate machine (51 GB). Note that the cons30 and cons100 tasks have relatively dense 5-mer frequency tables, so we used a 512-dimensional PCA projection before fitting the kNN or Linear SVR on those tasks. The GPRA task representations are sparser, although for memory constraints we only used half of the sequences during training (11.5 M for gpra-c and 8 M for gpra-d).

Features from the pretrained supervised models (DeepSEA, Beluga, Basenji2) were extracted from 1% of the training data and the final L2-regularized linear model for each was selected based on performance on the development set with validation choices of α = [0.1, 0.3, 1, 3, 10, 30]. Basenji2 features were computed as the average of predictions from three consecutive bins centered on the inputs, which led to slightly improved performance on GUANinEcompared to predictions from a single bin. For the four human tasks (non-GPRA), full-context sequence from hg38 was inputted to each model. Apple-to-apple model performance on ablated input sequence lengths (512 bp) is presented in omnibus Table 5. For the yeast (GPRA) tasks, whose exogenous sequences do not come with a natural context, we cross-validated 10 model-length-appropriate fixed scaffolds chosen from promoter contexts in s288c (yeast) and hg38 (human) genomes with 1% of the training data (0.5% for Basenji2). The best performing scaffold was used on our development set and during testing. Finally, we used the experimental track outputs after standardization for each of the DeepSEA, Beluga, and Basenji2 models, although other combinations of layers or intermediate representations may yield higher performance. For the Basenji2 model, we only used the 5313 outputs from the human output head, as GUANinEis meant to primarily evaluate model performance on human sequences, but the combined use of the human and mouse heads of the model may result in higher performance, particularly on the conservation tasks.

D. hg38 pretraining corpus

The impact of language modeling on model performance is displayed in Table 4. All T5 scores reported in Tables 3 and 4 are the average of two independent runs to reduce the impact of stochasticity.

For our language modeling, we aimed to:

use tokenization to increase our context length within 512-token segments [42, 14];
reduce reference bias introduced by training on the reference genome;
reduce the frequency of LINE1 and other repeat elements (and phylogenetically proximal genes), as our models prioritize functional genomics rather than mapping or assembly-related tasks, and repeated words or phrases in language corpora are known to reduce training efficacy and model quality [59, 44];
augment the limited size (relative to language modeling) of the human genome.

As such, we took the approach of narrowing down the human genome by first removing N-rich sequences (2 or more contiguous Ns), and then further removing repeat-rich regions (> 50% over a 768-bp sliding window) from our corpus. We then removed resulting sequences that were less than 1024 bp in length, as these would not meet our augmentation procedures (below). We finally employed a combination of locality-sensitive hashing (LSH) and blastn deduplication to remove segments with greater than 90% identity (dropping the shorter of the two sequences).

These procedures left us with 1.64 Gbp of genomic sequence before augmentation. Separately, we obtained genetic variants from unrelated individuals in the 1000 Genomes Project Phase 3 [68] with an allele frequency of 1e-3 or greater. We then created 4x and 64x upsampled versions of our 1.64 Gbp corpus (via offset sweeps through segments for each tokenization size), and we augmented each sampling by upsampling minor alleles at random. For a single minor allele of allele frequency $x \leq 0.5$ , the resulting upsampled frequency was $f (x) = \sqrt{x (1 - x)}$ . Any given 10% frequency minor allele had an approximately 30% chance of appearing, independent of other variants, with similar upsampling when multiple minor alleles were present to increase entropy (maximum entropy for minor alleles would be 50% frequency for one minor allele, 33% each for two minor alleles, etc, such that the major allele is nearly on par in terms of frequency). The 4x versions were used for training the ULM tokenizer [42]. After tokenization, we removed any sequences less than 128 tokens in length, which made our higher token corpora slightly smaller and more prone to overfitting. In the future, we suggest including additional naturally occurring human variation, perhaps via training on pangenomes, with the intent to reduce possible algorithmic bias of pretrained models in genomic AI [11, 55].

E. T5 and hgT5 (pre)training

All T5 models and pretrained hgT5 models were finetuned on each task separately, with a batch size of 2¹⁶ tokens and a learning rate of 1e-4 for 2¹⁸ steps. We used the checkpoint corresponding to the best development set performance for testing. Given the interrelatedness of human genome sequences, larger batches and/or diversity-based learning rates per minibatch may be warranted.

Self-supervised pretraining was done with a 15% noise density and mean span length of 3 tokens. We tested multiple distinct configurations for hgT5 self-supervised pretraining, and we converged on a batch size of 2¹⁷ tokens for 2¹⁷ steps with the standard learning rate schedule, although we did observe some overfitting in later steps with higher tokenization (likely due to seeing the corpus a second time, something uncommon in language applications [59]). During pretraining, we created three replicates of each hgT5 model, but to reduce the risk of overfitting, we discarded the lowest perplexity model for each tokenization size.

Our hgT5-515, -2051, -8195 models achieve bits per character (BPC) scores of 1.55, 1.51, and 1.48, respectively, on our development set. At the full length of 512-tokens of input, the hgT5-8195 model achieves a BPC of 1.46. For comparison, a tetragram model with backoff achieved a BPC of 1.76 [37], while Zaheer et al. [83] achieves better BPC albeit on unreleased, large-context language models with greater tokenization and numerous repeats in their corpus.

Worth noting is that our choice of Unigram Language Modeling [42] instead of the more popular byte-pair encoding was based in conservativeness about benign variation; the ULM tokenizers feature slight redundancy as a form of regularization. We see this is the case for our corpus, as tokenization increases the BPC for input DNA sequences from 2.0 to ≈ 2.09 across lengths (the starting point for our BPC during pretraining).

Footnotes

GUANinE data and evaluation tools are available at https://github.com/ni-lab/guanine

The 2 bp decrease allows for length-512 models to pass a ‘task-code’ token as input [58]

Because DeepSEA, Beluga, and Basenji2 all had chromosome 22 present in their training data, while we use it for evaluation of the dnase- and ccre-propensity tasks, their performance on these tasks may be exaggerated (as our targets are roughly summary statistics of their training data across cell types).

⁴

During runtime, Basenji2 parallelizes its receptive field over a much longer, 131 kbp sequence.

⁵

auROC is insensitive to class balance and can be misleading for highly imbalanced datasets (such as epigenomic tracks), while auPRC is sensitive but incomparable across different class balances.

⁶

Vaishnav et al. [76] included a transformer-esque model that after extensive training time scored more highly.

⁷

SVRs are much more data efficient than neural nets and perform well on smaller effective sample sizes for tasks

⁸

Future benchmarking work may consider removing the K562 line due to its age as well.

Contributor Information

eyes s. robson, Center for Computational Biology, UC Berkeley, Berkeley, CA 94720

Nilah M. Ioannidis, Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720

References

[1].Abdellaoui A., Dolan C. V., Verweij K. J. H., and Nivard M. G.. Gene–environment correlations across geographic regions affect genome-wide association studies. Nature Genetics, 54(9):1345–1354, Sep 2022. ISSN 1546-1718. doi: 10.1038/s41588-022-01158-0. URL 10.1038/s41588-022-01158-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Agarwal V. and Kelley D. R.. The genetic and biochemical determinants of mrna degradation rates in mammals. Genome Biology, 23(1):245, Nov 2022. ISSN 1474-760X. doi: 10.1186/s13059-022-02811-x. URL 10.1186/s13059-022-02811-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Agarwal V. and Shendure J.. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep, 31(7):107663, May 2020. [DOI] [PubMed] [Google Scholar]
[4].Angermueller C., Lee H. J., Reik W., and Stegle O.. Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biology, 18(1):67, Apr 2017. ISSN 1474-760X. doi: 10.1186/s13059-017-1189-z. URL 10.1186/s13059-017-1189-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Armstrong J., Hickey G., Diekhans M., Fiddes I. T., Novak A. M., Deran A., Fang Q., Xie D., Feng S., Stiller J., Genereux D., Johnson J., Marinescu V. D., Alföldi J., Harris R. S., Lindblad-Toh K., Haussler D., Karlsson E., Jarvis E. D., Zhang G., and Paten B.. Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587(7833):246–251, Nov 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-2871-y. URL 10.1038/s41586-020-2871-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Avsec Ž., Agarwal V., Visentin D., Ledsam J. R., Grabska-Barwinska A., Taylor K. R., Assael Y., Jumper J., Kohli P., and Kelley D. R.. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, Oct 2021. ISSN 1548-7105. doi: 10.1038/s41592-021-01252-x. URL 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
[7].Avsec Ž., Weilert M., Shrikumar A., Krueger S., Alexandari A., Dalal K., Fropf R., McAnany C., Gagneur J., Kundaje A., and Zeitlinger J.. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature Genetics, 53(3):354–366, Mar 2021. ISSN 1546-1718. doi: 10.1038/s41588-021-00782-6. URL 10.1038/s41588-021-00782-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Barz B. and Denzler J.. Do we train on test data? purging cifar of near-duplicates. Journal of Imaging, 6(6), 2020. ISSN 2313-433X. doi: 10.3390/jimaging6060041. URL https://www.mdpi.com/2313-433X/6/6/41. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Bender E M, Gebru T, McMillan-Major A, and Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL 10.1145/3442188.3445922. [DOI] [Google Scholar]
[10].Bernardi G.. Misunderstandings about isochores. part 1. Gene, 276(1):3-13, 2001. ISSN 0378-1119. doi: 10.1016/S0378-1119(01)00644-8. URL https://www.sciencedirect.com/science/article/pii/S0378111901006448. Papers presented at the ISME Symposium on Chromosomes: Structure, Function and Evolution, Guananacaste (Costa Rica), 8–12 January 2001. [DOI] [Google Scholar]
[11].Blodgett S. L., Barocas S., Daumé III H., and Wallach H.. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main.485. [DOI] [Google Scholar]
[12].Bowman S. R., Angeli G., Potts C., and Manning C. D.. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, Sept. 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075. [DOI] [Google Scholar]
[13].Chen T., Kornblith S., Norouzi M., and Hinton G.. A simple framework for contrastive learning of visual representations. In III H. D. and Singh A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13-18 Jul 2020. URL https://proceedings.mlr.press/v119/chen20j.html. [Google Scholar]
[14].Dai Z., Yang Z., Yang Y., Carbonell J., Le Q., and Salakhutdinov R.. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285. [DOI] [Google Scholar]
[15].Dalla-Torre H., Gonzalez L., Mendoza-Revilla J., Carranza N. L., Grzywaczewski A. H., Oteri F., Dallago C., Trop E., de Almeida B. P., Sirelkhatim H., Richard G., Skwark M., Beguir K., Lopez M., and Pierrot T.. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, 2023. doi: 10.1101/2023.01.11.523679. URL https://www.biorxiv.org/content/early/2023/09/19/2023.01.11.523679. [DOI] [Google Scholar]
[16].de Boer C. G. and Taipale J.. Hold out the genome: A roadmap to solving the cis-regulatory code. bioRxiv, 2023. doi: 10.1101/2023.04.20.537701. URL https://www.biorxiv.org/content/early/2023/04/20/2023.04.20.537701. [DOI] [PubMed] [Google Scholar]
[17].de Boer C. G., Vaishnav E. D., Sadeh R., Abeyta E. L., Friedman N., and Regev A.. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol, 38(1):56–65, Jan 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L.. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [Google Scholar]
[19].Devlin J., Chang M.-W., Lee K., and Toutanova K.. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. [DOI] [Google Scholar]
[20].Dey K. K., van de Geijn B., Kim S. S., Hormozdiari F., Kelley D. R., and Price A. L.. Evaluating the informativeness of deep learning annotations for human complex diseases. Nature Communications, 11(1):4703, Sep 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-18515-4. URL 10.1038/s41467-020-18515-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Einarsson H., Salvatore M., Vaagensø C., Alcaraz N., Bornholdt J., Rennie S., and Andersson R.. Promoter sequence and architecture determine expression variability and confer robustness to genetic variants. eLife, 11:e80943, nov 2022. ISSN 2050-084X. doi: 10.7554/eLife.80943. URL 10.7554/eLife.80943. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Fan R.-E., Chang K.-W., Hsieh C.-J., Wang X.-R., and Lin C.-J.. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008. [Google Scholar]
[23].Ferrari Dacrema M., Cremonesi P., and Jannach D.. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, page 101–109, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362436. doi: 10.1145/3298689.3347058. URL 10.1145/3298689.3347058. [DOI] [Google Scholar]
[24].Gebru T., Morgenstern J., Vecchione B., Vaughan J. W., Wallach H., III H. D., and Crawford K.. Datasheets for datasets. Commun. ACM, 64(12):86–92, nov 2021. ISSN 0001-0782. doi: 10.1145/3458723. URL 10.1145/3458723. [DOI] [Google Scholar]
[25].Ghandi M., Lee D., Mohammad-Noori M., and Beer M. A.. Enhanced regulatory sequence prediction using gapped k-mer features. PLOS Computational Biology, 10(7):1–15, 07 2014. doi: 10.1371/journal.pcbi.1003711. URL 10.1371/journal.pcbi.1003711. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Grešová K., Martinek V., Čechák D., Šimeček P., and Alexiou P.. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data, 24(1):25, May 2023. ISSN 2730-6844. doi: 10.1186/s12863-023-01123-8. URL 10.1186/s12863-023-01123-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].He K., Zhang X., Ren S., and Sun J.. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. [DOI] [Google Scholar]
[28].Holmes I. H.. Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics, 33(8):1227–1229, January 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btw791. URL 10.1093/bioinformatics/btw791. [DOI] [PMC free article] [PubMed] [Google Scholar]
[29].Hoskins R. A., Repo S., Barsky D., Andreoletti G., Moult J., and Brenner S. E.. Reports from CAGI: The Critical Assessment of Genome Interpretation. Hum Mutat, 38(9):1039–1041, Sep 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Ioannidis N. M., Rothstein J. H., Pejaver V., Middha S., McDonnell S. K., Baheti S., Musolf A., Li Q., Holzinger E., Karyadi D., Cannon-Albright L. A., Teerlink C. C., Stanford J. L., Isaacs W. B., Xu J., Cooney K. A., Lange E. M., Schleutker J., Carpten J. D., Powell I. J., Cussenot O., Cancel-Tassin G., Giles G. G., MacInnis R. J., Maier C., Hsieh C. L., Wiklund F., Catalona W. J., Foulkes W. D., Mandal D., Eeles R. A., Kote-Jarai Z., Bustamante C. D., Schaid D. J., Hastie T., Ostrander E. A., Bailey-Wilson J. E., Radivojac P., Thibodeau S. N., Whittemore A. S., and Sieh W.. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99(4):877–885, Oct 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Ioannidis N. M., Davis J. R., DeGorter M. K., Larson N. B., McDonnell S. K., French A. J., Battle A. J., Hastie T. J., Thibodeau S. N., Montgomery S. B., Bustamante C. D., Sieh W., and Whittemore A. S.. FIRE: functional inference of genetic variants that regulate gene expression. Bioinformatics, 33(24):3895–3901, August 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx534. URL 10.1093/bioinformatics/btx534. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Jaganathan K., Kyriazopoulou Panagiotopoulou S., McRae J. F., Darbandi S. F., Knowles D., Li Y. I., Kosmicki J. A., Arbelaez J., Cui W., Schwartz G. B., Chow E. D., Kanterakis E., Gao H., Kia A., Batzoglou S., Sanders S. J., and Farh K. K.-H.. Predicting splicing from primary sequence with deep learning. Cell, 176 (3):535–548.e24, 2019. ISSN 0092-8674. doi: 10.1016/j.cell.2018.12.015. URL https://www.sciencedirect.com/science/article/pii/S0092867418316295. [DOI] [PubMed] [Google Scholar]
[33].Ji Y., Zhou Z., Liu H., and Davuluri R. V.. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, February 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btab083. URL 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Johnson J., Douze M., and Jégou H.. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019. [Google Scholar]
[35].Kaplan J., McCandlish S., Henighan T., Brown T. B., Chess B., Child R., Gray S., Radford A., Wu J., and Amodei D.. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361. [Google Scholar]
[36].Karollus A., Mauermeier T., and Gagneur J.. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biology, 24(1):56, Mar 2023. ISSN 1474-760X. doi: 10.1186/s13059-023-02899-9. URL 10.1186/s13059-023-02899-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Katz S.. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400–401, 1987. doi: 10.1109/TASSP.1987.1165125. [DOI] [Google Scholar]
[38].Kelley D. R.. Cross-species regulatory sequence activity prediction. PLOS Computational Biology, 16(7):1–27, July 2020. doi: 10.1371/journal.pcbi.1008050. URL 10.1371/journal.pcbi.1008050. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Kim M. and Hwang K.-B.. An empirical evaluation of sampling methods for the classification of imbalanced data. PLOS ONE, 17(7):1–22, July 2022. doi: 10.1371/journal.pone.0271260. URL 10.1371/journal.pone.0271260. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Kircher M., Witten D. M., Jain P., O’Roak B. J., Cooper G. M., and Shendure J.. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46(3):310–315, Mar 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[41].Krizhevsky A., Sutskever I., and Hinton G. E.. Imagenet classification with deep convolutional neural networks. In Pereira F., Burges C., Bottou L., and Weinberger K., editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf. [Google Scholar]
[42].Kudo T.. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1007. URL https://aclanthology.org/P18-1007. [DOI] [Google Scholar]
[43].Lappalainen T., Sammeth M., nder M. R., ’t Hoen P. A., Monlong J., Rivas M. A., lez Porta M., Kurbatova N., Griebel T., Ferreira P. G., Barann M., Wieland T., Greger L., van Iterson M., Ribeca J. f, P., Pulyakhina I., Esser D., Giger T., Tikhonov A., Sultan M., Bertier G., MacArthur D. G., Lek M., Lizano E., Buermans H. P., Padioleau I., Schwarzmayr T., Karlberg O., Ongen H., Kilpinen H., Beltran S., Gut M., Kahlem K., Amstislavskiy V., Stegle O., Pirinen M., Montgomery S. B., Donnelly P., McCarthy M. I., Flicek P., Strom T. M., Lehrach H., Schreiber S., Sudbrak R., Carracedo A., Antonarakis S. E., sler R., nen A. C., van Ommen G. J., Brazma A., Meitinger T., Rosenstiel P., Gut R. ó, I. G., Estivill X., Dermitzakis E. T., Estivill X., Guigo R., Dermitzakis E., Antonarakis S., Meitinger T., Strom T. M., Palotie A., Deleuze J. F., Sudbrak R., Lerach H., Gut I., nen A. C., Gyllensten U., Schreiber S., Rosenstiel P., Brunner H., Veltman J., Hoen P. A., van Ommen G. J., Carracedo A., Brazma A., Flicek P., Cambon-Thomsen A., Mangion J., Bentley D., and Hamosh A., Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501(7468):506–511, Sep 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Lee K., Ippolito D., Nystrom A., Zhang C., Eck D., Callison-Burch C., and Carlini N.. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577. [DOI] [Google Scholar]
[45].Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., and Stoyanov V.. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692. [Google Scholar]
[46].McVicker G., Gordon D., Davis C., and Green P.. Widespread genomic signatures of natural selection in hominid evolution. PLOS Genetics, 5(5):1–16, May 2009. doi: 10.1371/journal.pgen.1000471. URL 10.1371/journal.pgen.1000471. [DOI] [PMC free article] [PubMed] [Google Scholar]
[47].Mock F., Kretschmer F., Kriese A., Böcker S., and Marz M.. Taxonomic classification of dna sequences beyond sequence similarity using deep neural networks. Proceedings of the National Academy of Sciences, 119(35):e2122636119, 2022. doi: 10.1073/pnas.2122636119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2122636119. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Mukwende M.. Mind the gap: A clinical handbook of signs and symptoms in black and brown skin. Wounds UK, pages 16–16, 2020. [Google Scholar]
[49].Musgrave K., Belongie S., and Lim S.-N.. A metric learning reality check. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 681–699, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58594-5. doi: 10.1007/978-3-030-58595-2_41. URL 10.1007/978-3-030-58595-2_41. [DOI] [Google Scholar]
[50].Oubounyt M., Louadi Z., Tayara H., and Chong K. T.. Deepromoter: Robust promoter predictor using deep learning. Frontiers in Genetics, 10, 2019. ISSN 1664-8021. doi: 10.3389/fgene.2019.00286. URL https://www.frontiersin.org/articles/10.3389/fgene.2019.00286. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Paik C., Aroca-Ouellette S., Roncone A., and Kann K.. The World of an Octopus: How Reporting Bias Influences a Language Model’s Perception of Color. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 823–835, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.63. URL https://aclanthology.org/2021.emnlp-main.63. [DOI] [Google Scholar]
[52].Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E.. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]
[53].Pollard K. S., Salama S. R., Lambert N., Lambot M.-A., Coppens S., Pedersen J. S., Katzman S., King B., Onodera C., Siepel A., Kern A. D., Dehay C., Igel H., Ares M., Vanderhaeghen P., and Haussler D.. An rna gene expressed during cortical development evolved rapidly in humans. Nature, 443(7108):167–172, Sep 2006. ISSN 1476-4687. doi: 10.1038/nature05113. URL 10.1038/nature05113. [DOI] [PubMed] [Google Scholar]
[54].Pollard K. S., Hubisz M. J., Rosenbloom K. R., and Siepel A.. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res, 20(1):110–121, Jan 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
[55].Popejoy A. B. and Fullerton S. M.. Genomics is failing on diversity. Nature, 538(7624):161–164, Oct 2016. ISSN 1476-4687. doi: 10.1038/538161a. URL 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Périer R. C., Praz V., Junier T., Bonnard C., and Bucher P.. The Eukaryotic Promoter Database (EPD). Nucleic Acids Research, 28(1):302–303, January 2000. ISSN 0305-1048. doi: 10.1093/nar/28.1.302. URL 10.1093/nar/28.1.302. [DOI] [PMC free article] [PubMed] [Google Scholar]
[57].Racimo F., Sankararaman S., Nielsen R., and Huerta-Sánchez E.. Evidence for archaic adaptive introgression in humans. Nature Reviews Genetics, 16(6):359–371, Jun 2015. ISSN 1471-0064. doi: 10.1038/nrg3936. URL 10.1038/nrg3936. [DOI] [PMC free article] [PubMed] [Google Scholar]
[58].Radford A., Wu J., Child R., Luan D., Amodei D., and Sutskever I.. Language models are unsupervised multitask learners. 2019. [Google Scholar]
[59].Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y., Li W., and Liu P. J.. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435. [Google Scholar]
[60].Rajpurkar P., Zhang J., Lopyrev K., and Liang P.. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264. [DOI] [Google Scholar]
[61].Reiter E.. A structured review of the validity of BLEU. Computational Linguistics, 44(3):393–401, Sept. 2018. doi: 10.1162/coli_a_00322. URL https://aclanthology.org/J18-3002. [DOI] [Google Scholar]
[62].Schreiber J., Bilmes J., and Noble W. S.. apricot: Submodular selection for data summarization in python. Journal of Machine Learning Research, 21(161):1–6, 2020. URL http://jmlr.org/papers/v21/19-467.html.34305477 [Google Scholar]
[63].Singh R., Lanchantin J., Robins G., and Qi Y.. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, August 2016. ISSN 1367-4803. doi: 10.1093/bioinformatics/btw427. URL 10.1093/bioinformatics/btw427. [DOI] [PubMed] [Google Scholar]
[64].Standley T., Zamir A., Chen D., Guibas L., Malik J., and Savarese S.. Which tasks should be learned together in multi-task learning? In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020. [Google Scholar]
[65].Sundaram L., Gao H., Padigepati S. R., McRae J. F., Li Y., Kosmicki J. A., Fritzilas N., Hakenberg J., Dutta A., Shon J., Xu J., Batzoglou S., Li X., and Farh K. K.. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet, 50(8):1161–1170, Aug 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[66].Tay Y., Dehghani M., Abnar S., Shen Y., Bahri D., Pham P., Rao J., Yang L., Ruder S., and Metzler D.. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k. [Google Scholar]
[67].Taylor W. L.. “cloze procedure”: A new tool for measuring readability. Journalism Quarterly, 30(4):415–433, 1953. doi: 10.1177/107769905303000401. URL 10.1177/107769905303000401. [DOI] [Google Scholar]
[68].The 1000 Genomes Project Consortium, et al. A global reference for human genetic variation. Nature, 526(7571):68–74, Oct 2015. ISSN 1476-4687. doi: 10.1038/nature15393. URL 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
[69].The ENCODE Project Consortium. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, Jul 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-2493-4. URL 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[70].The ENCODE Project Consortium, Moore J., and Purcaro M. e. a.. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, Jul 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-2493-4. URL 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[71].The G10K Consortium et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737–746, Apr 2021. ISSN 1476-4687. doi: 10.1038/s41586-021-03451-0. URL 10.1038/s41586-021-03451-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
[72].The GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature, 550(7675):204–213, Oct 2017. ISSN 1476-4687. doi: 10.1038/nature24277. URL 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]
[73].The Roadmap Epigenomics Consorstium, et al. . Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, Feb 2015. ISSN 1476-4687. doi: 10.1038/nature14248. URL 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
[74].Tiwari S., Ramachandran S., Bhattacharya A., Bhattacharya S., and Ramaswamy R.. Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics, 13(3):263–270, June 1997. ISSN 1367-4803. doi: 10.1093/bioinformatics/13.3.263. URL 10.1093/bioinformatics/13.3.263. [DOI] [PubMed] [Google Scholar]
[75].Trippe B. L., Huang B., DeBenedictis E. A., Coventry B., Bhattacharya N., Yang K. K., Baker D., and Crawford L.. Randomized gates eliminate bias in sort-seq assays. Protein Science, 31(9):e4401, 2022. doi: 10.1002/pro.4401. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/pro.4401. [DOI] [Google Scholar]
[76].Vaishnav E. D., de Boer C. G., Molinet J., Yassour M., Fan L., Adiconis X., Thompson D. A., Levin J. Z., Cubillos F. A., and Regev A.. The evolution, evolvability and engineering of gene regulatory dna. Nature, 603(7901):455–463, Mar 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04506-6. URL 10.1038/s41586-022-04506-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
[77].van Alten S., Domingue B. W., Faul J., Galama T., and Marees A. T.. Correcting for volunteer bias in gwas uncovers novel genetic variants and increases heritability estimates. In medRxiv, 2022. URL https://api.semanticscholar.org/CorpusID:262032876. [Google Scholar]
[78].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L. u., and Polosukhin I.. Attention is all you need. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [Google Scholar]
[79].Vitsios D., Dhindsa R. S., Middleton L., Gussow A. B., and Petrovski S.. Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nature Communications, 12(1):1504, Mar 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-21790-4. URL 10.1038/s41467-021-21790-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
[80].Wang A., Singh A., Michael J., Hill F., Levy O., and Bowman S.. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446. [DOI] [Google Scholar]
[81].Wang A., Pruksachatkun Y., Nangia N., Singh A., Michael J., Hill F., Levy O., and Bowman S.. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., and Garnett R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf. [Google Scholar]
[82].Waters P. D., Patel H. R., Ruiz-Herrera A., Álvarez González L., Lister N. C., Simakov O., Ezaz T., Kaur P., Frere C., Grützner F., Georges A., and Graves J. A. M.. Microchromosomes are building blocks of bird, reptile, and mammal chromosomes. Proceedings of the National Academy of Sciences, 118(45):e2112494118, 2021. doi: 10.1073/pnas.2112494118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2112494118. [DOI] [PMC free article] [PubMed] [Google Scholar]
[83].Zaheer M., Guruganesh G., Dubey K. A., Ainslie J., Alberti C., Ontanon S., Pham P., Ravula A., Wang Q., Yang L., and Ahmed A.. Big bird: Transformers for longer sequences. In Larochelle H., Ranzato M., Hadsell R., Balcan M., and Lin H., editors, Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. [Google Scholar]
[84].Zhang R., Isola P., and Efros A. A.. Colorful image colorization. In Leibe B., Matas J., Sebe N., and Welling M., editors, Computer Vision – ECCV 2016, pages 649–666, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46487-9. [Google Scholar]
[85].Zhang Y., An L., Xu J., Zhang B., Zheng W. J., Hu M., Tang J., and Yue F.. Enhancing hi-c data resolution with deep convolutional neural network hicplus. Nature Communications, 9(1):750, Feb 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-03113-2. URL 10.1038/s41467-018-03113-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
[86].Zhou J. and Troyanskaya O. G.. Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods, 12(10):931–934, Oct 2015. ISSN 1548-7105. doi: 10.1038/nmeth.3547. URL 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
[87].Zhou J., Theesfeld C. L., Yao K., Chen K. M., Wong A. K., and Troyanskaya O. G.. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet, 50(8):1171–1179, Aug 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Abdellaoui A., Dolan C. V., Verweij K. J. H., and Nivard M. G.. Gene–environment correlations across geographic regions affect genome-wide association studies. Nature Genetics, 54(9):1345–1354, Sep 2022. ISSN 1546-1718. doi: 10.1038/s41588-022-01158-0. URL 10.1038/s41588-022-01158-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Agarwal V. and Kelley D. R.. The genetic and biochemical determinants of mrna degradation rates in mammals. Genome Biology, 23(1):245, Nov 2022. ISSN 1474-760X. doi: 10.1186/s13059-022-02811-x. URL 10.1186/s13059-022-02811-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Agarwal V. and Shendure J.. Predicting mRNA Abundance Directly from Genomic Sequence Using Deep Convolutional Neural Networks. Cell Rep, 31(7):107663, May 2020. [DOI] [PubMed] [Google Scholar]

[R4] [4].Angermueller C., Lee H. J., Reik W., and Stegle O.. Deepcpg: accurate prediction of single-cell dna methylation states using deep learning. Genome Biology, 18(1):67, Apr 2017. ISSN 1474-760X. doi: 10.1186/s13059-017-1189-z. URL 10.1186/s13059-017-1189-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Armstrong J., Hickey G., Diekhans M., Fiddes I. T., Novak A. M., Deran A., Fang Q., Xie D., Feng S., Stiller J., Genereux D., Johnson J., Marinescu V. D., Alföldi J., Harris R. S., Lindblad-Toh K., Haussler D., Karlsson E., Jarvis E. D., Zhang G., and Paten B.. Progressive cactus is a multiple-genome aligner for the thousand-genome era. Nature, 587(7833):246–251, Nov 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-2871-y. URL 10.1038/s41586-020-2871-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Avsec Ž., Agarwal V., Visentin D., Ledsam J. R., Grabska-Barwinska A., Taylor K. R., Assael Y., Jumper J., Kohli P., and Kelley D. R.. Effective gene expression prediction from sequence by integrating long-range interactions. Nature Methods, 18(10):1196–1203, Oct 2021. ISSN 1548-7105. doi: 10.1038/s41592-021-01252-x. URL 10.1038/s41592-021-01252-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] [7].Avsec Ž., Weilert M., Shrikumar A., Krueger S., Alexandari A., Dalal K., Fropf R., McAnany C., Gagneur J., Kundaje A., and Zeitlinger J.. Base-resolution models of transcription-factor binding reveal soft motif syntax. Nature Genetics, 53(3):354–366, Mar 2021. ISSN 1546-1718. doi: 10.1038/s41588-021-00782-6. URL 10.1038/s41588-021-00782-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Barz B. and Denzler J.. Do we train on test data? purging cifar of near-duplicates. Journal of Imaging, 6(6), 2020. ISSN 2313-433X. doi: 10.3390/jimaging6060041. URL https://www.mdpi.com/2313-433X/6/6/41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Bender E M, Gebru T, McMillan-Major A, and Shmitchell S. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL 10.1145/3442188.3445922. [DOI] [Google Scholar]

[R10] [10].Bernardi G.. Misunderstandings about isochores. part 1. Gene, 276(1):3-13, 2001. ISSN 0378-1119. doi: 10.1016/S0378-1119(01)00644-8. URL https://www.sciencedirect.com/science/article/pii/S0378111901006448. Papers presented at the ISME Symposium on Chromosomes: Structure, Function and Evolution, Guananacaste (Costa Rica), 8–12 January 2001. [DOI] [Google Scholar]

[R11] [11].Blodgett S. L., Barocas S., Daumé III H., and Wallach H.. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.485. URL https://aclanthology.org/2020.acl-main.485. [DOI] [Google Scholar]

[R12] [12].Bowman S. R., Angeli G., Potts C., and Manning C. D.. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal, Sept. 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075. [DOI] [Google Scholar]

[R13] [13].Chen T., Kornblith S., Norouzi M., and Hinton G.. A simple framework for contrastive learning of visual representations. In III H. D. and Singh A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 13-18 Jul 2020. URL https://proceedings.mlr.press/v119/chen20j.html. [Google Scholar]

[R14] [14].Dai Z., Yang Z., Yang Y., Carbonell J., Le Q., and Salakhutdinov R.. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285. URL https://aclanthology.org/P19-1285. [DOI] [Google Scholar]

[R15] [15].Dalla-Torre H., Gonzalez L., Mendoza-Revilla J., Carranza N. L., Grzywaczewski A. H., Oteri F., Dallago C., Trop E., de Almeida B. P., Sirelkhatim H., Richard G., Skwark M., Beguir K., Lopez M., and Pierrot T.. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, 2023. doi: 10.1101/2023.01.11.523679. URL https://www.biorxiv.org/content/early/2023/09/19/2023.01.11.523679. [DOI] [Google Scholar]

[R16] [16].de Boer C. G. and Taipale J.. Hold out the genome: A roadmap to solving the cis-regulatory code. bioRxiv, 2023. doi: 10.1101/2023.04.20.537701. URL https://www.biorxiv.org/content/early/2023/04/20/2023.04.20.537701. [DOI] [PubMed] [Google Scholar]

[R17] [17].de Boer C. G., Vaishnav E. D., Sadeh R., Abeyta E. L., Friedman N., and Regev A.. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol, 38(1):56–65, Jan 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Deng J., Dong W., Socher R., Li L.-J., Li K., and Fei-Fei L.. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. [Google Scholar]

[R19] [19].Devlin J., Chang M.-W., Lee K., and Toutanova K.. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423. [DOI] [Google Scholar]

[R20] [20].Dey K. K., van de Geijn B., Kim S. S., Hormozdiari F., Kelley D. R., and Price A. L.. Evaluating the informativeness of deep learning annotations for human complex diseases. Nature Communications, 11(1):4703, Sep 2020. ISSN 2041-1723. doi: 10.1038/s41467-020-18515-4. URL 10.1038/s41467-020-18515-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Einarsson H., Salvatore M., Vaagensø C., Alcaraz N., Bornholdt J., Rennie S., and Andersson R.. Promoter sequence and architecture determine expression variability and confer robustness to genetic variants. eLife, 11:e80943, nov 2022. ISSN 2050-084X. doi: 10.7554/eLife.80943. URL 10.7554/eLife.80943. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Fan R.-E., Chang K.-W., Hsieh C.-J., Wang X.-R., and Lin C.-J.. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008. [Google Scholar]

[R23] [23].Ferrari Dacrema M., Cremonesi P., and Jannach D.. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19, page 101–109, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362436. doi: 10.1145/3298689.3347058. URL 10.1145/3298689.3347058. [DOI] [Google Scholar]

[R24] [24].Gebru T., Morgenstern J., Vecchione B., Vaughan J. W., Wallach H., III H. D., and Crawford K.. Datasheets for datasets. Commun. ACM, 64(12):86–92, nov 2021. ISSN 0001-0782. doi: 10.1145/3458723. URL 10.1145/3458723. [DOI] [Google Scholar]

[R25] [25].Ghandi M., Lee D., Mohammad-Noori M., and Beer M. A.. Enhanced regulatory sequence prediction using gapped k-mer features. PLOS Computational Biology, 10(7):1–15, 07 2014. doi: 10.1371/journal.pcbi.1003711. URL 10.1371/journal.pcbi.1003711. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Grešová K., Martinek V., Čechák D., Šimeček P., and Alexiou P.. Genomic benchmarks: a collection of datasets for genomic sequence classification. BMC Genomic Data, 24(1):25, May 2023. ISSN 2730-6844. doi: 10.1186/s12863-023-01123-8. URL 10.1186/s12863-023-01123-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].He K., Zhang X., Ren S., and Sun J.. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90. [DOI] [Google Scholar]

[R28] [28].Holmes I. H.. Historian: accurate reconstruction of ancestral sequences and evolutionary rates. Bioinformatics, 33(8):1227–1229, January 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btw791. URL 10.1093/bioinformatics/btw791. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] [29].Hoskins R. A., Repo S., Barsky D., Andreoletti G., Moult J., and Brenner S. E.. Reports from CAGI: The Critical Assessment of Genome Interpretation. Hum Mutat, 38(9):1039–1041, Sep 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Ioannidis N. M., Rothstein J. H., Pejaver V., Middha S., McDonnell S. K., Baheti S., Musolf A., Li Q., Holzinger E., Karyadi D., Cannon-Albright L. A., Teerlink C. C., Stanford J. L., Isaacs W. B., Xu J., Cooney K. A., Lange E. M., Schleutker J., Carpten J. D., Powell I. J., Cussenot O., Cancel-Tassin G., Giles G. G., MacInnis R. J., Maier C., Hsieh C. L., Wiklund F., Catalona W. J., Foulkes W. D., Mandal D., Eeles R. A., Kote-Jarai Z., Bustamante C. D., Schaid D. J., Hastie T., Ostrander E. A., Bailey-Wilson J. E., Radivojac P., Thibodeau S. N., Whittemore A. S., and Sieh W.. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet, 99(4):877–885, Oct 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Ioannidis N. M., Davis J. R., DeGorter M. K., Larson N. B., McDonnell S. K., French A. J., Battle A. J., Hastie T. J., Thibodeau S. N., Montgomery S. B., Bustamante C. D., Sieh W., and Whittemore A. S.. FIRE: functional inference of genetic variants that regulate gene expression. Bioinformatics, 33(24):3895–3901, August 2017. ISSN 1367-4803. doi: 10.1093/bioinformatics/btx534. URL 10.1093/bioinformatics/btx534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Jaganathan K., Kyriazopoulou Panagiotopoulou S., McRae J. F., Darbandi S. F., Knowles D., Li Y. I., Kosmicki J. A., Arbelaez J., Cui W., Schwartz G. B., Chow E. D., Kanterakis E., Gao H., Kia A., Batzoglou S., Sanders S. J., and Farh K. K.-H.. Predicting splicing from primary sequence with deep learning. Cell, 176 (3):535–548.e24, 2019. ISSN 0092-8674. doi: 10.1016/j.cell.2018.12.015. URL https://www.sciencedirect.com/science/article/pii/S0092867418316295. [DOI] [PubMed] [Google Scholar]

[R33] [33].Ji Y., Zhou Z., Liu H., and Davuluri R. V.. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics, 37(15):2112–2120, February 2021. ISSN 1367-4803. doi: 10.1093/bioinformatics/btab083. URL 10.1093/bioinformatics/btab083. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Johnson J., Douze M., and Jégou H.. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019. [Google Scholar]

[R35] [35].Kaplan J., McCandlish S., Henighan T., Brown T. B., Chess B., Child R., Gray S., Radford A., Wu J., and Amodei D.. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361. [Google Scholar]

[R36] [36].Karollus A., Mauermeier T., and Gagneur J.. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biology, 24(1):56, Mar 2023. ISSN 1474-760X. doi: 10.1186/s13059-023-02899-9. URL 10.1186/s13059-023-02899-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Katz S.. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech, and Signal Processing, 35(3):400–401, 1987. doi: 10.1109/TASSP.1987.1165125. [DOI] [Google Scholar]

[R38] [38].Kelley D. R.. Cross-species regulatory sequence activity prediction. PLOS Computational Biology, 16(7):1–27, July 2020. doi: 10.1371/journal.pcbi.1008050. URL 10.1371/journal.pcbi.1008050. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Kim M. and Hwang K.-B.. An empirical evaluation of sampling methods for the classification of imbalanced data. PLOS ONE, 17(7):1–22, July 2022. doi: 10.1371/journal.pone.0271260. URL 10.1371/journal.pone.0271260. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Kircher M., Witten D. M., Jain P., O’Roak B. J., Cooper G. M., and Shendure J.. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet, 46(3):310–315, Mar 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] [41].Krizhevsky A., Sutskever I., and Hinton G. E.. Imagenet classification with deep convolutional neural networks. In Pereira F., Burges C., Bottou L., and Weinberger K., editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf. [Google Scholar]

[R42] [42].Kudo T.. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1007. URL https://aclanthology.org/P18-1007. [DOI] [Google Scholar]

[R43] [43].Lappalainen T., Sammeth M., nder M. R., ’t Hoen P. A., Monlong J., Rivas M. A., lez Porta M., Kurbatova N., Griebel T., Ferreira P. G., Barann M., Wieland T., Greger L., van Iterson M., Ribeca J. f, P., Pulyakhina I., Esser D., Giger T., Tikhonov A., Sultan M., Bertier G., MacArthur D. G., Lek M., Lizano E., Buermans H. P., Padioleau I., Schwarzmayr T., Karlberg O., Ongen H., Kilpinen H., Beltran S., Gut M., Kahlem K., Amstislavskiy V., Stegle O., Pirinen M., Montgomery S. B., Donnelly P., McCarthy M. I., Flicek P., Strom T. M., Lehrach H., Schreiber S., Sudbrak R., Carracedo A., Antonarakis S. E., sler R., nen A. C., van Ommen G. J., Brazma A., Meitinger T., Rosenstiel P., Gut R. ó, I. G., Estivill X., Dermitzakis E. T., Estivill X., Guigo R., Dermitzakis E., Antonarakis S., Meitinger T., Strom T. M., Palotie A., Deleuze J. F., Sudbrak R., Lerach H., Gut I., nen A. C., Gyllensten U., Schreiber S., Rosenstiel P., Brunner H., Veltman J., Hoen P. A., van Ommen G. J., Carracedo A., Brazma A., Flicek P., Cambon-Thomsen A., Mangion J., Bentley D., and Hamosh A., Transcriptome and genome sequencing uncovers functional variation in humans. Nature, 501(7468):506–511, Sep 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Lee K., Ippolito D., Nystrom A., Zhang C., Eck D., Callison-Burch C., and Carlini N.. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577. [DOI] [Google Scholar]

[R45] [45].Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., and Stoyanov V.. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692. [Google Scholar]

[R46] [46].McVicker G., Gordon D., Davis C., and Green P.. Widespread genomic signatures of natural selection in hominid evolution. PLOS Genetics, 5(5):1–16, May 2009. doi: 10.1371/journal.pgen.1000471. URL 10.1371/journal.pgen.1000471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] [47].Mock F., Kretschmer F., Kriese A., Böcker S., and Marz M.. Taxonomic classification of dna sequences beyond sequence similarity using deep neural networks. Proceedings of the National Academy of Sciences, 119(35):e2122636119, 2022. doi: 10.1073/pnas.2122636119. URL https://www.pnas.org/doi/abs/10.1073/pnas.2122636119. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Mukwende M.. Mind the gap: A clinical handbook of signs and symptoms in black and brown skin. Wounds UK, pages 16–16, 2020. [Google Scholar]

[R49] [49].Musgrave K., Belongie S., and Lim S.-N.. A metric learning reality check. In Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV, page 681–699, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-58594-5. doi: 10.1007/978-3-030-58595-2_41. URL 10.1007/978-3-030-58595-2_41. [DOI] [Google Scholar]

[R50] [50].Oubounyt M., Louadi Z., Tayara H., and Chong K. T.. Deepromoter: Robust promoter predictor using deep learning. Frontiers in Genetics, 10, 2019. ISSN 1664-8021. doi: 10.3389/fgene.2019.00286. URL https://www.frontiersin.org/articles/10.3389/fgene.2019.00286. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Paik C., Aroca-Ouellette S., Roncone A., and Kann K.. The World of an Octopus: How Reporting Bias Influences a Language Model’s Perception of Color. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 823–835, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.63. URL https://aclanthology.org/2021.emnlp-main.63. [DOI] [Google Scholar]

[R52] [52].Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., and Duchesnay E.. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Google Scholar]

[R53] [53].Pollard K. S., Salama S. R., Lambert N., Lambot M.-A., Coppens S., Pedersen J. S., Katzman S., King B., Onodera C., Siepel A., Kern A. D., Dehay C., Igel H., Ares M., Vanderhaeghen P., and Haussler D.. An rna gene expressed during cortical development evolved rapidly in humans. Nature, 443(7108):167–172, Sep 2006. ISSN 1476-4687. doi: 10.1038/nature05113. URL 10.1038/nature05113. [DOI] [PubMed] [Google Scholar]

[R54] [54].Pollard K. S., Hubisz M. J., Rosenbloom K. R., and Siepel A.. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res, 20(1):110–121, Jan 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] [55].Popejoy A. B. and Fullerton S. M.. Genomics is failing on diversity. Nature, 538(7624):161–164, Oct 2016. ISSN 1476-4687. doi: 10.1038/538161a. URL 10.1038/538161a. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Périer R. C., Praz V., Junier T., Bonnard C., and Bucher P.. The Eukaryotic Promoter Database (EPD). Nucleic Acids Research, 28(1):302–303, January 2000. ISSN 0305-1048. doi: 10.1093/nar/28.1.302. URL 10.1093/nar/28.1.302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] [57].Racimo F., Sankararaman S., Nielsen R., and Huerta-Sánchez E.. Evidence for archaic adaptive introgression in humans. Nature Reviews Genetics, 16(6):359–371, Jun 2015. ISSN 1471-0064. doi: 10.1038/nrg3936. URL 10.1038/nrg3936. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] [58].Radford A., Wu J., Child R., Luan D., Amodei D., and Sutskever I.. Language models are unsupervised multitask learners. 2019. [Google Scholar]

[R59] [59].Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., Zhou Y., Li W., and Liu P. J.. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435. [Google Scholar]

[R60] [60].Rajpurkar P., Zhang J., Lopyrev K., and Liang P.. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, Nov. 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264. [DOI] [Google Scholar]

[R61] [61].Reiter E.. A structured review of the validity of BLEU. Computational Linguistics, 44(3):393–401, Sept. 2018. doi: 10.1162/coli_a_00322. URL https://aclanthology.org/J18-3002. [DOI] [Google Scholar]

[R62] [62].Schreiber J., Bilmes J., and Noble W. S.. apricot: Submodular selection for data summarization in python. Journal of Machine Learning Research, 21(161):1–6, 2020. URL http://jmlr.org/papers/v21/19-467.html.34305477 [Google Scholar]

[R63] [63].Singh R., Lanchantin J., Robins G., and Qi Y.. DeepChrome: deep-learning for predicting gene expression from histone modifications. Bioinformatics, 32(17):i639–i648, August 2016. ISSN 1367-4803. doi: 10.1093/bioinformatics/btw427. URL 10.1093/bioinformatics/btw427. [DOI] [PubMed] [Google Scholar]

[R64] [64].Standley T., Zamir A., Chen D., Guibas L., Malik J., and Savarese S.. Which tasks should be learned together in multi-task learning? In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020. [Google Scholar]

[R65] [65].Sundaram L., Gao H., Padigepati S. R., McRae J. F., Li Y., Kosmicki J. A., Fritzilas N., Hakenberg J., Dutta A., Shon J., Xu J., Batzoglou S., Li X., and Farh K. K.. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet, 50(8):1161–1170, Aug 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] [66].Tay Y., Dehghani M., Abnar S., Shen Y., Bahri D., Pham P., Rao J., Yang L., Ruder S., and Metzler D.. Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qVyeW-grC2k. [Google Scholar]

[R67] [67].Taylor W. L.. “cloze procedure”: A new tool for measuring readability. Journalism Quarterly, 30(4):415–433, 1953. doi: 10.1177/107769905303000401. URL 10.1177/107769905303000401. [DOI] [Google Scholar]

[R68] [68].The 1000 Genomes Project Consortium, et al. A global reference for human genetic variation. Nature, 526(7571):68–74, Oct 2015. ISSN 1476-4687. doi: 10.1038/nature15393. URL 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] [69].The ENCODE Project Consortium. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, Jul 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-2493-4. URL 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] [70].The ENCODE Project Consortium, Moore J., and Purcaro M. e. a.. Expanded encyclopaedias of dna elements in the human and mouse genomes. Nature, 583(7818):699–710, Jul 2020. ISSN 1476-4687. doi: 10.1038/s41586-020-2493-4. URL 10.1038/s41586-020-2493-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] [71].The G10K Consortium et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature, 592(7856):737–746, Apr 2021. ISSN 1476-4687. doi: 10.1038/s41586-021-03451-0. URL 10.1038/s41586-021-03451-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] [72].The GTEx Consortium et al. Genetic effects on gene expression across human tissues. Nature, 550(7675):204–213, Oct 2017. ISSN 1476-4687. doi: 10.1038/nature24277. URL 10.1038/nature24277. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R73] [73].The Roadmap Epigenomics Consorstium, et al. . Integrative analysis of 111 reference human epigenomes. Nature, 518(7539):317–330, Feb 2015. ISSN 1476-4687. doi: 10.1038/nature14248. URL 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] [74].Tiwari S., Ramachandran S., Bhattacharya A., Bhattacharya S., and Ramaswamy R.. Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics, 13(3):263–270, June 1997. ISSN 1367-4803. doi: 10.1093/bioinformatics/13.3.263. URL 10.1093/bioinformatics/13.3.263. [DOI] [PubMed] [Google Scholar]

[R75] [75].Trippe B. L., Huang B., DeBenedictis E. A., Coventry B., Bhattacharya N., Yang K. K., Baker D., and Crawford L.. Randomized gates eliminate bias in sort-seq assays. Protein Science, 31(9):e4401, 2022. doi: 10.1002/pro.4401. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/pro.4401. [DOI] [Google Scholar]

[R76] [76].Vaishnav E. D., de Boer C. G., Molinet J., Yassour M., Fan L., Adiconis X., Thompson D. A., Levin J. Z., Cubillos F. A., and Regev A.. The evolution, evolvability and engineering of gene regulatory dna. Nature, 603(7901):455–463, Mar 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-04506-6. URL 10.1038/s41586-022-04506-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R77] [77].van Alten S., Domingue B. W., Faul J., Galama T., and Marees A. T.. Correcting for volunteer bias in gwas uncovers novel genetic variants and increases heritability estimates. In medRxiv, 2022. URL https://api.semanticscholar.org/CorpusID:262032876. [Google Scholar]

[R78] [78].Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N., Kaiser L. u., and Polosukhin I.. Attention is all you need. In Guyon I., Luxburg U. V., Bengio S., Wallach H., Fergus R., Vishwanathan S., and Garnett R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [Google Scholar]

[R79] [79].Vitsios D., Dhindsa R. S., Middleton L., Gussow A. B., and Petrovski S.. Prioritizing non-coding regions based on human genomic constraint and sequence context with deep learning. Nature Communications, 12(1):1504, Mar 2021. ISSN 2041-1723. doi: 10.1038/s41467-021-21790-4. URL 10.1038/s41467-021-21790-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R80] [80].Wang A., Singh A., Michael J., Hill F., Levy O., and Bowman S.. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL https://aclanthology.org/W18-5446. [DOI] [Google Scholar]

[R81] [81].Wang A., Pruksachatkun Y., Nangia N., Singh A., Michael J., Hill F., Levy O., and Bowman S.. Superglue: A stickier benchmark for general-purpose language understanding systems. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., and Garnett R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf. [Google Scholar]

[R82] [82].Waters P. D., Patel H. R., Ruiz-Herrera A., Álvarez González L., Lister N. C., Simakov O., Ezaz T., Kaur P., Frere C., Grützner F., Georges A., and Graves J. A. M.. Microchromosomes are building blocks of bird, reptile, and mammal chromosomes. Proceedings of the National Academy of Sciences, 118(45):e2112494118, 2021. doi: 10.1073/pnas.2112494118. URL https://www.pnas.org/doi/abs/10.1073/pnas.2112494118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R83] [83].Zaheer M., Guruganesh G., Dubey K. A., Ainslie J., Alberti C., Ontanon S., Pham P., Ravula A., Wang Q., Yang L., and Ahmed A.. Big bird: Transformers for longer sequences. In Larochelle H., Ranzato M., Hadsell R., Balcan M., and Lin H., editors, Advances in Neural Information Processing Systems, volume 33, pages 17283–17297. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. [Google Scholar]

[R84] [84].Zhang R., Isola P., and Efros A. A.. Colorful image colorization. In Leibe B., Matas J., Sebe N., and Welling M., editors, Computer Vision – ECCV 2016, pages 649–666, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46487-9. [Google Scholar]

[R85] [85].Zhang Y., An L., Xu J., Zhang B., Zheng W. J., Hu M., Tang J., and Yue F.. Enhancing hi-c data resolution with deep convolutional neural network hicplus. Nature Communications, 9(1):750, Feb 2018. ISSN 2041-1723. doi: 10.1038/s41467-018-03113-2. URL 10.1038/s41467-018-03113-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R86] [86].Zhou J. and Troyanskaya O. G.. Predicting effects of noncoding variants with deep learning-based sequence model. Nature Methods, 12(10):931–934, Oct 2015. ISSN 1548-7105. doi: 10.1038/nmeth.3547. URL 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R87] [87].Zhou J., Theesfeld C. L., Yao K., Chen K. M., Wong A. K., and Troyanskaya O. G.. Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk. Nat Genet, 50(8):1171–1179, Aug 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

This is a preprint.

GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models

eyes s robson

Nilah M Ioannidis

Abstract

1. Introduction

1.1. Background — Benchmarking

1.2. Related Work

2. GUANinE Tasks

2.1. Functional Elements

dnase-propensity

ccre-propensity

2.2. Conservation

cons30 and cons100

2.3. Gene Expression

gpra-c and gpra-d

3. Baselines

3.1. Non-neural Baselines

GC-content

k-mer frequency SVR

k-mer frequency kNN

3.2. Existing Neural Baselines

DeepSEA

Beluga

Basenji2

3.3. Novel Neural Baselines

T5

T5-515, -2051, -8195

hgT5-515, -2051, -8195

4. Results and Key Findings

Table 2:

Table 3:

4.1. Functional element tasks

Annotation vs understanding

4.2. Conservation tasks

4.3. Gene expression GPRA tasks

5. Reflections on Models

5.1. Non-Neural Baselines

5.2. DeepSEA, Beluga, and Basenji2

5.3. T5

6. hgT5 and the impact of self-supervised pretraining

Table 4:

7. Future Work

Functional elements

Conservation

Gene expression

Potential future tasks

8. Conclusion

Benchmark and Model Availability

Figure 1:

Table 1:

Acknowledgments

Appendices

A. Extended Performance Metrics

Table 5:

Figure 2:

B. Additional preprocessing information

B.1. dnase-propensity

B.2. ccre-propensity

B.3. gpra-c and gpra-d

C. Baseline methods

D. hg38 pretraining corpus

E. T5 and hgT5 (pre)training

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2. `GUANinE` Tasks