Skip to main content
Nature Communications logoLink to Nature Communications
. 2025 Dec 2;17:186. doi: 10.1038/s41467-025-66861-y

Discovering the complete enhancer map of human herpesviruses using a natural language processing model

Nilabja Roy Chowdhury 1,#, Deepanway Ghosal 2,#, Vyacheslav Gurevich 1, Meir Shamay 1,
PMCID: PMC12780183  PMID: 41330946

Abstract

Enhancers are distal cis-regulatory elements that dictate complex transcriptional repertoire. Herpes viruses exhibit programmed latent and lytic gene expression depending on the infected tissue and physiological state. Previously, using a systematic functional assay, we identified six enhancers within the genome of Kaposi’s sarcoma-associated herpesvirus (KSHV). In this study, we present a natural language processing model (NLP)-based tool, ENHAvir, that is trained with these six enhancers and non-enhancer control sequences from the KSHV genome. ENHAvir identifies known enhancers and predicted novel enhancer elements in human herpesviruses. The activity of the predicted enhancers in HSV-2, HCMV, HHV-6, HHV-7, and EBV is confirmed in an enhancer reporter assay. The terminal repeats of all the herpes viruses also serve as a strong enhancer. Comparing herpesvirus enhancers with human enhancers reveals conserved enhancer signatures and the involvement of Alu elements. Here, we present an AI tool that successfully predicts enhancers in both viral and human genomes.

Subject terms: Machine learning, Herpes virus, Gene expression


Artificial intelligence (AI) can detect and predict patterns that are hidden from the human eye and from conventional homology-detection tools. Enhancers are distal DNA cis-regulatory elements that regulate complex transcriptional repertoire. A natural language processing (NLP) model was trained using only six enhancer sequences from the Kaposi’s sarcoma-associated herpesvirus (KSHV) genome. This tool, termed ENHAvir, can identify known enhancers and predict novel enhancer elements in other herpesviruses, different viruses, and the human genome. The activity of the predicted enhancers in HSV-2, HCMV, HHV-6, HHV-7, and EBV was confirmed experimentally, enabling the creation of a comprehensive enhancer map of human herpesviruses. All human herpesviruses contain terminal repeats (TRs), which play important roles in cleaving the viral genome into genome-size units, genome encapsidation, and genome circularization following entry into the nucleus of a newly infected cell. This study adds another role for the TR of all human herpesviruses, a strong enhancer with the features of a “viral super enhancer”. Comparing herpesvirus enhancers with human enhancers revealed conserved enhancer signatures and the involvement of Alu elements. Here, an AI tool is presented that successfully predicts enhancers in both viral and human genomes.

Introduction

Enhancers are cis-regulatory elements (CREs) that enhance the transcription from its target gene promoter in an orientation- and position-independent manner13. These CREs are thus responsible for tissue-specific gene regulation and physiological homeostasis46. They regulate distal promoters by interacting with the respective gene promoters through genomic loops and, therefore, are key players in the successful formation of active Topologically Active Domains (TADs)79. Silencer elements have the opposite function to enhancers, i.e., they repress the target gene expression by recruiting repressors and interacting physically with respective gene promoters10,11. However, recent studies have shown that an enhancer can be a silencer and vice-versa depending on different cellular contexts12,13.

Since enhancers can be located at variable distances, upstream or downstream of a gene, or even within introns3, predicting their presence is challenging. Several methods can be applied to identify these regulatory sequences. Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) of different activating histone marks (like H3K27Ac, H3K4Me1, H3K4Me3)14,15, assays for identifying open chromatin regions (like ATAC-seq16, FAIRE-seq17, and DNase-seq1820), and each of their combinations might indicate the presence of an enhancer. Enhancer-promoter contacts might also be identified using chromosome conformation capture (3 C)21,22 and its genome-wide versions like 4C23, HiC24, micro-C25, and HiChIP26. Specific histone marks such as H3K27Ac are enriched within active enhancer regions27. However, repressed enhancers are devoid of this mark. Additional marks, such as H3K122Ac, H3K4Me1/2, etc., can also signify an enhancer2830. Although previously regarded as separate entities, various recent studies have shown that bidirectional active promoters can also serve as enhancers, termed ‘e-promoters’, where they show promoter-specific H3K4Me3 histone marks3134. It becomes more complicated as promoters of one gene can act as an enhancer for another gene3538, and active enhancers frequently give rise to transcripts known as ‘enhancer-derived RNAs’39,40. The search for enhancers becomes even more difficult in the case of viral genomes, where every genomic sequence of the compact genome has multiple roles. Self-transcribing active Regulatory Regions sequencing (STARR-seq)41,42 is a powerful functional assay that exploits a downstream enhancer’s ability to induce an upstream weak promoter for the mRNA transcription. However, all the aforementioned techniques are not only laborious but also time-consuming. Enhancer-predicting tools are developed to identify and indicate possible enhancer elements in a given genome. Many such tools have been developed based on de novo k-mers, nucleotide, and syntax rules4348. There are also tools that consider known biological data of transcription factor binding sites (TFBS) or histone mark occupancy instances4951. Although these models are decent in their prediction jobs, their results often show false-positive or false-negative results due to their biases towards the biological data on which they were built52,53.

Convolutional Neural Networks (CNNs), especially transformer-based Natural Language Processing (NLP) models, employ a self-attention mechanism to assign weights to the importance of different parts of an input sequence. This makes them very precise in identifying the patterns of long-range dependencies of certain characters in the text. These models can learn the micro-patterns of a given text sentence without any prior knowledge of data and can be programmed in downstream processes of interest. Although many enhancer-predicting tools are available, one that focuses on the viral genome is still missing. Viral enhancers contribute to the transcriptional repertoire of viral genes, affecting pathogenesis, infectivity, and oncogenesis (in the case of oncogenic viruses). Ongoing pursuits to identify and repress/activate viral enhancers are promising to carve the path for effective treatment of viral diseases. In this study, we present ENHAvir, an NLP model-based tool that can successfully predict the enhancers in a viral genome. The tool was trained with the previously published KSHV enhancer data54. ENHAvir learned the sequence micro-patterns of these enhancers and was successfully able to identify known and predict novel enhancer sequences for other human herpesviruses: Epstein-Barr virus (EBV), Herpes simplex virus 1 (HSV-1), and 2 (HSV-2), Varicella zoster virus (VZV), Human cytomegalovirus (HCMV), Human herpesvirus 6 (HHV-6A and 6B), and 7 (HHV-7), and Murine gammaherpesvirus 68 (MHV68). The novel enhancers in the HSV-2, HCMV, HHV-6, HHV-7 and EBV genomes predicted by ENHAvir were validated using the STARR-seq-based enhancer reporter approach. The EBV Terminal Repeats (TR) presents features of a “viral super-enhancer” responsive to the viral activator ZTA and lytic inducers, suggesting a role in the latency to lytic transition. When a virus unrelated to herpes viruses, such as SV40, was analyzed, ENHAvir successfully located the known enhancer in the SV40 genome. ENHAvir also successfully predicted enhancers on the human genome, and examples for Fos, Jun, DPPA3, and Myc gene loci are presented. In summary, ENHAvir precisely predicts not only viral enhancers but also points out the conservation of evolutionary and cross-species enhancer signatures between a virus and its host genomes, being trained with only six KSHV enhancer sequences and their activity values.

Results

ENHAvir trained on KSHV data can predict enhancers of related ɣ-herpesviruses

Enhancers are enriched with different histone marks and are commonly found in the open chromatin regions of the genome17,20,55. However, predicting enhancers is a challenging task since not all open chromatin or active chromatin histone marks can predict enhancers. Specifically, in the compacted genomes of viruses, promoters can also serve as enhancers. In a recent study, we identified several enhancers in the KSHV genome and their regulatory activity on the viral transcriptional repertoire54. Here, we hypothesized that the KSHV enhancer data may be used to train an NLP model-based tool to predict enhancers in other viruses. We used the DeBERTa v3 language model56 in our framework. DeBERTa v3 is an encoder-style language model, making it a suitable candidate for extracting representations that could be used for new tasks. This state-of-the-art model is pre-trained with the replaced token detection objective. The model also uses a gradient-disentangled embedding-sharing method that improves the efficiency of training and the quality of the language model. It reached the top of the leaderboard in the GLUE benchmark57, a widely used task suite consisting of various challenging language understanding problems. We used the six KSHV enhancer sequences and their respective activation values to train an NLP model (Fig. 1). We also included non-enhancer open chromatin regions from the KSHV genome as negative controls to train the NLP model. Then we asked the tool to predict the known enhancers in KSHV (Supplementary Fig. 1). The enhancer-predicting tool successfully predicted the six known enhancers in the KSHV genome. We termed this tool ENHAvir, as it is an enhancer-predicting tool based on viral sequences.

Fig. 1. Workflow of the ENHAvir pipeline.

Fig. 1

Enhancer sequences and activation values of the KSHV genome were used to train the Language Model of ENHAvir (as mentioned in the methods section) on the KSHV genome. Following that, ENHAvir could predict enhancers and their potential activation values of a given DNA sequence. The DNA loop portion of the figure was adopted from IsadoraofIbiza’s work in Wikimedia commons (https://commons.wikimedia.org/wiki/File:Transcription_Factors.svg#/media/File:Transcription_Factors.svg/2) under CC BY 3.0 license.

The Epstein-Barr virus (EBV) belongs to the human ɣ-herpesviruses together with KSHV. Since KSHV and EBV belong to the same subfamily of herpesviruses, we decided to start predicting enhancers using the EBV genome. Performing prediction with ENHAvir on the EBV genome revealed several enhancer peaks in the following regions: BamHI-W repeats, OriLyt-L, BPLF-1, BALF-3, EBNA1 last exon, BILF-2/OriLyt-R, and Terminal Repeats (Fig. 2a). Among them, OriLyt-L, and -R & BILF-2 regions have been reported to have enhancer activity5861, indicating the ability of ENHAvir to predict enhancers. In agreement with our identification of these regions as enhancers, there are HiC connections with several sites in the EBV genome, and with active histone marks (H3K4Me1/2/3 and H3K27Ac) (Fig. 2a & Supplementary Fig. 2).

Fig. 2. ENHAvir trained with the KSHV data can predict EBV enhancers.

Fig. 2

a ENHAvir-predicted enhancer peaks (black colored track) along with HiC connections (black curves: latency I and teal curves: Latency III), H3K27Ac (dark green), and H3K4Me-1/2/3 ChIP-seq data (purple, blue, and caramel colors, respectively) for the EBV genome109 are presented on the UCSC genome browser view158. The known enhancers are marked in red boxes. b Locations of the ENHAvir-predicted enhancers in the EBV genome, along with the cloning strategy and c expected outcomes for functional enhancers and silencers.

To validate the enhancer activity of the newly predicted enhancers, the predicted enhancer peaks, including ~200 bp upstream and downstream of each of the regions, were cloned downstream to the luciferase reporter gene under the control of a weak promoter (vector_ORI_empty) (Fig. 2b, c). All these clones resulted in an ENHAvir predicted EBV reporter library (as depicted in the materials and methods section and Fig. 2b). We co-transfected this library into Akata, an EBV-infected B-cell line, and performed reporter assays during the latency and the lytic phases. A reporter plasmid containing the ZTA promoter62 and the OriP (a known EBV enhancer) reporter plasmids were used as positive controls. The same plasmid from the EBV reporter library containing an inert (non-enhancer) genomic region, previously used in STARR-seq63, served as a negative control. The activity of the predicted enhancers was measured as relative activity to the negative control. A Renilla reporter plasmid was also co-transfected and served as an internal control to determine transfection efficiency.

Reporter assay in Akata cells revealed that BALF-3 (BALF), BPLF-1 (BPLF), EBNA1 last exon (EBNA1_Lex), and TRs, along with the positive controls, showed significantly high (at least 3-fold) reporter signal, which supports their enhancer activity (Fig. 3a, g–k). The only region that did not show any enhancer activity in Akata cells was the BamHI-W repeats (BamHI-W). To test the activity of EBV enhancers during the lytic cycle, we induced the cells with phorbol ester (TPA) and sodium butyrate (NaB). Upon lytic induction, all the enhancers, except for BamHI-W, were further activated. EBV gene expression during latency can be divided into several latency programs. Latency I is a silent program that expresses only EBNA1, BARF0, and EBER and is found in Burkitt’s lymphoma. Latency II expresses, in addition, the latent membrane proteins LMP-1, LMP-2A, and LMP-2B, and has been associated with Hodgkin’s disease, T-cell non-Hodgkin’s lymphoma, and nasopharyngeal carcinoma. Latency III is the most active latency program and usually expresses all forms of EBNAs, EBERs, and LMPs, and therefore is usually restricted to immunocompromised individuals and associated malignancies, such as post-transplant and AIDS-related lymphoproliferative disorders, and lymphoblastoid cell lines6467. The Akata cells present latency I, and we were interested to test the enhancer activity in latency III, therefore we repeated the experiment in Raji cells. The same enhancers, BALF, BPLF, EBNA1_Lex, and TRs, were active in Raji cells as well (Fig. 3b, g–k).

Fig. 3. Functional validation of the novel EBV enhancers predicted by ENHAvir.

Fig. 3

Enhancer reporter plasmids containing the predicted EBV enhancers were transfected into EBV-positive Akata (a) and Raji (b) cells or the EBV-negative BJAB cells (c), and dual luciferase reporter assay was performed. Cells were either left untreated following the transfection (uninduced) or treated with TPA and NaB for 24 hrs, followed by lysate collection for the reporter assay. Enhancer reporter plasmids containing the predicted EBV enhancers were transfected into EBV-negative cell lines, B-cell lymphoma BJAB cells (d), and epithelial HEK293T (e) and HCT-116 (f) cells. The cells were co-transfected with an empty vector or ZTA expression vector, and 48 hrs later, lysates were prepared for dual luciferase reporter assays. Relative Luciferase Units (RLU) represent the firefly luciferase units divided by the internal control Renilla units. The RLU values are presented as fold activation relative to the untransfected/uninduced negative control. Black bars represent the uninduced (ac) or empty vector-transfected cells (df), red bars indicate their induced (ac) counterparts, and teal bars indicate ZTA expression (df). The black and red, or teal, dotted lines represented their respective uninduced and induced, or ZTA-expressing states. gk Summary of the enhancer reporter assays as a fold change compared to the negative control in untreated/uninduced cells. Data represent mean ± S.D. of n = 9, three independent experiments, each with three biological replicates. Values in the graph indicate P values from 2-tailed T-tests. (ns) denotes non-significant values. Source data are provided as a Source Data file.

Performing enhancer reporter assays in infected cells treated with TPA and NaB presents two possibilities: first, that the drugs directly induce reporter activity; and second, that the drugs trigger the viral lytic cycle, which in turn enhances reporter activity. To distinguish between these possibilities, we repeated the experiment in BJAB, an EBV-negative B-cell line (Fig. 3c). In BJAB cells, TPA and NaB were unable to increase reporter activity, except for TR, which was further activated in treated cells (Fig. 3c). This suggests that viral lytic induction, but not TPA and NaB, leads to increased BALF, BPLF, and EBNA1_Lex enhancer activity, while TR is responsive to both TPA and NaB and viral lytic induction. Interestingly, the basal activity of TR was much higher in the EBV-infected cells compared to uninfected BJAB cells (Fig. 3k). Previously, we have shown that the KSHV lytic enhancers are responsive to the Replication and Transcription transactivator (RTA) protein54. This protein is solely responsible for initiating the KSHV lytic phase6871. Similarly, EBV encodes BamHI Z Epstein-Barr virus replication activator (BZLF1/ZEBRA/ZTA), which drives its lytic transcription initiation72. We asked if the EBV enhancers were responsive to the ZTA protein. To test this, we co-transfected the EBV enhancer reporter together with the ZTA expression plasmid into BJAB cells. ZTA expression strongly induced the TR of EBV and the two positive controls (Fig. 3d), indicating that additional EBV lytic factors are required for lytic enhancer activity for BALF, BPLF, and EBNA1_Lex. Reanalysis of ZTA ChIP-seq does not detect any significant peak on these enhancers (Supplementary Fig. 2). In many cases, enhancers are tissue-specific, including the ORF29 Intron enhancer we identified in KSHV54. Since EBV also infects epithelial cells, we co-transfected the reporter plasmids with or without ZTA into epithelial HEK293T cells. The enhancer activity in epithelial cells is similar to that of B-cells for all tested enhancers (Fig. 3e). To check the interferon responsiveness of the EBV enhancers, we repeated the enhancer reporter assay in HCT-116, a cell line with no interferon response mediated by the cGAS-STING pathway73. In HCT-116, the EBNA1_Lex, BPLF, and TR were still active, and TR was further induced by ZTA. In contrast, BALF and BamHI-W were not active (Fig. 3f) in HCT-116, suggesting its activity seems to be controlled by interferon-mediated response. The activity of the EBV enhancers in different cell lines is summarized in Fig. 3g-k. We also tested the TR enhancer in the context of OriP, and found that it is able to further enhance OriP (Supplementary Fig. 3a). To show the specificity of ENHAvir and the enhancer reporter plasmid for enhancers, we tested ZTA promoter (ZTAp) region62. This promoter is enriched with active histone marks and shows a prominent open chromatin peak in the ATAC-seq (Supplementary Fig. 2a, dark rose colored triangle). ENHAvir does not predict any enhancer activity for ZTAp, and this region does not show any enhancer activity when it is cloned downstream to the Luciferase reporter gene in reporter assay (Supplementary Fig. 3b, Luc-ZTAp construct). This experiment supported the notion that ENHAvir is specific for enhancers, and not for any active chromatin. In summary, the newly identified EBV enhancers, BALF, BPLF, EBNA1_Lex, and TR, are active during latency and can be further activated during the lytic phase. In addition, the TR has the strongest enhancer activity, specifically in EBV-infected cells.

Following the ability of ENHAvir to predict enhancers in EBV, we searched for enhancers in the genome of MHV-68, another ɣ-herpesvirus that infects mice and is often used as EBV/KSHV homolog in animal models7476 (Supplementary Fig. 4). ENHAvir predicted prominent peaks in the OriLyt-L and OriLyt-R, sequences that are identical to those of the KSHV77. Besides, previously it was reported for MHV68 OriLyts to have ‘cis’ activity78,79. In addition, ENHAvir predicts that TR is an enhancer, in agreement with the high homology between the TRs of MHV-68 and KSHV80.

EBV Terminal Repeat enhancer regulates both latent and lytic genes

To test the activity of the EBV TR enhancer in the context of the viral genome, possible targets were identified based on HiC connections with the EBV TR in latency I and III (Fig. 4a: black and teal curves, respectively). Interestingly, we saw that the EBV TR shows contacts with the key gene promoters for latency maintenance (EBNA-1), latency-III genes (LMP1 and LMP2A), and immediate-early lytic genes (ZTA and RTA) (Fig. 4a). To test the ability of TR enhancer to regulate these genes, we performed CRISPR activation (CRISPRa) targeting the EBV TR. Reporter assay for the TR enhancer in EBV-infected BC-1 cells revealed lower activity relative to Akata and Raji cells (Supplementary Fig. 3c); therefore, we decided to perform the CRISPRa in these cells. BC-1 cells were transduced with lentiviruses expressing dCas9 that tethers a strong activation domain (dCas9-10xSunTag and ScFv-2ERT2-VPH)81, and specific sgRNAs targeting TR or sgControl. Subsequently, the cells were treated with 4-hydroxytamoxifen to induce CRISPRa activity for 4 and 24 h, and RNA was subjected to RT-qPCR82. CRISPRa for EBV TR led to upregulation of EBNA1, LMP1, LMP2A, ZTA, and RTA expression (Fig. 4b). These gene targets were already upregulated following 4 h of CRISPRa activation, and this was even more profound in the case of 24 h (Fig. 4c). In contrast to these gene promoters that had HiC connections with TR, the two late lytic genes BCLF and BLLF1 did not present any HiC connections, and indeed these genes were not upregulated during 4 and 24 h of CRISPRa (Fig. 4b, c). Suggesting that EBNA-1, LMP1, LMP2A, ZTA and RTA are direct targets of TR enhancer.

Fig. 4. EBV Terminal Repeats regulate both latent and lytic viral genes.

Fig. 4

a Re-analysis of EBV HiC data (black and teal curves for EBV Latency I and III respectively)139. Genes written in brick red color indicate the ones having direct HiC connection with the EBV Terminal Repeats (TR). The ones written in vivid violet denote two viral genes that have no HiC connections with the EBV TR. RT-qPCR of different viral genes following 4 h (b) and 24 h (c) of EBV TR CRISPRa induction with 50 nM 4-hydroxytamoxifen (4OHT) vs control (EtOH: Ethanol). Brick red and vivid violet graphs indicate the genes that have or do not have HiC connection with EBV TR, respectively. Data represent mean ± S.D. of n = 3 biological replicates. Values in the graph indicate P values from 2-tailed T-tests. (ns) denotes non-significant values. Source data are provided as a Source Data file.

Prediction of enhancers of α-herpesviruses

Since ENHAvir could identify known and predict novel EBV enhancers, we wanted to move further to the α-herpesviruses. We proceeded with Herpes Simplex Virus-1 (HSV-1) genome, a relatively well-studied virus. ENHAvir prediction resulted in detecting enhancer peaks on the Latency-Associated Transcript promoter (LATp) region and Immediate Early mRNA 3 (IE mRNA3) regions (Fig. 5a). Both these regions are known HSV-1 enhancer elements8385. Thus, ENHAvir identified all the known enhancer sequences present in the HSV-1 genome. ENHAvir also predicted the same regions in HSV-2 as enhancers (Fig. 5b).

Fig. 5. ENHAvir enhancer prediction for α-herpesviruses.

Fig. 5

ENHAvir-predicted enhancer peaks (black colored track) for the HSV-1 (a), HSV-2 (b), and VZV (c) genomes are presented on the UCSC genome browser view158. The known enhancers are marked in red boxes.

Enhancer prediction analysis on the 3rd human α-herpesvirus, the Varicella-Zoster virus genome (VZV), revealed numerous peaks on its genome. ENHAvir revealed major peaks in the IR and TR regions8689. Among them, peaks were found in the bidirectional promoter of ORF62 and ORF6386 (Fig. 5c). Another bidirectional promoter serving ORF28 and ORF29 also showed a significant peak on the VZV genome8789. Apart from them, most of the peaks coincide with the ATAC-seq peaks reported previously90. These findings add another possible layer to the VZV transcriptional repertoire.

Predicting enhancers of β-herpesviruses

Next, we applied ENHAvir to predict the enhancers on the human β-herpesviruses. The analysis of the Human Cytomegalovirus (HCMV) predicted enhancer peaks in the RNA2.7 promoter, Ori-Lyt, IRS1, and TRS1 regions (Fig. 6a). Among them, RNA2.7 promoter and OriLyt regions are known HCMV enhancers91. A small peak was also observed in the immediate early gene region, known as the major immediate early enhancer9294. In addition to the known enhancers detected for HCMV, our analysis predicts possible enhancers at the IRS1 and TR regions. To validate that these two regions are enhancers, these regions were cloned into our enhancer reporter plasmid. Reporter experiments in HEK293T and HCT-116 cell lines revealed ~3-fold and ~4-fold upregulation of reporter activity by IRS1 than the negative control in HEK293T and HCT-116 cells, respectively (Fig. 6e). The TR of HCMV presented higher activity, around 20-fold in both the cell-lines. All these results indicate that, indeed, the two peaks in IRS1 and TR identified by ENHAvir, are functional enhancers.

Fig. 6. ENHAvir enhancer prediction for β-herpesviruses.

Fig. 6

ENHAvir-predicted enhancer peaks (black colored track) for HCMV (a), HHV-6A (b), HHV−6B (c), and HHV−7 (d) genomes presented on the UCSC genome browser view158. Red and Purple boxes indicate known and novel enhancer predictions (subjected to enhancer reporter validation), respectively. Enhancer reporter assay of HCMV IRS1 and TR (e) and HHV-6A/6B/7 TR (f) in HEK293T (blue) and HCT-116 (brown) cell lines. Relative Luciferase Units (RLU) are the firefly luciferase units divided by the internal control Renilla units. The RLU values presented are fold change compared to the negative. Data represent mean ± S.D. of n = 9, three independent experiments, each with three biological replicates. Values in the graph indicate P values from 2-tailed T-tests. (ns) denotes non-significant values. Source data are provided as a Source Data file.

The roseolovirus subfamily includes three highly related viruses: HHV-6A, HHV-6B, and HHV-7. For all three viral genomes, ENHAvir predicted enhancers in the upstream region of the U95 gene (Fig. 6b–d). For HHV-6 viruses, the region is known as R3 enhancer, while for HHV-7, it is termed as R2 enhancer9598. Interestingly, the ENHAvir prediction on these viruses showed a very similar enhancer pattern. For all of them, prominent peaks were visible in their Direct Repeats (DR)/Terminal Repeats (TR) region (Fig. 6b–d). Here again, cloning of a conserved region shared by HHV-6A/6B and −7 in the TR into the enhancer reporter plasmid presented enhancer function in reporter assay (Fig. 6f).

ENHAvir precisely predicts enhancers of polyomavirus

Given that ENHAvir precisely identified enhancers within the herpes genomes, we extended our investigation to a virus from a different family. For this purpose, we selected the Simian Vacuolating virus 40 (SV40). SV40, a small dsDNA virus that belongs to the polyomaviridae family, is a relatively well-studied virus. The two 72 bp repeats within the SV40 genome are the first identified enhancer99101. Analysis with ENHAvir on the SV40 viral genome revealed a major peak in the two 72 bp repeats (Supplementary Fig. 5). It is important to note that the BLAST search could not detect homology between the SV40 enhancer and KSHV enhancers, or between other herpesviral enhancers and KSHV, as we have analyzed previously. Nevertheless, our ENHAvir tool, trained only with KSHV enhancer sequences, can precisely locate the known enhancer sequences of a phylogenetically distant viral genome.

Training with the human herpesvirus enhancers improves the prediction accuracy of ENHAvir

ENHAvir was able to identify both known and novel viral enhancers. But still, it missed some key, previously reported enhancers in the EBV genome. To address this, we trained ENHAvir with enhancers from all human herpesviruses (Supplementary File T1). We named this newly trained model ENHAvir 2.0. The EBV OriP enhancer, which was not identified by ENHAvir trained with only KSHV enhancers (ENHAvir 1.0), was successfully identified by ENHAvir 2.0 (Fig. 7a).

Fig. 7. ENHAvir 2.0 outperforms its predecessor and other enhancer prediction tools in viral enhancer prediction.

Fig. 7

ad ENHAvir 2.0 (light blue colored track) shows more specificity and fewer background noise than ENHAVir 1.0 (black colored track) in enhancer prediction of γ-herpesviruses (a), α-herpesviruses (b), β-herpesviruses (c), and polyomavirus (d). Brick red colored track and teal colored track represent the viral enhancer prediction by EnhancerDetector and iEnhancer-ELM model, respectively107,108. Red and blue colored boxes indicate known enhancer and silencer sequences, respectively. Purple colored boxes indicate the novel enhancers validated in this study with the enhancer reporter assay. All the tracks are viewed in UCSC genome browser158.

In the case of the α-herpesviruses, the enhancer prediction for HSV-1 was the same with ENHAvir2.0 (Fig. 7b). In contrast, for HSV-2, ENHAvir2.0 predicted one more peak in the VP16 promoter, other than the LAT and IE mRNAp enhancers. Although, the HSV-2 genome has 80% sequence homology with that of HSV-1102, there is a significant difference in the promoter region of VP16/UL48. In case of HSV-2, the promoter contains 13 direct tandem repeats in this region harboring multiple Sp1/Egr1 sites103 which are not present in HSV-1 (Fig. 8a). A Study by Thompson and Sawtell showed that adding this sequence to the HSV-1 VP16 promoter increased reporter activity (compared to the wild-type HSV-1 VP16 promoter) and engineered HSV-1 virus containing this HSV-2 sequence acquired increased lytic entry and virulence103. We cloned both the HSV-1 and HSV-2 VP16 promoter regions into the enhancer reporter plasmid. In support of ENHAvir2.0 prediction, the region from HSV-2 is an enhancer, while HSV-1 is not (Fig. 8b). To show that the additional sequence in HSV-2 confers the enhancer activity, we added this sequence to the HSV-1 VP16 promoter, and turned it into an enhancer.

Fig. 8. HSV-2 VP16 promoter region harbors an enhancer.

Fig. 8

a ENHAvir 2.0 prediction (light blue colored track) of HSV-1 and HSV-2 genomes viewed in the UCSC genome browser158. Schematic representation of the VP16 promoter region of HSV-1 and HSV-2. Sequence alignment of HSV-1 and HSV-2 in the predicted enhancer peak is presented in a zoomed view using Jalview174. b VP16 promoter region from HSV-1 and HSV-2, and a chimeric sequence comprising the VP16 promoter of HSV-1 and the predicted enhancer sequence from HSV-2, were cloned downstream to the luciferase gene in the enhancer reporter plasmid, and an enhancer reporter assay was performed in HEK293T cells. Relative Luciferase Units (RLU) represent the firefly luciferase units divided by the internal control Renilla units. The RLU values presented are fold change compared to the negative control. Data represent mean ± S.D. of n = 9, three independent experiments, three biological replicates each. Values in the graph indicate P values from 2-tailed T-tests. (ns) denotes non-significant values. Source data are provided as a Source Data file.

Re-analysis of the HCMV genome, a β-herpesvirus, now showed a prominent peak of the MIE enhancer. Interestingly, for HHV-6A and −6B, ENHAvir 2.0 showed 2 new prominent peaks in the R1 and R2 regions, which were previously reported to serve as a silencer and an enhancer element, respectively (Fig. 7c)104106. Interestingly, the training data for ENHAvir 2.0 neither included the sequence of R1 and R2 nor any sequence that have similarity with these 2 sequences. Thus, it gives us more confidence that training with minimum number of sequences are enough to train ENHAvir for enhancer prediction. The prediction results also indicate that there is enough sequence syntax similarity between enhancers and silencers and ENHAvir is able to identify silencer sequences as well, at least for KSHV and the roseoloviruses (Fig. 7a, c). For the SV40 genome, ENHAvir 2.0 predicted the 72 bp repeats enhancer, like its previous version (Fig. 7d). Altogether, the new version, trained with the known enhancers of all herpesvirus improved the ability of ENHAvir to predict enhancers.

To test ENHAvir prediction compared to other available enhancer prediction tools, we performed benchmark analysis. We utilized two proven enhancer prediction tools: iEnhancer-ELM107, a BERT-based enhancer prediction model, and the EnhancerDetector model108, which can predict enhancers in both fly and non-fly systems. While iEnhancer-ELM failed to specifically detect any enhancer on the viral genomes, the Enhancer-detector model showed either background noise in the promoter regions, or it could not detect the correct enhancers in the viruses analyzed (Fig. 7- brick red and teal colored tracks, Supplementary Table 4). This comparison highlights the importance of training ENHAvir with viral enhancers to predict enhancers in viral genomes.

ENHAvir successfully predicts enhancers in the Human genome

DNA viruses utilize host transcription factors for the transcription of their viral genes. Therefore, some sequence similarity should be present between the viral and host enhancers. This understanding led us to ask whether we can predict human enhancers with ENHAvir trained with KSHV enhancer data. For a proof-of-concept analysis, we chose to analyze four gene loci from the human genome: Fos, Jun, DPPA3, and Myc.

Enhancer prediction with ENHAvir on Fos loci (hg38; chr14:75245513-75332436) region revealed several peaks (Fig. 9a). The most prominent peaks were seen on the Fos promoter. Interestingly, this promoter not only harbors ‘enhancer-signature’ (ENCODE ccREs)109 but is also shown to regulate various upstream genes like ABCD4 and NEK9 in the GeneHancer110 regulatory network track. This indicates that the Fos promoter can serve as an ‘e-promoter’. Re-analysis of published STARR-seq data of HepG2111 and K562 cells112 further proved this observation. ENHAvir also predicted a prominent peak in the LINC01220 promoter region, another STARR-seq peak, that has been shown to regulate the Fos promoter. This also signifies that the LINC01220 promoter is an enhancer and suggests a possible role as an ‘enhancer-derived lncRNA’ (eRNA) for this lncRNA. Some other peaks were also identified upstream of the Fos gene, which has been shown to regulate genes further upstream (GeneHancer track, Fig. 9a & Fos loci attribute, Supplementary Table 1). All the enhancers predicted in the Fos loci had H3K27Ac enrichment, which further signifies their enhancer activity.

Fig. 9. ENHAvir identifies enhancers on the human genome.

Fig. 9

The genome browser view displays ENHAvir-predicted enhancers on the human genome (black colored track) for the Fos (a), Jun (b), and DPPA3 (c) gene loci (in black), genome-wide STARR-seq data in HepG2111 (in rose) and K562112 (in teal) cells along with the GeneHancer110 (in magenta), ENCODE ccREs109 (enhancers in yellow and orange, promoters in red and CTCF sites in blue) and H3K27Ac tracks158 (in blue).

A similar observation was made when we predicted enhancers in the Jun (hg38; chr1:58734921-58827674) (Fig. 9b), DPPA3 (hg38; chr12:7542820-7626019) (Fig. 9c), and Myc (hg38; chr8:127729054-128193780) loci (Supplementary Fig. 6). Our prediction is in agreement with the ENCODE ccREs, the STARR-seq tracks, and the GeneHancer network.

Altogether, our tool, which was only trained with the enhancer data of KSHV, was not only able to identify most of the known enhancers on the human genome but also pointed out some novel enhancer regions that perfectly fit with the ENCODE ccRE enhancer signatures.

Viral enhancers and Alu elements share a common ancestral origin

Since ENHAvir was able to precisely detect enhancers in both viral and host genomes (human), we asked, what might be the common signature of the enhancers that ENHAvir picked up to identify the enhancers? To find out, we first analyzed all the viral enhancer sequences using MEME113. All the top 4 motifs represented a poly-A motif for the viral enhancers analyzed (Fig. 10a, Supplementary File 1). We then asked whether this motif was also found in the human enhancers we analyzed in this study. While the Fos, Jun, and DPPA3 loci enhancers showed different conserved motifs (Supplementary Fig. 7a–c, Supplementary File 24), the Myc loci enhancers analysis led to a poly-A stretch detection (Supplementary Fig. 7d, Supplementary File 5). Combining viral enhancers with human genomic enhancers from each aforementioned loci, again led to poly-A stretch motifs (Supplementary Fig. 7a–d, Supplementary File 610). We then randomly took ~200 human enhancers, and MEME analysis revealed that two out of 4 top motifs were again poly-A stretches (Fig. 10b & Supplementary Fig. 7e, Supplementary File 11). These motifs were present in the case of ~88 and ~99% analyzed sequences (Fig. 10b). Combining all the enhancers (human and viral) also revealed poly-A stretches that were present in ~80% viral enhancers and ~89% of human enhancers, which is substantially more than the other enhancer motifs (Fig. 10c, Supplementary File 12).

Fig. 10. A poly-A signature within viral and human enhancers leads to Alu elements.

Fig. 10

MEME analysis113 for common motif search involving only viral enhancers (a), 186 human enhancers (b), and a combination of the viral and human enhancers (c), along with the percentage of the common motif incidence. d Multiple alignment of representative viral and human enhancer sequences reveals a poly-A stretch motif similar to that of Alu elements. e Maximum likelihood tree170,171 for Alu sequences and the viral enhancers. The brown circle denotes the evolutionary age of the Alu elements122. TSD stands for Target Site Duplications. 0.17 scale bar represents the units of branch length calculated for nucleotide substitutions per site.

Previously, it has been reported that repeat sequences often harbor enhancer sequences114116, especially the transposable elements comprising Alus and LINEs117122. The Alu and the active LINE (L1HS and L1PA2) elements possess 3’ poly-A stretches in their sequences. We thus searched for these elements in the analyzed human enhancers. While we found 45, 15, 10, and 270 Alus in Fos, Jun, DPPA3, and Myc loci enhancers, respectively, we found only truncated LINE elements in these regions (Supplementary Fig. 7f, Supplementary Table 2). Alignment of the viral and human enhancers with the Alu sequences, also revealed a remarkable similarity of the poly-A stretch among the sequences (Fig. 10d, Supplementary Fig. 7g, h, Supplementary File 13, 14). The multiple alignment failed in the case of LINE-1 elements and viral enhancers, as the sequences were too diverse for Clustal W to align. A phylogenetic analysis between viral enhancers and Alu elements revealed that the viral enhancers form the same clade with the older Alu subfamilies (AluSx and AluJo) (Fig. 10e, Supplementary File 15). This result indicates that the viruses might have acquired Alu sequences and modified them to work as ‘e-promoters’ during their co-evolution with their host, or that there was a common ancestral origin to both. This evolutionary sequence similarity is indicative of the effectiveness of ENHAvir in both viral and human enhancer sequence identification.

Discussion

Locating enhancers and the quest for their regulatory activity is crucial to understanding cellular and viral gene expression. Enhancers can be identified with the help of functional assays like STARR-seq42, or by their association with open chromatin marks16,17. However, all these techniques are highly technical, time-consuming, and are limited to the activity tested in a specific cell type. Several enhancer prediction tools have been developed that utilize either nucleotide similarity syntax models or chromatin modification models, and often combine both. Recently, several models were also developed, which focused on the position weight matrix (PWM) of the TF binding site on a given DNA sequence116,123. However, when an enhancer is functional, not only is the core sequence important, but the sequences spanning the core enhancer also take a remarkable role124126. Some limitations of the current enhancer predicting tools are the ability to load viral sequences for analysis or to perform analysis on relatively small genomes. Based on our previous study54, we trained our Natural Language Processing (NLP) model-based tool with these KSHV enhancer data only. The ENHAvir model was trained on the KSHV genome, still it was able to predict all these enhancers on the KSHV genome. Then we started to predict enhancers on all human herpes viral genomes. ENHAvir successfully predicted most of the known enhancers; the only known enhancers ENHAvir did not detect were the EBV BamHI-C127, OriP59, and TPA-responsive enhancer MSTRE-I128, while HCMV MIE presented only a very small peak. To improve the enhancer prediction by ENHAvir, we decided to train it with the known and predicted herpesvirus enhancers. This improved tool, termed ENHAvir2.0, detected all herpesvirus enhancers except EBV BamHI-C. The BamHI-C was reported to be a B-cell-specific enhancer that is responsive to EBNA2127. We performed benchmark analysis to compare ENHAvir with two available NLM enhancer predicting tools; iEnhancer-ELM107, and the EnhancerDetector model108. This comparison shows that both EHNAvir and ENHAvir2.0 outperformed in predicting viral enhancers. In addition to viral enhancers, ENHAvir was able to predict enhancers on the human genome. Moreover, it was able to detect tissue-specific enhancers. This suggests that ENHAvir is able to detect some hidden motifs that are common to many enhancers regardless of their tissue specificity. When we tried the other way around, to train NLP model with human enhancers and predict viral enhancers, it failed to do so. (Supplementary Fig. 8), suggesting that viral enhancers capture common motifs for all enhancers, including the host enhancers. This is also supported by our failed attempt to predict enhancers of Drosophila sp. with ENHAvir. This might be due to the significantly large taxonomic distance between Drosophila and the host human enhancers, highlighting the intimate relations between viruses and their hosts. It will be interesting in the future to train the tool with viral enhancers from viruses that infect Drosophila for their ability to predict enhancers on the Drosophila genome.

HSV-1 and HSV-2 are very similar viruses that have similar tissue tropism129131, with higher HSV-1 prevalence in oral epithelium, and higher HSV-2 prevalence in genital epithelium, which is attributed to human behavior. When they reside in the same tissue, HSV-2 presents higher reactivation compared to HSV-1132. Analysis of HSV-1 and −2 with ENHAvir precisely identified the enhancers at the LATp85 and IE mRNA83,84. The immediate early promoter of VP16 is very similar between HSV-1 and HSV-2, but additional repeats in HSV-2 make it more active. Inserting this region from HSV-2 into the VP16 promoter of HSV-1 results in a reactivation pattern similar to that of HSV-2103. Interestingly, ENHAvir2.0 predicts that this region in HSV-2, but not in HSV-1, is an enhancer. Using the enhancer reporter assay, we validated that the HSV-2 promoter region is indeed an enhancer, whereas the HSV-1 promoter is not. Moreover, inserting the enhancer sequence from HSV-2 next to HSV-1 made it an enhancer. Therefore, the added enhancer in HSV-2 might contribute to its unique reactivation pattern.

Another α-herpesvirus, VZV, has a complex transcriptional mechanism that is still understudied. It is reported that bidirectional promoters often lead to enhancer activity133. In the VZV genome, two such promoters were previously reported in their OriS regions, which not only regulate the viral replication but also transcribe the two pairs of viral proteins ORF62-6386 and ORF 28-298789. Our enhancer prediction for these two regions gives confidence to the notion that they may also function as enhancers. Besides, ENHAvir pointed out some other novel enhancer candidates on the VZV genome, these regions present active histone marks and open chromatin regions90.

Amongst the β-herpesviruses, ENHAvir predicted all the known enhancers on the HCMV genome, including RNA2.7 promoters and OriLyt91. The MIE enhancer presented only a small peak with ENHAvir, but a prominent peak was observed with the improved ENHAvir2.0. Interestingly, ENHAvir predicted two additional enhancers at IRS1, and the Terminal Repeats regions. These two regions in HCMV are known for contributing to viral replication134. Using the enhancer reporter plasmid, we found that both IRS1 and TR are functional enhancers.

For HHV-6 viruses, peaks in the upstream region of the U95 region were observed. This region is already known as the R3 enhancer for HHV-6 virus135. Likewise, a similar repeat sequence, R2, in the HHV-7 genome is also known to serve as an enhancer9597. The terminal repeats of HHV-6 and −7 are known as DR region, and ENHAvir predicted that the DR is an enhancer. We cloned the common repeat sequence of HHV-6 and 7 into the enhancer reporter plasmid and found that DR is an enhancer.

In EBV (a ɣ-herpesvirus), ENHAvir predicted the known enhancers: OriP59, OriLyts58,60, and BILF-261. ENHAvir did not predict the known BamHI-C127 and TPA-responsive enhancer MSTRE-I, in the EBV genome128. In addition to the known enhancers, ENHAvir predicted enhancers at the BamHI-W repeats/BWRF1, BPLF-1, BALF-3, EBNA1_Lex/BKRF1, and TRs. Functional reporter assay indicates that BPLF-1, BALF-3, EBNA1_Lex/BKRF1, and TR are active enhancers during latency and can be further activated during the lytic phase. Treatment with TPA and NaB to initiate the lytic cycle induced the activity of BPLF-1, BALF-3, and EBNA1_Lex in infected cells but not in uninfected cells, suggesting that they are responsive to EBV lytic gene expression. The transfection of the ZTA expression vector in uninfected cells was not sufficient to induce, and even repressed, these enhancers, indicating that additional EBV factors are needed. It has been shown that ZTA both activates and represses transcription136138. Since these enhancer regions do not contain ZTA binding sites, ZTA may repress their transcription indirectly. In contrast, the TR enhancer was upregulated by TPA and NaB in both infected and uninfected cells, suggesting that it is responsive to TPA and NaB and might serve a role in the induction of the EBV lytic cycle by these drugs. ZTA was sufficient to upregulate the TR enhancer in uninfected cells. In the case of EBV TR, we also performed CRISPRa and found that promoters that contact TR in HiC experiments, EBNA1, LMP1, LMP2A, ZTA, and RTA, are regulated by TR enhancer. Previous studies detected loop interaction via HiC, for BamHI-W repeats/BWRF1, BPLF-1, BALF-3, EBNA1_Lex/BKRF1, and TRs61,139, supporting their enhancer function. A previous study that cloned the TR upstream of a reporter gene observed that two TR repeats had much stronger activity than 17-repeats140, which somehow contradicts our observations. There are two differences between the two studies: the previous study cloned the TR upstream of a gene, testing its function as a promoter and not an enhancer; the second difference is that this experiment was performed in uninfected cells, and we found that the TR enhancer is highly active only in infected cells. The BamHI-W repeats/BWRF1 presented a different pattern, high activity in uninfected cells, low activity in infected cells, and no activity in HCT-116 cells. Latent herpes viruses, including EBV, inhibit the cellular anti-viral response141, and HCT-116 is defective in interferon response73. These data suggest that the BamHI-W repeats/BWRF1 is an interferon-responsive enhancer, similar to the vIRF1 promoter region in the genome of KSHV54.

The terminal repeats (TR or DR) are recognized by the portal protein complex for genome encapsidation and genome concatemer cleavage. Following infection of a new cell, the linear genome is injected into the nucleus, where it circularizes via the TR. It has been shown that HSV-1 and HSV-2 ie-mRNA, which reside in the TR84,142, and KSHV TR54,143,144 are enhancers. In this study, we attributed enhancer activity to TR in HCMV and EBV, and DR in HHV-6A/6B/7. Therefore, we propose to add an enhancer as a novel function common to all TR/DR in herpesviruses. Cellular super-enhancers are characterized by large size, high abundance of transcription factors, high H3K27Ac and BRD2/BRD4, which translates to very strong enhancer activity and sensitivity to perturbations145. ChIP-seq experiments detected a high abundance of H3K27Ac and BRD2/BRD4 on EBV TR61,146. Our functional assay detected very high enhancer activity, 70-fold in Akata and 27-fold in Raji for EBV TR, compared to only 5–10-fold activation for the rest of the enhancers. These observations suggest that similar to KSHV TR54, EBV TR is a viral super-enhancer. The possible role of EBV TR in latency and lytic gene expression should be further explored in the future.

Following successful enhancer prediction by ENHAvir for herpesviral genomes, we asked whether ENHAvir can predict enhancers outside the herpesvirus family. SV40 is a polyomavirus that primarily infects monkeys and is unrelated to the Herpesviridae family. Nevertheless, EHNAvir successfully predicted the 72 bp repeats enhancers in the SV40 genome99,100. This demonstrates the tool’s robustness in identifying viral enhancers beyond herpesviruses and signifies the conservation of regulatory element micro-signatures.

DNA viruses that reside in the nucleus exploit their host machinery for transcription. Due to the relatively small genome size, they should adopt the most effective elements from their host to make their enhancers active. We hypothesized that these shared enhancer signatures, which are not visible in traditional analysis methods, might be detected by ENHAvir. We thus employed ENHAvir, trained with KSHV enhancer data only, to predict enhancers for four human gene loci: Fos, Jun, DPPA3, and Myc in the human genome. Most of the enhancers of these loci were identified with high accuracy. The predicted enhancers were also identified by the functional enhancer assay, STARR-seq analyses111,112. Importantly, even tissue-specific enhancers that were active only in one of the tissues analyzed by STARR-seq were still detected by ENHAvir. To use their limited genome size most efficiently, most viral enhancers are bifunctional promoters that adopt enhancer functions, or ‘e-promoters’. It seems that EHNAvir is mostly sensitive to detect these ‘e-promoters’ in the human genome.

Although ENHAvir precisely predicted enhancers on viral and human genomes, some points of concern remain. ENHAvir2.0 still could not detect the EBV BamHI-C enhancer147, and the small peak at the BORF2 promoter (just to the right of the BPLF enhancer) did not exhibit enhancer activity in the functional assay (Supplementary Fig. 9). Other than that, all the predicted enhancers by ENHAvir exhibited enhancer activity in the functional reporter assay.

Once we had the sequence of 30 human herpesvirus enhancers, we were interested in revealing the common enhancer micro-signature. We performed MEME analysis for the common motif search among all the herpes viral enhancers. Interestingly, all four top motifs contained a stretch of poly-A. The same MEME analysis on a panel of 186 randomly picked human enhancers revealed again a motif containing a poly-A stretch. The same MEME analysis for the human and viral enhancers combined found that the poly-A stretch is the most abundant motif shared by viral and human enhancers. We speculate that these poly-A stretches may recruit RNA polymerase II and facilitate the opening of the double helix or prevent the positioning of nucleosome148, thereby assisting in the transcription of enhancer-derived RNAs. Alu elements are ubiquitous, 11% of the human genome149, and possess a 3’ poly-A stretch tail. These retrotransposons adopted during evolution enhancer function in the host genome122. A phylogenetic analysis revealed that the viral enhancers are closely related to the older Alus. This might have occurred due to the virus’s tendency to incorporate its host’s genomic sequences during co-evolution150152, a random retrotransposition of Alu into a viral genome, or both co-evolved to adapt the host transcription machinery.

In summary, ENHAvir is an example of how we may adopt natural language models trained with a minimal set of high-quality data to advance human knowledge. ENHAvir is an application of AI in genomics that can pick up hidden motifs leading to functional biological control.

Methods

Notations for the Natural Language Processing model

The following notations were introduced for our framework: LM^ denotes the DeBERTa v3 language model56. X is the input text sequence of n tokens x1, x2, …., xn. The language model LM^ takes X as the input and produces a vector representation Y as the output from its last layer. Y is a two-dimensional vector of n × hd, where hd denotes the internal hidden dimension of the language model. In other words, the language model is a complex function that produces a final output representation for each token sent as the input:

Y=LM^(X) 1

Where, Yn × hd, for X = [x1, x2, …., xn].

For completeness, we also denoted the tokenized representation of Y by [y1, y2, …., yn], where each yi is a hd dimensional vector. As the DeBERTa v3 is a self-attention-based model, every yi functionally depends on every xi for each i = 1, 2, … n.

Since our objective is to consider nucleotide sequences similar to X to predict an amount z, we used the representation obtained from the language model Y to predict the amount of the activation of z. In the literature, one widely used method of performing such sequence modeling tasks is done with a linear transformation of the 1st token representation of Y56,153,154. We followed this method as follows:

z=W.yi+b 2

where W is a hd dimensional vector, ‘.’ denotes the dot product of the two vectors, and b is a scalar number. Here z is a scalar number such as 0.05 or 1. Altogether, the full pipeline of the language model with the linear transformation layer was denoted as ENHAvir.This model takes the nucleotide sequences as X as input and gives an output of the activation value z:

z=ENHAvir(X) 3

Training of ENHAvir

The parameters of ENHAvir consist of W, b and the original parameters of LM^. All the parameters of the model were tuned to fit some given nucleotide sequences X and their corresponding z. In other words, the parameters of the model can be adapted in such a way that its predictions z¯ for input X will very closely match the gold output data z.

Data collection

We used both the DNA sequences and the enhancer activity, corresponding to the Relative Luciferase Units (RLU) as determined previously in the enhancer reporter assay for the KSHV enhancers54. The RLUs from the last two independent experiments were ‘normalized’ to the 1st experiment with respect to the negative control. The adjusted mean (μadj) was calculated as follows:

μadj=μ±k×σ 4

Where μ is the average of the samples from each replicate within each independent experiment, σ is the standard deviation of the values of each sample and k is a constant factor (in this case we used k = 1).

For ENHAvir 2.0, the herpesviral enhancers we used to train the model are mentioned in Zenodo (Supplementary File T1).

Data processing

Training the NLP model requires both positive and negative data155. Specifically, sequences (Xhigh) are needed for which the activation z is higher, as well as sequences (Xlow) for which the activation z is lower. Positive data (high values of z) and negative data (low values of z) were collected as per the method described above.

Typically, genomic enhancers can be 100–1000 bp in length and multiple enhancers might form clusters working as a ‘super-enhancer’156. The KSHV genomic enhancers are also no exception to this length. We assumed Xenh is an enhancer region with an activation value of zenh. The zenh values from the six KSHV enhancers were collected at first to calculate the maximum activation value zenh.max. We also assumed the non-enhancer region values to be Xnon-enh and znon-enh as their corresponding values.

We then used a sliding window technique to create the (X, z) data pairs from the original data as follows:

  1. We created of sub-sequences of length n by starting from the beginning of Xenh and sliding s positions at a time: Xenh[0:n − 1], Xenh[s:s + n − 1], Xenh[2∗s:2∗s + n − 1], etc. These subsequences would form the input sequence data X.

  2. We drew a value from the uniform distribution u[0.95 * zenh, 1.05 * zenh] and divided it by zenh.max to use as the corresponding output activation value z. Note that, we drew a new sample for each of the subsequences of length n. We then performed the division to ensure that the output values were close to the 0–1 range.

  3. We then repeated all the steps above for all the enhancer regions present in the provided data.

Positive data instances were created following the method described above. A similar process was followed with Xnon-enh and znon-enh to create the negative data instances. These two groups were merged and shuffled to finally obtain ≈21.2k instances of (X,z) data pairs for training (Dtrain) and ≈2.2k instances of (X,z) data pairs for validation (Dvalid).

We created and validated the data using a sliding value of s = 5. All the sequences in X are of length n = 200 tokens to balance token length and data quantity.

Training ENHAvir

We trained all the parameters of ENHAvir on the Dtrain data. We used the mean absolute error (MAE) between the predicted activation z¯ and the original activation z as the loss of function. We used the Adafactor optimizer157 to tune the parameters of ENHAvir using the MAE loss function. We used a batch size of 4 instances and trained for 20 epochs with a learning rate 1e–5. After training, we picked the epoch checkpoint with the lowest validation loss on the Dvalid set. The MAE on a percentage scale for this checkpoint was 0.04%, implying that the trained ENHAvir tool can closely approximate the real activation data z for various values of X in the validation set. This checkpoint was used to generate all the results in this paper where all the viral and human genome predictions were done using ENHAvir trained on KSHV real-world enhancer data only.

Enhancer prediction using ENHAvir

The framework to obtain predictions using ENHAvir was built on the following considerations: X¯ being the test sequence for prediction, l being its length (no. of nucleotides). These considerations were used for the following steps:

Data pre-processing

Similar to the training data pre-processing described before, we followed a sliding-window-based approach to chunk the test sequence. Assuming we followed a sequence length of n and sliding window size of s, we created smaller sub-sequences like: X¯[o:n1], X¯[s:s+n1], X¯[2*s:2*s+n1], etc. We denoted these sub-sequences concisely as X¯0, X¯s, X¯2*s etc.

Prediction

We then individually predicted the activation value of z¯ for each of these sequences from the trained ENHAvir model. We denoted these predictions as z¯0, z¯s, z¯2*s, etc. We used the values of s = 100 and n = 200 to generate predictions for the test data. Hence, we obtained the activation values as z¯0, z¯100, z¯200,… i.e., at positions with intervals of 100. In summary, we had predictions at various starting points over the test sequences at regular intervals s.

Post-processing

The raw prediction values were then further processed for a denoised cleaner version using the following post-processing steps:

  1. We first considered all the predicted activation values from the entire sequence: z¯0, z¯s, …, z¯k*s, … to find the p-th percentile value zp.

  2. We defined a window size of w around a particular predicted position k ∗ s to be a consecutive sequence of w predicted values that include the prediction of z¯k*s. For example, a window that was extended at position ks is {z¯kw*s,z¯kw+1*s,z¯k*s}. Similarly, the window that began at position k∗s is {z¯k*s,z¯k+1*s,,z¯(k+w)*s}.

  3. We considered the number of windows of size w for position k ∗ s as b. If at least a% of these windows had a maximum predicted value above zp, then we set the final activation prediction at position k ∗ s to be max(z¯k*szp,0). Otherwise, we set the final prediction at position k ∗ s to be 0 based on the fact that enhancers need a specific length for the TFs to bind and the enhancers can exhibit their regulatory functions.

We used the values of p = 80, w = 4, and a = 25. The value of b depended on the position k ∗ s. When k ∗ s pointed towards the very beginning or the very end of the original sequence X¯, then the value of b could be between 1 and w – 1. Otherwise, the value of b stays equal to w for the rest of the sequence.

In summary, in the post-processing step, we used a sliding-window-based technique to modify the values of the predicted activation using a percentile-based filtering strategy.

Genome assembly

ENHAvir prediction values of each viral and human genomic loci were reformatted in ‘wiggle’ format and visualized in the UCSC genome browser158. All the tracks, along with the viral assemblies and the human genomic coordinates, are mentioned in Supplementary Table 1.

STARR-seq analysis

Raw STARR-seq reads of K562 cells derived from Reddy et al, 2024 in ENCODE112 (Identifier No. ENCSR926NDZ). STARR-seq reads of HepG2 cells were derived from Sahu et al.111 (GEO: GSE180158). All the raw sequencing reads were first adapter-trimmed with Trimmomatic159. The trimmed files were aligned with the human reference genome (hg38) using BowTie2160 applying the - very sensitive local mapping parameter. The generated STARR-seq BAM files were compared with respective control files using DeepTools2161 and then visualized in the UCSC Genome Browser158.

HiC data analysis

Publicly available HiC dataset GSE160973 was reanalyzed using the pipeline described by Morgan et al139. Paired-end reads of 75 bp were separately aligned to the EBV reference genome (NC_007605.1) using BowTie2160. The PCR bias-derived redundant reads, reads with low quality (MapQ <30), and reads from self-ligation and undigested products were removed. The reference genome was divided into 5 kb windows with a 1 kb sliding window. The paired reads, assigned to two 5 kb windows, were used to produce raw contact matrices, where the biases were removed with ICE162. ICE contact normalization was done 30 times. Based on the distance between two 5 kb windows, significant associations were determined, which were then categorized into 20 groups. HiC score was assumed as Poisson with a λ parameter matching the mean scores. FDR-adjusted p < 0.05 were considered as significant contacts.

ChIP-seq analysis

Publicly available ChIP-seq raw sequence reads (FASTq files) were used: E-MTAB-7788 (for ZTA)163, GSE29611 (for H3K27Ac, H3K4Me1/2/3)109, GSE281522 (for RNApolII)164. The raw read files were first subjected to adapter trimming using trimmomatic159. The files were then aligned to the EBV reference genome (NC_007605.1) using BowTie2, using the following parameter: very-sensitive-local mapping160. PCR duplicates were then removed from those files using SAMTools RmDup165. Deeptools2 was used to compare the treated and input Bam files161. The normalized ChIP-seq file was visualized in the UCSC genome browser158.

ATAC-seq analysis

Publicly available ATAC-seq raw sequence reads (FASTq files) from GSE170245 were used109. The raw read files were trimmed using trimmomatic159 and aligned to the EBV reference genome (NC_007605.1) using BowTie2 using the following parameter: very-sensitive-local mapping160. The alignment file was visualized in the UCSC genome browser158.

Cell lines

HEK293T and HCT-116 cells were maintained respectively in DMEM and McCoy’s 5 A medium supplemented with 10% fetal bovine serum (FBS, Gibco), L-glutamine (2 mM, Gibco), Penicillin-streptomycin (100 IU/ml and 100 µg/ml, respectively, Gibco), Sodium-pyruvate (1 mM, Gibco) at 37 °C under 5% CO2 atmosphere. Raji and Akata cells were maintained in RPMI-1640 medium with 10% FBS and the same supplements. BJAB and BC-1 cells were maintained in RPMI-1640 medium with 20% FBS, along with the same supplements. Raji, Akata, and BC-1 cells were treated with 20 ng/mL phorbol ester (TPA) and 1.25 mM sodium butyrate (NaB) to induce lytic reactivation of EBV. BJAB cells were also treated with the same concentrations of TPA and NaB, where indicated. HEK293T, BJAB, Akata, Raji, and BC-1 cells were kindly provided by Richard F. Ambinder. HCT-116 cell line was a kind gift from Bert Vogelstein.

Cloning the predicted enhancers into the enhancer reporter plasmid

For validation, we used the sequence of the predicted enhancer. In case the peak was smaller than 1000 bp, we added 200 bp on both sides from the respective viral genome. Gibson assembly-specific primers were designed (Supplementary Table 3) for the ENHAvir-predicted enhancers on the EBV genome and a STARR-seq negative (non-enhancer) human genome sequence (hg19, chr3: 55198740-55200041)63. Whole genomic DNA was isolated from Akata cells using DNeasy Blood and Tissue Kit (Qiagen, Cat. No. 69506). PCR was performed on the isolated DNA with Phusion Hot Start Flex polymerase (NEB, Cat. No. M0535S), and amplicons were purified using Wizard SV Gel and PCR purification system (Promega, Cat. No. A9281). The amplicons were cloned into the STARR-seq luciferase reporter vector within the XbaI and FseI sites using Gibson assembly master mix (EURx, Cat. No.ERX-E1050-01). The STARR-seq luciferase reporter plasmid (vector_ORI_empty) was a gift from Alexander Stark (Addgene plasmid #99297; https://www.addgene.org/99297; RRID: Addgene_992927)63. ZTA promoter reporter plasmid, used as the positive control for the EBV reporter assays, was a gift from Honglin Chen and S. Diane Hayward62. The EBV TR (TAR11 Plus) vector was a gift from Prashant Dessai. The 11X EBV TR was cloned into the XbaI site of Addgene #99297 plasmid using T4 DNA ligase (NEB, Cat. No. M0202). Suitable primers were designed (Supplementary Table 3) for cloning the OriP sequence into the HindIII-NcoI site, upstream to the luciferase gene in the STARR-seq luciferase reporter plasmid using T4 DNA ligase. Subsequent cloning of the EBV TR downstream to the luciferase gene was done, as mentioned earlier, into the OriP-Luciferase reporter plasmid.

ENHAvir predicted peaks in the HCMV (TB40/E), kindly provided by Noam Stern-Ginossar, IRS1 and TR sequences, and the similar sequence of TRs of HHV-6A, -6B, and -7 were cloned to the XbaI site of the STARR-seq luciferase reporter using T4 DNA ligase (Supplementary Table 3). HSV-1 (NC_001806.2) and HSV-2 (strain 186), kindly provided by Oren Kobiler, VP16 wildtype promoters and HSV-1 VP16 promoter with added tandem Sp1/Egr1 binding sites were cloned into the reporter plasmid via Gibson assembly, similar to the EBV enhancers (Supplementary Table 3).

Transfection of plasmids

Akata and Raji cell lines were chemically transfected with the ENHAvir-predicted enhancer constructs and the internal control for transfection efficiency- pGL4.74 (Renilla) plasmid at a 3:1 ratio using FuGENE HD reagent (Promega, Cat. No. E2311). The BJAB cells were chemically transfected with the predicted enhancer library, pGL4.74 control, and ZTA expression plasmid147 or empty vector (PSG5) at a 3:1:1 ratio using the FuGENE HD reagent. Briefly, 4.5 × 105 cells per well were seeded in 6-well plates, and 3 μg DNA was transfected in each well using 13.5 μl FuGENE HD reagent (4.5 μl FuGENE for 1 μg DNA). The BC-1 cells were also transfected in a similar manner like BJAB cells, but only the EBV TR reporter plasmid was used instead of the EBV enhancer library. HEK293T and HCT-116 cells were transfected using PolyJet (Signagen, Cat. No. SL100688), following the manufacturer’s protocol. Briefly, 150 ng of each firefly reporter plasmid from the EBV library, 50 ng control renilla plasmid, and 50 ng ZTA expression plasmid (a kind gift from Gary Hayward147) or pSG5 were co-transfected in 50,000 HEK293T or HCT116 cells per well in a 24-well plate. For the OriP-Luciferase-EBV_TR plasmid, FuGene HD was used to transfect HEK293T and BJAB cells at a ratio of 3:1 and 6:1, respectively, with the reporter plasmids.

Dual-luciferase reporter assay

Dual-luciferase reporter assays were performed using 48 h post-transfected cell lysates with in-house prepared substrate solutions containing CycLuc1 (Calbiochem, Cat. No. 530650) and Coelentarazine (NBT New Biotechnology, Cat. No. BTM-10110-1), respectively. Reporter signals were measured using a Multilabile Microplate Reader (Varioskan™ LUX Multimode Microplate Reader, ThermoFisher Scientific). Relative Luciferase Units (RLU) were calculated by dividing the firefly luciferase units of the tested reporter plasmids by the renilla luciferase units of the internal control.

Generation of CRISPR-activation plasmids and lentiviral transduction

HEK293T cells were transfected with the ScFv-2ERT2-VPH plasmid, a gift from Yu Wang (Addgene plasmid #120556; https://www.addgene.org/120556/; RRID: Addgene_120556)81 and pHRdSV40-dCas9-10xGCN4_v4-P2A-BFP (Addgene plasmid # 60903; https://www.addgene.org/60903/; RRID: Addgene_60903), a gift from Ron Vale166, where the BFP was replaced with the hygromycin resistance gene and specific sgRNAs targeting EBV-TR were cloned (Supplementary Table 3), pMD2.G (Addgene plasmid #12259; https://www.addgene.org/12259/; RRID: Addgene_12259) and psPAX2 (Addgene plasmid #12260; https://www.addgene.org/12260/; RRID: Addgene_12260), gifts from Didier Trono, to generate lentiviral particles. BC-1 cells were transduced with lentiviruses resuspended in RPMI media supplemented with 8 µg/ml polybrene. The cells were selected with 750 µg/ml G418 and 600 µg/ml hygromycin for 7 days. The selected cells were treated with 50 nM 4-hydroxytamoxifen for 4 and 24 h to induce CRISPRa.

RNA isolation and reverse transcription-qauntitative PCR

Total RNA isolation from BC-1 cells following CRISPRa was performed using the RNeasy Mini Kit (Qiagen, Cat. No. 74106) according to the manufacturer’s protocol. cDNA was synthesized from the RNA using High-Capacity cDNA Reverse Transcription Kit (Applied Biosystems, Cat. No. 4374966). Subsequently, qPCR was performed using 2x qPCRBIO SyGreenBlue Mix Hi-ROX (PCR Biosystems, Cat. No. PB20.16-50) and analyzed with CFX96 Touch Real-Time PCR Detection System (Bio-Rad Laboratories). Relative expression of target genes (Supplementary Table 3) was performed with suitable primers167,168 by ΔΔCt method169 using β-Actin as an internal control.

Motif and phylogenetic analysis

FASTA sequences of the human Alu sequences and randomly curated 186 enhancer sequences from the human genome were retrieved from the hg38 reference genome using the UCSC genome browser158. Only the Alu sequences having more than 300 bp length were considered for all the downstream analyses. The FASTA sequences of the viral and the analyzed human enhancers, along with the human Alu sequences and random human enhancer sequences, were concatenated and searched for common motifs using the MEME suite113. To search for the significant evolutionary conservation among these sequences, multiple alignment was performed using Clustal W170 using the -slow: dynamic programming (accurate) parameter. The aligned file was then used to produce a maximum likelihood tree using the Tamura Nei distance model in the Mega software171 and plotted using the Microreact online server172. Branches and clades having more than the bootstrap value of over 80 were considered to build the phylogenetic tree. Data to determine the age of the human Alu elements was adopted from Su et al.122.

Quantification and statistical analyses

The generated data represent the mean ± S.D. from n = 9, three independent experiments, with three biological replicates per reporter assay. For RT-qPCR assays, sample size, n = 3. GraphPad Prism (v9.0.1) for Mac OS, GraphPad Software, Boston, Massachusetts USA, (www.graphpad.com) was used to quantify data, generate graphs, and perform statistical analyses with two-tailed Student’s t tests. The exact P values of these t tests are mentioned in the relevant figures and the source data file.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

Reporting Summary (92.1KB, pdf)

Source data

Source Data (74.3KB, xlsx)

Acknowledgements

We would like to thank Alexander Stark, Yu Wang, Ron Vale, Didier Trono, Prashant Dessai, Gary S. Hayward, Honglin Chen and S. Diane Hayward for kindly providing the plasmids, Noam Stern-Ginossar and Oren Kobiler for viral DNA, Bert Vogelstein, Richard F. Ambinder, and S. Diane Hayward for cell lines, and Oren Kobiler for fruitful discussions. This work was supported by grants from the Israel Science Foundation (https://www.isf.org.il) to M.S. (1365/21), and the Israel Cancer Research Fund (https://www.icrfonline.org/) to M.S. (23-101-PG). We are grateful for the support of the Elias, Genevieve, and Georgianna Atol Charitable Trust to the Daniella Lee Casper Laboratory in Viral Oncology. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author contributions

These authors contributed equally: Nilabja Roy Chowdhury, Deepanway Ghosal. Conceived and designed the experiments: N.R.C., D.G., V.G., M.S. Performed the experiments: N.R.C., D.G. Analyzed the data: N.R.C., D.G., M.S. Generated tools & reagents: N.R.C., D.G., V.G., Wrote the paper: N.R.C., D.G., and M.S.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Data availability

HiC dataset, by Morgan et al.139, used in this study, is publicly available in the NCBI GEO under the accession no. GSE160973. ChIP-seq raw sequence reads used in this study are publicly available in repositories with accession nos. as follows: E-MTAB-7788 (for ZTA)163, GSE29611 (for H3K27Ac, H3K4Me1/2/3)109, GSE281522 (for RNApolII)164. Publicly available ATAC-seq raw sequence reads from the NCBI GEO accession no. GSE170245 was used109. Raw STARR-seq reads of K562 cells derived from Reddy et al., 2024 in ENCODE112 are publicly available under the Identifier no. ENCSR926NDZ. STARR-seq reads of HepG2 cells were used from Sahu et al.111, available in the NCBI GEO under the accession no. GSE180158Source data are provided with this paper.

Code availability

All the source codes for the ENHAvir frameworks have been deposited in the GitHub repository (https://github.com/shamay-lab/ENHAvir). All the supplementary files, along with the codes of ENHAvir, have been deposited in Zenodo (10.5281/zenodo.17500116)173.

Competing interests

M.S. and N.R.C. applied for a patent entitled “Natural Language Model-based tool ENHAvir predicts viral and host enhancers” discussed in this paper (Application number 63/791,758; filing date: Apr. 21, 2025). D.G. and V.G. are collaborators on the same patent application.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Nilabja Roy Chowdhury, Deepanway Ghosal.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-66861-y.

References

  • 1.Imperiale, M. J. & Nevins, J. R. Adenovirus 5 E2 transcription unit: an E1A-inducible promoter with an essential element that functions independently of position or orientation. Mol. Cell Biol.4, 875–882 (1984). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Jones, N. C., Rigby, P. W. & Ziff, E. B. Trans-acting protein factors and the regulation of eukaryotic transcription: lessons from studies on DNA tumor viruses. Genes Dev.2, 267–281 (1988). [DOI] [PubMed] [Google Scholar]
  • 3.Pennacchio, L. A., Bickmore, W., Dean, A., Nobrega, M. A. & Bejerano, G. Enhancers: five essential questions. Nat. Rev. Genet.14, 288–295 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Maniatis, T., Goodbourn, S. & Fischer, J. A. Regulation of inducible and tissue-specific gene expression. Science236, 1237–1245 (1987). [DOI] [PubMed] [Google Scholar]
  • 5.Smith, A. D., Sumazin, P., Xuan, Z. & Zhang, M. Q. DNA motifs in human and mouse proximal promoters predict tissue-specific expression. Proc. Natl. Acad. Sci.103, 6275–6280 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Yu, X., Lin, J., Zack, D. J. & Qian, J. Identification of tissue-specific cis-regulatory modules based on interactions between transcription factors. BMC Bioinforma.8, 437 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Matharu, N. & Ahituv, N. Minor loops in major folds: enhancer–promoter looping, chromatin restructuring, and their association with transcriptional regulation and disease. PLoS Genet.11, e1005640 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Symmons, O. et al. The Shh topological domain facilitates the action of remote enhancers by reducing the effects of genomic distances. Dev. Cell39, 529–543 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Matityahu, A. & Onn, I. Hit the brakes – a new perspective on the loop extrusion mechanism of cohesin and other SMC complexes. J. Cell Sci.134, jcs247577 (2021). [DOI] [PubMed]
  • 10.Brand, A. H., Breeden, L., Abraham, J., Sternglanz, R. & Nasmyth, K. Characterization of a “silencer” in yeast: A DNA sequence with properties opposite to those of a transcriptional enhancer. Cell41, 41–48 (1985). [DOI] [PubMed] [Google Scholar]
  • 11.Pang, B., van Weerd, J. H., Hamoen, F. L. & Snyder, M. P. Identification of non-coding silencer elements and their regulation of gene expression. Nat. Rev. Mol. Cell Biol.10.1038/s41580-022-00549-9 (2022). [DOI] [PubMed] [Google Scholar]
  • 12.Gisselbrecht, S. S. et al. Transcriptional silencers in drosophila serve a dual role as transcriptional enhancers in alternate cellular contexts. Mol. Cell77, 324–337.e8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Huang, D. & Ovcharenko, I. Enhancer–silencer transitions in the human genome. Genome Res.32, 437–448 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet.39, 311–318 (2007). [DOI] [PubMed] [Google Scholar]
  • 15.Sharifi-Zarchi, A. et al. DNA methylation regulates discrimination of enhancers from promoters through a H3K4me1-H3K4me3 seesaw mechanism. BMC Genom.18, 964 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods10, 1213–1218 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Giresi, P. G., Kim, J., McDaniell, R. M., Iyer, V. R. & Lieb, J. D. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res.17, 877–885 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Keene, M. A., Corces, V., Lowenhaupt, K. & Elgin, S. C. DNase I hypersensitive sites in Drosophila chromatin occur at the 5’ ends of regions of transcription. Proc. Natl. Acad. Sci.78, 143–146 (1981). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.MCGHEE, J. A 200 base pair region at the 5$prime; end of the chicken adult $beta;-globin gene is accessible to nuclease digestion. Cell27, 45–55 (1981). [DOI] [PubMed] [Google Scholar]
  • 20.Boyle, A. P. et al. High-resolution mapping and characterization of open chromatin across the genome. Cell132, 311–322 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Dekker, J., Rippe, K., Dekker, M. & Kleckner, N. Capturing chromosome conformation. Science295, 1306–1311 (2002). [DOI] [PubMed] [Google Scholar]
  • 22.Dekker, J. The three ‘C’ s of chromosome conformation capture: controls, controls, controls. Nat. Methods3, 17–21 (2006). [DOI] [PubMed] [Google Scholar]
  • 23.Simonis, M. et al. Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C). Nat. Genet.38, 1348–1354 (2006). [DOI] [PubMed] [Google Scholar]
  • 24.Belton, J.-M. et al. Hi–C: a comprehensive technique to capture the conformation of genomes. Methods58, 268–276 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hsieh, T.-H. S. et al. Mapping nucleosome resolution chromosome folding in yeast by micro-C. Cell162, 108–119 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Mumbach, M. R. et al. HiChIP: efficient and sensitive analysis of protein-directed genome architecture. Nat. Methods13, 919–922 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Creyghton, M. P. et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl. Acad. Sci.107, 21931–21936 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zentner, G. E., Tesar, P. J. & Scacheri, P. C. Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome Res.21, 1273–1283 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Pradeepa, M. M. et al. Histone H3 globular domain acetylation identifies a new class of enhancers. Nat. Genet.48, 681–686 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Pradeepa, M. M. Causal role of histone acetylations in enhancer function. Transcription8, 40–47 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mattaj, I. W., Lienhard, S., Jiricny, J. & De Robertis, E. M. An enhancer-like sequence within the Xenopus U2 gene promoter facilitates the formation of stable transcription complexes. Nature316, 163–167 (1985). [DOI] [PubMed] [Google Scholar]
  • 32.Pikaard, C. S. Ribosomal gene promoter domains can function as artificial enhancers of RNA polymerase I transcription, supporting a promoter origin for natural enhancers in Xenopus. Proc. Natl. Acad. Sci.91, 464–468 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Mikhaylichenko, O. et al. The degree of enhancer or promoter activity is reflected by the levels and directionality of eRNA transcription. Genes Dev.32, 42–57 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Dao, L. T. M. & Spicuglia, S. Transcriptional regulation by promoters with enhancer function. Transcription9, 307–314 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Marsman, J. & Horsfield, J. A. Long distance relationships: Enhancer–promoter communication and dynamic gene transcription. Biochim. Biophys. Acta Gene Regulat. Mech.1819, 1217–1227 (2012). [DOI] [PubMed] [Google Scholar]
  • 36.Kim, T.-K. & Shiekhattar, R. Architectural and functional commonalities between enhancers and promoters. Cell162, 948–959 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Medina-Rivera, A., Santiago-Algarra, D., Puthier, D. & Spicuglia, S. Widespread enhancer activity from core promoters. Trends Biochem. Sci.43, 452–468 (2018). [DOI] [PubMed] [Google Scholar]
  • 38.Zhu, I., Song, W., Ovcharenko, I. & Landsman, D. A model of active transcription hubs that unifies the roles of active promoters and enhancers. Nucleic Acids Res.49, 4493–4505 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Schaukowitch, K. et al. Enhancer RNA facilitates NELF release from immediate early genes. Mol. Cell56, 29–42 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kim, T.-K., Hemberg, M. & Gray, J. M. Enhancer RNAs: a class of long noncoding RNAs synthesized at enhancers: Fig. 1. Cold Spring Harb. Perspect. Biol.7, a018622 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Muerdter, F., Boryń, ŁM. & Arnold, C. D. STARR-seq — principles and applications. Genomics106, 145–150 (2015). [DOI] [PubMed] [Google Scholar]
  • 42.Arnold, C. D. et al. Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science339, 1074–1077 (2013). [DOI] [PubMed] [Google Scholar]
  • 43.Hung, P. V. & Phuong, T. M. Discriminative prediction of enhancers with word combinations as features. In Knowledge and Systems Engineering, 35–47, 10.1007/978-3-319-11680-8_4 (2015).
  • 44.Jia, C. & He, W. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Sci. Rep.6, 38741 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.de Almeida, B. P., Reiter, F., Pagani, M. & Stark, A. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers. Nat. Genet54, 613–624 (2022). [DOI] [PubMed] [Google Scholar]
  • 46.Liu, B. iEnhancer-PsedeKNC: Identification of enhancers and their subgroups based on Pseudo degenerate kmer nucleotide composition. Neurocomputing217, 46–52 (2016). [Google Scholar]
  • 47.Zhu, D. et al. A deep learning based two-layer predictor to identify enhancers and their strength. Methods211, 23–30 (2023). [DOI] [PubMed] [Google Scholar]
  • 48.Lim, L. W. K., Chung, H. H., Chong, Y. L. & Lee, N. K. A survey of recently emerged genome-wide computational enhancer predictor tools. Comput. Biol. Chem.74, 132–141 (2018). [DOI] [PubMed] [Google Scholar]
  • 49.Dogan, N. et al. Occupancy by key transcription factors is a more accurate predictor of enhancer activity than histone modifications or chromatin accessibility. Epigenet. Chromatin8, 16 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Fang, Y., Wang, Y., Zhu, Q., Wang, J. & Li, G. In silico identification of enhancers on the basis of a combination of transcription factor binding motif occurrences. Sci. Rep.6, 32476 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kumar, S. & Bucher, P. Predicting transcription factor site occupancy using DNA sequence intrinsic and cell-type specific chromatin features. BMC Bioinforma.17, S4 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Herman-Izycka, J., Wlasnowolski, M. & Wilczynski, B. Taking promoters out of enhancers in sequence based predictions of tissue-specific mammalian enhancers. BMC Med Genomics10, 34 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Belokopytova, P. S., Nuriddinov, M. A., Mozheiko, E. A., Fishman, D. & Fishman, V. Quantitative prediction of enhancer–promoter interactions. Genome Res.30, 72–84 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Roy Chowdhury, N., Gurevich, V. & Shamay, M. KSHV genome harbors both constitutive and lytically induced enhancers. J. Virol.98, e0017924 (2024). [DOI] [PMC free article] [PubMed]
  • 55.Kumar, V. et al. Uniform, optimal signal processing of mapped deep-sequencing data. Nat. Biotechnol.31, 615–622 (2013). [DOI] [PubMed] [Google Scholar]
  • 56.He, P., Gao, J. & Chen, W. DeBERTaV3: improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. arXiv, https://arxiv.org/abs/2111.09543 (2023).
  • 57.Wang, A. et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP 353–355 (Association for Computational Linguistics, 2018). 10.18653/v1/W18-5446.
  • 58.Chavrier, P., Gruffat, H., Chevallier-Greco, A., Buisson, M. & Sergeant, A. The Epstein-Barr virus (EBV) early promoter DR contains a cis-acting element responsive to the EBV transactivator EB1 and an enhancer with constitutive and inducible activities. J. Virol.63, 607–614 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Wysokenski, D. A. & Yates, J. L. Multiple EBNA1-binding sites are required to form an EBNA1-dependent enhancer and to activate a minimal replicative origin within oriP of Epstein-Barr virus. J. Virol.63, 2657–2666 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.AuCoin, D. P. et al. Amplification of the Kaposi’s sarcoma-associated herpesvirus/human herpesvirus 8 lytic origin of DNA replication is dependent upon a cis-acting AT-rich region and an ORF50 response element and the trans-acting factors ORF50 (K-Rta) and K8 (K-bZIP). Virology318, 542–555 (2004). [DOI] [PubMed] [Google Scholar]
  • 61.Ding, W. et al. The Epstein-Barr virus enhancer interaction landscapes in virus-associated cancer cell lines. J. Virol.96, e0073922 (2022). [DOI] [PMC free article] [PubMed]
  • 62.Huang, J. et al. Contribution of C/EBP proteins to Epstein-Barr virus lytic gene expression and replication in epithelial cells. J. Virol.80, 1098–1109 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods15, 141–149 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Rowe, M. et al. Differences in B cell growth phenotype reflect novel patterns of Epstein-Barr virus latent gene expression in Burkitt’s lymphoma cells. EMBO J.6, 2743–2751 (1987). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Cesarman, E. & Mesri, E. A. Virus-associated lymphomas. Curr. Opin. Oncol.11, 322 (1999). [DOI] [PubMed] [Google Scholar]
  • 66.Sbih-Lammali, F. et al. Transcriptional expression of Epstein-Barr virus genes and proto-oncogenes in North African nasopharyngeal carcinoma. J. Med Virol.49, 7–14 (1996). [DOI] [PubMed] [Google Scholar]
  • 67.Thompson, M. P. & Kurzrock, R. Epstein-Barr virus and cancer. Clin. Cancer Res.10, 803–821 (2004). [DOI] [PubMed] [Google Scholar]
  • 68.Chen, J. et al. Activation of latent Kaposi’s sarcoma-associated herpesvirus by demethylation of the promoter of the lytic transactivator. Proc. Natl. Acad. Sci.98, 4119–4124 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Xu, Y. et al. A Kaposi’s sarcoma-associated herpesvirus/human herpesvirus 8 ORF50 deletion mutant is defective for reactivation of latent virus and DNA replication. J. Virol.79, 3479–3487 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Gradoville, L. et al. Kaposi’s sarcoma-associated herpesvirus open reading frame 50/Rta protein activates the entire viral lytic cycle in the HH-B2 primary effusion lymphoma cell line. J. Virol.74, 6207–6212 (2000). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Nishimura, K. et al. Functional analysis of Kaposi’s sarcoma-associated herpesvirus RTA in an RTA-depressed cell line. J. Hum. Virol.4, 296–305 (2001). [PubMed] [Google Scholar]
  • 72.Lieberman, P. M., Hardwick, J. M., Sample, J., Hayward, G. S. & Hayward, S. D. The zta transactivator involved in induction of lytic cycle gene expression in Epstein-Barr virus-infected lymphocytes binds to both AP-1 and ZRE sites in target promoter and enhancer regions. J. Virol.64, 1143–1155 (1990). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Neumayr, C., Pagani, M., Stark, A. & Arnold, C. D. STARR−seq and UMI-STARR-seq: assessing enhancer activities for genome-wide-, high-, and low-complexity candidate libraries. Curr. Protoc. Mol. Biol. 128, e105 (2019). [DOI] [PMC free article] [PubMed]
  • 74.Stewart, J. P., Janjua, N. J., Sunil-Chandra, N. P., Nash, A. A. & Arrand, J. R. Characterization of murine gammaherpesvirus 68 glycoprotein B (gB) homolog: similarity to Epstein-Barr virus gB (gp110). J. Virol.68, 6496–6504 (1994). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Stevenson, P. G. & Efstathiou, S. Immune mechanisms in murine gammaherpesvirus-68 infection. Viral Immunol.18, 445–456 (2005). [DOI] [PubMed] [Google Scholar]
  • 76.Dong, S., Forrest, J. C. & Liang, X. Murine gammaherpesvirus 68: a small animal model for gammaherpesvirus-associated diseases. Adv. Exp. Med. Biol. 225–236, 10.1007/978-981-10-5765-6_14 (2017). [DOI] [PubMed]
  • 77.Sattler, C., Steer, B. & Adler, H. Multiple lytic origins of replication are required for optimal gammaherpesvirus fitness in vitro and in vivo. PLoS Pathog.12, e1005510 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Deng, H., Chu, J. T., Park, N.-H. & Sun, R. Identification of cis sequences required for lytic DNA replication and packaging of murine gammaherpesvirus 68. J. Virol.78, 9123–9131 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Qi, J., Gong, D. & Deng, H. CCAAT/enhancer binding proteins play a role in oriLyt-dependent genome replication during MHV-68 de novo infection. Protein Cell2, 463–469 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Ponnusamy, R. et al. KSHV but not MHV-68 LANA induces a strong bend upon binding to terminal repeat viral DNA. Nucleic Acids Res. gkv987, 10.1093/nar/gkv987 (2015). [DOI] [PMC free article] [PubMed]
  • 81.Lu, J. et al. Multimode drug inducible CRISPR/Cas9 devices for transcriptional activation and genome editing. Nucleic Acids Res.46, e25–e25 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Sui, M. et al. Novel drug-inducible CRISPRa/i systems for rapid and reversible manipulation of gene transcription. Cell. Mol. Life Sci.82, 249 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Ostler, J. B., Thunuguntla, P., Hendrickson, B. Y. & Jones, C. Transactivation of Herpes Simplex Virus 1 (HSV-1) infected cell protein 4 enhancer by glucocorticoid receptor and stress-induced transcription factors requires overlapping Krüppel-like transcription factor 4/Sp1 binding sites. J. Virol. 95, e01776-20 (2021). [DOI] [PMC free article] [PubMed]
  • 84.Lang, J. C., Spandidos, D. A. & Wilkie, N. M. Transcriptional regulation of a herpes simplex virus immediate early gene is mediated through an enhancer-type sequence. EMBO J.3, 389–395 (1984). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Berthomme, H., Thomas, J., Texier, P., Epstein, A. & Feldman, L. T. Enhancer and long-term expression functions of herpes simplex virus type 1 latency-associated promoter are both located in the same region. J. Virol.75, 4386–4393 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Khalil, M. I., Sommer, M. H., Hay, J., Ruyechan, W. T. & Arvin, A. M. Varicella-zoster virus (VZV) origin of DNA replication oriS influences origin-dependent DNA replication and flanking gene transcription. Virology481, 179–186 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Stow, N. D. & Davison, A. J. Identification of a Varicella-Zoster virus origin of DNA replication and its activation by herpes simplex virus type 1 gene products. J. Gen. Virol.67, 1613–1623 (1986). [DOI] [PubMed] [Google Scholar]
  • 88.Meier, J. L. & Straus, S. E. Varicella-zoster virus DNA polymerase and major DNA-binding protein genes have overlapping divergent promoters. J. Virol.67, 7573–7581 (1993). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Yang, M., Hay, J. & Ruyechan, W. T. The DNA element controlling expression of the Varicella-Zoster virus open reading frame 28 and 29 genes consists of two divergent unidirectional promoters which have a common USF site. J. Virol.78, 10939–10952 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Wang, J. et al. Repression of varicella zoster virus gene expression during quiescent infection in the absence of detectable histone deposition. PLoS Pathog.21, e1012367 (2025). [DOI] [PMC free article] [PubMed]
  • 91.Forte, E. et al. Critical role for the human cytomegalovirus major immediate early proteins in recruitment of RNA polymerase II and H3K27Ac to an enhancer-like element in Ori Lyt. Microbiol. Spectr.11, e0314422 (2023). [DOI] [PMC free article] [PubMed]
  • 92.BOSHART, M. et al. A very strong enhancer is located upstream of an immediate early gene of human cytomegalovirus. Cell41, 521–530 (1985). [DOI] [PubMed] [Google Scholar]
  • 93.Thomsen, D. R., Stenberg, R. M., Goins, W. F. & Stinski, M. F. Promoter-regulatory region of the major immediate early gene of human cytomegalovirus. Proc. Natl. Acad. Sci.81, 659–663 (1984). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Hale, A. E. & Moorman, N. J. The ends dictate the means: promoter switching in herpesvirus gene expression. Annu. Rev. Virol.8, 201–218 (2021). [DOI] [PubMed] [Google Scholar]
  • 95.Isegawa, Y. et al. Comparison of the complete DNA sequences of human herpesvirus 6 variants A and B. J. Virol.73, 8053–8063 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Nicholas, J. Determination and analysis of the complete nucleotide sequence of human herpesvirus. J. Virol.70, 5975–5989 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Gu, W. et al. Genomic organization and molecular characterization of porcine cytomegalovirus. Virology460–461, 165–172 (2014). [DOI] [PubMed] [Google Scholar]
  • 98.Martin, M. E., Nicholas, J., Thomson, B. J., Newman, C. & Honess, R. W. Identification of a transactivating function mapping to the putative immediate-early locus of human herpesvirus 6. J. Virol.65, 5381–5390 (1991). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Treisman, R. & Maniatis, T. Simian virus 40 enhancer increases number of RNA polymerase II molecules on linked DNA. Nature315, 72–75 (1985). [DOI] [PubMed] [Google Scholar]
  • 100.Herr, W. The SV40 enhancer: transcriptional regulation through a hierarchy of combinatorial interactions. Semin. Virol.4, 3–13 (1993). [Google Scholar]
  • 101.Moreau, P. et al. The SV40 72 base repair repeat has a striking effect on gene expression both in SV40 and other chimeric recombinants. Nucleic Acids Res.9, 6047–6068 (1981). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Schmid, D. S. Mixing it up: new insights into interspecies recombination between herpes simplex virus type 1 and 2. J. Infect. Dis.10.1093/infdis/jiz200 (2019). [DOI] [PMC free article] [PubMed]
  • 103.Thompson, R. L. & Sawtell, N. M. Targeted promoter replacement reveals that herpes simplex virus type-1 and 2 specific VP16 promoters direct distinct rates of entry into the lytic program in sensory neurons in vivo. Front. Microbiol. 10, 1624 (2019). [DOI] [PMC free article] [PubMed]
  • 104.Tomoiu, A., Gravel, A. & Flamand, L. Mapping of human herpesvirus 6 immediate–early 2 protein transactivation domains. Virology354, 91–102 (2006). [DOI] [PubMed] [Google Scholar]
  • 105.Kobayashi, N. et al. Identification of a strong genetic risk factor for major depressive disorder in the human virome. iScience27, 109203 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Dominguez, G. et al. Human herpesvirus 6B genome sequence: coding content and comparison with human herpesvirus 6A. J. Virol.73, 8040–8052 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Li, J. et al. iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models. Bioinformatics Adv.3, vbad043 (2023). [DOI] [PMC free article] [PubMed]
  • 108.Solis, L. M., Sterling-Lentsch, G., Halfon, M. S. & Girgis, H. Z. EnhancerDetector: enhancer discovery from human to fly via interpretable deep learning. Preprint at 10.1101/2025.05.28.656532 (2025).
  • 109.An integrated encyclopedia of DNA elements in the human genome. Nature489, 57–74 (2012). [DOI] [PMC free article] [PubMed]
  • 110.Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database2017, bax028 (2017). [DOI] [PMC free article] [PubMed]
  • 111.Sahu, B. et al. Sequence determinants of human gene regulatory elements. Nat. Genet54, 283–294 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Fabiha, T. et al. A consensus variant-to-function score to functionally prioritize variants for disease. Preprint at 10.1101/2024.11.07.622307 (2024).
  • 113.Bailey, T. L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int Conf. Intell. Syst. Mol. Biol.2, 28–36 (1994). [PubMed] [Google Scholar]
  • 114.Liao, X. et al. Repetitive DNA sequence detection and its role in the human genome. Commun. Biol.6, 954 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Buffry, A. D., Mendes, C. C. & McGregor, A. P. The functionality and evolution of eukaryotic transcriptional enhancers. Adv. Genet. 143–206, 10.1016/bs.adgen.2016.08.004 (2016). [DOI] [PubMed]
  • 116.Yáñez-Cuna, J. O. et al. Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res.24, 1147–1156 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Karttunen, K. et al. Transposable elements as tissue-specific enhancers in cancers of endodermal lineage. Nat. Commun.14, 5313 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Pérez-Rico, Y. A. et al. Transcriptional perturbation of LINE-1 elements reveals their cis -regulatory potential. Preprint at 10.1101/2024.02.20.581275 (2024).
  • 119.Yang, Z., Boffelli, D., Boonmark, N., Schwartz, K. & Lawn, R. Apolipoprotein(a) Gene Enhancer Resides within a LINE Element. J. Biol. Chem.273, 891–897 (1998). [DOI] [PubMed] [Google Scholar]
  • 120.Roller, M. et al. LINE retrotransposons characterize mammalian tissue-specific and evolutionarily dynamic regulatory regions. Genome Biol.22, 62 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Norris, J. et al. Identification of a new subclass of Alu DNA repeats which can function as estrogen receptor-dependent transcriptional enhancers. J. Biol. Chem.270, 22777–22782 (1995). [DOI] [PubMed] [Google Scholar]
  • 122.Su, M., Han, D., Boyd-Kirkup, J., Yu, X. & Han, J.-D. J. Evolution of Alu elements toward enhancers. Cell Rep.7, 376–385 (2014). [DOI] [PubMed] [Google Scholar]
  • 123.King, D. M. et al. Synthetic and genomic regulatory elements reveal aspects of cis-regulatory grammar in mouse embryonic stem cells. Elife9, e41279 (2020). [DOI] [PMC free article] [PubMed]
  • 124.Corces, V. G. Keeping enhancers under control. Nature376, 462–463 (1995). [DOI] [PubMed] [Google Scholar]
  • 125.Small, S., Arnosti, D. N. & Levine, M. Spacing ensures autonomous expression of different stripe enhancers in the even-skipped promoter. Development119, 762–772 (1993). [PubMed] [Google Scholar]
  • 126.Guéroult-Bellone, M. et al. Spacer sequences separating transcription factor binding motifs set enhancer quality and strength. Preprint at 10.1101/098830 (2017).
  • 127.Sung, N. S., Kenney, S., Gutsch, D. & Pagano, J. S. EBNA-2 transactivates a lymphoid-specific enhancer in the BamHI C promoter of Epstein-Barr virus. J. Virol.65, 2164–2169 (1991). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.Liu, Q. & Summers, W. C. Identification of the 12-O-tetradecanoylphorbol-13-acetate-responsive enhancer of the MS gene of the Epstein-Barr virus. J. Biol. Chem.267, 12049–12054 (1992). [PubMed] [Google Scholar]
  • 129.McKendall, R. R. Comparative neurovirulence and latency of hsv 1 and hsv 2 following footpad inoculation in mice. J. Med Virol.5, 25–32 (1980). [DOI] [PubMed] [Google Scholar]
  • 130.Löwhagen, G.-B., Tunbäck, P. & Bergström, T. Proportion of herpes simplex virus (HSV) type 1 and type 2 among genital and extragenital HSV isolates. Acta Derm. Venereol.82, 118–120 (2002). [DOI] [PubMed] [Google Scholar]
  • 131.Mertz, G. J., Rosenthal, S. L. & Stanberry, L. R. Is herpes simplex virus type 1 (HSV-1) now more common than HSV-2 in first episodes of genital herpes? Sex. Transm. Dis.30, 801–802 (2003). [DOI] [PubMed] [Google Scholar]
  • 132.Brazão, C. et al. Six-year study on mucocutaneous herpes simplex virus infections at the largest tertiary teaching hospital in Portugal. Actas Dermosifiliogr.116, 897–901 (2025). [DOI] [PubMed] [Google Scholar]
  • 133.Tippens, N. D., Vihervaara, A. & Lis, J. T. Enhancer transcription: what, where, when, and why? Genes Dev.32, 1–3 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Hamzeh, F. M., Lietman, P. S., Gibson, W. & Hayward, G. S. Identification of the lytic origin of DNA replication in human cytomegalovirus by a novel approach utilizing ganciclovir-induced chain termination. J. Virol.64, 6184–6195 (1990). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Takemoto, M., Shimamoto, T., Isegawa, Y. & Yamanishi, K. The R3 region, one of three major repetitive regions of human herpesvirus 6, is a strong enhancer of immediate-early gene U95. J. Virol.75, 10149–10160 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Balan, N., Osborn, K. & Sinclair, A. J. Repression of CIITA by the Epstein–Barr virus transcription factor Zta is independent of its dimerization and DNA binding. J. Gen. Virol.97, 725–732 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Li, D. et al. Down-regulation of MHC class II expression through inhibition of CIITA transcription by lytic transactivator Zta during Epstein-Barr virus reactivation. J. Immunol.182, 1799–1809 (2009). [DOI] [PubMed] [Google Scholar]
  • 138.Sato, H., Takeshita, H., Furukawa, M. & Seiki, M. Epstein-Barr virus BZLF1 transactivator is a negative regulator of Jun. J. Virol.66, 4732–4736 (1992). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Morgan, S. M. et al. The three-dimensional structure of Epstein-Barr virus genome varies by latency type and is regulated by PARP1 enzymatic activity. Nat. Commun.13, 187 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Repic, A. M., Shi, M., Scott, R. S. & Sixbey, J. W. Augmented latent membrane protein 1 expression from Epstein-Barr virus episomes with minimal terminal repeats. J. Virol.84, 2236–2244 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141.Albanese, M., Tagawa, T. & Hammerschmidt, W. Strategies of Epstein-Barr virus to evade innate antiviral immunity of its human host. Front. Microbiol.13, 955603 (2022). [DOI] [PMC free article] [PubMed]
  • 142.Sandri-Goldin, R. M. Replication of the herpes simplex virus genome: Does it really go around in circles? Proc. Natl. Acad. Sci.100, 7428–7429 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143.Izumiya, Y. et al. Kaposi’s sarcoma-associated herpesvirus terminal repeat regulates inducible lytic gene promoters. J. Virol.10.1128/jvi.01386-23 (2024). [DOI] [PMC free article] [PubMed]
  • 144.Garber, A. C., Shu, M. A., Hu, J. & Renne, R. DNA binding and modulation of gene expression by the latency-associated nuclear antigen of Kaposi’s sarcoma-associated herpesvirus. J. Virol.75, 7882–7892 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145.Lovén, J. et al. Selective inhibition of tumor oncogenes by disruption of super-enhancers. Cell153, 320–334 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 146.Keck, K. M. et al. Bromodomain and extraterminal inhibitors block the Epstein-Barr virus lytic cycle at two distinct steps. J. Biol. Chem.292, 13284–13295 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Sarisky, R. T. & Hayward, G. S. Evidence that the UL84 gene product of human cytomegalovirus is essential for promoting oriLyt-dependent DNA replication and formation of replication compartments in cotransfection assays. J. Virol.70, 7398–7413 (1996). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148.Sahrhage, M., Paul, N. B., Beißbarth, T. & Haubrock, M. The importance of DNA sequence for nucleosome positioning in transcriptional regulation. Life Sci. Alliance7, e202302380 (2024). [DOI] [PMC free article] [PubMed]
  • 149.Deininger, P. Alu elements: know the SINEs. Genome Biol.12, 236 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150.Koonin, E. V., Dolja, V. V. & Krupovic, M. The logic of virus evolution. Cell Host Microbe30, 917–929 (2022). [DOI] [PubMed] [Google Scholar]
  • 151.Weiss, R. A. & Stoye, J. P. Our Viral Inheritance. Science340, 820–821 (2013). [DOI] [PubMed] [Google Scholar]
  • 152.Weiss, R. A. Exchange of genetic sequences between viruses and hosts. Curr. Top. Microbiol. Immunol. 1–29, 10.1007/82_2017_21 (2017). [DOI] [PubMed]
  • 153.Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv10.48550/arXiv.1810.04805 (2018).
  • 154.Liu, Y. et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv10.48550/arXiv.1907.11692 (2019).
  • 155.Bekker, J. & Davis, J. Learning from positive and unlabeled data: a survey. Mach. Learn109, 719–760 (2020). [Google Scholar]
  • 156.Whyte, W. A. et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell153, 307–319 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157.Shazeer, N & Stern M. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. ArXiv10.48550/arXiv.1804.04235 2018.
  • 158.Kent, W. J. et al. The human genome browser at UCSC. Genome Res.12, 996–1006 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics30, 2114–2120 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 160.Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods9, 357–359 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 161.Ramírez, F. et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res.44, W160–W165 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 162.Imakaev, M. et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods9, 999–1003 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 163.Schaeffner, M. et al. BZLF1 interacts with chromatin remodelers promoting escape from latent infections with EBV. Life Sci. Alliance2, e201800108 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 164.Tian, S. Z. et al. Landscape of the Epstein-Barr virus-host chromatin interactome and gene regulation. EMBO J.44, 3872–3915 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 165.Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics25, 2078–2079 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 166.Tanenbaum, M. E., Gilbert, L. A., Qi, L. S., Weissman, J. S. & Vale, R. D. A protein-tagging system for signal amplification in gene expression and fluorescence imaging. Cell159, 635–646 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 167.Yu, X., McCarthy, P. J., Wang, Z., Gorlen, D. A. & Mertz, J. E. Shutoff of BZLF1 gene expression is necessary for immortalization of primary B cells by Epstein-Barr virus. J. Virol.86, 8086–8096 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 168.Sugimoto, A. et al. Different distributions of Epstein-Barr virus early and late gene transcripts within viral replication compartments. J. Virol.87, 6693–6699 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 169.Livak, K. J. & Schmittgen, T. D. Analysis of relative gene expression data using real-time quantitative PCR and the 2 − ΔΔCT method. Methods25, 402–408 (2001). [DOI] [PubMed] [Google Scholar]
  • 170.Larkin, M. A. et al. Clustal W and Clustal X version 2.0. Bioinformatics23, 2947–2948 (2007). [DOI] [PubMed] [Google Scholar]
  • 171.Kumar, S. et al. MEGA12: molecular evolutionary genetic analysis version 12 for adaptive and green computing. Mol. Biol. Evol.41, msae263 (2024). [DOI] [PMC free article] [PubMed]
  • 172.Argimón, S. et al. Microreact: visualizing and sharing data for genomic epidemiology and phylogeography. Microb. Genom2, e000093 (2016). [DOI] [PMC free article] [PubMed]
  • 173.Roy Chowdhury, N., Deepanway, G., Gurevich, V., Shamay, M. Discovering the complete enhancer map of human herpesviruses using a natural language processing model: supplemental files. Zenodo. 10.5281/zenodo.17500116. 2025 [DOI] [PMC free article] [PubMed]
  • 174.Waterhouse, A. M., Procter, J. B., Martin, D. M. A., Clamp, M. & Barton, G. J. Jalview Version 2—a multiple sequence alignment editor and analysis workbench. Bioinformatics25, 1189–1191 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Reporting Summary (92.1KB, pdf)
Source Data (74.3KB, xlsx)

Data Availability Statement

HiC dataset, by Morgan et al.139, used in this study, is publicly available in the NCBI GEO under the accession no. GSE160973. ChIP-seq raw sequence reads used in this study are publicly available in repositories with accession nos. as follows: E-MTAB-7788 (for ZTA)163, GSE29611 (for H3K27Ac, H3K4Me1/2/3)109, GSE281522 (for RNApolII)164. Publicly available ATAC-seq raw sequence reads from the NCBI GEO accession no. GSE170245 was used109. Raw STARR-seq reads of K562 cells derived from Reddy et al., 2024 in ENCODE112 are publicly available under the Identifier no. ENCSR926NDZ. STARR-seq reads of HepG2 cells were used from Sahu et al.111, available in the NCBI GEO under the accession no. GSE180158Source data are provided with this paper.

All the source codes for the ENHAvir frameworks have been deposited in the GitHub repository (https://github.com/shamay-lab/ENHAvir). All the supplementary files, along with the codes of ENHAvir, have been deposited in Zenodo (10.5281/zenodo.17500116)173.


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES