Skip to main content
PLOS One logoLink to PLOS One
. 2020 Apr 14;15(4):e0223203. doi: 10.1371/journal.pone.0223203

Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals

Krishna Unadkat 1, Justen B Whittall 1,*
Editor: Marc Robinson-Rechavi2
PMCID: PMC7156063  PMID: 32287315

Abstract

There is a molecular basis for many sleep patterns and disorders involving circadian clock genes. In humans, “short-sleeper” behavior has been linked to specific amino acid substitutions in BHLHE41 (DEC2), yet little is known about variation at these sites and across this gene in mammals. We compare BHLHE41 coding sequences for 27 mammals. Approximately half of the coding sequence was invariable at the nucleotide level and close to three-quarters of the amino acid alignment was identical. No other mammals had the same “short-sleeper” amino acid substitutions previously described from humans. Phylogenetic analyses based on the nucleotides of the coding sequence alignment are consistent with established mammalian relationships confirming orthology among the sampled sequences. Significant purifying selection was detected in about two-thirds of the variable codons and no codons exhibited significant signs of positive selection. Unexpectedly, the gorilla BHLHE41 sequence has a 318 bp insertion at the 5’ end of the coding sequence and a deletion of 195 bp near the 3’ end of the coding sequence (including the two short sleeper variable sites). Given the strong signal of purifying selection across this gene, phylogenetic congruence with expected relationships and generally conserved function among mammals investigated thus far, we suggest the indels predicted in the gorilla BHLHE41 may represent an annotation error and warrant experimental validation.

Introduction

Sleep plays a vital function for survival in animals [13], especially vertebrates and even some invertebrates [4]. It is essential in maintaining both physical and mental health, especially in humans where sleep deprivation is linked to diabetes, high blood pressure, obesity, and decreased immune function [5,6,7]. The timing and duration of sleep varies widely among mammals [8] and is regulated by a plethora of intricate mechanisms including many circadian clock genes [9].

Among the genes responsible for circadian regulation in mammals is the basic helix-loop-helix family member e41 [5, 10, 11], also known as “differentially expressed in chondrocytes protein 2” (DEC2). It is an essential clock protein that acts as a transcription factor which maintains the negative feedback loop in the circadian clock by repressing E-box-mediated transcription [5]. Specifically, by binding to the promoter region on the prepro-orexin gene, BHLHE41 acts as a repressor of orexin expression in mammals. Furthermore, disabling orexin results in narcolepsy in mammals, confirming that orexin plays a vital role in sleep regulation [5].

BHLHE41 has several conserved functional domains including a bHLH region and the “orange” domain. As a member of the bHLH family, BHLHE41 contains a ~60 amino acid bHLH conserved domain that promotes dimerization and DNA binding [10]. Specifically, the bHLH domain is composed of a DNA-binding region, E-box/N-box specificity site, and a dimerization interface for polypeptide binding. The DNA-binding region is followed by two alpha-helices surrounding a variable loop region. As a member of the group E bHLH family, this protein specifically binds to an N-box sequence (CACGCG or CACGAG) based on BHLHE41 amino acid site 53 (glutamate) [12]. The other well studied conserved domain in BHLHE41 is the orange domain which provides specificity as a transcriptional repressor [13]. These domains are conserved between humans and zebrafish in both their amino acid composition and function [14]. Unfortunately, there is no 3D structure described for a mammalian BHLHE41 in Genbank’s Protein Data Bank [15] to determine the spatial effects of amino acid variants.

Because of its essential function in sleep regulation, anomalies in clock genes can lead to abnormal patterns of sleep that can manifest in a wide variety of ways, ranging from insomnia to oversleeping [1]. A rare point mutation in the BHLHE41 gene of Homo sapiens (P384R in NM_030762, also referred to as P385R as in [10]) confers a “short-sleeper phenotype”. The mutation involves a transversion from a C to G in the DNA sequence of BHLHE41, which results in a non-synonymous substitution from proline to arginine at amino acid position 385 of the BHLHE41 protein. Since proline (nonpolar) and arginine (electrically charged, basic) have chemically dissimilar structures and since substituting these amino acids is relatively rare (BLOSUM62 value of -2), it is not surprising that this mutation has a substantial phenotypic effect. Subjects with this allele reported shorter daily sleep patterns than those with the wild type allele, without reporting any other adverse effects [10]. The function of BHLHE41 in controlling sleep and circadian clocks is conserved between humans and mice, but untested in most other mammals [10]. In zebrafish, the BHLHE41 has similar structure (five exons separated by four introns) and high sequence similarity to human homologue [14], but no variation at this residue. In Drosophila melanogaster, the most similar gene to BHLHE41 is CG17100 (Clockwork Orange), but is only weakly similar (<11% amino acid identity; [16]). However, transgenically introducing the short-sleeper allele P385R into Drosophila still resulted in the short-sleeper phenotype [10] suggesting the existence of a similar regulatory network. Another nonsynonymous substitution in BHLHE41 that correlates with altered sleep behavior in humans is Y362H [17]. This mutation reduced the ability of BHLHE41 to suppress CLOCK/BMAL1 and NPAS2/BMAL1 transactivation in vitro [17].

These short-sleeper variants could provide adaptive functions in other mammals. In such case, we may detect the signature of positive selection on those codons. However, genes such as BHLHE41 are essential for survival and reproduction and are therefore often highly conserved and are more likely to show patterns of purifying selection. Purifying selection can be manifested as higher rates of synonymous substitutions compared to rates of non-synonymous substitutions (dN-dS) [18]. Negative overall dN-dS values indicate purifying selection and are often evidence that a gene is involved in some essential function (like the circadian clock), yet a codon-by-codon dN/dS analysis can detect signs of positive selection (e.g,. adaptation at the molecular level) on specific codons. To date, no one has examined patterns of selection in BHLHE41.

In fact, very few nucleotide, nor amino acid comparisons have been made in mammals beyond human vs. mouse. With the rapid accumulation of mammalian genome sequences, a plethora of homologous sequences likely exist (see [12] for phylogenetic analysis of all bHLH, but only includes two mammals—human and mouse; see [14] for a comparison of zebrafish and human that calls for further sampling of mammals). Furthermore, the well-resolved mammalian phylogeny [19, 20] provides a robust foundation for which to test for homology and confirm orthology. For most non-model mammalian species with whole-genome sequences, genes are predicted using algorithms that locate open reading frames (e.g., [21]), yet rarely are the predicted genes validated experimentally [22, 23]. Some algorithms compare putative open reading frames with model-species to confirm length and expected sequence variation. Accounting for any differences in the length of coding sequences can be a challenge, due to both the existence of alternative mRNA isoforms and an increasing time of divergence [24]. A comparative approach across a diversity of lineages can help elucidate any unusual patterns of sequence variation.

In order to further explore the function of the BHLHE41 gene, we analyzed the evolutionary relationships among the BHLHE41 coding sequence in humans and other mammals. There are two clear aims of this study: (1) to utilize pre-existing data in Genbank to determine whether any mammals other than humans have the “short-sleeper” allele or exhibit variation at amino acid sites P385R and Y362H, and (2) to assess the degree of biochemical changes at all amino acid substitutions and search for the footprints of selection (dN-dS). To address these goals, we compared BHLHE41 sequences from 27 species of mammals and a reptilian outgroup that came from sequenced cDNA and full genome sequencing projects. After creating a multiple sequence alignment, we used Bayesian and maximum likelihood analyses to investigate the evolutionary relationships underlying this gene among mammals to confirm orthology. Finally, we used the multiple sequence alignment to test for purifying and positive selection across codons.

Materials and methods

Query sequence search

In order to find the complete mRNA coding sequence for BHLHE41 from H. sapiens, we searched ENTREZ using the “RefSeq” filter with the following query to the “Gene” database: “DEC2 AND H. sapiens [organism]”. We confirmed that the same sequence was obtained when searching for “BHLHE41 AND H. sapiens [organism]”. We found a single hit for the Homo sapiens BHLHE41 gene with the RefSeq accession number NM_030762 [25]. The coding sequence for this gene is 1449 base pairs long. According to EMBL (ENSGGOT00000015550.3), there are five introns (yet see [14] where they report only four introns). All subsequent analyses are based solely on the coding sequence as determined by EMBL.

Locating homologous sequences with BLAST

After locating the accession number for our sequence of interest from H. sapiens, we used NCBI’s nucleotide BLAST [26] to find other mammalian homologues to the H. sapiens BHLHE41 mRNA. We searched the nucleotide collection (nr/nt) using Megablast with default parameters (Word Size: 28, Match/Mismatch: 1, -2, Gap Costs: Linear). We downloaded sequences with E-values < 10−3, local percent identity > 70%, and query coverages ~100% as Genbank complete flatfiles.

In order to find an outgroup sequence, we performed another BLAST search using Discontiguous Megablast with default parameters (Word size: 1, Match/Mismatch: 2, -3, Gap Existence/Extension Costs: 5, 2) except excluding mammals from the search results. We included the reptile, Pelodiscus sinensis or Chinese Softshell Turtle, as our outgroup based on the aforementioned E-value, identity, and query coverage cut-offs. GenBank flatfiles for each species coding sequence was downloaded and imported into Geneious Prime (Biomatters, New Zealand).

Multiple sequence alignment

In order to create an alignment with sequences that represent homology to the H. sapiens BHLHE4 mRNA, we used the Geneious Aligner within Geneious Prime. To prevent single nucleotide gaps and ensure all remaining nucleotide gaps were in multiples of three, since this is coding sequence, we applied a cost matrix of 70% similarity (match/mismatch of 5.0/-4.5), a gap open penalty of 90, a gap extension penalty of one, and two refinement iterations.

Phylogenetic analyses

In order to test for homology and confirm we were comparing orthologous sequences, we conducted maximum likelihood and Bayesian phylogenetic analyses. If the evolutionary relationships of the BHLHE41 coding sequence reflects the known relationships among mammals, then we can conclude homology and proceed with the tests for selection. In order to construct a maximum likelihood tree, we used the RAxML v.4.0 [27] plugin in Geneious Prime. We applied the GTR+CAT+I model of sequence evolution, with the Rapid Bootstrapping algorithm, 1,000 bootstrap replicates, and a Parsimony Random Seed of one. This is the most complex model of sequence evolution available for the RAxML plugin in Geneious Prime. It accounts for six rates of nucleotide substitution with categories for rate variation instead of a gamma distribution for efficiency, while simultaneously estimating the proportion of invariant sites [27].

To compare our maximum likelihood results with another method, we constructed a phylogenetic tree using the MrBayes v.2.2.4 [28] plugin from within Geneious Prime. For this analysis, we used the GTR (General Time Reversible) model of sequence evolution with “gamma” rate variation. The search ran for 2,000,000 generations, subsampling every 1,000 generations after 1,000,000 generations of burnin. Two parallel runs were conducted using four chains each with a heated chain temp of 0.2. In order to confirm sufficient number of generations were sampled in the Bayesian analysis, we recorded the standard deviation of split frequencies comparing the two runs. Furthermore, we examined the trace depicting the maximum likelihood value at each generation to ensure there was no slope (S1 Fig). After both maximum likelihood and Bayesian trees were generated, we rooted them with the reptilian outgroup, P. sinensis (Chinese Softshell Turtle).

Molecular evolution

By comparing the rates of nonsynonymous (dN) and synonymous (dS) substitutions, we tested for selection at the molecular scale. In MEGA7 [29], we used the codon-based Z-test of selection to test for pairwise dN-dS values, using “In Sequence Pairs” as the scope, “Positive Selection” as the test hypothesis, the “Nei-Gojobori method (Proportional)” as the model [30], and “Pairwise Deletion” to account for gaps without removing sites entirely. We then repeated this process using “Purifying Selection” as the test hypothesis. Purifying selection was represented by negative dN-dS values, positive selection was represented by positive dN-dS values. dN-dS values of zero represent neutrality. For the codon-based Z-test of selection, p-values under 0.05 were considered significant.

In order to determine if there was directional selection on any specific codons, we used HyPhy [31] from within MEGA7. We used a “Neighbor-Joining tree”, “Maximum Likelihood” statistical method, “Syn-Nonsynonymous” substitution, and the “General Time Reversible” model of sequence evolution to analyze the alignment codon-by-codon. We applied the partial deletion option if <70% of the sequences had a gap. After running HyPhy, we removed invariant codons where dN and dS could not be calculated and examined the remaining codons with significant P-values. Values greater than 0.95 were considered significant evidence of purifying selection. We estimated the average dN-dS values for both conserved domains compared to the remaining codons outside the conserved domains.

Results

Locating homologous sequences with BLAST

From the results of the BLAST search using BHLHE41 from H. sapiens, we downloaded 27 mammalian sequences (including the query) and one reptile sequence as an outgroup for a total alignment of 28 species.The E-values for all sequences were 0.0 and the local identity scores from the BLAST report ranged from 87.50% to 100% (Table 1). The coding sequences ranged in length from 1368 (P. sinensis) to 1569 (Gorilla gorilla gorilla) base pairs. The query coverage from the BLAST report ranged from 38% to 100% (Table 1).

Table 1. Sequences used in creating the multiple sequence alignment and their BLAST scores using the mRNA from the human basic helix-loop-helix family member e41 as the query.

Latin name Accession number Query coverage (%) Identity (%)
Bos indicus x Bos taurus XM_027541573 64 91.97
Callorhinus ursinus XM_025879601 82 92.94
Cebus capucinus XM_017507035 94 96.60
Cercocebus atys XM_012093655 46 98.21
Chlorocebus sabaeus XM_007967990 99 97.96
Delphinapterus leucas XM_022577811 95 87.54
Homo sapiens1 NM_030762 100 100
Gorilla gorilla gorilla XM_031000846.1 89 98.92
Lagenorhynchus obliquidens XM_027129408 89 87.27
Lipotes vexillifer XM_007446307 38 93.63
Macaca fascicularis XM_005570417 100 98.25
Macaca mulatta XM_015151321 49 98.13
Macaca nemestrina XM_011759130 100 98.08
Marmota flaviventris XM_027934162 52 92.03
Microcebus murinus XM_012739537 93 93.71
Orcinus orca XM_004270956 51 89.93
Ovis aries XM_015093964 67 92.73
Panthera pardus XM_019452268 60 92.65
Pan troglodytes XM_520805 49 99.15
Pelodiscus sinesis 2 XM_006127674 49 88.53
Physeter catodon XM_024128992 85 87.50
Piliocolobus tephrosceles XM_023209042 100 97.17
Pongo abelii XM_002823045 48 98.98
Rousettus aegyptiacus XM_016119294 92 91.98
Sus scrofa XM_003355541 79 92.49
Theropithecus gelada XM_025402281 94 97.23
Tursiops truncatus XM_019936346 95 87.54
Zalophus californianus XM_027593397 87 92.89

1 Query sequence

2 Reptile outgroup

Multiple sequence alignment

All sequences in the multiple sequence alignment are complete from start codon (AUG) to stop codon (all use TGA) (S3 Table). Indels ranged from three base pairs to 318 base pairs—always in multiples of three. Of the 1,794 bp nucleotide alignment for mammals, 986 bp were identical (55.0%). The average nucleotide pairwise identity among the mammalian sequences was 92.2%. At the amino acid level, of the 598 residues for mammals, 71.7% were identical. The pairwise percent identity in amino acids was 94.8% (S4 Table).

There are no amino acid substitutions in our alignment at either residue previously described to confer alternative sleep behaviors in humans (Y362H and P385R, site numbers refer to human sequence). In our multiple sequence alignment, Y362H is at amino acid alignment position 476 (S4 Table) and nucleotide alignment positions 1423-1425bp (S3 Table). There is also no nucleotide variation for this codon. Alternatively, P385R is at amino acid alignment position 498 (S4 Table) and nucleotide alignment positions 1489-1491bp (S3 Table). Although there are no amino acid substitutions at this residue, there is synonymous variation. All but four sequences have the codon CCG, which codes for proline. The exceptions are synonymous substitutions in Sus scrofa (CCA), Rousettus aegyptiacus (CCC), and P. sinensis (CCC)—all of which still code for proline. However, in the G. gorilla gorilla sequence, both residues 362 and 385 fall within the 195 base pair deletion described above.

Sequence length variation in gorilla

There are two large indels in the gorilla sequence (Fig 1; S3 and S4 Tables). The first 318 base pairs are only present in accession XM_031000846.1—a predicted protein from the G. gorilla gorilla genome sequence [32]. Additionally, the sequence for G. gorilla gorilla has a 195 base pair deletion starting at nucleotide alignment site 1,360 and ending at 1,555bp. Both of these indels are multiples of three and therefore maintain the reading frame throughout the coding sequence yielding a predicted G. gorilla gorilla BHLHE41 amino acid sequence 522 residues. The average non-gorilla mammalian amino acid sequence is 482aa long.

Fig 1. Multiple sequence alignment of the BHLHE41 mRNA for 27 mammals and one reptile outgroup.

Fig 1

Sequence identity is shown immediately below the consensus (green = 100% identical; gold = 25–99% identical; red < 25% identical). The two amino acid variants known to affect sleep behavior in humans (P385R and Y362H) are indicated with arrows. The alignment shows two unexpected findings in the gorilla sequence: a 318 base pair insertion on the 5’ end and a 195 base pair gap starting at bp 1360.

We searched the Gorilla gorilla gorilla chromosome 12 whole genome shotgun sequence (NC_018436) between bp 58,885,949 and 58,889,015 and found that although the unusual 318bp upstream from the mammalian start codon exists, the gorilla annotation actually identified the correct start codon (no 318bp insertion on the 5’ end). Yet, regarding the 195bp deletion near the 3’ end, we found 224 N’s between exon 5 and exon 6 which likely includes both intron 5 and the missing 195bps of exon 6. In this case, the gorilla annotation is clearly different from the predicted mRNA.

Phylogenetic analyses

Both maximum likelihood and Bayesian phylogenetic analyses were highly congruent. There were 20 significantly supported branches in both the maximum likelihood phylogenetic analysis (Fig 2) and in the Bayesian phylogenetic analysis (S2 Fig). In both trees, H. sapiens and Pan troglodytes are strongly supported sister species (bootstrap = 97%, posterior probability = 0.99). Additionally, the Great Apes are monophyletic in both phylogenetic analyses. While the two trees support the same evolutionary relationships, they have one minor differences in terms of support values. In the tree generated using Bayesian analysis, two of the Old World monkeys (Cercocebus atys and Theropithecus gelada) are sister species with a strong posterior probability value of 0.99, while in the tree generated using maximum likelihood, these species have bootstrap values of 68%, which is just below the frequently used cut-off for reliability of 70% [33].

Fig 2. Maximum likelihood phylogenetic analysis of mammalian BHLHE41 coding sequence.

Fig 2

We used the GTR+CAT+I parameter settings with 100 bootstrap replicates which are indicated next to the branches. The tree is rooted with the reptilian outgroup, Pelodiscus sinensis.

Molecular evolution and variation around conserved domains

Among the species, there were no significant pairwise dN-dS values in the test for positive selection (all comparisons had p = 1.0). On the other hand, the Z-test for purifying selection revealed 96.8% of the pairwise species comparisons had dN-dS values significantly less than zero (S1 Table). The mean dN-dS value was -6.13 suggesting strong purifying selection.

After removing invariant codons and those with a gap in >70% of the sequences, we found 227 of 343 codons had significantly higher dS than dN values (66.2%) indicating strong purifying selection (S2 Table; Fig 3). The dN-dS value for the “short-sleeper” allele (P385R [10]) had a dN-dS value of -3.02 (p < 0.01), consistent with strong purifying selection. When compared to all 343 codons, P385R had the 45th most negative dN-dS value. Another variant known to affect sleep behavior in humans, Y362H [17], exhibited no variation in the codon and therefore no p-value could be calculated (S2 Table).

Fig 3. Codon by codon comparison of dN-dS across the mammalian alignment of BHLHE41.

Fig 3

Positive dN-dS values represent positive selection, negative dN-dS values represent purifying selection, and zero dN-dS values represent neutrality. Codon # comes from the HyPhy output and does not include codons removed because >70% of sequences in the alignment had gaps (e.g., the first 106 amino acids in gorilla). All the codons with significant p-values (red) have dN < dS. Blue points have dN-dS that are not significantly different from zero. There are no codons with significant dN > dS. Conserved domains in the Homo sapiens BHLHE41 protein are indicated with black bars above the graph representing the codon positions for bHLH and the orange domain. Invariant codons are not shown because a p-value could not be calculated.

Although 3D structures are an integral part of determining a protein’s function, there was no known 3D structure for H. sapiens BHLHE41 protein. To confirm that there were no homologous sequences with known 3D structures in other species or sequences with alternative gene annotations, we conducted a BLAST search using the human BHLHE41 sequence and filtered for results with known 3D structures. The best-hit had an E-value of 0.042, which is substantially above the commonly used threshold for homology (<10−3; [34]), thus we conclude that no 3D structures for BHLHE41 are currently available.

Instead, we compared variation in the two conserved domains from the H. sapiens BHLHE41 protein GenBank flat file—the bHLH domain and the orange domain. There is no variation in the amino acid alignment across the 59 amino acids in the bHLH domain (S4 Table). The average dN-dS value for these codons is -0.891 suggesting purifying selection. All the p-values of variable codons in this region are <0.01 (S4 Table; Fig 3). The orange domain spans amino acids in the human coding sequence 129–175 (amino acids 235–281 in our alignment, S4 Table). There are two variable sites at human amino acid sites S147A (variants appear in pig, whales, dolphins, sheep and cow; dN-dS = 0.50) and P157Q (variants appear in leopard, northern fur seal and California sea lion; dN-dS = -0.141). The average dN-dS for the 47 codons in the orange domain is -0.811 (Fig 3). Of the 27 p-values that could be calculated in the orange domain, all but five have p-values <0.05. In general, there are large stretches of invariant amino acids among the mammalian samples (e.g., residues 148 to 252 of our alignment). Furthermore, there are seven poly-alanine residues ranging from four to 16 amino acids in length between amino acid alignment positions 407 and 547.

Discussion

Strong purifying selection on BHLHE41 in mammals

Through this study, we explored patterns of molecular evolution in the sleep-related, circadian clock gene BHLHE41 in mammals. Overall, this gene is highly conserved among mammals consistent with its essential function. For example, the bHLH conserved domain shows no amino acid variation among mammals (and even the reptilian outgroup) (amino acid alignment positions 152–210 in S4 Table). Furthermore, the evolutionary history of this gene among mammals is consistent with well-established species-level phylogenetic relationships [19, 20].

In general, the consequences of purifying selection (aka background selection) have been described as “poorly understood [35].” We know that this type of selection arises when the rate of nonsynonmous substitutions (dN) is substantially lower than the rate of synonymous substitutions (dS) [18]. The difference in substitution rates occurs because most nonsynonymous substitutions in genes under purifying selection are deleterious and are removed from the population in order to preserve the biological function of the protein. Purifying selection explains amino acid sequence conservation across long evolutionary time periods [35]. Purifying selection by definition reduces genetic diversity at both the codons under direct selection and those linked to codons under selection [35]. Genes under purifying selection tend to be essential for biological function, highly expressed, and employed in vital developmental pathways [36] like sleep regulation. The strong footprint of purifying selection that we detected in the sleep related gene, BHLHE41, is consistent with its essential role in sleep regulation [5]. Yet, the expectations under purifying selection lie in stark contrast with observed non-synonymous substitutions recorded in humans [10,17] that originally inspired this study (i.e. “short sleeper allele). For example, at least two nonsynonymous substitutions found in humans (P385R and Y362H) are not lethal and in fact confer altered sleep patterns that may even be adaptive under certain circumstances [10,17].

Unexpected length variation in gorilla BHLHE41

Unexpectedly, we found two large indels in the gorilla homologue for BHLHE41. The 318 base pair insertion at the 5’ end of the coding sequence suggests a start codon 106 amino acids upstream from the remaining mammalian start codons. It is noteworthy that the gorilla sequence still contains AUG at the site where the remaining sequences start translation. Additionally, the gorilla sequence contains a 195 base pair deletion near the 3’ end of the coding sequence. This predicted deletion includes both short-sleeper variants previously described (Y362H and P385R)—essential amino acids for proper circadian clock function [10, 14]. Although these indels may reflect novel function of BHLHE41 in gorilla, these animals are not known to have particularly unusual sleep patterns, nor disrupted circadian clocks as would be expected from the addition of 106 amino acids on the 5’ end and the deletion of 65 amino acids from near the 3’ end.

The existence of these indels seems especially unlikely given the widespread pattern of purifying selection on this gene across mammals (>97% of pairwise species comparisons) and across codons (~50% of codons). The 5’ insertion is especially suspicious because it is unique among the 27 mammal sequences investigated and without it, the sequence aligns perfectly with the rest of the mammalian start codons. Although this insertion does not immediately affect the bHLH conserved domain, such a large insertion within 50 residues seems very likely to disrupt protein folding in this region. Without a known 3D structure, confident determination of the effects of these indels on the 3D structure and therefore function remain unknown.

The gorilla BHLHE41 sequence was produced during whole-genome sequencing and was predicted using an annotation pipeline [32]. There is no literature discussing this unusual gorilla BHLHE41 sequence. Unfortunately, there is no cDNA sequence for this gene from gorilla in Genbank release 233.0 (April 2019). Therefore, we suggest that there may have been an error in the identification of the start codon by the open reading frame search algorithm [37].

An error in the open reading frame detection algorithm may account for the incorrectly identified start codon. It is noteworthy that He et al. [10] suggested only four introns, yet EMBL identified five introns. Furthermore, EMBL indicates this gorilla amino acid sequence is only 419aa long compared to the Genbank accession which measures 522 residues (S3 Fig). Experimental determination of the length of the gorilla BHLHE41 protein by sequencing cDNA or RNA-Seq will be necessary in order to determine the true start codon in gorilla (or start codons if there are multiple isoforms of this gene) and the validity of the 195bp deletion near the 3’ end of the coding sequence.

There is no evidence of alternative splice variants for BHLHE41 in Gorilla according to EMBL (ENSGGOG00000015498; accessed June 13, 2019). Furthermore, although there are 11 paralogues in EMBL, all are less than 37% identical to BHLHE41 indicating significant sequence divergence and unlikely to be mistaken for BHLHE41. If these were paralogous sequences, they would most likely show incongruent relationships with the well-established mammalian phylogeny. The status in the UNIPROTKB database indicates it is still only a predicted protein with an Annotation Score of 2/5 (G3RHJ7_GORGO). It is noteworthy that the EMBL transcript protein sequence contains neither the early start codon, nor the 195bp deletion in the coding sequence (ENSGGOT00000015550.3). However, the Genbank Annotation Release 101 of the Gorilla gorilla gorilla genome (Nov 4 2016) still contains these two large indels. A very recent, new genome sequence of a different gorilla individual (Kamilah, GCA_000151905.3, Aug. 28, 2019) no longer exhibits the 195bp deletion near the 3’ end of the coding sequence. No annotations were available for this genome, but hopefully it eventually includes a start codon that matches the rest of mammals.

The annotation of protein-coding genes is currently based on gene prediction algorithms [37]. Gene prediction algorithms have been through several revolutions since their initial application [38,39]. Majoros et al. [40] evaluated the quality of gene prediction algorithms. An evaluation of gene finders based on hidden markov models (HMMs) was done by Knapp & Chen [41]; the authors reported that no significant improvement in the quality of de novo gene prediction methods occurred during the previous 5 years. Bakke et al. [42] evaluated three second-generation gene annotation systems on the genome of the archaeon Halorhabdus utahensis from the performance of the gene-prediction models to the functional assignments of genes and pathways. Comparison of gene-calling methods showed that 90% of all three annotations share exact stop sites with the other annotations, but only 48% of identified genes share both start and stop sites [42]. Palleja et al. [43] performed an interesting investigation of overlapping CDS in prokaryotic genomes. They compared overlapping genes with their corresponding orthologues and found that more than 900 reported overlaps larger than 60 bp were not real overlaps, but annotation errors. Given that BHLHE41 is just one of the 46,653 coding sequences predicted in gorilla, we are cautious about making any widespread conclusions about the remaining loci.

To avoid annotation mistakes, Armengaud [44] recommends using proteomics in association with translations in all six reading frames. Prasad et al. [38] provide a method combining transcriptome and proteomics to aid in genome annotation. However, genes that are expressed only under special conditions or in rarely sampled tissues, or whose expression is below the detection level, pose a challenge even for proteomic and cDNA validation.

Conclusions

We sought to determine if there was a footprint of positive selection on BHLHE41 in mammals in light of its effect on sleep behaviors. We found that the majority of the BHLHE41 coding sequences exhibit a history of purifying selection (especially the conserved domains), indicating the gene has an essential function for survival and reproduction. In particular, if adaptive sleep behaviors are conferred by BHLHE41, we predicted residues 362 and/or 385 to show a history of positive selection. Both sites were invariant across mammals consistent with strong purifying selection on the underlying codons. The evolutionary history of BHLHE41 is largely congruent with the well-established mammalian phylogeny indicative of homologous comparisons. From the single sequences we used per species for a limited number of mammals, we found no other species (besides humans) that exhibited the two “short-sleeper” variants [10]. These sites are likely undergoing strong purifying selection in most mammalian species. Additional population-level sampling across a broader diversity of mammals would be required to accurately determine if these variants are truly unique to humans. During our investigation, we discovered an unusually annotated sequence for G. gorilla gorilla. We suggest that the early start codon and deletion near the 3’ end are annotation errors that warrant experimental verification.

Supporting information

S1 Fig. Verification that the Bayesian MCMC phylogenetic search reached stationarity.

(DOCX)

S2 Fig. Phylogenetic tree of euteleostomi BHLHE41 mRNA using Bayesian analysis.

(DOCX)

S3 Fig. EMBL structure of the transcript for BHLHE41 from gorilla gorilla gorilla with conserved domains indicated.

(DOCX)

S1 Table. Pairwise codon-based test of purifying selection for mammalian BHLHE41.

(DOCX)

S2 Table. Codon-by-codon test for selection.

(DOCX)

S3 Table. BHLHE41 mammalian nucleotide alignment with reptile outgroup.

(DOCX)

S4 Table. BHLHE41 mammalian amino acid alignment with reptile outgroup.

(DOCX)

Acknowledgments

The authors thank Santa Clara University’s Department of Biology for providing access to the computer lab running Geneious software, especially Daryn Baker and Steve Hines. We acknowledge the students in BIOL178 who provided valuable feedback during the early stages of this research. Aleezah Salmaan provided valuable editorial advice on an earlier draft of this study.

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Allada R, Siegel JM. Unearthing the phylogenetic roots of sleep. Current biology. 2008. August 5;18(15):R670–9. 10.1016/j.cub.2008.06.033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Joiner WJ. Unraveling the evolutionary determinants of sleep. Current biology. 2016. October 24;26(20):R1073–87. 10.1016/j.cub.2016.08.068 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Miyazaki S, Liu CY, Hayashi Y. Sleep in vertebrate and invertebrate animals, and insights into the function and evolution of sleep. Neuroscience research. 2017. May 1;118:3–12. 10.1016/j.neures.2017.04.017 [DOI] [PubMed] [Google Scholar]
  • 4.Tosches MA, Bucher D, Vopalensky P, Arendt D. Melatonin signaling controls circadian swimming behavior in marine zooplankton. Cell. 2014. September 25;159(1):46–57. 10.1016/j.cell.2014.07.042 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hirano A, Hsu PK, Zhang L, Xing L, McMahon T, Yamazaki M, et al. DEC2 modulates orexin expression and regulates sleep. Proceedings of the National Academy of Sciences. 2018. March 27;115(13):3434–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.American Academy of Sleep Medicine. “Sleep Deprivation.” American Academy of Sleep Medicine–Association for Sleep Clinicians and Researchers, AASM, 2008, aasm.org/.
  • 7.Medic G, Wille M, Hemels ME. Short-and long-term health consequences of sleep disruption. Nature and science of sleep. 2017;9:151 10.2147/NSS.S134864 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Capellini I, Barton RA, McNamara P, Preston BT, Nunn CL. Phylogenetic analysis of the ecology and evolution of mammalian sleep. Evolution. 2008. July;62(7):1764–76. 10.1111/j.1558-5646.2008.00392.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Albrecht U. Invited review: regulation of mammalian circadian clock genes. Journal of Applied Physiology. 2002. March 1;92(3):1348–55. 10.1152/japplphysiol.00759.2001 [DOI] [PubMed] [Google Scholar]
  • 10.He Y, Jones CR, Fujiki N, Xu Y, Guo B, Holder JL, et al. The transcriptional repressor DEC2 regulates sleep length in mammals. Science. 2009. August 14;325(5942):866–70. 10.1126/science.1174443 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Kato Y, Kawamoto T, Fujimoto K, Noshiro M. DEC1/STRA13/SHARP2 and DEC2/SHARP1 coordinate physiological processes, including circadian rhythms in response to environmental stimuli. In Current topics in developmental biology. 2014. January 1 (Vol. 110, pp. 339–372). Academic Press. [DOI] [PubMed] [Google Scholar]
  • 12.Ledent V, Paquet O, Vervoort M. Phylogenetic analysis of the human basic helix-loop-helix proteins. Genome biology. 2002. May 1;3(6):research0030–1. 10.1186/gb-2002-3-6-research0030 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Dawson SR, Turner DL, Weintraub H, Parkhurst SM. Specificity for the hairy/enhancer of split basic helix-loop-helix (bHLH) proteins maps outside the bHLH domain and suggests two separable modes of transcriptional repression. Molecular and cellular biology. 1995. December 1;15(12):6923–31. 10.1128/mcb.15.12.6923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Abe T, Ishikawa T, Masuda T, Mizusawa K, Tsukamoto T, Mitani H, et al. Molecular analysis of Dec1 and Dec2 in the peripheral circadian clock of zebrafish photosensitive cells. Biochemical and biophysical research communications. 2006. December 29;351(4):1072–7. 10.1016/j.bbrc.2006.10.172 [DOI] [PubMed] [Google Scholar]
  • 15.Parasuraman S. Protein data bank. Journal of pharmacology & pharmacotherapeutics. 2012. October;3(4):351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kadener S, Stoleru D, McDonald M, Nawathean P, Rosbash M. Clockwork Orange is a transcriptional repressor and a new Drosophila circadian pacemaker component. Genes & development. 2007. July 1;21(13):1675–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pellegrino R, Kavakli IH, Goel N, Cardinale CJ, Dinges DF, Kuna ST, et al. A novel BHLHE41 variant is associated with short sleep and resistance to sleep deprivation in humans. Sleep. 2014. August 1;37(8):1327–36. 10.5665/sleep.3924 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Molecular biology and evolution. 2007. August 1;24(8):1586–91. 10.1093/molbev/msm088 [DOI] [PubMed] [Google Scholar]
  • 19.Kemp TS, Kemp TS. The origin and evolution of mammals. Oxford University Press on Demand; 2005. [Google Scholar]
  • 20.Tarver JE, Dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, et al. The interrelationships of placental mammals and the limits of phylogenetic inference. Genome biology and evolution. 2016. February 1;8(2):330–44. 10.1093/gbe/evv261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Swiss Institute for Bioinformatics (SIB). ExPASY Translate tool. website: https://web.expasy.org/translate/ [accessed August 30, 2019].
  • 22.Parker, J. Encyclopedia of Genetics, 2001, Page 1375.
  • 23.Sieber P, Platzer M, Schuster S. The definition of open reading frame revisited. Trends in Genetics. 2018. March 1;34(3):167–70. 10.1016/j.tig.2017.12.009 [DOI] [PubMed] [Google Scholar]
  • 24.Wang Y, Liu J, Huang BO, Xu YM, Li J, Huang LF, et al. Mechanism of alternative splicing and its regulation. Biomedical reports. 2015. March 1;3(2):152–8. 10.3892/br.2014.407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Grottke C, Mantwill K, Dietel M, Schadendorf D, Lage H. Identification of differentially expressed genes in human melanoma cells with acquired resistance to various antineoplastic drugs. International journal of cancer. 2000. November 15;88(4):535–46. [DOI] [PubMed] [Google Scholar]
  • 26.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research. 1997. September 1;25(17):3389–402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Stamatakis A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics. 2006. November 1;22(21):2688–90. 10.1093/bioinformatics/btl446 [DOI] [PubMed] [Google Scholar]
  • 28.Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP. Bayesian inference of phylogeny and its impact on evolutionary biology. science. 2001. December 14;294(5550):2310–4. 10.1126/science.1065889 [DOI] [PubMed] [Google Scholar]
  • 29.Kumar S, Stecher G, Tamura K. MEGA7: molecular evolutionary genetics analysis version 7.0 for bigger datasets. Molecular biology and evolution. 2016. March 22;33(7):1870–4. 10.1093/molbev/msw054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Molecular biology and evolution. 1986. September 1;3(5):418–26. 10.1093/oxfordjournals.molbev.a040410 [DOI] [PubMed] [Google Scholar]
  • 31.Pond SL, Muse SV. HyPhy: hypothesis testing using phylogenies. In Statistical methods in molecular evolution 2005. (pp. 125–181). Springer, New York, NY. [DOI] [PubMed] [Google Scholar]
  • 32.Scally A, Dutheil JY, Hillier LW, Jordan GE, Goodhead I, Herrero J, et al. Insights into hominid evolution from the gorilla genome sequence. Nature. 2012. March;483(7388):169–75. 10.1038/nature10842 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hillis DM, Bull JJ. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Systematic biology. 1993. June 1;42(2):182–92. [Google Scholar]
  • 34.Butler T, Dick C, Carlson ML, Whittall JB. Transcriptome analysis of a petal anthocyanin polymorphism in the arctic mustard, Parrya nudicaulis. PloS one. 2014;9(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Cvijović I, Good BH, Desai MM. The effect of strong purifying selection on genetic diversity. Genetics. 2018. August 1;209(4):1235–78. 10.1534/genetics.118.301058 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Lawrie DS, Messer PW, Hershberg R, Petrov DA. Strong purifying selection at synonymous sites in D. melanogaster. PLoS genetics. 2013. May;9(5). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Trimble WL, Keegan KP, D’Souza M, Wilke A, Wilkening J, Gilbert J, et al. Short-read reading-frame predictors are not created equal: sequence error causes loss of signal. BMC bioinformatics. 2012. December 1;13(1):183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Prasad TK, Mohanty AK, Kumar M, Sreenivasamurthy SK, Dey G, Nirujogi RS, et al. Integrating transcriptomic and proteomic data for accurate assembly and annotation of genomes. Genome research. 2017. January 1;27(1):133–44. 10.1101/gr.201368.115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Poptsova MS, Gogarten JP. Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology. 2010. July 1;156(7):1909–17. [DOI] [PubMed] [Google Scholar]
  • 40.Majoros WH, Pertea M, Antonescu C, Salzberg SL. GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic acids research. 2003. July 1;31(13):3601–4. 10.1093/nar/gkg527 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Knapp K, Chen YP. An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic acids research. 2007. January 1;35(1):317–24. 10.1093/nar/gkl1026 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Bakke P, Carney N, DeLoache W, Gearing M, Ingvorsen K, Lotz M, et al. Evaluation of three automated genome annotations for Halorhabdus utahensis. PloS one. 2009;4(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Pallejà A, Harrington ED, Bork P. Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?. BMC genomics. 2008. December;9(1):335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Armengaud J. A perfect genome annotation is within reach with the proteomics and genomics alliance. Current opinion in microbiology. 2009. June 1;12(3):292–300. 10.1016/j.mib.2009.03.005 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Marc Robinson-Rechavi

17 Dec 2019

PONE-D-19-25843

Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals

PLOS ONE

Dear Dr. Whittall,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Reviewer #2 makes a number of very good suggestions, which I invite you to follow. While PLOS One does not have any criteria of significance, it is important that the presentation of the article be consistent with the results, as suggested. It is also important to compare fairly species with or without population data, especially in the discussion. Finally, please do use the new Gorilla genome reference sequence.

We would appreciate receiving your revised manuscript by Jan 31 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements:

  1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.plosone.org/attachments/PLOSOne_formatting_sample_main_body.pdf and http://www.plosone.org/attachments/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this study, BHLH41 genes related with circadian clock was aligned and compared in 27 mammals. The result shows that the gorilla BHLHE 41 sequence has indels which were not found in other mammals. However, this variation in sequence was not verified. Overall, this study is superficial, no much information can we get from this study. The only interesting point about gorilla BLHLH variation isn’t supported by experimental validation as well.

Reviewer #2: Summary

In this manuscript, the authors investigated the evolutionary history of a circadian clock gene BHLHE41 in mammals. In humans, this gene is known to harbor two variants linked to a short-sleeper behavior. The authors compared homologous nucleotide and protein sequences to the human BHLHE41 sequences for 27 mammals and 1 reptilian outgroup. They performed phylogenetic reconstruction of the gene evolutionary history in these species and showed it was consistent with the species phylogeny. Using sequence alignments for the 27 mammals, they showed that most of the gene was under purifying selection, with no codon under positive selection. Finally, they found a 318bp insertion at the beginning of BHLHE41 sequence and a 195bp deletion near the end of the sequence in Gorilla gorilla gorilla, which they suggest to be errors from the automatic annotation process in this species.

Comments

Overall, I think the analyses performed in this study are technically correct and yielded results that could be of interest to other researchers in the field. However, I have several concerns about the manuscript’s structure and the way results are presented. I also have minor comments about several aspects of the manuscripts, which will be listed after the main concerns.

- To me, the manuscript is not very clear in its current form. This could be addressed by better defining the aims of the study, namely which results is the manuscript focusing on. At the moment, there are two main results presented in the paper: 1) conservation of BHLHE41 among mammals and no sign of selection for short-sleeper variants, and 2) potential annotation errors in Gorilla gorilla gorilla from the automated annotation pipeline. The introduction mostly focuses on analyses related to the first results, while the discussion is almost entirely dedicated to the second result. I personally think that the study of conservation of BHLHE41’s sequence among mammals is of higher interest to the community than the annotation error, which is not present in ENSEMBL’s annotation, and I suggest that the paper be reorganized so that the different sections are more homogeneous and the logical flow is easier to follow overall.

- If the authors would like to keep the discovery of uncharacteristic indels in G. gorilla gorilla as a main result, I suggest dedicating a clear section in the results to the analyses supporting their claims. I think this section should include some of the analyses mentioned in the discussion, e.g. the analysis of G. gorilla gorilla genome and the comparison with the ENSEMBL annotation.

- A new Gorilla gorilla gorilla genome is available since 2019/08/28 and is now the reference genome for this species on NCBI Genbank/RefSeq. This genome is briefly mentioned in the discussion, but the manuscript was not updated accordingly. In particular, accession numbers and genomic positions have changed for any Gorilla sequence, including chr12 in the discussion. The accession number for gorilla BHLHE41 is now XM_031000846.1 (XM_019037881 does not yield any results). Fortunately, it seems the sequence of BHLHE41 has not changed at all, but this should be double-checked. Other analyses involving the Gorilla genome (e.g. discussion) should be updated with this new assembly.

- In the conclusion, page 21, the authors claim “From the available mammalian sequences, it appears that the “short-sleeper” variant is only present in humans”. I do not think you can reach this conclusion without population data from other species; the variants are also absent from the human reference genome.

- Overall, I strongly suggest that the manuscript be revised for language and structure within each section. The manuscript is overall well structured, but the organization within each section is not always easy to follow and lacks a clear logical flow.

In addition to these concerns, I have several minor comments:

- I think the results section of the abstract is too detailed. I would recommend simplifying the summary of results to make the abstract more engaging to the reader.

- There are several issues with formatting (things like e.g. and i.e. should be in italic)

- Some sections of the introduction seem outside the scope of a research article, e.g. explaining dN/dS. In the introduction, I would suggest talking about “estimating selection” rather than “comparing dN and dS”.

- The last paragraph of the introduction is not very clear; I expected a clearer “aims, methods, results” structure to facilitate reading the rest of the manuscript.

Methods:

- Why did you blast the human BHLHE41 sequence instead of using already existing annotation? All the sequences you found were annotated as BHLHE41 already.

- Why did you restrict the comparison to these species? Searching “(BHLHE41) AND "mammals"[porgn:__txid40674]” in the proteins database on NCBI yields 165 results. There may be a good reason for choosing these 28 species but then it should be clearly stated.

Results:

- Querying sequence and BLAST are not results in my opinion. I think they would fit better in the methods section as a description of the sequences used for the comparisons.

- It is not clear to me what the phylogenetic analyses bring to the paper. In the discussion, these results are used to justify that the history of the gene follows that of the species, but this result is not really connected to the rest of the paper. I think it could be moved to supplementary information, or better integrated in the study.

- In the section “Molecular Evolution and Variation around Conserved Domains”, page 16: the sentence “Additionally, a BLAST search revealed there were no sequences that had known protein structures in NCBI’s protein data bank with E-values below 0.042, which is above the commonly used threshold for homology (<10-3 ; [33])” is not clear. Do you mean there were no sequences homologous to that of BHLHE41 with known protein structure?

- The legend for Figure 1 is not very clear.

- When referring to the Gorilla genome (in the discussion at the moment), give the accession number of the assembly (GCA_000151905.3) instead of the bioproject.

- If both names refer to the same domain, I think it would be better to use consistent names, e.g. bHLH vs HLH

- In the discussion, pages 18-19: “We searched the Gorilla gorilla gorilla chromosome 12 whole genome shotgun sequence (NC_018436) between bp 58,885,949 and 58,889,015 and found that although the unusual 318bp 19 upstream from the mammalian start codon exists, the gorilla annotation actually identified the correct start codon (no 318bp insertion on the 5’ end)”. If the gorilla annotation is different from the predicted mRNA, it should be stated clearly.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 14;15(4):e0223203. doi: 10.1371/journal.pone.0223203.r002

Author response to Decision Letter 0


9 Mar 2020

March 9, 2020

Dear PLOS ONE Editor,

Thank you for sharing the two reviews of our manuscript (PONE-D-19-25843) entitled, “Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals.” Below you will find our responses to your comments and both reviewers’ comments in bold font. Page and line numbers refer to those in the WORD version of this revised manuscript.

Editor’s Comments extracted from decision email:

Reviewer #2 makes a number of very good suggestions, which I invite you to follow.

We agree. See our responses below.

While PLOS One does not have any criteria of significance, it is important that the presentation of the article be consistent with the results, as suggested.

We have made substantial changes to re-align the paper with the results spanning the Abstract, Introduction, Results, Discussion and References sections. Our specific changes have been itemized in our response to Reviewer #2 below.

It is also important to compare fairly species with or without population data, especially in the discussion.

We agree that population-level data might be necessary to determine variation within a species. However, we chose to sample broadly across mammals (27 species) to look for species-level changes or above. Unfortunately, due to limited population-level data available for most of these species, we were unable to conduct thorough searches for variants for these diverse mammal species. Therefore, we have qualified our results based on the sampling in the revised Discussion.

Finally, please do use the new Gorilla genome reference sequence.

Done & updated, but this did not change the results.

Reviewer #1: In this study, BHLH41 genes related with circadian clock was aligned and compared in 27 mammals. The result shows that the gorilla BHLHE 41 sequence has indels which were not found in other mammals. However, this variation in sequence was not verified. Overall, this study is superficial, no much information can we get from this study. The only interesting point about gorilla BLHLH variation isn’t supported by experimental validation as well.

We thank the reviewer for identifying the gorilla BHLH variation as an “interesting point”. This finding is now highlighted with a unique subheader within the Results section entitled, “Sequence Length Variation in Gorilla” (P13, L322). In regards to the reviewer’s concern that the study is “superficial” and doesn’t have much to offer, I would point the reviewer to the seven “Criteria for Publication” at PLoS ONE, none of which suggest anything about the general interest of the contribution. My understanding of the PLoS ONE philosophy is to publish manuscripts that satisfy the criteria for publication (rigorous and responsible science) and then let the readership determine whether it is “superficial” and how “interesting” it really is. We note that bioinformatics is an established field that does not require “experimental validation”. In fact, this report will hopefully stimulate such follow up study once this is part of the scientific record. Finally, I point to Reviewer #2 who clearly indicated that the “analyses performed in this study are technically correct and yielded results that could be of interest to other researchers in the field”.

Reviewer #2: Summary

In this manuscript, the authors investigated the evolutionary history of a circadian clock gene BHLHE41 in mammals. In humans, this gene is known to harbor two variants linked to a short-sleeper behavior. The authors compared homologous nucleotide and protein sequences to the human BHLHE41 sequences for 27 mammals and 1 reptilian outgroup. They performed phylogenetic reconstruction of the gene evolutionary history in these species and showed it was consistent with the species phylogeny. Using sequence alignments for the 27 mammals, they showed that most of the gene was under purifying selection, with no codon under positive selection. Finally, they found a 318bp insertion at the beginning of BHLHE41 sequence and a 195bp deletion near the end of the sequence in Gorilla gorilla gorilla, which they suggest to be errors from the automatic annotation process in this species.

Comments

Overall, I think the analyses performed in this study are technically correct and yielded results that could be of interest to other researchers in the field. However, I have several concerns about the manuscript’s structure and the way results are presented. I also have minor comments about several aspects of the manuscripts, which will be listed after the main concerns.

We would like to thank Reviewer #2 for taking the time to carefully read, assess and make thoughtful, constructive comments for improving the manuscript. Few Reviewers would go so far as to repeat our initial BLAST to identify that there were more than 27 mammalian sequences available. We appreciate this and have integrated nearly every one of their suggestions.

- To me, the manuscript is not very clear in its current form. This could be addressed by better defining the aims of the study, namely which results is the manuscript focusing on.

We have clarified the aims by restructuring the last paragraph of the Introduction (P1, L121-125). It now states, “There are two clear aims of this study…”. Of course, during our investigation we discovered an unexpected length difference which wasn’t one of the original aims, but we felt deserved reporting and discussing later in the manuscript (see changes made in this regard in our responses below).

At the moment, there are two main results presented in the paper: 1) conservation of BHLHE41 among mammals and no sign of selection for short-sleeper variants, and 2) potential annotation errors in Gorilla gorilla gorilla from the automated annotation pipeline. The introduction mostly focuses on analyses related to the first results, while the discussion is almost entirely dedicated to the second result. I personally think that the study of conservation of BHLHE41’s sequence among mammals is of higher interest to the community than the annotation error, which is not present in ENSEMBL’s annotation…

We thank the Reviewer for suggesting that the sequence evolution is of “higher interest” than the length variation in the Gorilla sequence (although not one of the Criteria for Publication as mentioned in our response to Reviewer #1). We agree and have kept the Introduction focused on the sequence variation and expanded the Discussion of the remarkable sequence conservation across mammals. In fact, in the original draft, there were seven paragraphs discussing the possibility of an annotation error in the gorilla genome, while at the same time there was only one paragraph dedicated to the discussion of purifying selection discovered in the study.

As suggested by the reviewer #2, we have further developed the Discussion by giving it a subheading “Strong Purifying Selection on BHLHE41 in Mammals” and elaborating on our purifying selection discovery across mammals therein. This includes an additional paragraph in the Discussion and additional references that align with our specific results (P18, L532-548).

I suggest that the paper be reorganized so that the different sections are more homogeneous and the logical flow is easier to follow overall.

Agreed. Per the Reviewer #2’s comment, we have added two descriptive headers to the Discussion in order to maintain a clear distinction between the two main findings of our study. The new headers are “Strong Purifying Selection on BHLHE41 in Mammals” and “Unexpected Length Variation in Gorilla BHLHE41”.

- If the authors would like to keep the discovery of uncharacteristic indels in G. gorilla gorilla as a main result, I suggest dedicating a clear section in the results to the analyses supporting their claims. I think this section should include some of the analyses mentioned in the discussion, e.g. the analysis of G. gorilla gorilla genome and the comparison with the ENSEMBL annotation.

Great idea! We moved half of the third paragraph in the Discussion regarding length variation in the Gorilla gene and appended it as a second paragraph to a newly dedicated section in the Results called “Sequence Length Variation in Gorilla.” I think this helps structure the Results and the Discussion.

- A new Gorilla gorilla gorilla genome is available since 2019/08/28 and is now the reference genome for this species on NCBI Genbank/RefSeq. This genome is briefly mentioned in the discussion, but the manuscript was not updated accordingly. In particular, accession numbers and genomic positions have changed for any Gorilla sequence, including chr12 in the discussion. The accession number for gorilla BHLHE41 is now XM_031000846.1 (XM_019037881 does not yield any results). Fortunately, it seems the sequence of BHLHE41 has not changed at all, but this should be double-checked. Other analyses involving the Gorilla genome (e.g. discussion) should be updated with this new assembly.

Thank you for informing us about this update. We have replaced the old IDs with the new ones in Table 1 and in the “Sequence Length Variation in Gorilla” section (P13, L278).

- In the conclusion, page 21, the authors claim “From the available mammalian sequences, it appears that the “short-sleeper” variant is only present in humans”. I do not think you can reach this conclusion without population data from other species; the variants are also absent from the human reference genome.

We have qualified our conclusion in light of our limited sampling (one sequence per species for only 27 mammal species). This conclusion sentence now reads, “From the single sequences we used per species for a limited number of mammals, we found no other species (besides humans) that exhibits the “short-sleeper” variants…” (P22, L642-648). This is followed by the suggestion for future research in the following sentence, “Additional population-level sampling across a broader diversity of mammals would be required to accurately determine if these variants are truly unique to humans.” (P22, L649-650).

- Overall, I strongly suggest that the manuscript be revised for language and structure within each section. The manuscript is overall well structured, but the organization within each section is not always easy to follow and lacks a clear logical flow.

We appreciate the suggestion and made numerous small restructuring edits throughout the manuscript as can be seen from the track changes version.

In addition to these concerns, I have several minor comments:

- I think the results section of the abstract is too detailed. I would recommend simplifying the summary of results to make the abstract more engaging to the reader.

Done.

- There are several issues with formatting (things like e.g. and i.e. should be in italic)

Fixed what we could find.

- Some sections of the introduction seem outside the scope of a research article, e.g. explaining dN/dS. In the introduction, I would suggest talking about “estimating selection” rather than “comparing dN and dS”.

The 3rd to last paragraph of the Introduction that the reviewer is referring to starts with a description of selection (as the reviewer recommends). We don’t mention dN-dS until the end of the 3rd sentence and in that case, only to provide the necessary background for the reader to later interpret our results which are reported in those terms (e.g., Fig 3 and Table S2). Therefore, we have not changed this specific section. However, we did tighten up the Introduction in several other paragraphs where the background material was superfluous (like 2 general sentences in the first paragraph of the Intro; P3, L57-58).

- The last paragraph of the introduction is not very clear; I expected a clearer “aims, methods, results” structure to facilitate reading the rest of the manuscript.

Thank you for pointing this out. We agree that this “signpost” paragraph is essential to cue readers for what’s to come. We now present two clear “Aims” and a brief description of our Methods hinting at the type of Results we will present (P6, L137-146).

Methods:

- Why did you blast the human BHLHE41 sequence instead of using already existing annotation? All the sequences you found were annotated as BHLHE41 already.

Great question. We did this intentionally. Using BLAST ensures we find all sequences with significant similarity regardless of annotation errors. If we had only relied on the annotations, we could accidentally pickup sequences that were mistakenly named BHLHE41 or more likely, we would have missed sequences that were annotated with a different name (or parts of genomes that remain unannotated). As we mention in the Introduction, this gene is also known as DEC2 (2nd paragraph) so using BLAST casts a wider net since it relies on the sequence content, not the annotation that can sometimes come with ambiguity or even mistakes.

- Why did you restrict the comparison to these species? Searching “(BHLHE41) AND "mammals"[porgn:__txid40674]” in the proteins database on NCBI yields 165 results. There may be a good reason for choosing these 28 species but then it should be clearly stated.

Yes, there are more sequences, but many are partial coding sequences that we chose not to include for thoroughness of the sequence evolution analysis. Furthermore, many of those additional sequences are identical sequences from the same species. After a preliminary analysis of all full-length CDS (including multiple sequences for some species), we confirmed that those duplicates were not informative. Therefore, we arbitrarily selected one sequence per species and that is how we arrived at the 27 mammalian sequences included herein.

Results:

- Querying sequence and BLAST are not results in my opinion. I think they would fit better in the methods section as a description of the sequences used for the comparisons.

Good suggestion. Moved (P7, L167-162).

- It is not clear to me what the phylogenetic analyses bring to the paper. In the discussion, these results are used to justify that the history of the gene follows that of the species, but this result is not really connected to the rest of the paper. I think it could be moved to supplementary information, or better integrated in the study.

Thank you for this feedback. We feel it is essential and have attempted to justify its placement in the manuscript. Briefly, in order to test for selection on the genes and individual codons, we have to compare orthologues (not gene duplicates = paralogues, since they often show different patterns of dN/dS following silencing or neo-functionalization). For an individual gene in mammals, we can assess orthology using phylogenetic analysis since most relationships are well established in this lineage (for example, see Kemp’s The Origin and Evolution of Mammals from 2005).

We have updated the phylogenetic analysis with the following justification in the Methods section entitled, “Phylogenetic Analysis”. It reads, “In order to test for homology and confirm that we were comparing orthologous sequences, we conducted maximum likelihood and Bayesian phylogenetic analyses. If the evolutionary relationships of the BHLHE41 coding sequence reflects the known relationships among mammals, then we can conclude homology and proceed with the tests for selection.” (P8, L205-208).

- In the section “Molecular Evolution and Variation around Conserved Domains”, page 16: the sentence “Additionally, a BLAST search revealed there were no sequences that had known protein structures in NCBI’s protein data bank with E-values below 0.042, which is above the commonly used threshold for homology (<10-3 ; [33])” is not clear. Do you mean there were no sequences homologous to that of BHLHE41 with known protein structure?

Yes, there are no homologous sequences with 3D structures. We have clarified this Result.

- The legend for Figure 1 is not very clear.

Thank you for the feedback. We reordered the description of the figure to start with the explanation of the overall aspects (sequence identity graph and the location of the short sleeper variants as arrows) and then listed the two unexpected findings in Gorilla.

- When referring to the Gorilla genome (in the discussion at the moment), give the accession number of the assembly (GCA_000151905.3) instead of the bioproject.

Done.

- If both names refer to the same domain, I think it would be better to use consistent names, e.g. bHLH vs HLH

Done, we have changed to bHLH throughout.

- In the discussion, pages 18-19: “We searched the Gorilla gorilla gorilla chromosome 12 whole genome shotgun sequence (NC_018436) between bp 58,885,949 and 58,889,015 and found that although the unusual 318bp 19 upstream from the mammalian start codon exists, the gorilla annotation actually identified the correct start codon (no 318bp insertion on the 5’ end)”. If the gorilla annotation is different from the predicted mRNA, it should be stated clearly.

Done.

Attachment

Submitted filename: rebuttal letter.200309.pdf

Decision Letter 1

Marc Robinson-Rechavi

26 Mar 2020

Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals

PONE-D-19-25843R1

Dear Dr. Whittall,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: I would like to thank the authors for considering and implementing my suggestions and for answering in a constructive manner. I also appreciate that the authors have made extra changes following the "spirit" of the comments, and not just answered point by point. I think the manuscript is now suitable for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: Yes: Romain Feron

Acceptance letter

Marc Robinson-Rechavi

30 Mar 2020

PONE-D-19-25843R1

Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals

Dear Dr. Whittall:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Verification that the Bayesian MCMC phylogenetic search reached stationarity.

    (DOCX)

    S2 Fig. Phylogenetic tree of euteleostomi BHLHE41 mRNA using Bayesian analysis.

    (DOCX)

    S3 Fig. EMBL structure of the transcript for BHLHE41 from gorilla gorilla gorilla with conserved domains indicated.

    (DOCX)

    S1 Table. Pairwise codon-based test of purifying selection for mammalian BHLHE41.

    (DOCX)

    S2 Table. Codon-by-codon test for selection.

    (DOCX)

    S3 Table. BHLHE41 mammalian nucleotide alignment with reptile outgroup.

    (DOCX)

    S4 Table. BHLHE41 mammalian amino acid alignment with reptile outgroup.

    (DOCX)

    Attachment

    Submitted filename: rebuttal letter.200309.pdf

    Data Availability Statement

    All relevant data are within the paper and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES