Skip to main content
RNA Biology logoLink to RNA Biology
. 2019 Mar 19;16(6):830–837. doi: 10.1080/15476286.2019.1591035

Using high-resolution annotation of insect mitochondrial DNA to decipher tandem repeats in the control region

Haishuo Ji a,*, Xiaofeng Xu b,*, Xiufeng Jin a, Hong Yin b,c, Jianxun Luo b, Guangyuan Liu b, Qiang Zhao a, Ze Chen b, Wenjun Bu a,, Shan Gao a,
PMCID: PMC6546396  PMID: 30870076

ABSTRACT

In this study, we used a small RNA sequencing (sRNA-seq) based method to annotate the mitochondrial genome of the insect Erthesina fullo Thunberg at 1 bp resolution. The high-resolution annotations cover both entire strands of the mitochondrial genome without any gaps or overlaps. Most of the new annotations were consistent with the previous annotations which had been obtained using PacBio full-length transcripts. Two important findings were that animals transcribe both entire strands of mitochondrial genomes and the tandem repeats in the control region of the E. fullo mitochondrial genome contains the repeated Transcription Initiation Sites (TISs) of the heavy strand. In addition, we found that the copy numbers of tandem repeats showed a great diversity within an individual, suggesting that mitochondrial DNA recombination occurs in an individual. In conclusion, the sRNA-seq based method uses 5′ and 3′ end small RNAs to annotate nuclear non-coding and mitochondrial genes at 1 bp resolution, and can be used to identify new steady RNAs, particularly long non-coding RNAs (lncRNAs). The high-resolution annotations of mitochondrial genomes can also be used to study the molecular phylogenetics and evolution of animals or to investigate mitochondrial gene transcription, RNA processing, RNA maturation and several other related topics. The complete mitochondrial genome sequence of E. fullo with the new annotations using the sRNA-seq based method is available at the NCBI GenBank database under the accession number MK374364. We publish our theories, methods, the high quality sRNA-seq and RNA-seq data (SRA: SRP174926) for extensive use.

KEYWORDS: Mitochondrial DNA, annotation, sRNA-seq, control region, tandem repeat

Introduction

Animal mitochondrial DNA is a small, circular, and extrachromosomal genome, typically approximately 16 kbp in size. One animal mitochondrial genome usually contains 37 genes: two for rRNAs, 13 for mRNAs (coding genes), 22 for tRNAs and at least one control region (CR) with a few possible exceptions (e.g. Asymmetron [1]). CR is also called the Displacement-loop (D-loop) region, although the control and the D-loop region is not the same for some species [2]. CR had not been considered as a transcriptional region until 2017, when Gao et al. reported that this region encodes two long non-coding RNAs (lncRNAs) [3]. The annotation of animal mitochondrial genomes usually uses blastx or structure-based covariance models [4]. Although the annotation of mitochondrial mRNAs, tRNAs and rRNAs can be easily performed using web servers (e.g. MITOS [4]), the current methods result in gaps and overlaps in the annotations of mitochondrial genomes and the annotation resolution is still limited. Gao et al. constructed the first quantitative transcription map of animal mitochondrial genomes by sequencing the full-length transcriptome of the insect Erthesina fullo Thunberg [5] on the PacBio platform [6] and established a straightforward and concise methodology to improve genome annotation, resulting in several findings. These findings included the 3′ polyadenylation and possible 5′ m7G caps of rRNAs, the polycistronic transcripts, the antisense transcripts of 37 mitochondrial genes [5], two novel lncRNAs [3] and a novel 31-nt ncRNA [7] in human mitochondrial DNA.

Recently, 5′ and 3′ end small RNAs (5′ and 3′ sRNAs) were reported to exist in all nuclear non-coding and mitochondrial genes [7]. 5′ and 3′ sRNAs can be used to annotate nuclear non-coding and mitochondrial genes at 1 bp resolution and identify new steady RNAs, which are usually transcribed from functional genes. The high-resolution annotation of animal mitochondrial genomes can be used to investigate RNA processing, maturation, degradation and even gene expression regulation [7]. In this study, we used this sRNA-seq based method to annotate the E. fullo mitochondrial genome. New findings were obtained to prove the value of high-resolution annotations of animal mitochondrial genomes and enrich the fundamental knowledge in mitochondrial biology.

Results

Using 5ʹ and 3ʹ sRNAs to annotate the E. fullo mitochondrial genome

The two strands of one mitochondrial genome are differentiated by their nucleotide content. They are a guanine-rich strand referred to as the heavy strand (H-strand) and a cytosine-rich strand referred to as the light strand (L-strand). Using the sRNA-seq based method (Materials and Methods), all genes on both strands of the E. fullo mitochondrial genome have been annotated (Fig. 1) to update the previous version of annotations [5], which were obtained using PacBio full-length transcripts. The complete mitochondrial genome sequence of E. fullo with the new annotations (Table 1) is available at the NCBI GenBank database under the accession number MK374364. Using the new annotations, we found that E. fullo transcribes both entire strands of its mitochondrial genome to produce two primary transcripts, which updated our finding that the E. fullo mitochondrial genome contained five transcriptional regions [5]. The quantitative transcription map of the E. fullo mitochondrial genome exhibited four transcriptional regions (Fig. 1A) due to the rapid degradation of transient RNAs between these four transcriptional regions. To validate this finding, we used high-depth RNA-seq data from a strand-specific library (Materials and Methods) to cover 315× (4,604,253/14,624) of the E. fullo mitochondrial genome. As a result, the alignment result using the high-depth RNA-seq data supported that E. fullo transcribes both entire strands.

Figure 1.

Figure 1.

Annotation of the E. fullo mitochondrial genome at 1 bp resolution.

This reference sequence is available at the NCBI GenBank database under the accession number MK374364. The mitochondrial genome exhibits four transcriptional regions including region 1 (MDL1, ND2, COI, COII, ATP8/6, COIII and ND3), region 2 (ND5 and ND4/4L), region 3 (ND6 and Cytb) and region 4 (ND1, 16S rRNA, 12S rRNA and MDL1AS). However, E. fullo transcribes both entire strands to produce two primary transcripts. (a) Alignments of full-length transcripts on the H-strand in red color are piled along the positive y-axis. Alignments of full-length transcripts on the L-strand strand in blue color are piled along the negative y-axis. All tRNAs are represented using their amino acids in green color [5]. The TIS of the H-strand (TISH) and the TIS of the L-strand (TISL) are at the positions 15,093 and 15,044 bp. The detailed information of the control region is provided in Figure 3. (b) Alignments of sRNA-seq reads on the H-strand in red color are piled along the positive y-axis. Alignments of sRNA-seq reads on the L-strand strand in blue color are piled along the negative y-axis. In our previous study [7], we have defined a new file format, named ‘5-end format’, to easily identify the 5ʹ ends of mature RNAs. As two examples, the 5ʹ ends of COII and 16S rRNA are annotated at position 2,975 and 13,742 bp, respectively.

Table 1.

Annotation of the E. fullo mitochondrial genome.

Gene Strand Start End Length
tRNA-Ile J(+) 1 67* 67
tRNA-GlnAS J(+) 68* 134* 67
tRNA-Met J(+) 135 200 66
ND2 J(+) 201 1,182* 982
tRNA-Trp J(+) 1,183 1,249 67
AS1(+) J(+) 1,250* 1,372* 123
COI J(+) 1,373 2,909* 1,537
tRNA-Leu J(+) 2,910 2,974 65
COII J(+) 2,975 3,653 679
tRNA-Lys J(+) 3,654 3,724* 71
tRNA-Asp J(+) 3,725 3,789 65
ATP8/6 J(+) 3,790 4,618* 829
COIII J(+) 4,619 5,410* 792
tRNA-Gly J(+) 5,411 5,475 65
ND3 J(+) 5,476 5,827 352
tRNA-Ala J(+) 5,828 5,896 69
tRNA-Arg J(+) 5,897* 5,964* 68
tRNA-Asn J(+) 5,965 6,030 66
tRNA-Ser J(+) 6,031* 6,100* 70
tRNA-Glu J(+) 6,101 6,166 66
AS2(+) J(+) 6,167* 9,634* 3,468
tRNA-Thr J(+) 9,635 9,698 64
tRNA-ProAS J(+) 9,699* 9,762* 64
ND6 J(+) 9,763 10,240* 478
Cytb J(+) 10,241 11,386* 1,146
tRNA-Ser J(+) 11,387 11,455 69
AS3(+) J(+) 11,456* 16,485* 5,030
MDL1 J(+) 15,093* 16,485* 1,393
tRNA-Gln N(-) 64 132 69
AS1(-) N(-) 133* 1,241* 1,109
tRNA-Cys N(-) 1,242 1,302 61
tRNA-Tyr N(-) 1,303 1,366 64
AS2(-) N(-) 1,367* 6,164* 4,798
tRNA-Phe N(-) 6,165 6,231 67
ND5 N(-) 6,232* 7,949 1,718
tRNA-His N(-) 7,950* 8,019* 70
ND4/4L N(-) 8,020* 9,632* 1,613
tRNA-ThrAS N(-) 9,633* 9,698* 66
tRNA-Pro N(-) 9,699 9,760 62
AS3(-) N(-) 9,761* 11,479* 1,719
ND1 N(-) 11,480 12,400* 921
tRNA-Leu N(-) 12,401 12,465 65
16S rRNA N(-) 12,466 13,742 1,277
tRNA-Val N(-) 13,743 13,810 68
12S rRNA N(-) 13,811 14,624* 814
MDL1AS N(-) 14,625* 15,044* 420
MDL2AS N(-) 14,625* 63* 1,924
TIS(H)   15,093* ATACAAGTATCATAA
TIS(L)   15,044* AATGATTCATCTTAT

This reference sequence is available at the NCBI GenBank database under the accession number MK374364. J(+) and N(-) represent the major and minor coding strands of the mitochondrial genome, respectively. * represents the new annotations using the sRNA-seq based method. TIS(H) and TIS(L) represents the TIS of the H-strand (TISH) and the TIS of the L-strand (TISL), respectively. AS1(+), AS2(+) and AS3(+) represents tRNACysAS/tRNATyrAS, tRNAPheAS/ND5AS/tRNAHisAS/ND4/4LAS, ND1AS/tRNALeuAS/16S rRNAAS/tRNAValAS/12S rRNAAS/CR, respectively. AS1(-), AS2(-) and AS3(-) represents tRNAMetAS/ND2AS/tRNATrpAS, COIAS/tRNALeuAS/COIIAS/tRNALysAS/tRNAAspAS/ATP8/6AS/COIIIAS/tRNAGlyAS/ND3AS/tRNAAlaAS/tRNAArgAS/tRNAAsnAS/tRNASerAS/tRNAGluAS, ND6AS/CytbAS/tRNASerAS, respectively. In total, four short antisense genes (tRNAGlnAS, tRNAProAS, tRNAThrAS and AS1(+)) and five long antisense genes (AS2(+), AS3(+), AS1(-), AS2(-) and AS3(-)) are annotated.

The new annotation of the E. fullo mitochondrial genome was confirmed by the ‘mitochondrial cleavage’ model that we proposed in our previous study [3]. This model is based on the fact that RNA cleavage is processed: 1) at 5′ and 3′ ends of tRNAs, 2) between mRNAs and mRNAs except fusion genes (e.g. ATP8/6 and ND4/4L), 3) between antisense tRNAs and mRNAs and 4) between mRNAs and antisense tRNAs; but is not processed: 1) between mRNAs and antisense mRNAs or 2) between antisense RNAs. In particular, this model helps the annotation of long antisense genes (e.g. ND1AS/tRNALeuAS/16S rRNAAS/tRNAValAS/12S rRNAAS/CR in E. fullo) in animal mitochondrial genomes, as all of them are transcribed as transient RNAs and are usually not well covered by aligned reads from sRNA-seq or RNA-seq data due to their rapid degradation.

Annotation of mitochondrial RNAs at 1 bp resolution

The new annotations (GenBank: MK374364) cover both entire strands of mitochondrial genomes without gaps or overlaps at 1 bp resolution. Most of the new annotations are consistent with the annotations of the only public E. fullo mitochondrial genome JQ743673.1 in the NCBI GenBank database. The sequence of our E. fullo mitochondrial genome (GenBank: MK374364) shares an identity of 99.23% (14,498/14,611) with that of JQ743673.1, which covers all the mitochondrial genes except the control region. In total, 9 of 11 mRNA genes (ATP8/6 and ND4/4L are polycistronic transcripts), 1 of 2 rRNA genes and 5 of 22 tRNA genes have been re-annotated. In addition, the annotations of 11 ncRNA genes (9 antisense genes, MDL1 and MDL1AS) have been annotated for the first time (Table 1). Here, we uses JQ743673.1 as a contrast to demonstrate the improvement of the new annotations in MK374364.

Four typical examples (COI, ND6, ND1 and 12S rRNA) are used to show the improved annotations of mRNA and rRNA genes. In the new annotations, five nucleotide residues TCTAA were removed from the annotated coding sequence (CDS) of COI, resulting in the removal of one amino acid residue F. As for ND6, 47 nucleotide residues ATGAACAAACCAATACGAAAAAATCACCCTCTAATTAAAATTATTAA were removed from the CDS, resulting in the removal of 15 amino acid residues YEQTNTKKSPSNQNY. Six nucleotide residues ATAGTA were removed from the CDS of ND1, corresponding to two amino acid residues MV. These examples demonstrated that the mitochondrial coding genes can not be annotated accurately by their CDS regions based on the analysis of protein codons. The mistakes in the annotation of coding genes are usually attributed to the irregularity in the start and stop codons. For example, the start codons of COI, ATP8, ND6 and ND1 in the E. fullo mitochondrial genome are TGG, and the stop codons of ND2, COI, COII, ND3 and ND6 are completed by polyadenylation of the last nucleotide residues (T or TA). In the new annotations, the first 15-bp sequence GATTTATTATGAAAT was added to obtain the complete 12S rRNA sequence, which was missed in JQ743673.1. This example demonstrated that the mitochondrial rRNA genes and control regions can not be correctly annotated without our method if they do not have neighbouring mRNAs or tRNAs upstream or downstream.

Three typical examples (tRNAIle, tRNAArg and tRNAHis) are used to show the improved annotations of tRNA genes (Fig. 2). The annotation resolution of mitochondrial tRNAs is limited due to the complexity of tRNA processing. In our previous study, we annotated the mitochondrial tRNAs of human at 1 bp resolution using 5′ and 3′ sRNAs [7]. In particular, we improved the annotation of consecutive tRNAs (e.g. tRNATyr/tRNACys/tRNAAsn/tRNAAla in human) and found a novel 31-nt ncRNA named non-coding mitochondrial RNA 1 (ncMT1) between tRNACys and tRNAAsn. Based on these results, we proposed a mitochondrial tRNA processing model [7]. Using our method, mitochondrial tRNAs are annotated between two cleavage sites and the information of the trimming sites can be derived using the mitochondrial tRNA processing model.

Figure 2.

Figure 2.

Corrected annotations of mitochondrial tRNAs.

The new annotations are available at the NCBI GenBank database under the accession number MK374364. Based on our model [7], one mitochondrial tRNA is cleaved from a primary transcript into a precursor, and then the acceptor stem of the precursor is adenylated or trimmed to contain a 1-bp overhang at the 3ʹ end. Finally, CCAs (for most of tRNAs) or CAs (e.g. tRNAIle and tRNAHis in E. fullo) are post-transcriptionally added to the 3ʹ ends of tRNAs, one nucleotide at a time. (a) Mitochondrial tRNAs are annotated between two cleavage sites using the sRNA-seq based method, while they are annotated between two trimming sites using other existing methods. Red and green color with circles are used to indicate nucleic acids which need be added and removed in the new annotations, respectively. (b) Mature tRNAs are identified using 5ʹ and 3ʹ end sRNAs.

Using pan RNA-seq analysis, we confirmed that nuclear mitochondrial DNA segments (NUMTs) in the human genome do not transcribe into RNAs [3]. This finding simplified the analysis of animal mitochondrial genes (e.g. variation detection or quantification) using transcriptome (RNA-seq) data. As the complete mitochondrial genome of E. fullo (GenBank: MK374364) used the consensus DNA sequence of one individual, we were able to detect single nucleotide substitutions between individuals. Comparing the sequence of MK374364 except the control region with those of other 18 E. fullo individuals, the maximum genetic variation is 101 single nucleotide substitutions in 13 mRNAs, two rRNAs and one tRNA (Table 2). Among 101 single nucleotide substitutions, 65 substitutions in coding genes result in 20 amino acid substitutions. This maximum genetic variation between two individuals is close to the variation between MK374364 and JQ743673.1, which includes 111 single nucleotide substitutions and 2 single nucleotide InDels. This suggested that E. fullo has a great diversity between individuals. Among 164,346,320 cleaned RNA-seq reads, 2.8% (4,604,253/164,346,320) were aligned to the E. fullo mitochondrial genome except the control region to obtain a very high average depth 315× (4,604,253/14,624). These cleaned reads covered 96.89% (15,973/16,485) and 99.37% (16,381/16,485) of the H-strand and L-strand, respectively. For each of the 101 detected variations, the major allele frequency was more than 99.9%, suggesting the minor allele frequency (noise) was less than 0.1%. These results proved that RNA-seq improves the variation detection of mitochondrial genes, ruling out the false detected variations in NUMTs by DNA-seq.

Table 2.

Single nucleotide substitutions between E. fullo individuals.

Position Feature NT Ref Change Codon Ref Change AA Ref Change
226 ND2 A T TAC TTC Y F
362 ND2 G T GCG GCT A A
503 ND2 A G GCA GCG A A
741 ND2 T C TTA CTA L L
761 ND2 A G GTA GTG V V
798 ND2 T A TTA ATA L M
832 ND2 G A AGT AAT S N
920 ND2 G A GGG GGA G G
923 ND2 G A GGG GGA G G
970 ND2 A T AAT ATT N I
1115 ND2 C T AAC AAT N N
1121 ND2 A T TTA TTT L F
1124 ND2 A G ATA ATG M M
1487 COI C T CTA TTA L L
1630 COI A G GGA GGG G G
1753 COI C T TAC TAT Y Y
1921 COI G A GTG GTA V V
2095 COI A G GTA GTG V V
2233 COI A G GCA GCG A A
2329 COI T C TTT TTC F F
2419 COI T C GGT GGC G G
2507 COI C T CTA TTA L L
2686 COI T C TAT TAC Y Y
2707 COI T C TAT TAC Y Y
2737 COI A G GGA GGG G G
2792 COI C T CTT TTA L L
2794 COI T A CTT TTA L L
3088 COII A G GTA GTG V V
3110 COII C A CTT ATT L I
3187 COII C T ATC ATT I I
3861 ATP8 T C ACT ACC T T
3991 ATP6 A T TAC TTC Y F
4328 ATP6 G A GTG GTA V V
4337 ATP6 G A GGG GGA G G
4350 ATP6 C T CTA TTA L L
4673 COIII C T CTA TTA L L
5060 COIII A G ACT GCT T A
5085 COIII C T TCA TTA S L
5230 COIII T C TTT TTC F F
5482 ND3 C T CTA TTA L L
6309 ND5 G A GGC GGT G G
6311 ND5 C T GGC AGC G S
6396 ND5 A G AGT AGC S S
6404 ND5 G A CGT TGT R C
6584 ND5 A G TTA CTA L L
6617 ND5 T C AGT GGT S G
6635 ND5 A G TTA CTA L L
6768 ND5 G A CGC CGT R R
6940 ND5 T C TAT TGT Y C
7068 ND5 T C TCA TCG S S
7077 ND5 C T GAG GAA E E
7096 ND5 G A GCT GTT A V
7362 ND5 G A GCC GCT A A
7490 ND5 T C ATT GTT I V
7506 ND5 G A GGC GGT G G
7526 ND5 C T GTT ATT V I
7556 ND5 T C ATA GTA M V
7611 ND5 T C TTA TTG L L
7701 ND5 C T TTG TTA L L
7704 ND5 A G GAT GAC D D
7785 ND5 C T TGG TGA W W
7998 tRNA-His C A        
8058 ND4 G A CTA TTA L L
8095 ND4 C T TTG TTA L L
8124 ND4 C T GGT AGT G S
8197 ND4 C T ATG ATA M M
8206 ND4 C T TTG TTA L L
8251 ND4 T C TCA TCG S S
8296 ND4 G T TCC TCA S S
8530 ND4 A G GTT GTC V V
8619 ND4 A C TTA GTA L V
8686 ND4 A G GGT GGC G G
8868 ND4 G A CAT TAT H Y
8922 ND4 A G TTA CTA L L
8926 ND4 A G TAT TAC Y Y
9112 ND4 G A TAC TAT Y Y
9476 ND4L A G TTA CTA L L
9575 ND4L C T GCA ACA A T
9912 ND6 C T TAC TAT Y Y
9920 ND6 C T ACC ATC T I
9957 ND6 T C ATT ATC I I
10276 Cytb T C ATT ATC I I
10363 Cytb C T TGC TGT C C
10708 Cytb T C TAT TAC Y Y
10795 Cytb T C CTT CTC L L
10796 Cytb C T CTA TTA L L
10807 Cytb C T ATC ATT I I
11110 Cytb A G GGA GGG G G
11332 Cytb T C ATT ATC I I
11518 ND1 G A CTA TTA L L
11542 ND1 A G TTA CTA L L
12140 ND1 G A TGC TGT C C
12194 ND1 A T TTT TTA F L
12760 16S rRNA G T        
13051 16S rRNA T A        
13225 16S rRNA T C        
13241 16S rRNA A C        
13245 16S rRNA T C        
13703 16S rRNA G A        
14524 12S rRNA A T        
14532 12S rRNA C T        

This reference sequence (GenBank: MK374364) uses the consensus DNA sequence of one individual collected from Jiangxi province of China. All these single nucleotide substitutions were detected in another individual collected from Tianjin of China. Feature represents gene or CDS, while Ref, NT and AA represent reference, nucleotide and amino acid.

The MDL1 and MDL1AS genes

In this study, we propose that any animal mitochondrial genome that contains one control region transcribes both entire strands. However, the situation in animal mitochondrial genomes that contain more than one control region [8] is unknown. One control region contains at least two Transcription Initiation Sites (TISs), which are the TIS of the H-strand (TISH) and the TIS of the L-strand (TISL). The investigation of more than two TISs for one strand (e.g. TISH1 and TISH2 in human [7]) is beyond the scope of this study. One control region is involved in at least four genomic regions (Fig. 3A). The first and second regions that cover the complete control region are between the two nearest cleavage sites on the H-strand and L-strand, respectively. In E. fullo, they are ND1AS/tRNALeuAS/16S rRNAAS/tRNAValAS/12S rRNAAS/CR (mtDNA:11456–16485) on the H-strand and CR/tRNAIleAS (mtDNA:14625–63) on the L-strand. The third region (mtDNA:15,093–16,485) starts at the TISH and ends at the downstream cleavage site on the H-strand. The fourth region (mtDNA: 14625–15044) starts at the TISL and ends at the downstream cleavage site on the L-strand. As the first region overlaps the third and the second region overlaps the fourth, we propose these criteria to define four regions as potential lncRNAs for all the animal mitochondrial genomes. On the H-strand, the shorter region (mtDNA:15093–16485) is defined as Mitochondrial D-loop 1 (MDL1) and the longer one (mtDNA: 11456–16485) is defined as MDL2. On the L-strand, the shorter region (mtDNA: 14625–15044) is defined as Mitochondrial D-loop 1 antisense gene (MDL1AS) and the longer one (mtDNA: 14625–63) is defined as MDL2AS (Fig. 3A). In our previous study, the longer region on the H-strand was defined as the human MDL1 (hsa-MDL1), as the shorter one (NC_012920: 561–576) is only 16 bp in length, which is not likely to have biological functions. Thus, hsa-MDL1 (NC_012920: 15956–576) contains tRNAProAS and the control region, hsa-MDL1AS (NC_012920: 16024–407) starts at the TISL and ends at the end of the control region. As the MDL2AS in E. fullo (eft-MDL2AS) and hsa-MDL2AS (NC_012920: 16024–4328) were identified as transient RNAs, it is not necessary to define them as lncRNAs.

Figure 3.

Figure 3.

Deciphering tandem repeats in the control region.

(a) The control region is involved in eft-MDL1 (mtDNA:15093–16485), eft-MDL2 (mtDNA:11456–16485), eft-MDL1AS (mtDNA: 14625–15044) and eft-MDL2AS (mtDNA: 14625–63). The black and red arrows represent the 83-bp repeat unit and the insert sequence, respectively. (b) eft-MDL1 (mtDNA:15093–16485) is a repeat region which is composed of multiple 83-bp repeat units. Three frequently occurring polymorphic sites in 83-bp repeat units have alleles T/C, ATA/GTAGTA and G/A. The 83-bp repeat units and the insert sequences are indicated by black and red colors, respectively. More 83-bp repeat units are indicated by the thick black lines.

Deciphering tandem repeats in the control region

Many, but not all, control regions contain tandem repeats, which appear to be conserved in position across a range of (mostly mammalian) species and are hypothesized to be involved in the termination of transcription by nature of their complex secondary structure [9]. Although tandem repeats in control regions have been reported in more than 150 species, including chicken, cat, rabbit, pig, sheep, horse, Japanese monkey and human (only gastric cancer [10]), their genetic diversity and biological functions still require extensive research. In the previous study [5], we obtained the tandem repeats in the E. fullo mitochondrial genome using the PacBio full-length transcriptome data and found that eft-MDL1 (mtDNA:15093–16485) was a repeat region which was composed of multiple 83-bp repeat units (noted as R) and several insert sequences (noted as A, B and so on). Most repeat regions detected using E. fullo insects in this study follow the pattern RxARy (Fig. 3B), while a few follow the pattern RxARyBRz, where x, y and z represent copy numbers of R. In addition, we found that TISH was at the 5′ end of the 83-bp repeat unit. In this way, TISs are repeated in RxARy (x + y = n) for n times. Further study showed three frequently occurring polymorphic sites in the 83-bp repeat unit and RxARy was further cleaved into two RNAs (Fig. 3B). To validate these polymorphic sites, each of short and long DNA segments in the repeat region (Materials and Methods) were amplified using PCR to obtain the sequences of Rx and Ry using 18 E. fullo adults (Supplementary file 1). Sixteen short and five long segments were successfully sequenced using Sanger sequencing to determine x and y. Using our data, we found that x ranged from 1 to 3 and y ranged from 8 to 13. The total copy numbers n (x + y) in most sequenced segments were more than 10. Another important finding related to tandem repeats was that the copy numbers of tandem repeats show a great diversity within an individual. This was also validated by Sanger sequencing of 18 E. fullo adults (Supplementary file 1). These results suggested that mitochondrial DNA recombination [11] occurs in an individual.

Conclusion and discussion

In this study, we demonstrated that the sRNA-seq based method can be used to annotate mitochondrial genomes at 1 bp resolution. This improved method brought new findings, which updated our understanding of the conservation and polymorphism in mitochondrial genomes. We propose that any animal mitochondrial genome which contains one control region transcribes both entire strands. One control region contains at least one TISH and one TISL, which initiate the H-strand and L-strand primary transcripts, respectively. The transcription and cleavage of primary transcripts synchronize to maintain a high efficiency of mitochondrial gene expression. Although all antisense transcripts (Table 1) are produced as ncRNAs, long antisense transcripts (e.g. ND1AS/tRNALeuAS/16S rRNAAS/tRNAValAS/12S rRNAAS/CR in E. fullo) are identified as transient RNAs, which are not likely to perform specific functions. This conclusion does not rule out the possibility that sRNAs degraded from transient RNAs could have detrimental effects on the regulation of gene expression [7]. Short antisense transcripts (e.g. tRNAGlnAS and tRNACysAS/tRNATyrAS in E. fullo) could be steady RNAs. However, their functions are still unknown. Therefore, we concluded that animal mitochondrial genomes containing one control region encode two steady lncRNAs (MDL1 and MDL1AS), while all other reported mitochondrial lncRNAs could be degraded fragments of transient RNAs or random breaks during experimental processing [7].

The mechanisms for the termination of mitochondrial transcription are still unclear, as the sRNA-seq based method cannot be used to determine the Transcription Termination Sites (TTSs) of primary transcripts. Based on our findings and hypotheses, the existence of TTSs is not necessary for the transcription of mitochondrial genes and an uninterrupted transcription could result in the highest efficiency. In our previous study, we were interested in whether RNA polymerase had the ability to read through the TTS after the H-strand primary transcript had been completely synthesized. Surprisingly, we found two long PacBio sequences to support this ‘read through’ model. As cells usually prefer economic and efficient ways of existence, the ‘read through’ model and the possible uninterrupted transcription of mitochondrial genomes merit further research.

As the control regions in the mitochondrial genomes are less conserved than the coding genes in evolution, they are not well investigated, compared to other mitochondrial genes. However, the discovery of repeated TISs suggested that control regions contain DNA elements which are highly conserved in evolution. Our findings proved that the annotations of control regions provide additional information for the study of the molecular phylogenetics and evolution of animals, particularly between or within individuals. Future work needs to be performed to investigate the copy number distribution of tandem repeats in the E. fullo mitochondrial genome using insects from a wide distribution. The diversity of the copy numbers within an individual can also be used to study insect development and aging, as the occurrence rate of mitochondrial DNA recombination increases with the animal survival time. As the copy numbers of tandem repeats detected in this study are more than 10, future work needs to be performed to investigate if the evolution filter out E. fullo insects with a less number of repeat units. Our hypothesis is that a large number of repeat units in an individual could prevent the deleterious mutation effects on TISs and E. fullo insects containing a large number of repeat units have an advantage in evolution. However, excessive repeat units reduce the efficiency in the mitochondrial gene expression. If mitochondrial DNA recombination only accumulates repeat units, what mechanism exists to restrict the copy numbers of tandem repeats? Another question is if the accumulation of repeat units has detrimental effects in aging.

Materials and methods

The PacBio full-length transcriptome data was collected from our previous study [5]. The sRNA-seq and RNA-seq data were obtained by sequencing three and one libraries, respectively. Total RNA was isolated from thoracic muscles of three, one and one E. fullo adults to construct three sRNA-seq libraries, respectively, which were sequenced two times as technical replicates using 50-bp single-end strategy on the Illumina HiSeq 2500 platform. Total RNA was isolated from thoracic muscles of one E. fullo adult to construct one RNA-seq library with the size of 210 bp, which was sequenced two times as technical replicates using 150-bp paired-end strategy on the Illumina HiSeq X Ten platform. These sRNA-seq and RNA-seq libraries were constructed following the protocols in our previous studies ([12] and [13]), respectively. Finally, Six runs of sRNA-seq data and four runs of RNA-seq data were submitted to the NCBI SRA database under the project accession number SRP174926.

The cleaning and quality control of sRNA-seq and RNA-seq were performed using the pipeline Fastq_clean [14] that was optimized to clean the raw reads from Illumina platforms. 291,941,849 cleaned sRNA-seq reads and 164,346,320 cleaned RNA-seq reads were used to perform and validate the new annotation of the E. fullo mitochondrial genome, respectively. Using the software bowtie, sRNA-seq and RNA-seq reads were aligned to the E. fullo mitochondrial genome with one mismatch and two mismatches, receptively. Statistics and plotting were conducted using the software R v2.15.3 the Bioconductor packages [15]. The complete mitochondrial genome sequence of E. fullo with the new annotations using the sRNA-seq based method [7] is available at the NCBI GenBank database under the accession number MK374364.

To investigate the genetic diversity between and within individuals, 18 E. fullo adults were collected from Hebei province and Tianjin of China. Using the forward and reverse primers CTATTCCTAGCTCACATTTAAGTTCG and TGCAGGAGGAGCAATACTTG, the first DNA segment (mtDNA:14779–15176), named the short segment, was amplified to obtain the sequence of Rx. Using the forward and reverse primers CTATTCCTAGCTCACATTTAAGTTCG and CCAGGGTATGAACCTGTTAGC, the second DNA segment (mtDNA:14779–156), named the long segment, was amplified to obtain the sequences of Rx and Ry. The PCR reaction mixture for the short segment was incubated at 94°C for 3 min, followed by 40 PCR cycles (10 s at 94°C, 10 s at 49°C and 20 s at 72°C for each cycle) using Taq PCR Mix (Sino-Novel, China). The PCR reaction mixture for the long segment was incubated at 98°C for 3 min, followed by 40 PCR cycles (10 s at 98°C, 30 s at 52°C and 3 min at 72°C for each cycle) using LA Taq (TaKaRa, China).

Funding Statement

This work was supported by National Key Research and Development Program of China (2016YFC0502304-03) to Defu Chen, National Natural Science Foundation of China (31871992) to Bingjun He and National Natural Science Foundation of China (31471967) to Ze Chen.

Acknowledgments

We appreciate the help equally from the people listed below. They are Professor Defu Chen, Bingjun He, Guoqing Liu, Dawei Huang and the graduate student Siyu Li from College of Life Sciences, Nankai University.

Disclosure statement

No potential conflict of interest was reported by the authors.

Supplementary material

Supplementary data for the article can be accessed here.

Supplemental Material

References

  • [1].Igawa T, Nozawa M, Suzuki DG, et al. Evolutionary history of the extant amphioxus lineage with shallow-branching diversification. Sci Rep. 2017;7(1157). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Boore JL. Animal mitochondrial genomes. Nucleic Acids Res. 1999;27(8):1767–1780. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Gao S, Tian X, Chang H, et al. Two novel lncRNAs discovered in human mitochondrial DNA using PacBio full-length transcriptome data. Mitochondrion; 2017. [DOI] [PubMed] [Google Scholar]
  • [4].Bernt M, Donath A, Jühling F, et al. MITOS: improved de novo metazoan mitochondrial genome annotation. Mol Phylogen Evol. 2013;69(2):313–319. [DOI] [PubMed] [Google Scholar]
  • [5].Gao S, Ren Y, Sun Y, et al. PacBio Full-length transcriptome profiling of insect mitochondrial gene expression. RNA Biol. 2016;13(9):820–825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Ren Y, Jiaqing Z, Sun Y, et al. Full-length transcriptome sequencing on PacBio platform (in Chinese). Chinese Sci Bull. 2016;61(11):1250–1254. [Google Scholar]
  • [7].Xu X, Ji H, Jin X, et al. Using pan RNA-seq analysis to reveal the ubiquitous existence of 5ʹ and 3ʹ end small RNAs. Front Genet. 2019;10:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Kumazawa Y, Ota H, Nishida M, et al. The complete nucleotide sequence of a snake (Dinodon semicarinatus) mitochondrial genome with two identical control regions. Genetics. 1998;150(1):313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Lunt DH, Whipple LE, Hyman BC. Mitochondrial DNA variable number tandem repeats (VNTRs): utility and problems in molecular ecology. Mol Ecol. 2010;7(11):1441–1455. [DOI] [PubMed] [Google Scholar]
  • [10].Hung WY, Lin JC, Lee LM, et al. Tandem duplication/triplication correlated with poly-cytosine stretch variation in human mitochondrial DNA D-loop region. Mutagenesis. 2008;23(2):137–142. [DOI] [PubMed] [Google Scholar]
  • [11].Rokas A, Ladoukakis E, Zouros E. Animal mitochondrial DNA recombination revisited. Trends Ecol Evol. 2003;18(8):411–417. [Google Scholar]
  • [12].Chen Z, Sun Y, Yang X, et al. Two featured series of rRNA-derived RNA fragments (rRFs) constitute a novel class of small RNAs. Plos One. 2017;12(4):e0176458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Xu Y, Gao S, Yang Y, et al. Transcriptome sequencing and whole genome expression profiling of chrysanthemum under dehydration stress. BMC Genomics. 2013;14(1):662. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Zhang M, Zhan F, Sun H, et al. Fastq_clean: an optimized pipeline to clean the Illumina sequencing data with quality control. in Bioinformatics and Biomedicine (BIBM). 2014 IEEE International Conference on; Belfast; 2014. IEEE. [Google Scholar]
  • [15].Gao S, Ou J, Xiao K. R language and Bioconductor in bioinformatics applications(Chinese Edition). Tianjin: Tianjin Science and Technology Translation Publishing Ltd; 2014. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

Articles from RNA Biology are provided here courtesy of Taylor & Francis

RESOURCES