Skip to main content
GigaByte logoLink to GigaByte
. 2021 Oct 8;2021:gigabyte30. doi: 10.46471/gigabyte.30

A chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)

Kenta Shirasawa 1, Rakesh Chahota 2, Hideki Hirakawa 1, Soichiro Nagano 1,4, Hideki Nagasaki 1, Tilak Sharma 3, Sachiko Isobe 1,*
PMCID: PMC9650294  PMID: 36824333

Abstract

Horsegram (Macrotyloma uniflorum [Lam.] Verdc.) is an underutilized warm-season diploid legume (2n = 20, 22). Because of its ability to grow under water-deficient and marginal soil conditions, horsegram is a preferred choice in the era of global climate change. In recognition of its potential as a crop species, we generated and analyzed a draft genome sequence for a horsegram variety, HPK-4. Ten chromosome-scale pseudomolecules were created by aligning Illumina scaffold sequences onto a linkage map. The total length of the ten pseudomolecules was 259.2 Mbp, covering 89% of the total length of the assembled sequences. A total of 36,105 genes were predicted on the assembled sequences. Diversity analysis of 89 horsegram accessions by dd-RAD-Seq identified 277 single nucleotide polymorphisms (SNPs), suggesting narrow genetic diversity among the horsegram accessions. This is the first attempt to generate a draft genome sequence of horsegram and will provide a reference for sequence-based analysis of horsegram germplasm.

Data description

Background

Horsegram (Macrotyloma uniflorum [Lam.] Verdc.) (NCBI:txid271171), is an underutilized warm-season diploid legume (2n = 20, 22). It belongs to the Fabaceae family of the Phaseoleae tribe, and is cultivated mainly in semi-arid regions of the world. On the Indian subcontinent, horsegram is consumed primarily as a food legume, whereas in Africa and Australia it is grown mainly for use as a concentrated animal feed and fodder. This self-pollinating plant is thought to have originated in Africa because most of its 32 wild species exist there [1], and the Northwestern Himalayan region is considered its secondary center of origin [2]. Horsegram may have been domesticated as M. uniflorum var. uniflorum in the southern part of India, but its probable progenitor, M. axillare, has not been reported in India. Therefore, the process by which cultivated horsegram was domesticated from its wild ancestors has not yet been established [3].

Because of its ability to grow under water-deficient and marginal soil conditions, horsegram is a preferred choice in the era of global climate change. Horsegram contains 16.0–30.4% protein [4], and constitutes an important source of dietary protein for the undernourished population in south Asia. In addition, the seeds are a rich source of lysine and vitamins [5], and its antioxidant, antimicrobial, and unique antilithiatic properties make it a food of nutraceutical importance [68]. As a result of horsegram’s medicinal importance and ability to thrive under drought-like conditions, the US National Academy of Sciences has identified this legume as a potential food source for the future [9].

Context

The existence of many wild and unsolicited characteristics makes horsegram a less favorable legume for commercial cultivation, although it does possess numerous attributes that make it a potential food legume for warm arid regions. In addition, there is a lack of genetic and molecular tools with which to genetically enhance horsegram. To elucidate the potential of this food legume species, we generated and analyzed a draft genome sequence for HPK-4, a horsegram cultivar commercially released by CSK Himachal Pradesh Agricultural University (HPAU), Palampur, India. This variety, which has dark-brown seeds, is under cultivation in many parts of the Northwestern Indian Himalayan region. It is resistant to anthracnose (Colletotrichum truncatum) and tolerant to abiotic stresses such as drought, salinity, and heavy metals. This is the first attempt to generate a draft genome sequence of this ‘orphan’, but it is an important food legume species and will provide a reference for sequence-based analysis of horsegram germplasm to elucidate the genetic bases of important traits.

Methods

Whole genome sequencing and assembly of horsegram

The genome sequences of a horsegram variety, HPK-4, bred at CSK-HPAU, were generated from a paired-end (PE) library by Illumina HiSeq 2000 with a total length of 37.9 Gbp (gigabase pairs) [10]. All data analysis for this study was performed on Linux servers running Red Hat Enterprise Linux Server 7.1. Using the Jellyfish v1.1.6 program (RRID: SCR_005491) [11], the genome size of HPK-4 was estimated to be approximately 343.6 Mbp (megabase pairs) (Figure 1). The parameters used in the analysis are listed in Table 1.

Figure 1.

Figure 1.

Genome size estimation using Jellyfish with the distribution of the number of distinct k-mers (k = 17) with the given multiplicity values.

Table 1.

Parameters used in each program.

Program name Parameters or BUSCO data set Comments
Jellyfish v1.1.6 –m 17 –s 1000000000 –t 32 –C
SOAPdenovo2 r223 –K (61 71 81 91) –R –F –p 8
SSPACE v2.0 –x 0 –z 0 –k 3 –a 0.7 –n 15 –T 8 –g 0 –v 1
GapFiller v1.10 –m 30 –o 5 –r 0.7 –n 10 –d 50 –t 10 –g 0 –T 8 –i 1
Platanus v1.2.1 –t 12 –m 300
MaSuRCA v2.3.2 Default Parameters
TruSPAdes v3.6.2 Default Parameters
RepeatMasker v3.2.9 –poly –x –lib
RepeatScout v1.0.5 Default Parameters
BRAKER1 v1.9 Default Parameters
BUSCO v3.0 Embryophyta, odb10
Samtools 0.1.19 samtools mpileup mpileup –d 10000000 –D –u
bcftools 0.1.19 bcftools view –c –g –v
vctools 0.1.12b vcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 5 –minQ 214 –max-missing 1 SNP filterfing for 8 F2 WGS
vctools 0.1.12b vcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 10 –minQ 50 –max-missing 0.2 SNP filterfing for 214 F2TAS
vctools 0.1.12b vcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 5 –minQ 999 –max-missing 0.5 –maf 0.05 SNP filtering for 89 population
JoinMap 4 Kosambi’s mapping function, linkage with rec. frec. Smaller than 0.4 and a LOD lather than 1.0, Goodness-of-fit for removal of loci  = 5.0, Number of added loci after which to perform a ripple  = 1, Third round  = yes

The Illumina PE reads were assembled by SOAPdenovo2 r223 (RRID: SCR_014986) [12] with k-mers of 61 and 81, and contigs were generated with total lengths of 352.2 Mbp (k-mer = 81) and 389.3 Mbp (k-mer = 61) (Table 2). The contigs constructed with k-mer = 81 were selected and scaffolded with mate-pair (MP) reads with insert sizes of 2, 5, 10, and 15 Kbp (kilobase pairs) by using SSPACE v2.0 (RRID: SCR_005056) [13]. The number of generated scaffolds was 6227 after gap filling by GapFiller [14] and excluding contamination. The total length of the scaffolds was 297.1 Mbp (Assembly 1, Table 2), which was approximately 55–92 Mbp shorter in length than the estimated genome size of HPK-4.

Table 2.

Statistics of de novo whole genome assembly.

File name SOAPdenovo Contigs Assembly 1 SOAPdenovo/SSPACE/GapFiller (k-mer  = 81) Assembly 2 Platanus Assembly 3 MaSuRCA Assembly 4 TruSPAdes Assembly 5 GMcloser MUN_r1.1 MUN_r1.11 MUN_r1.11 pseudomolecule
Input reads PE PE PE + MP PE + MP PE + MP PE + MP SLR Assembly 1 + Assembly 4 Assembly 5 MUN_r1.1 MUN_r1.11 + Linkage map
Comments k-mer  = 61 k-mer  = 81 Include contamination Exclude contamination Include contamination Include contamination   Exclude contamination ≥500bp Scaffolds revised
All

Number of sequences 1,534,576 779,101 7,123 6,228 62,323 17,400 374,253 6,228 3,495 3,497 10
Total length (bp) 389,388,347 352,263,669 297,816,217 297,127,168 287,695,252 313,146,882 1,357,659,302 295,740,202 294,688,765 294,688,765 259,245,825
Average length (bp) 254 452 41,811 47,708 4,616 17,997 3,628 47,486 84,317 84,269 25,924,583
Max length (bp) 43,864 86,786 13,495,995 13,495,995 13,114,378 9,397,721 79,948 13,482,853 9,844,273 9,844,273 33,386,276
Min length (bp) 62 82 300 300 100 71 1,500 146 500 500 15,505,026
N50 length (bp) 2,602 6,108 3,571,813 3,571,813 4,221,442 2,147,735 4,120 3,568,883 2,818,555 2,818,555 28,154,654
A 137,634,352 123,511,855 98,936,270 98,713,635 95,878,438 103,574,375 440,896,819 100,065,275 99,718,915 99,718,915 88,615,763
T 128,359,494 116,788,629 98,000,497 97,829,343 96,128,274 103,392,204 440,199,341 99,104,678 98,784,555 98,784,555 88,538,203
G 61,810,612 56,330,394 43,804,355 43,679,038 42,484,816 46,002,912 238,337,484 44,442,744 44,254,557 44,254,557 38,863,122
C 61,583,889 55,632,791 43,617,622 43,524,196 42,490,698 46,133,757 238,219,308 44,268,956 44,083,227 44,083,227 38,986,862
N 0 0 13,457,473 13,380,956 10,713,026 14,043,634 6,350 7,858,549 7,847,511 7,847,511 4,241,875
Total (ATGC, bp) 389,388,347 352,263,669 284,358,744 283,746,212 276,982,226 299,103,248 1,357,652,952 287,881,653 286,841,254 286,841,254 255,003,950
GC% (ATGC) 31.7 0 30.7 30.7 30.7 30.8 35.1 30.8 30.8 30.8 30.5
≥300 bp
Number of sequences 96,834 85,229 7,123 6,228 13,045 17,107 374,253 6,226 - - -
Total length (bases) 255,385,256 270,010,759 297,816,217 297,127,168 281,104,166 313,093,559 1,357,659,302 295,739,758 - - -
Average length (bases) 2,637 3,168 41,811 47,708 21,549 18,302 3,628 47,501 - - -
≥500 bp
Number of sequences 72,097 56,065 3,945 3,468 8,514 13,654 374,253 3,469 3,495 3,497
Total length (bases) 245,981,429 258,994,765 296,598,533 296,074,149 279,298,951 311,725,703 1,357,659,302 294,688,765 294,688,765 294,688,765
Average length (bases) 3,412 4,619 75,183 85,373 32,805 22,830 3,628 84,949 84,317 84,269
≥1 Kbp
Number of sequences 52,710 39,176 1,976 1,842 2,716 9,787 374,253 1,836 1,862 1,864
Total length (bases) 232,331,716 247,369,100 295,266,774 294,976,239 275,198,779 308,936,166 1,357,659,302 293,585,247 293,585,247 293,585,247
Average length (bases) 4,408 6,314 149,427 160,139 101,325 31,566 3,628 159,905 157,672 157,503
≥2 Kbp
Number of sequences 27,007 23,688 1,205 1,186 511 4,079 190,853 1,176 1,202 1,204
Total length (bases) 185,671,228 219,707,018 294,019,523 293,893,601 272,047,316 299,058,101 960,490,176 292,496,469 292,496,469 292,496,469
Average length (bases) 6,875 9,275 244,000 247,802 532,382 73,317 5,033 248,721 243,341 242,937
≥3 Kbp
Number of sequences 15,668 16,505 1,084 1,073 395 3,261 70,396 1,056 1,082 1,084
Total length (bases) 141,491,218 191,540,893 293,553,905 293,455,161 271,611,649 295,920,059 495,770,522 292,032,037 292,032,037 292,032,037
Average length (bases) 9,031 11,605 270,806 273,490 687,624 90,745 7,043 276,545 269,900 269,402

We speculated that the shorter observed length of the total scaffolds may have been caused by misintegration of repeat sequences by SSPACE v2.0. Therefore, we performed subsequent assemblies using two programs, Platanus v1.2.1 (RRID: SCR_015531) [15] and MaSuRCA v2.3.2 (RRID: SCR_010691) [16]. The total ATGC lengths of the scaffolds were not significantly different among the three assemblies: 284.4 Mbp in SOAPdenovo-SSPACE (Assembly 1, before excluding contamination; Table 2), 277.0 Mbp in Platanus (Assembly 2), and 299.1 Mbp in MaSuRCA (Assembly 3).

Meanwhile, an Illumina synthetic long-reads (SLR) library was constructed with high-molecular-weight cellular DNA using a TruSeq synthetic long-read DNA library prep kit (Illumina). Sequences were generated by Illumina HiSeq 2000 and MiSeq systems with read lengths of 93 nt and 251 nt, respectively. The SLR reads were synthesized through the TruSPAdes v3.6.2 pipeline [17]. Among the three assemblies with PE and MP reads, Assembly 1 was used for subsequent analysis, and gaps were closed with Illumina SLRs by GMcloser (RRID: SCR_000646) [18]. Potentially contaminated sequences were excluded using BLASTN searches against the chloroplast and mitochondrial genome sequences of Arabidopsis thaliana (accession numbers NC_000932.1 and NC_001284.2), human genome sequences (hg19 [19]), fungal genome sequences registered with the National Center for Biotechnology Information (NCBI) [20], bacterial genome sequences registered with [21], vector sequences in UniVec [22], and PhiX (NC_001422.1) [23] sequences with E-value cutoffs of 1 × 10−10 and length coverage >10%. The total length of the resultant assembly (Assembly 5) was 295.7 Mbp.

The results of benchmarking universal single-copy ortholog (BUSCO) analysis (RRID: SCR_015008) [24] identified that 93.1% of BUSCOs were found as complete genes in Assembly 5. We therefore considered that Assembly 5 covered most of the coding regions of the horsegram genome. Sequences shorter than 500 bp were excluded from Assembly 5, and the remaining sequences were designated as MUN_r1.1.

Linkage map and pseudomolecule construction

To construct chromosome-scale genome sequences, a SNP linkage map was created with the 214 F2 progenies. SNPs segregating in the F2 population were detected by mapping Illumina re-sequence reads of the eight F2 individuals onto the assembled genome using Bowtie2 (RRID: SCR_016368) [25], and by calling variants using SAMtools 0.1.19 (RRID: SCR_002105) [26] and vcftools 0.1.12 (RRID: SCR_001235) [27]. Target amplicon sequencing (TAS) was performed to genotype the identified SNPs according to the methods described in Shirasawa et al. [28].

The linkage map was constructed using JoinMap 4 with Kosambi’s mapping function (RRID: SCR_009248) [29]. The assembled genome sequence scaffolds were aligned onto the linkage map for pseudomolecule construction. The female parent of the F2 progenies was HPK-4. The male parent was initially considered to be HPKM-193, but this assignment was later found to be wrong when the whole genome sequences of HPK-4, HPKM-193, and the eight F2 progenies were compared. Candidate SNPs segregating in the F2 progenies were identified by mapping the whole genome Illumina sequences of the eight randomly selected F2 progenies onto MUN_r1.1.

A total of 2942 SNPs were identified, and 1378 SNPs were successfully genotyped by TAS analysis in 214 F2 progenies. Of these, 1263 SNPs were mapped onto the ten linkage groups with a total length of 980 cM (Table 3). A total of 219 scaffolds in MUN_r1.1 were then aligned onto the linkage map (Figure 2; Table 3; and in GigaDB [10]). During the process of alignment, two scaffolds were discovered to be misscaffoldings and split. The revised set of scaffolds was designated as MUN_r1.11 (Table 4; Table 2). The number of sequences of MUN_r1.11 was 3,495, with a total length of 294.7 Mbp and an N50 length of 2.8 Mbp. The aligned scaffolds on the linkage map were connected to 10,000 Ns for the construction of chromosome-scale pseudomolecules. The total length of the ten pseudomolecules was 259.2 Mbp, with an N50 length of 28.2 Mbp (Table 4; Table 5). When the total length of the A, G, T, and C bases was compared, the 10 pseudomolecules were found to cover 89% of the scaffolds in MUN_r1.11. The ratios of complete BUSCOs identified in MUN_r1.11 and the 10 scaffolds were 93.1% and 87.4%, respectively. Most of the complete BUSCOs were identified as single copies, suggesting a slow rate of duplication in the coding regions of the assembled genomes.

Table 3.

Statistics of a SNP linkage map and numbers of anchored scaffolds.

  Linkage map Number of anchored scaffolds (MUN_r1.1)
Number of mapped SNPs Length (cM) Mean distance between SNPs (cM) Segregation distortion ratio (%)
Chr1 148 97.2 0.66 4.05 29
Chr2 128 73.1 0.57 74.22 26
Chr3 84 120.3 1.43 3.57 14
Chr4 131 123.1 0.94 14.50 18
Chr5 124 100.8 0.81 3.23 22
Chr6 76 88.9 1.17 14.57 16
Chr7 148 119.1 0.80 12.84 22
Chr8 185 117.1 0.63 2.70 19
Chr9 87 51.7 0.59 81.61 25
Chr10 152 88.8 0.58 3.95 28
Total 1,263 980 0.78   219

Figure 2.

Figure 2.

Anchoring the horsegram genome assembly to the genetic linkage map. The linkage groups (left black bars) and 219 anchored MUN_r1.1 scaffolds (right blue bars) with 1263 SNPs. The crossbars on the linkage groups show the positions of mapped SNPs. Blue, aqua, pink, and red colors represent the numbers of mapped SNPs per cM of 1–5, 6–10, 10–15, and ≧16, respectively.

Table 4.

Statistics on the horsegram genome assembly and CDS.

  MUN_r1.11 MUN_r1.11 MUN_r1.1_cds
Genome/Scaffolds Genome/Pseudomolecules CDS
Number of sequences 3,497 10 36,105
Total length (bp) 294,688,765 259,245,825 38,820,013
Average length (bp) 84,269 25,924,583 1,075
Maximum length (bp) 9,844,273 33,386,276 15,732
Minimum length (bp) 500 15,505,026 150
N50 length (bp) 2,818,555 28,154,654 1,488
Total length of AGTC (bp) 286,841,254 255,003,950
Gaps (bp) 7,847,511 4,241,875 -
GC% 30.8 30.5 43.8
Repeat % 28.99497136 - -
Number of complete genes - - 35,508
Number of partial genes - - 597

Table 5.

Assembly statistics of MUN_r1.11 pseudomolecules.

  MUN_chr01 MUN_chr02 MUN_chr03 MUN_chr04 MUN_chr05 MUN_chr06 MUN_chr07 MUN_chr08 MUN_chr09 MUN_chr10
Total length of sequences (bp) 28,154,654 24,194,727 23,973,329 31,423,847 25,354,798 18,013,159 28,753,260 33,386,276 15,505,026 30,486,749
A 9,658,539 8,292,913 8,231,116 10,813,335 8,677,934 6,151,137 9,805,582 11,440,621 5,247,870 10,296,716
T 9,689,700 8,302,615 8,197,375 10,734,237 8,667,169 6,161,839 9,816,935 11,432,900 5,212,370 10,323,063
G 4,127,487 3,524,028 3,621,541 4,739,623 3,794,983 2,702,465 4,354,257 5,051,796 2,292,810 4,654,132
C 4,128,504 3,558,607 3,614,862 4,779,430 3,830,211 2,738,375 4,343,032 5,030,174 2,307,484 4,656,183
N 550,424 516,564 308,435 357,222 384,501 259,343 433,454 430,785 444,492 556,655
Total (ATGC) 27,604,230 23,678,163 23,664,894 31,066,625 24,970,297 17,753,816 28,319,806 32,955,491 15,060,534 29,930,094
GC% (ATGC) 29.9 29.9 30.6 30.6 30.5 30.6 30.7 30.6 30.5 31.1
Number of anchored scaffolds 29 26 14 18 22 16 22 19 25 28
Total length of scaffolds 27,874,654 23,944,727 23,843,329 31,253,847 25,144,798 17,863,159 28,543,260 33,206,276 15,265,026 30,216,749
Total length of inserted Ns (N10000) 280,000 250,000 130,000 170,000 210,000 150,000 210,000 180,000 240,000 270,000

Repetitive sequences

Repetitive sequences in the assembled genome were identified by RepeatMasker v3.2.9 (RRID: SCR_012954) [30] for known repetitive sequences registered in Repbase (RRID: SCR_021169) [31], and de novo repetitive sequences were defined by RepeatScout v1.0.5 (RRID: SCR_014653) [32]. A total of 50.2 Mbp of repetitive sequences were identified on the assembled genome, occupying 29% of the total length (Table 6). Of the identified repetitive sequences, the sequences registered in Repbase were found on 12.0% of the assembled genome, whereas unique repetitive sequences, i.e., those not registered in Repbase, were located on 17.0% of the assembled genome. Simple sequence repeat (SSR) motifs were identified by MISA mode in SciRoKo software with the default parameters (RRID: SCR_000941) [33]. A total of 74,362 SSRs were identified in MUN_r1.11 with an average frequency of 0.21 SSR per 100 Kbp [10]. The highest SSR frequency, 0.66 SSR per 100 Kbp, was observed in chr06, and this value was almost three times higher than that in chr03 and chr08 (0.22 SSR per 100 Kbp).

Table 6.

Length and ratio of repetitive sequences.

          MUN_r1.11 MUN_chr01 MUN_chr02 MUN_chr03 MUN_chr04 MUN_chr05 MUN_chr06 MUN_chr07 TSUd_chr08 MUN_chr09 MUN_chr10
          Length occupied (bp) % of Whole genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb) Length occupied (bp) % of Whole line-specific genomeb)
Known repeats in Pepbase Interspersed repeats Class I SINEs 33,194 0.0 1,893 0.0 2,704 0.0 2,330 0.0 3,998 0.0 3,017 0.0 2,162 0.0 5,597 0.0 3,606 0.0 769 0.0 3,640 0.0
LINEs 758,450 0.3 79,135 0.3 60,158 0.2 78,223 0.3 67,284 0.2 61,531 0.2 35,312 0.2 76,948 0.3 95,328 0.3 32,464 0.2 88,614 0.3
LTR elements Total 18,946,454 6.4 2,502,167 8.9 1,905,724 7.9 1,430,860 6.0 1,626,139 5.2 1,474,156 5.8 938,839 5.2 1,864,330 6.5 2,080,923 6.2 1,268,235 8.2 1,796,168 5.9
Copia 12,243,368 4.2 1,611,136 5.7 1,257,513 5.2 902,900 3.8 1,167,096 3.7 949,455 3.7 658,107 3.7 1,176,391 4.1 1,417,106 4.2 740,132 4.8 1,173,567 3.8
Gypsy 6,273,336 2.1 820,553 2.9 601,000 2.5 500,201 2.1 417,277 1.3 496,833 2.0 254,838 1.4 654,972 2.3 613,122 1.8 491,823 3.2 594,170 1.9
Class II DNA elements 3,100,519 1.1 380,164 1.4 280,403 1.2 227,524 0.9 294,636 0.9 206,575 0.8 121,303 0.7 294,742 1.0 307,336 0.9 198,994 1.3 289,814 1.0
Unclassified 420 0.0 0 0.0 77 0.0 0 0.0 0 0.0 0 0.0 0 0.0 0 0.0 243 0.0 0 0.0 63 0.0
Helitrons 182,492 0.1 24,503 0.1 51,573 0.2 15,763 0.1 8,828 0.0 12,915 0.1 16,031 0.1 8,133 0.0 17,183 0.1 8,945 0.1 13,311 0.0
Low complexitya) 3,615,499 1.2 222,186 0.8 208,479 0.9 203,104 0.8 238,959 0.8 194,268 0.8 134,128 0.7 269,659 0.9 257,639 0.8 119,470 0.8 256,405 0.8
Simple repeat 8,038,181 2.7 744,057 2.6 655,110 2.7 581,825 2.4 793,431 2.5 634,665 2.5 493,067 2.7 687,667 2.4 783,434 2.3 367,990 2.4 753,700 2.5
Unknown 13,270 0.0 366 0.0 993 0.0 642 0.0 1,053 0.0 520 0.0 1,094 0.0 1,491 0.0 884 0.0 859 0.0 2,820 0.0
Subtotal 35,247,906 12.0 4,044,950 14.4 3,229,413 13.3 2,581,448 10.8 3,088,969 9.8 2,627,752 10.4 1,775,449 9.9 3,250,692 11.3 3,614,610 10.8 2,038,420 13.1 3,241,129 10.6
Unique repeats Unknown 49,554,792 16.8 6,234,995 22.1 4,672,589 19.3 3,435,357 14.3 4,094,114 13.0 3,347,931 13.2 2,243,739 12.5 4,226,300 14.7 5,021,333 15.0 3,433,883 22.1 5,105,098 16.7
Simple repeat 642,225 0.2 67,981 0.2 58,761 0.2 47,815 0.2 66,279 0.2 51,673 0.2 40,270 0.2 61,328 0.2 65,386 0.2 31,004 0.2 61,449 0.2
Subtotal 50,197,017 17.0 6,302,976 22.4 4,731,350 19.6 3,483,172 14.5 4,160,393 13.2 3,399,604 13.4 2,284,009 12.7 4,287,628 14.9 5,086,719 15.2 3,464,887 22.3 5,166,547 16.9
Total         85,444,923 29.0 10,347,926 36.8 7,960,763 32.9 6,064,620 25.3 7,249,362 23.1 6,027,356 23.8 4,059,458 22.5 7,538,320 26.2 8,701,329 26.1 5,503,307 35.5 8,407,676 27.6

aPrimarily poly-purine/poly-pyrimidine stretches, or regions of extremely high AT or GC content. Stretches of DNA (100 bp) were masked when they were >87% AT or >89% GC, and 30 bp stretches were masked when they contained 29 A/T (or GC) nucleotides.bN bases were excluded from the calculation.

Transcript sequencing, gene prediction, and annotation

Total RNA of HPK-4 was extracted from seedlings, leaves, roots, flowers, and young pods using the RNeasy Plant Mini Kit (QIAGEN). RNA libraries were constructed by using a TruSeq standard mRNA HT sample prep kit (Illumina). Library sequencing was performed by an Illumina HiSeq system with a read length of 93 nt. Assembly was performed by Trinity [34]. A total of 485 million transcript Illumina reads were obtained from seedlings, leaves, roots, flowers, and young pods of HPK-4 (Figure 3) [10].

Figure 3.

Figure 3.

Plant materials used for transcript sequences. The seedlings (A), leaves (B), roots (C), flowers (D), and young pods (E) of HPK-4 used for Illumina transcript sequencing.

Ab initio gene prediction was performed by BRAKER1 v1.9 (RRID: SCR_018964) [35] with the obtained transcript sequences. Transposable elements (TEs) were detected by BLASTP searches against the NCBI NR protein database [36] with an E-value cutoff of 1 × 10−10. Domain search was performed by InterProScan against the InterPro database with an E-value cutoff of 1.0 (RRID: SCR_005829) [37].

A total of 46,095 gene sequences were predicted on the assembled genome with a total length of 48.3 Mbp (Table 7). After removal of TEs and both pseudo and short gene sequences, 36,105 gene sequences remained, and this set of sequences was designated as MUN_r1.1_cds (Table 4). The ratio of complete BUSCOs identified on MUN_r1.1_cds was 91.2%. Of the 36,105 sequences, 35,508 were classified as complete genes and 597 as partial. The coding sequences (CDSs) were further tagged with “f” (full similarity), “p” (partial similarity), and “d” (domain) according to the similarity level against the non-redundant database (f: E-values ≤1 × 10−20 and identity ≥70%; p: E-values ≤1 × 10−20 and identity <70%) and the InterPro database (d: E-values ≤1.0; Table 8). Of the 36,105 sequences, 21,471 (59.4%) were tagged with “f” and 6,692 (18.5%) with “p”. The number of gene sequences tagged with “d” was 24,575 (68.1%).

Table 7.

Statistics of candidate genes predicted by BRAKER1 v1.9.

  All predicted genes MUN_r1.1_cds
Exclude TE, pseudo and short genes
Number of sequences 46,095 36,105
Total length (bp) 48,277,179 38,820,013
Average length (bp) 1,047 1,075
Max length (bp) 15,732 15,732
Min length (bp) 60 150
N50 length (bp) 1,443 1,488
GC% 43.3 43.8

Table 8.

Number of CDSs showing significant similarity by BLASTP and domain searches against NCBI NR and InterPro.

Number of CDSs % to Total Classification Tag All predicted genes MUN_r1.1_cds
All predicted genes MIN_r1.1_cds Similarity against NR Domain
19,874 43 55 Complete f d Included Included
1554 3 4 Complete f - Included Included
3574 8 10 Complete p d Included Included
2996 6 8 Complete p - Included Included
1052 2 3 Complete - d Included Included
6458 14 18 Complete - - Included Included
26 0 0 Partial f d Included Included
17 0 0 Partial f - Included Included
49 0 0 Partial p d Included Included
73 0 0 Partial p - Included Included
124 0 0 Partial - d Included Included
308 1 1 Partial - - Included Included
126 0.3 - Pseudo   Included Not included
107 0.2 - Short   Included Not included
9757 21.2 - TE   Included Not included

BLASTP Search against NCBI NR database f: E-value ≦1 × 10−20 and similarity ≧70% p: E-value ≦1 × 10−20 and similarity <70% InterProScan against the InterPro database d: E-value ≦1.0

Transfer RNA genes were predicted using tRNAscan-SE ver. 1.23 with the default parameters [38], and compared with the numbers on the genomes of Phaseolus vulgaris (Pvulgaris_218_v1.0, 681) [39], Vigna angularis (Vangularis_v1.a1) [40], Lotus japonicus (Lj3.0) [41], and A. thaliana (Araport11 [42]). Total number of putative tRNA genes in the assembled genomes (MUN_r1.11) was 690, almost the same as the numbers for the genomes of P. vulgaris (681), V. angularis (667), and A. thaliana (699, Table 9). rRNA genes were predicted by BLAST searches (E-value cutoff of 1 × 10−10) with query sequences of A. thaliana 5.8S and 25S rRNAs (X52320.1) and 18S rRNA (X16077.1). The total number of putative rRNA genes identified in the genome was 139, which was again the same as the number in the P. vulgaris genome.

Table 9.

Numbers of putative tRNA and rRNA encoding genes identified in MUN_r1.1 and other legume species.

tRNA
Encode M. uniflorum (MUN_r1.11) P. vulgaris (Pvulgaris_218_v1.0) V. angularis (Vangularis_v1.genome) L. japonicus (Lj3.0_pseudomol) A. thaliana (TAIR10_genome)
Ala 40 41 44 40 33
Arg 34 43 52 54 39
Asn 17 22 30 28 19
Asp 27 30 32 32 28
Cys 12 16 20 76 17
Gln 21 21 24 23 19
Glu 31 28 32 40 27
Gly 46 51 44 48 43
His 13 15 14 19 12
Ile 101 30 32 29 25
Leu 49 56 53 57 45
Lys 34 43 35 41 33
Met 38 39 43 48 31
Phe 21 28 20 22 17
Pro 43 48 36 46 68
Ser 50 50 51 44 72
Thr 690 28 24 36 26
Trp 15 16 17 21 16
Tyr 15 17 18 22 83
Val 36 41 37 34 32
Subtotal 669 663 658 760 685
Subtotal (%) 97.0 97.4 98.7 88.6 98.0
Pseudo 19 12 7 80 13
SeC 0 1 0 12 0
Sup 0 1 0 0 0
Undet 2 4 2 6 1
Total 690 681 667 858 699
rRNA
Encode gene M. uniflorum (MUN_r1.1) P. vulgaris (Pvulgaris_218_v1.0) V. angularis (Vangularis_v1.genome) L. japonicus (Lj3.0_pseudomol) A. thaliana (TAIR10_genome)
18S 40 48 224 27 4
25S 87 83 421 64 5
5.8S 12 8 139 6 2
Total 139 139 784 97 11

aPseudo, SeC, Sup, and Undet represent pseudogenes, selenocysteine tRNAs, possible suppressor tRNAs, and tRNAs with undetermined/unknown isotypes, respectively.bAccession numbers used for identification or 5.8S, 18S, and 25S rRNAs were X52320.1, X16077.1, and X52320.1, respectively.

Diversity analysis in genetic resources

Only two species in the genus Macrotyloma, i.e., horsegram and M. geocarpum, are used as crops. It was speculated that horsegram domestication occurred in India twice: once in northwestern India at 4000 years before present, and once on the Indian Peninsula at 3500 years before present [43]. In addition, horsegram has narrow genetic diversity, as revealed by molecular analysis [44]. The genetic diversity of 91 cultivated horsegram accessions and one M. axellare accession, a wild relative of horsegram that is maintained at CSK-HPAU, were investigated based on dd-RAD-Seq analysis [10]. Library construction and variant calling were performed according to Shirasawa et al. [45]. The ddRAD-Seq reads were generated by an Illumina HiSeq 2000 system with a read length of 93 nt and mapped onto the assembled genome sequences. The two accessions, IC139449 and IC547543, were excluded from further analysis because of the small number of obtained reads. The mapped ratio of the reads onto the genome (MUN_r1.11) ranged from 80–90% in most of the accessions (Figure 4). However, M. axellare and one horsegram accession (IC313367) showed low mapping ratios of 17% and 55%, respectively. M. axellare was excluded from further analysis because of its low mapping ratio.

Figure 4.

Figure 4.

Mapped ratios of the dd-RAD-Seq reads of 92 accessions.

A total of 277 SNPs were identified in the remaining 89 accessions across the genome [10]. The Jaccard similarity coefficients of the 277 SNPs were calculated using GGT 2.0 [46], and a neighbor-joining (NJ) phylogenetic tree was constructed using MEGA ver 10.1.8 (RRID: SCR_000667) [47]. The NJ tree classified the 89 accessions into two clusters (Figure 5). Cluster 1 included varieties bred in the CSK-HPAU, which are prefixed with “HPK”. Most of the HPK varieties showed very close genetic relations and formed a subcluster (HPK cluster); the single exception was HPK-4.

Figure 5.

Figure 5.

A phylogenetic tree of the 89 horsegram accessions based on 277 SNPs. HPK-4, used in the reference genome construction; HPK-4, used in the reference genome construction, and HPKM-193; the obtained whole genome sequences of the accessions are circled with blue and green lines, respectively.

Whole genome structure in horsegram

Figure 6 shows a graphical view of the horsegram genome structure with a graph drawn by Circos (Figure 6; RRID: SCR_011798) [48]. Repetitive sequences were frequently observed in the midsection of each chromosome, and the tendency was more pronounced in horsegram-specific sequences (Figure 6A). The ratio of repetitive sequences commonly observed in all five species was quite low, suggesting the uniqueness of repetitive sequences compared to the gene sequences. The gene sequences commonly observed between horsegram and the other compared species tended to be distributed to the two end regions of the chromosomes (Figure 6B). On the other hand, horsegram-specific gene sequences were distributed more uniformly across the genome, suggesting the unique structure of the horsegram genome.

Figure 6.

Figure 6.

Graphical view of the horsegram genome structure. (A) Ratios of repetitive sequences in 1-Mbp windows. Blue bars represent horsegram-specific sequences. Green bars show sequences commonly observed in horsegram and three other legume species: P. vulgaris, V angularis, and L. japonicus. Red bars show sequences commonly observed in the four legume species and A. thaliana. (B) Numbers of the predicted horsegram genes (MUN_r1.1_cds) in 1-Mbp windows. The bar colors are the same as in (A). (C) CNV distribution in a 1-Mbp window. Green and red dots show log2 ratio plus and minus values, respectively. (D) Pi values and positions of the 255 SNPs identified in the 89 horsegram accessions by using dd-RAD-Seq. The distance between horizontal lines represents a Pi value of 0.1. (E) SNP density identified among the F2 mapping population based on whole genome re-sequencing of the eight F2 progenies. Blue, pale blue, pale pink, and pink indicate numbers of SNPs ≦50, ≦100, ≦150, and 150 < in 1-Mbp windows, respectively.

Copy number variations (CNVs) of one horsegram accession, HPKM-193, were detected against the HPK-4 genome (Figure 6C) based on the whole genome sequence reads of HPKM-193 using CNV-Seq (RRID: SCR_013357) [49] with a 1-Mbp window. CNVs with a minus log2 ratio were particularly observed on chr09 and chr02.

Of the 277 SNPs identified among the 89 horsegram accessions, 255 were located across the genome sequences of 10 chromosomes (Figure 6D). In each chromosome, SNPs were mostly identified in the regions where common putative genes of horsegram and the other compared species were located, particularly for chr04, chr07, chr08, and chr10. The differing trends in variable distribution is thought to reflect the presence of varying degrees of selection pressure in the horsegram germplasm resources in Himachal Pradesh.

SNP density mapped on the linkage map is illustrated in Figure 6E. As in the case of the CNVs, distribution bias was observed in the SNPs of HPKM-193; however, this bias was not like that in CNVs. A higher SNP density was observed in the midsection in most of the chromosomes. Chr06 showed less variation than the other chromosomes.

Genes related to drought tolerance

Horsegram is considered one of the most drought-tolerant legume crop species. Personal investigation showed that plants can survive for more than 20 days without water under controlled conditions. A study by Bhardwaj et al. [50] described a transcriptome analysis of eight shoot and root tissues of a drought-sensitive (M-191) genotype and a drought-tolerant (M-249) genotype of horsegram under controlled and drought stress conditions. This study identified some important genic regions responsible for drought tolerance.

To estimate genes related to drought tolerance in the horsegram genome, a BLASTP search of the 36,105 putative genes was performed against amino acid sequences of A. thaliana (Araport11), and hit genes were further used in BLAST searches against DroughtDB [51], the NCBI NR protein database, and Plant Stress Gene Database [52]. A total of 158 horsegram genes showed significant similarity to the 78 genes in DroughtDB [10]. The most frequently hit gene was ABCG40, which encodes a protein that functions as an ABC transporter, and showed significant similarity to 14 horsegram genes. OST1/SRK2E and AtrbohF were also frequently identified, with hits to seven and six horsegram genes, respectively. Of the 158 genes, 93 showed the same domain sequences as the A. thaliana gene, and 52 were like the genes registered in the PSGD. These genes were indicated to have a greater likelihood of being candidate genes related to drought tolerance.

Comparative and phylogenetic analyses with other legume species

Horsegram belongs to the subtribe Phaseolinae in the millettioid clade, along with P. vulgaris and V. angularis. The genome structure of horsegram was compared with those of P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), and L. japonicus (Lr_r3.0).

The predicted gene sequences in MUN_r1.1_cds were clustered with other plant species (P. vulgaris, V. angularis, L. japonicus, and A. thaliana) for comparison at the protein sequence level. A total of 73,457 clusters were generated using the program CD-HIT (RRID: SCR_007105) [53] (Table 10). Of the 36,105 putative gene sequences, 21,369 (59.2%) genes were clustered with other plant species and 14,736 (40.8%) were considered horsegram-specific genes (Figure 7). A total of 3738 (10.4%) horsegram gene sequences were clustered with 3,864 P. vulgaris and 3,713 V. angularis genes, which were considered millettioid-specific genes. Common genes in legumes were identified for 6550 (18.1%) horsegram gene sequences, based on clusters with P. vulgaris, V. angularis, andL. japonicus.

Table 10.

Number of gene clusters in horsegram and the four plant species, P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11).

Clustered species Number of clustered species Number of clusters Number of clustered genes
Horsegram P. vulgaris (Pv) V. angularis (Va) L. japonicus (Lj) A. thaliana (At)
Horsegram 1 9,578 14,736 0 0 0 0
P. vulgaris (Pv) 1 3,449 0 4,114 0 0 0
V. angularis (Va) 1 8,306 0 0 9,545 0 0
L. japonicus (Lj) 1 17,544 0 0 0 21,677 0
A. thaliana (At) 1 15,177 0 0 0 0 18,129
Horsegram + Pv 2 897 1,003 1,028 0 0 0
Horsegram + Va 2 471 531 0 530 0 0
Horsegram + Lj 2 362 391 0 0 484 0
Horsegram + At 2 111 116 0 0 0 147
Pv + Va 2 1,031 0 1,248 1,159 0 0
Pv + Lj 2 262 0 293 0 333 0
Pv + At 2 88 0 97 0 0 130
Va + Lj 2 290 0 0 317 382 0
Va + At 2 92 0 0 95 0 113
Lj + At 2 326 0 0 0 398 415
Horsegram + Pv + Va (Common in Millettioids) 3 3,189 3,738 3,864 3,713 0 0
Horsegram + Pv + Lj 3 458 519 521 0 594 0
Horsegram + Pv + At 3 87 89 92 0 0 110
Horsegram + Va + Lj 3 230 272 0 255 292 0
Horsegram + Va + At 3 49 53 0 54 0 58
Horsegram + Lj + At 3 56 59 0 0 72 74
Pv + Va + Lj 3 581 0 649 664 725 0
Pv + Va + At 3 94 0 101 99 0 118
Pv + Lj + At 3 66 0 79 0 79 89
Va + Lj + At 3 41 0 0 45 51 53
Horsegram + Pv + Va + Lj (common in legumes) 4 5,011 6,550 6,591 6,472 6,847 0
Horsegram + Pv + Va + At 4 670 853 857 845 0 873
Horsegram + Pv + Lj + At 4 173 201 196 0 222 208
Horsegram + Va + Lj + At 4 76 82 0 83 102 98
Pv + Va + Lj + At 4 302 0 370 365 420 385
Common in all 5 4,390 6,912 7,097 7,000 7,056 6,638
Number of genes 36,105 27,197 31,241 39,734 27,621
Number of clustered genes 36,105 27,197 31,241 39,734 27,638
Number of non-clustered genes 0 0 0 0 17

Figure 7.

Figure 7.

Ratios of genes of horsegram (MUN_r1.1_cds) clustered with those of four other plant species. Pv, Va, Li, and At represent genes of P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11), respectively.

Functional analysis was performed for MUN_r1.1_cds by classifying 36,105 putative genes into the Gene Ontology (GO) and euKaryotic clusters of Orthologous Groups (KOG) databases [54]. A total of 24,699 (68.4%) putative genes were annotated with GO categories including 9086 (25.2%) genes involved in biological processes, 4127 (11.4%) genes coding for cellular components, and 1377 (38.7%) genes associated with molecular functions (Figure 8). The ratio of annotated horsegram genes was smaller than those of the other species. The species with a ratio of classified GO categories most like that of horsegram was L. japonicus. A total of 18,630 (51.6%) putative genes showed significant similarity to genes in the KOG database (Figure 9). As in the results for GO, the ratio of hit genes was lower than for the other four species.

Figure 8.

Figure 8.

Comparison of genes annotated by the GO database in horsegram (MUN_r1.1_cds), P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11). (A) Numbers and ratios of genes annotated by GO database. (B) Ratios of the classified GO categories in the predicted genes.

Figure 9.

Figure 9.

Comparison of genes annotated by the KOG database in horsegram (MUN_r1.1_cds), P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11). (A) Numbers and ratios of genes annotated by the KOG database. (B) Ratio of the classified KOG categories in hit genes.

Clear relationships were observed with a warm-season legume, V. angularis, and one-on-one relationships were observed between horsegram chr02 (Mun_chr02) and V. angularis chr09 (Va_chr09), Mun_chr04 and Va_chr02, Mun_chr06 and Va_chr10, Mun_chr07 and Va_chr08, Mun_chr08 and Va_chr04, and Mun_chr09 and Va_chr05 (Figure 10A). The syntenic relations with P. vulgaris were slightly more complex than those with V. angularis, and those with the cool-season legume L. japonicus were more fragmented.

Figure 10.

Figure 10.

Comparative and phylogenetic analyses with other legume species. (A) Graphical view of syntenic relationships between horsegram and P. vulgaris (left), V. anagularis (middle), and L. japonicus (right). Pink and blue dots show homologous sequences of MUN_r1.11 with forward and reverse directions against the reference sequences. (B) Distribution of Ks values of orthologous gene pairs in horsegram (Mu) and the four plant species: P. vulgaris (Pv), V. anagularis (Va), L. japonicus (Lj), and A. thaliana (At). C: Phylogenetic tree of 4,154 common single-copy genes of the six legume species: P. vulgaris, V. anagularis, G. max, L. japonicus, M. truncatula, and A. thaliana.

Synonymous substitutions per site (Ks) were estimated by comparing gene pairs in each combination of horsegram, P. vulgaris, V. angularis, L. japonicus, and A. thaliana (Araport11) by KaKs Calculator [55] based on the clustered genes using the CD-HIT program (Figure 10B). The similar distributions of horsegram, P. vulgaris, and V. angularis indicated the close relations among the three species. The ratios of gene pairs showing Ks values less than 0.1% were 21.1% between horsegram and P. vulgaris and 8.4% between horsegram and V. angularis, suggesting that there was a closer relationship between horsegram and P. vulgaris at the gene level.

The phylogenetic analysis was performed with Medica truncatula (r5.0) [56] and Glycine max (Glma4 [57]) in addition to P. vulgaris, V. angularis, L. japonicus, and A. thaliana. A total of 978 common single-copy genes were identified for horsegram and the six species by clustering the genes using OrthoFinder (RRID: SCR_017118) [58]. Multiple alignment was performed for the 978 single-copy genes using Muscle (RRID: SCR_011812) [59], and gaps were excluded by Gblock [60]. An NJ tree was created with the 4154 single-copy orthologous genes identified in the four legume species by MEGA 7.0.9 beta (RRID: SCR_000667) [61] and TIMETREE (RRID: SCR_021162) [62]. A. thaliana was used for the outgroup (Figure 10C). When the divergence time between M. truncatula and G. max was considered to be 53 million years ago, it was estimated that horsegram diverged from P. vulgaris and V. angularis 20.75 million years ago (Figure 10C). Among the four legume species in millettioids, P. vulgaris and V. angularis shared closer relations with each other than with horsegram, and horsegram was closer to P. vulgaris and V. angularis than to G. max. The results are in consonance with a previous study based on a comparison of eight chloroplast regions [63].

Data validation and quality control

The quality of assembled genome and gene sequences was investigated by using 1375 Embryophyta BUSCOs (v 3.0, obd10; RRID: SCR_015008) [24]. A total of 1340 (93.1%) BUSCOs were identified on the assembled scaffolds (MUN_r1.11), while 1259 (87.4%) and 1313 (91.2%) were converted by the pseudomolecules and CDS sequences (Table 11).

Table 11.

Statistics of the horsegram genome assembly and CDS.

  MUN_r1.11 MUN_r1.11 MUN_r1.1_cds
Genome/Scaffolds Genome/Pseudomolecules CDS
BUSCOs
Complete 1340 (93.1%) 1259 (87.4%) 1313 (91.2%)
Complete single-copy 1252 (86.9%) 1181 (82%) 1208 (83.9%)
Complete duplicated 88 (6.1%) 78 (5.4%) 105 (7.3%)
Fragmented 26 (1.8%) 31 (2.2%) 23 (1.6%)
Missing 74 (5.1%) 150 (10.4%) 104 (7.2%)

Reuse potential

In this study, we have provided a first-draft genome assembly of horsegram cultivar (HPK-4) and investigated features of the horsegram genome and gene sequences as well as the genetic diversity of the accessions. This information will help to establish an efficient breeding program for horsegram by integrating conventional breeding with marker-based biotechnological tools. Finally, the genomic information revealed in this study can be applied to the improvement of other disadvantageous food legumes.

Funding Statement

This work was supported by the Bilateral Joint Research Projects from the Japan Society for the Promotion of Science and the Department of Science and Technology of the Government of India, and by funds from the Kazusa DNA Research Institute Foundation.

Data availability

The genome assembly data, annotations, and gene models are available at the Horsegram Database [64]. The obtained genome sequence reads are available from the DNA Databank of Japan (DDBJ) Sequence Read Archive (DRA) under the BioProject accession number PRJDB5374.

Data sets supporting the results of this article are available in GigaScience Database [10].

Declarations

List of abbreviations

BUSCO: benchmarking universal single-copy ortholog; CDS: coding sequence; CNV: copy number variation; CSK-HPAU: CSK Himachal Pradesh Agricultural University; GO: Gene Ontology; Kbp: kilobase pairs; KOG: euKaryotic clusters of Orthologous Groups; Mbp: megabase pairs; MP: mate-pair; NCBI: National Center for Biotechnology Information; NJ: neighbor-joining; PE: paired-end; SLR: synthetic long read; SNP: single nucleotide polymorphism; SSR: simple sequence repeat; TAS: target amplicon sequencing; TE: transposable element.

Ethical approval

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by the Bilateral Joint Research Projects from the Japan Society for the Promotion of Science and the Department of Science and Technology of the Government of India, and by funds from the Kazusa DNA Research Institute Foundation.

Author contributions

The project was designed by S.I., R.C., and T.S. K.S. contributed to the genome sequencing, linkage map, and pseudomolecule construction. R.C. contributed to the creation of the mapping population. H.H. contributed to the genome assembly and gene prediction, annotation, and phylogenetic analysis. S.N. contributed to the transcript sequencing. H.N. contributed to the pseudomolecule construction. S.I. contributed to the entire process of data analysis.

Acknowledgements

We thank A. Watanabe, S. Nakayama, Y. Kishida, M. Kohara, H. Tsuruoka, C. Minami, and S. Sasamoto (KDRI) for their technical assistance. We also thank the National Bureau of Plant Genetic Resources (NBPGR), New Delhi, for providing the germplasm lines for diversity analysis.

References

  • 1.Gillett J, Polhill R, Verdcourt B, . Flora of tropical East Africa. Leguminosae (Part 3) Subfamily Papilionoideae (1); Leguminosae (Part 4) Subfamily Papilionoideae (2). London: Crown Copyright, 1971. [Google Scholar]
  • 2.Arora R, Chandel K, . Botanical source areas of wild herbage legumes in India. Trop Grasslands, 1972; 6: 213–221. [Google Scholar]
  • 3.Chahota RK, Sharma TR, Sharma SK, Kumar N, Rana JC, . 12 - Horsegram. In: Singh M, Upadhyaya HD, Bisht IS (eds), Genetic and genomic resources of grain legume improvement. Oxford: Elsevier, 2013; pp. 293–305. [Google Scholar]
  • 4.Patel DP, Dabas BS, Sapra RS, Mandal S, . Evaluation of horsegram (Macrotyloma uniflorum) (Lam.) germplasm. New Delhi: National Bureau of Plant Genetic Resources Publication, 1995. [Google Scholar]
  • 5.Gopalan C, Ramashastri BV, Balasubramanyan SC, . Nutritive value of Indian foods. Hyderabad: National Institute of Nutrition, ICMR, Offset Press, 1999; p. 156. [Google Scholar]
  • 6.Reddy B, Brijitha N, Raghavender C, . Aflatoxin contamination in insect damaged seeds of horsegram under storage. Mycotoxin Res., 2005; 21: 187–191. [DOI] [PubMed] [Google Scholar]
  • 7.Gautam M, Katoch S, Chahota RK, . Comprehensive nutritional profiling and activity directed identification of lead antioxidant, antilithiatic agent from Macrotyloma uniflorum (Lam.) Verdc. Food Res. Int., 2020; 137: 109600. [DOI] [PubMed] [Google Scholar]
  • 8.Gautam M, Datt N, Chahota RK, . Assessment of calcium oxalate crystal inhibition potential, antioxidant activity and amino acid profiling in horse gram (Macrotyloma uniflorum): high altitude farmer’s varieties. 3 Biotech., 2020; 10: 402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.National Research Council. . Tropical legumes: resources for the future. Washington, DC: National Academies Press, 1979; 10.17226/19836. [DOI] [Google Scholar]
  • 10.Shirasawa K, Chahota RK, Hirakawa H, Nagano S, Nagasaki H, Sharma TR, Isobe SN, . Supporting data for “A Chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)”. GigaScience Database. 2021; 10.5524/100932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Marcais G, Kingsford C, . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011; 27: 764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Luo R, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 2012; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W, . Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 2011; 27: 578–579. [DOI] [PubMed] [Google Scholar]
  • 14.Boetzer M, Pirovano W, . Toward almost closed genomes with GapFiller. Genome Biol., 2012; 13: 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kajitani R, et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res., 2014; 24: 1384–1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA, . The MaSuRCA genome assembler. Bioinformatics, 2013; 29: 2669–2677. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bankevich A, Pevzner PA, . TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat. Methods, 2016; 13: 248–250. [DOI] [PubMed] [Google Scholar]
  • 18.Kosugi S, Hirakawa H, Tabata S, . GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31: 3733–3741. [DOI] [PubMed] [Google Scholar]
  • 19.University of California Santa Cruz Genomics Institute. . UCSC Genome Browser on Human February 2009 (GRCh37/hg19) Assembly. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/. Accessed 13 April 2016.
  • 20.National Center for Biotechnology Information . https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/. Accessed 13 April 2016.
  • 21.National Center for Biotechnology Information . https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/. Accessed 13 April 2016.
  • 22.UniVec . Bethesda, MD: National Center for Biotechnology Information. 2016; http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/. Accessed 13 April 2016.
  • 23.Air GM, Els MC, Brown LE, Laver WG, Webster RG, . Location of antigenic sites on the three-dimensional structure of the influenza N2 virus neuraminidase. Virology, 1985; 145: 237–248. [DOI] [PubMed] [Google Scholar]
  • 24.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM, . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015; 31: 3210–3212. [DOI] [PubMed] [Google Scholar]
  • 25.Langmead B, Wilks C, Antonescu V, Charles R, . Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 2019; 35: 421–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H, . Twelve years of SAMtools and BCFtools. GigaScience, 2021; 10(2): giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Danecek P, et al. The variant call format and VCFtools. Bioinformatics, 2011; 27: 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Shirasawa K, Kuwata C, Watanabe M, Fukami M, Hirakawa H, Isobe S, . Target amplicon sequencing for genotyping genome-wide single nucleotide polymorphisms identified by whole-genome resequencing in peanut. Plant Genome, 2016; 9: 10.3835/plantgenome2016.06.0052. [DOI] [PubMed] [Google Scholar]
  • 29.Van Ooijen J, . JoinMap® 4, Software for the calculation of genetic linkage maps in experimental populations. Wageningen: Kyazma BV, 2006; p. 33. [Google Scholar]
  • 30.Smit AFA, Hubley R, Green P, . RepeatMasker Open-3.0. 2013–2015, http://www.repeatmasker.org.
  • 31.Bao W, Kojima KK, Kohany O, . Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA, 2015; 6: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Price AL, Jones NC, Pevzner PA, . De novo identification of repeat families in large genomes. Bioinformatics, 2005; 21(Suppl 1): i351–i358. [DOI] [PubMed] [Google Scholar]
  • 33.Kofler R, Schlötterer C, Lelley T, . SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics, 2007; 23: 1683–1685. [DOI] [PubMed] [Google Scholar]
  • 34.Henschel R, Lieber M, Wu LS, Nista PM, Haas BJ, LeDuc RD., . Trinity RNA-Seq assembler performance optimization. In: Proceedings of the 1st conference of the extreme science and engineering discovery environment: bridging from the eXtreme to the campus and beyond. 2012; pp. 1–8. [Google Scholar]
  • 35.Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M, . BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 2016; 32: 767–769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.National Center for Biotechnology Information . Available from: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/. Accessed 13 November 2016.
  • 37.Mulder NJ, et al. New developments in the InterPro database. Nucleic Acids Res., 2007; 35: D224–D228. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Lowe TM, Chan PP, . tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res., 2016; 44: W54–W57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Schmutz J, et al. A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet., 2014; 46: 707–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Sakai H, et al. The Vigna Genome Server, “Vig GS”: A genomic knowledge base of the genus Vigna based on high-quality, annotated genome sequence of the Azuki Bean, Vigna angularis (Willd.) Ohwi & Ohashi. Plant Cell Physiol., 2016; 57: e2. [DOI] [PubMed] [Google Scholar]
  • 41.Sato S, et al. Genome structure of the legume, Lotus japonicus. DNA Res., 2008; 15: 227–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.The Arabidopsis Information Resource (TAIR) . Araport11 genome release. https://arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FAraport11_genome_release. Accessed 26 November 2016.
  • 43.Fuller DQ, Murphy C, . The origins and early dispersal of horsegram (Macrotyloma uniflorum), a major crop of ancient India. Genet. Resour. Crop. Evol., 65: 285–305. [Google Scholar]
  • 44.Sharma V, Sharma TR, Rana JC, Chahota RK, . Analysis of genetic diversity and population structure in horsegram (Macrotyloma uniflorum) using RAPD and ISSR markers. Agri. Res., 2015; 4: 221–230. [Google Scholar]
  • 45.Shirasawa K, Hirakawa H, Isobe S, . Analytical workflow of double-digest restriction site-associated DNA sequencing based on empirical and in silico optimization in tomato. DNA Res., 2016; 23: 145–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Van Berloo R, . GGT 2.0: versatile software for visualization and analysis of genetic data. J. Hered., 2008; 99: 232–236. [DOI] [PubMed] [Google Scholar]
  • 47.Kumar S, Stecher G, Li M, Knyaz C, Tamura K, . MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol., 2018; 35: 1547–1549. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Krzywinski M, et al. Circos: an information aesthetic for comparative genomics. Genome Res., 2009; 19: 1639–1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Xie C, Tammi MT, . CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform., 2009; 10: 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Bhardwaj J, et al. Comprehensive transcriptomic study on horse gram (Macrotyloma uniflorum): De novo assembly, functional characterization and comparative analysis in relation to drought stress. BMC Genom., 2013; 14: 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Alter S, et al. DroughtDB: an expert-curated compilation of plant drought stress genes and their homologs in nine species. Database, 2015; 2015: bav046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Plant Stress Gene Database . New Delhi; Jawaharlal Nehru University. http://ccbb.jnu.ac.in/stressgenes/frontpage.html. Accessed 12 October 2020.
  • 53.Fu L, Niu B, Zhu Z, Wu S, Li W, . CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012; 28: 3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinform., 2003; 4: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Zhang Z, Li J, Zhao XQ, Wang J, Wong GKS, Yu J, . KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinform., 2006; 4: 259–263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Pecrix Y, et al. Whole-genome landscape of Medicago truncatula symbiotic genes. Nat. Plants, 2018; 4: 1017–1025. [DOI] [PubMed] [Google Scholar]
  • 57.Soybase . US Department of Agriculture–Agricultural Research Service, Iowa State University. https://soybase.org/. Accessed 13 August 2021.
  • 58.Emms DM, Kelly S, . OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol., 2019; 20: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.European Molecular Biology Laboratory–European Bioinformatics Institute (EMBI–EBI) . MUSCLE. Multiple sequence alignment. Hinxton: EMBL–EBI, 2019https://www.ebi.ac.uk/Tools/msa/muscle. Accessed 13 August 2021. [Google Scholar]
  • 60.Castresana J, . Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol., 2000; 17: 540–552. [DOI] [PubMed] [Google Scholar]
  • 61.Kumar S, Nei M, Dudley J, Tamura K, . MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinform., 2008; 9: 299–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Hedges SB, Dudley J, Kumar S, . TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics, 2006; 22: 2971–2972. [DOI] [PubMed] [Google Scholar]
  • 63.Stefanović S, Pfeil BE, Palmer JD, Doyle JJ, . Relationships among phaseoloid legumes based on sequences from eight chloroplast regions. System Bot., 2009; 34: 115–128. [Google Scholar]
  • 64.Horsegram Database . Kazusa DNA Research Institute. 2017; http://horsegram.kazusa.or.jp/. Accessed 19 January 2021.
GigaByte. 2021 Oct 8;2021:gigabyte30.

Article Submission

Sachiko Isobe
GigaByte.

Assign Handling Editor

Editor: Scott Edmunds
GigaByte.

Editor Assess MS

Editor: Hongfang Zhang
GigaByte.

Curator Assess MS

Editor: Mary-Ann Tuli
GigaByte.

Review MS

Editor: Tianzuo Wang

Reviewer name and names of any other individual's who aided in reviewer Tianzuo Wang
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.) Yes
Is the language of sufficient quality? Yes
Please add additional comments on language quality to clarify if needed It can be improved better.
Are all data available and do they match the descriptions in the paper? Yes
Additional Comments
Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples &lt;a href="http://gigadb.org/site/guide" target="_blank"&gt;http://gigadb.org/site/guide&lt;/a&gt; Yes
Additional Comments
Is the data acquisition clear, complete and methodologically sound? Yes
Additional Comments
Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes
Additional Comments
Is there sufficient data validation and statistical analyses of data quality? Yes
Additional Comments
Is the validation suitable for this type of data? Yes
Additional Comments
Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes
Additional Comments
Any Additional Overall Comments to the Author Shirasawa et al. reported a Chromosome-scale draft genome sequence of horsegram, and performed the analysis of comparative genomics. 1.If Pacbio data was used, the quality of genome can be improved much. 2.Only genomes of P. vulgaris, V. angularis and L. japonicus in the legume were used for phylogenetic analysis. Soybean and Medicago, as the model legume plants, should be added at last. 3.The section of Whole genome structure in horsegram should be introduced before Diversity analysis in genetic resources, Genes related to drought tolerance, and Transcript sequencing, gene prediction and annotation. Because genome information is the foundation of other analysis.
Recommendation Major Revision
GigaByte.

Review MS

Editor: Penghao Wang

Reviewer name and names of any other individual's who aided in reviewer Penghao Wang
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.) Yes
Is the language of sufficient quality? Yes
Please add additional comments on language quality to clarify if needed
Are all data available and do they match the descriptions in the paper? Yes
Additional Comments
Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples &lt;a href="http://gigadb.org/site/guide" target="_blank"&gt;http://gigadb.org/site/guide&lt;/a&gt; Yes
Additional Comments
Is the data acquisition clear, complete and methodologically sound? Yes
Additional Comments
Is there sufficient detail in the methods and data-processing steps to allow reproduction? Yes
Additional Comments
Is there sufficient data validation and statistical analyses of data quality? Yes
Additional Comments
Is the validation suitable for this type of data? Yes
Additional Comments
Is there sufficient information for others to reuse this dataset or integrate it with other data? Yes
Additional Comments
Any Additional Overall Comments to the Author Authors presented a paper on describing a new pseudo-chromosome draft genome sequences of a legume plant horsegram and some bioinformatics analyses based on the data. The presented assembly is of good quality and the bioinformatics analysis performed is sound. The resources made available by the study should prove valuable to researchers working on the plant and legume community on a whole. The paper is generally well written and I personally found out the paper is quite easy to follow. Few grammatic errors can be found. The bioinformatics methodology that has been utilised in the study is sound and the software used fit the goals of the study. However, authors need to present more details on some analysis components, e.g. the parameter set used for the software, the version of the software, the OS, etc, so that the analysis can be better reproduced. For example, in Methods section, line 76 the Jellyfish program was used to estimate the genome size, the parameter, version, OS of running the software were not mentioned. Line 78 SOAPdenovo2 apart from Kmer the most important parameter, what about the rest? SSPACE 2.0 was used for scaffolding, the insert sizes? Platanus, MaSuRCA, TruSPAdes, RepeatMasker, augustus, all these software involve a number of parameters, and the details on how they were used need to be provided. Because the results can be sharply different with different parameters. Some figures appear to be created by using some tools, and these tools need to be acknowledged and referenced. For example, is Circus used to generate the circular plot in Fig 5? In addition, I could not find captions for all the main figures.
Recommendation Minor Revision
GigaByte.

Editor Decision

Editor: Hongfang Zhang
GigaByte. 2021 Oct 8;2021:gigabyte30.

Minor Revision

Sachiko Isobe
GigaByte.

Assess Revision

Editor: Hongfang Zhang
GigaByte.

Re-Review MS

Editor: Tianzuo Wang

Comments on revised manuscript I advise gigabyte journal to accept this version.
GigaByte.

Editor Decision

Editor: Hongfang Zhang
GigaByte.

Final Data Preparation

Editor: Mary-Ann Tuli
GigaByte.

Editor Decision

Editor: Hongfang Zhang
GigaByte.

Accept

Editor: Scott Edmunds

Comments to the Author Looking forward to seeing your work published soon.
GigaByte.

Export to Production

Editor: Scott Edmunds

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Availability Statement

    The genome assembly data, annotations, and gene models are available at the Horsegram Database [64]. The obtained genome sequence reads are available from the DNA Databank of Japan (DDBJ) Sequence Read Archive (DRA) under the BioProject accession number PRJDB5374.

    Data sets supporting the results of this article are available in GigaScience Database [10].


    Articles from GigaByte are provided here courtesy of Gigascience Press

    RESOURCES