A chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)

Kenta Shirasawa; Rakesh Chahota; Hideki Hirakawa; Soichiro Nagano; Hideki Nagasaki; Tilak Sharma; Sachiko Isobe

doi:10.46471/gigabyte.30

. 2021 Oct 8;2021:gigabyte30. doi: 10.46471/gigabyte.30

A chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)

Kenta Shirasawa ¹, Rakesh Chahota ², Hideki Hirakawa ¹, Soichiro Nagano ^1,⁴, Hideki Nagasaki ¹, Tilak Sharma ³, Sachiko Isobe ^1,^*

PMCID: PMC9650294 PMID: 36824333

Abstract

Horsegram (Macrotyloma uniflorum [Lam.] Verdc.) is an underutilized warm-season diploid legume (2n = 20, 22). Because of its ability to grow under water-deficient and marginal soil conditions, horsegram is a preferred choice in the era of global climate change. In recognition of its potential as a crop species, we generated and analyzed a draft genome sequence for a horsegram variety, HPK-4. Ten chromosome-scale pseudomolecules were created by aligning Illumina scaffold sequences onto a linkage map. The total length of the ten pseudomolecules was 259.2 Mbp, covering 89% of the total length of the assembled sequences. A total of 36,105 genes were predicted on the assembled sequences. Diversity analysis of 89 horsegram accessions by dd-RAD-Seq identified 277 single nucleotide polymorphisms (SNPs), suggesting narrow genetic diversity among the horsegram accessions. This is the first attempt to generate a draft genome sequence of horsegram and will provide a reference for sequence-based analysis of horsegram germplasm.

Data description

Background

Horsegram (Macrotyloma uniflorum [Lam.] Verdc.) (NCBI:txid271171), is an underutilized warm-season diploid legume (2n = 20, 22). It belongs to the Fabaceae family of the Phaseoleae tribe, and is cultivated mainly in semi-arid regions of the world. On the Indian subcontinent, horsegram is consumed primarily as a food legume, whereas in Africa and Australia it is grown mainly for use as a concentrated animal feed and fodder. This self-pollinating plant is thought to have originated in Africa because most of its 32 wild species exist there [1], and the Northwestern Himalayan region is considered its secondary center of origin [2]. Horsegram may have been domesticated as M. uniflorum var. uniflorum in the southern part of India, but its probable progenitor, M. axillare, has not been reported in India. Therefore, the process by which cultivated horsegram was domesticated from its wild ancestors has not yet been established [3].

Because of its ability to grow under water-deficient and marginal soil conditions, horsegram is a preferred choice in the era of global climate change. Horsegram contains 16.0–30.4% protein [4], and constitutes an important source of dietary protein for the undernourished population in south Asia. In addition, the seeds are a rich source of lysine and vitamins [5], and its antioxidant, antimicrobial, and unique antilithiatic properties make it a food of nutraceutical importance [6–8]. As a result of horsegram’s medicinal importance and ability to thrive under drought-like conditions, the US National Academy of Sciences has identified this legume as a potential food source for the future [9].

Context

The existence of many wild and unsolicited characteristics makes horsegram a less favorable legume for commercial cultivation, although it does possess numerous attributes that make it a potential food legume for warm arid regions. In addition, there is a lack of genetic and molecular tools with which to genetically enhance horsegram. To elucidate the potential of this food legume species, we generated and analyzed a draft genome sequence for HPK-4, a horsegram cultivar commercially released by CSK Himachal Pradesh Agricultural University (HPAU), Palampur, India. This variety, which has dark-brown seeds, is under cultivation in many parts of the Northwestern Indian Himalayan region. It is resistant to anthracnose (Colletotrichum truncatum) and tolerant to abiotic stresses such as drought, salinity, and heavy metals. This is the first attempt to generate a draft genome sequence of this ‘orphan’, but it is an important food legume species and will provide a reference for sequence-based analysis of horsegram germplasm to elucidate the genetic bases of important traits.

Methods

Whole genome sequencing and assembly of horsegram

The genome sequences of a horsegram variety, HPK-4, bred at CSK-HPAU, were generated from a paired-end (PE) library by Illumina HiSeq 2000 with a total length of 37.9 Gbp (gigabase pairs) [10]. All data analysis for this study was performed on Linux servers running Red Hat Enterprise Linux Server 7.1. Using the Jellyfish v1.1.6 program (RRID: SCR_005491) [11], the genome size of HPK-4 was estimated to be approximately 343.6 Mbp (megabase pairs) (Figure 1). The parameters used in the analysis are listed in Table 1.

Figure 1. — Genome size estimation using Jellyfish with the distribution of the number of distinct k-mers (k = 17) with the given multiplicity values.

Table 1.

Parameters used in each program.

Program name	Parameters or BUSCO data set	Comments
Jellyfish v1.1.6	–m 17 –s 1000000000 –t 32 –C
SOAPdenovo2 r223	–K (61 71 81 91) –R –F –p 8
SSPACE v2.0	–x 0 –z 0 –k 3 –a 0.7 –n 15 –T 8 –g 0 –v 1
GapFiller v1.10	–m 30 –o 5 –r 0.7 –n 10 –d 50 –t 10 –g 0 –T 8 –i 1
Platanus v1.2.1	–t 12 –m 300
MaSuRCA v2.3.2	Default Parameters
TruSPAdes v3.6.2	Default Parameters
RepeatMasker v3.2.9	–poly –x –lib
RepeatScout v1.0.5	Default Parameters
BRAKER1 v1.9	Default Parameters
BUSCO v3.0	Embryophyta, odb10
Samtools 0.1.19	samtools mpileup mpileup –d 10000000 –D –u
bcftools 0.1.19	bcftools view –c –g –v
vctools 0.1.12b	vcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 5 –minQ 214 –max-missing 1	SNP filterfing for 8 F₂ WGS
vctools 0.1.12b	vcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 10 –minQ 50 –max-missing 0.2	SNP filterfing for 214 F₂TAS
vctools 0.1.12b	vcftools_0.1.12b/bin/vcftools –remove-indels –min-alleles 2 –max-alleles 2 –minDP 5 –minQ 999 –max-missing 0.5 –maf 0.05	SNP filtering for 89 population
JoinMap 4	Kosambi’s mapping function, linkage with rec. frec. Smaller than 0.4 and a LOD lather than 1.0, Goodness-of-fit for removal of loci = 5.0, Number of added loci after which to perform a ripple = 1, Third round = yes

Open in a new tab

The Illumina PE reads were assembled by SOAPdenovo2 r223 (RRID: SCR_014986) [12] with k-mers of 61 and 81, and contigs were generated with total lengths of 352.2 Mbp (k-mer = 81) and 389.3 Mbp (k-mer = 61) (Table 2). The contigs constructed with k-mer = 81 were selected and scaffolded with mate-pair (MP) reads with insert sizes of 2, 5, 10, and 15 Kbp (kilobase pairs) by using SSPACE v2.0 (RRID: SCR_005056) [13]. The number of generated scaffolds was 6227 after gap filling by GapFiller [14] and excluding contamination. The total length of the scaffolds was 297.1 Mbp (Assembly 1, Table 2), which was approximately 55–92 Mbp shorter in length than the estimated genome size of HPK-4.

Table 2.

Statistics of de novo whole genome assembly.

File name	SOAPdenovo Contigs		Assembly 1 SOAPdenovo/SSPACE/GapFiller (k-mer = 81)		Assembly 2 Platanus	Assembly 3 MaSuRCA	Assembly 4 TruSPAdes	Assembly 5 GMcloser	MUN_r1.1	MUN_r1.11	MUN_r1.11 pseudomolecule
Input reads	PE	PE	PE + MP	PE + MP	PE + MP	PE + MP	SLR	Assembly 1 + Assembly 4	Assembly 5	MUN_r1.1	MUN_r1.11 + Linkage map
Comments	k-mer = 61	k-mer = 81	Include contamination	Exclude contamination	Include contamination	Include contamination		Exclude contamination	≥500bp	Scaffolds revised
All

Number of sequences	1,534,576	779,101	7,123	6,228	62,323	17,400	374,253	6,228	3,495	3,497	10
Total length (bp)	389,388,347	352,263,669	297,816,217	297,127,168	287,695,252	313,146,882	1,357,659,302	295,740,202	294,688,765	294,688,765	259,245,825
Average length (bp)	254	452	41,811	47,708	4,616	17,997	3,628	47,486	84,317	84,269	25,924,583
Max length (bp)	43,864	86,786	13,495,995	13,495,995	13,114,378	9,397,721	79,948	13,482,853	9,844,273	9,844,273	33,386,276
Min length (bp)	62	82	300	300	100	71	1,500	146	500	500	15,505,026
N50 length (bp)	2,602	6,108	3,571,813	3,571,813	4,221,442	2,147,735	4,120	3,568,883	2,818,555	2,818,555	28,154,654
A	137,634,352	123,511,855	98,936,270	98,713,635	95,878,438	103,574,375	440,896,819	100,065,275	99,718,915	99,718,915	88,615,763
T	128,359,494	116,788,629	98,000,497	97,829,343	96,128,274	103,392,204	440,199,341	99,104,678	98,784,555	98,784,555	88,538,203
G	61,810,612	56,330,394	43,804,355	43,679,038	42,484,816	46,002,912	238,337,484	44,442,744	44,254,557	44,254,557	38,863,122
C	61,583,889	55,632,791	43,617,622	43,524,196	42,490,698	46,133,757	238,219,308	44,268,956	44,083,227	44,083,227	38,986,862
N	0	0	13,457,473	13,380,956	10,713,026	14,043,634	6,350	7,858,549	7,847,511	7,847,511	4,241,875
Total (ATGC, bp)	389,388,347	352,263,669	284,358,744	283,746,212	276,982,226	299,103,248	1,357,652,952	287,881,653	286,841,254	286,841,254	255,003,950
GC% (ATGC)	31.7	0	30.7	30.7	30.7	30.8	35.1	30.8	30.8	30.8	30.5
≥300 bp
Number of sequences	96,834	85,229	7,123	6,228	13,045	17,107	374,253	6,226	-	-	-
Total length (bases)	255,385,256	270,010,759	297,816,217	297,127,168	281,104,166	313,093,559	1,357,659,302	295,739,758	-	-	-
Average length (bases)	2,637	3,168	41,811	47,708	21,549	18,302	3,628	47,501	-	-	-
≥500 bp
Number of sequences	72,097	56,065	3,945	3,468	8,514	13,654	374,253	3,469	3,495	3,497
Total length (bases)	245,981,429	258,994,765	296,598,533	296,074,149	279,298,951	311,725,703	1,357,659,302	294,688,765	294,688,765	294,688,765
Average length (bases)	3,412	4,619	75,183	85,373	32,805	22,830	3,628	84,949	84,317	84,269
≥1 Kbp
Number of sequences	52,710	39,176	1,976	1,842	2,716	9,787	374,253	1,836	1,862	1,864
Total length (bases)	232,331,716	247,369,100	295,266,774	294,976,239	275,198,779	308,936,166	1,357,659,302	293,585,247	293,585,247	293,585,247
Average length (bases)	4,408	6,314	149,427	160,139	101,325	31,566	3,628	159,905	157,672	157,503
≥2 Kbp
Number of sequences	27,007	23,688	1,205	1,186	511	4,079	190,853	1,176	1,202	1,204
Total length (bases)	185,671,228	219,707,018	294,019,523	293,893,601	272,047,316	299,058,101	960,490,176	292,496,469	292,496,469	292,496,469
Average length (bases)	6,875	9,275	244,000	247,802	532,382	73,317	5,033	248,721	243,341	242,937
≥3 Kbp
Number of sequences	15,668	16,505	1,084	1,073	395	3,261	70,396	1,056	1,082	1,084
Total length (bases)	141,491,218	191,540,893	293,553,905	293,455,161	271,611,649	295,920,059	495,770,522	292,032,037	292,032,037	292,032,037
Average length (bases)	9,031	11,605	270,806	273,490	687,624	90,745	7,043	276,545	269,900	269,402

Open in a new tab

We speculated that the shorter observed length of the total scaffolds may have been caused by misintegration of repeat sequences by SSPACE v2.0. Therefore, we performed subsequent assemblies using two programs, Platanus v1.2.1 (RRID: SCR_015531) [15] and MaSuRCA v2.3.2 (RRID: SCR_010691) [16]. The total ATGC lengths of the scaffolds were not significantly different among the three assemblies: 284.4 Mbp in SOAPdenovo-SSPACE (Assembly 1, before excluding contamination; Table 2), 277.0 Mbp in Platanus (Assembly 2), and 299.1 Mbp in MaSuRCA (Assembly 3).

Meanwhile, an Illumina synthetic long-reads (SLR) library was constructed with high-molecular-weight cellular DNA using a TruSeq synthetic long-read DNA library prep kit (Illumina). Sequences were generated by Illumina HiSeq 2000 and MiSeq systems with read lengths of 93 nt and 251 nt, respectively. The SLR reads were synthesized through the TruSPAdes v3.6.2 pipeline [17]. Among the three assemblies with PE and MP reads, Assembly 1 was used for subsequent analysis, and gaps were closed with Illumina SLRs by GMcloser (RRID: SCR_000646) [18]. Potentially contaminated sequences were excluded using BLASTN searches against the chloroplast and mitochondrial genome sequences of Arabidopsis thaliana (accession numbers NC_000932.1 and NC_001284.2), human genome sequences (hg19 [19]), fungal genome sequences registered with the National Center for Biotechnology Information (NCBI) [20], bacterial genome sequences registered with [21], vector sequences in UniVec [22], and PhiX (NC_001422.1) [23] sequences with E-value cutoffs of 1 × 10⁻¹⁰ and length coverage >10%. The total length of the resultant assembly (Assembly 5) was 295.7 Mbp.

The results of benchmarking universal single-copy ortholog (BUSCO) analysis (RRID: SCR_015008) [24] identified that 93.1% of BUSCOs were found as complete genes in Assembly 5. We therefore considered that Assembly 5 covered most of the coding regions of the horsegram genome. Sequences shorter than 500 bp were excluded from Assembly 5, and the remaining sequences were designated as MUN_r1.1.

Linkage map and pseudomolecule construction

To construct chromosome-scale genome sequences, a SNP linkage map was created with the 214 F₂ progenies. SNPs segregating in the F₂ population were detected by mapping Illumina re-sequence reads of the eight F₂ individuals onto the assembled genome using Bowtie2 (RRID: SCR_016368) [25], and by calling variants using SAMtools 0.1.19 (RRID: SCR_002105) [26] and vcftools 0.1.12 (RRID: SCR_001235) [27]. Target amplicon sequencing (TAS) was performed to genotype the identified SNPs according to the methods described in Shirasawa et al. [28].

The linkage map was constructed using JoinMap 4 with Kosambi’s mapping function (RRID: SCR_009248) [29]. The assembled genome sequence scaffolds were aligned onto the linkage map for pseudomolecule construction. The female parent of the F₂ progenies was HPK-4. The male parent was initially considered to be HPKM-193, but this assignment was later found to be wrong when the whole genome sequences of HPK-4, HPKM-193, and the eight F₂ progenies were compared. Candidate SNPs segregating in the F₂ progenies were identified by mapping the whole genome Illumina sequences of the eight randomly selected F₂ progenies onto MUN_r1.1.

A total of 2942 SNPs were identified, and 1378 SNPs were successfully genotyped by TAS analysis in 214 F₂ progenies. Of these, 1263 SNPs were mapped onto the ten linkage groups with a total length of 980 cM (Table 3). A total of 219 scaffolds in MUN_r1.1 were then aligned onto the linkage map (Figure 2; Table 3; and in GigaDB [10]). During the process of alignment, two scaffolds were discovered to be misscaffoldings and split. The revised set of scaffolds was designated as MUN_r1.11 (Table 4; Table 2). The number of sequences of MUN_r1.11 was 3,495, with a total length of 294.7 Mbp and an N50 length of 2.8 Mbp. The aligned scaffolds on the linkage map were connected to 10,000 Ns for the construction of chromosome-scale pseudomolecules. The total length of the ten pseudomolecules was 259.2 Mbp, with an N50 length of 28.2 Mbp (Table 4; Table 5). When the total length of the A, G, T, and C bases was compared, the 10 pseudomolecules were found to cover 89% of the scaffolds in MUN_r1.11. The ratios of complete BUSCOs identified in MUN_r1.11 and the 10 scaffolds were 93.1% and 87.4%, respectively. Most of the complete BUSCOs were identified as single copies, suggesting a slow rate of duplication in the coding regions of the assembled genomes.

Table 3.

Statistics of a SNP linkage map and numbers of anchored scaffolds.

	Linkage map				Number of anchored scaffolds (MUN_r1.1)
	Number of mapped SNPs	Length (cM)	Mean distance between SNPs (cM)	Segregation distortion ratio (%)
Chr1	148	97.2	0.66	4.05	29
Chr2	128	73.1	0.57	74.22	26
Chr3	84	120.3	1.43	3.57	14
Chr4	131	123.1	0.94	14.50	18
Chr5	124	100.8	0.81	3.23	22
Chr6	76	88.9	1.17	14.57	16
Chr7	148	119.1	0.80	12.84	22
Chr8	185	117.1	0.63	2.70	19
Chr9	87	51.7	0.59	81.61	25
Chr10	152	88.8	0.58	3.95	28
Total	1,263	980	0.78		219

Open in a new tab

Figure 2. — Anchoring the horsegram genome assembly to the genetic linkage map. The linkage groups (left black bars) and 219 anchored MUN_r1.1 scaffolds (right blue bars) with 1263 SNPs. The crossbars on the linkage groups show the positions of mapped SNPs. Blue, aqua, pink, and red colors represent the numbers of mapped SNPs per cM of 1–5, 6–10, 10–15, and ≧16, respectively.

Table 4.

Statistics on the horsegram genome assembly and CDS.

	MUN_r1.11	MUN_r1.11	MUN_r1.1_cds
	Genome/Scaffolds	Genome/Pseudomolecules	CDS
Number of sequences	3,497	10	36,105
Total length (bp)	294,688,765	259,245,825	38,820,013
Average length (bp)	84,269	25,924,583	1,075
Maximum length (bp)	9,844,273	33,386,276	15,732
Minimum length (bp)	500	15,505,026	150
N50 length (bp)	2,818,555	28,154,654	1,488
Total length of AGTC (bp)	286,841,254	255,003,950
Gaps (bp)	7,847,511	4,241,875	-
GC%	30.8	30.5	43.8
Repeat %	28.99497136	-	-
Number of complete genes	-	-	35,508
Number of partial genes	-	-	597

Open in a new tab

Table 5.

Assembly statistics of MUN_r1.11 pseudomolecules.

	MUN_chr01	MUN_chr02	MUN_chr03	MUN_chr04	MUN_chr05	MUN_chr06	MUN_chr07	MUN_chr08	MUN_chr09	MUN_chr10
Total length of sequences (bp)	28,154,654	24,194,727	23,973,329	31,423,847	25,354,798	18,013,159	28,753,260	33,386,276	15,505,026	30,486,749
A	9,658,539	8,292,913	8,231,116	10,813,335	8,677,934	6,151,137	9,805,582	11,440,621	5,247,870	10,296,716
T	9,689,700	8,302,615	8,197,375	10,734,237	8,667,169	6,161,839	9,816,935	11,432,900	5,212,370	10,323,063
G	4,127,487	3,524,028	3,621,541	4,739,623	3,794,983	2,702,465	4,354,257	5,051,796	2,292,810	4,654,132
C	4,128,504	3,558,607	3,614,862	4,779,430	3,830,211	2,738,375	4,343,032	5,030,174	2,307,484	4,656,183
N	550,424	516,564	308,435	357,222	384,501	259,343	433,454	430,785	444,492	556,655
Total (ATGC)	27,604,230	23,678,163	23,664,894	31,066,625	24,970,297	17,753,816	28,319,806	32,955,491	15,060,534	29,930,094
GC% (ATGC)	29.9	29.9	30.6	30.6	30.5	30.6	30.7	30.6	30.5	31.1
Number of anchored scaffolds	29	26	14	18	22	16	22	19	25	28
Total length of scaffolds	27,874,654	23,944,727	23,843,329	31,253,847	25,144,798	17,863,159	28,543,260	33,206,276	15,265,026	30,216,749
Total length of inserted Ns (N10000)	280,000	250,000	130,000	170,000	210,000	150,000	210,000	180,000	240,000	270,000

Open in a new tab

Repetitive sequences

Repetitive sequences in the assembled genome were identified by RepeatMasker v3.2.9 (RRID: SCR_012954) [30] for known repetitive sequences registered in Repbase (RRID: SCR_021169) [31], and de novo repetitive sequences were defined by RepeatScout v1.0.5 (RRID: SCR_014653) [32]. A total of 50.2 Mbp of repetitive sequences were identified on the assembled genome, occupying 29% of the total length (Table 6). Of the identified repetitive sequences, the sequences registered in Repbase were found on 12.0% of the assembled genome, whereas unique repetitive sequences, i.e., those not registered in Repbase, were located on 17.0% of the assembled genome. Simple sequence repeat (SSR) motifs were identified by MISA mode in SciRoKo software with the default parameters (RRID: SCR_000941) [33]. A total of 74,362 SSRs were identified in MUN_r1.11 with an average frequency of 0.21 SSR per 100 Kbp [10]. The highest SSR frequency, 0.66 SSR per 100 Kbp, was observed in chr06, and this value was almost three times higher than that in chr03 and chr08 (0.22 SSR per 100 Kbp).

Table 6.

Length and ratio of repetitive sequences.

					MUN_r1.11		MUN_chr01		MUN_chr02		MUN_chr03		MUN_chr04		MUN_chr05		MUN_chr06		MUN_chr07		TSUd_chr08		MUN_chr09		MUN_chr10
					Length occupied (bp)	% of Whole genome^b)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)	Length occupied (bp)	% of Whole line-specific genomeb)
Known repeats in Pepbase	Interspersed repeats	Class I	SINEs		33,194	0.0	1,893	0.0	2,704	0.0	2,330	0.0	3,998	0.0	3,017	0.0	2,162	0.0	5,597	0.0	3,606	0.0	769	0.0	3,640	0.0
			LINEs		758,450	0.3	79,135	0.3	60,158	0.2	78,223	0.3	67,284	0.2	61,531	0.2	35,312	0.2	76,948	0.3	95,328	0.3	32,464	0.2	88,614	0.3
			LTR elements	Total	18,946,454	6.4	2,502,167	8.9	1,905,724	7.9	1,430,860	6.0	1,626,139	5.2	1,474,156	5.8	938,839	5.2	1,864,330	6.5	2,080,923	6.2	1,268,235	8.2	1,796,168	5.9
				Copia	12,243,368	4.2	1,611,136	5.7	1,257,513	5.2	902,900	3.8	1,167,096	3.7	949,455	3.7	658,107	3.7	1,176,391	4.1	1,417,106	4.2	740,132	4.8	1,173,567	3.8
				Gypsy	6,273,336	2.1	820,553	2.9	601,000	2.5	500,201	2.1	417,277	1.3	496,833	2.0	254,838	1.4	654,972	2.3	613,122	1.8	491,823	3.2	594,170	1.9
		Class II	DNA elements		3,100,519	1.1	380,164	1.4	280,403	1.2	227,524	0.9	294,636	0.9	206,575	0.8	121,303	0.7	294,742	1.0	307,336	0.9	198,994	1.3	289,814	1.0
		Unclassified			420	0.0	0	0.0	77	0.0	0	0.0	0	0.0	0	0.0	0	0.0	0	0.0	243	0.0	0	0.0	63	0.0
	Helitrons	182,492	0.1	24,503	0.1	51,573	0.2	15,763	0.1	8,828	0.0	12,915	0.1	16,031	0.1	8,133	0.0	17,183	0.1	8,945	0.1	13,311	0.0
	Low complexity^a)	3,615,499	1.2	222,186	0.8	208,479	0.9	203,104	0.8	238,959	0.8	194,268	0.8	134,128	0.7	269,659	0.9	257,639	0.8	119,470	0.8	256,405	0.8
	Simple repeat	8,038,181	2.7	744,057	2.6	655,110	2.7	581,825	2.4	793,431	2.5	634,665	2.5	493,067	2.7	687,667	2.4	783,434	2.3	367,990	2.4	753,700	2.5
	Unknown	13,270	0.0	366	0.0	993	0.0	642	0.0	1,053	0.0	520	0.0	1,094	0.0	1,491	0.0	884	0.0	859	0.0	2,820	0.0
	Subtotal	35,247,906	12.0	4,044,950	14.4	3,229,413	13.3	2,581,448	10.8	3,088,969	9.8	2,627,752	10.4	1,775,449	9.9	3,250,692	11.3	3,614,610	10.8	2,038,420	13.1	3,241,129	10.6
Unique repeats	Unknown	49,554,792	16.8	6,234,995	22.1	4,672,589	19.3	3,435,357	14.3	4,094,114	13.0	3,347,931	13.2	2,243,739	12.5	4,226,300	14.7	5,021,333	15.0	3,433,883	22.1	5,105,098	16.7
	Simple repeat	642,225	0.2	67,981	0.2	58,761	0.2	47,815	0.2	66,279	0.2	51,673	0.2	40,270	0.2	61,328	0.2	65,386	0.2	31,004	0.2	61,449	0.2
	Subtotal	50,197,017	17.0	6,302,976	22.4	4,731,350	19.6	3,483,172	14.5	4,160,393	13.2	3,399,604	13.4	2,284,009	12.7	4,287,628	14.9	5,086,719	15.2	3,464,887	22.3	5,166,547	16.9
Total					85,444,923	29.0	10,347,926	36.8	7,960,763	32.9	6,064,620	25.3	7,249,362	23.1	6,027,356	23.8	4,059,458	22.5	7,538,320	26.2	8,701,329	26.1	5,503,307	35.5	8,407,676	27.6

Open in a new tab

^aPrimarily poly-purine/poly-pyrimidine stretches, or regions of extremely high AT or GC content. Stretches of DNA (100 bp) were masked when they were >87% AT or >89% GC, and 30 bp stretches were masked when they contained 29 A/T (or GC) nucleotides.^bN bases were excluded from the calculation.

Transcript sequencing, gene prediction, and annotation

Total RNA of HPK-4 was extracted from seedlings, leaves, roots, flowers, and young pods using the RNeasy Plant Mini Kit (QIAGEN). RNA libraries were constructed by using a TruSeq standard mRNA HT sample prep kit (Illumina). Library sequencing was performed by an Illumina HiSeq system with a read length of 93 nt. Assembly was performed by Trinity [34]. A total of 485 million transcript Illumina reads were obtained from seedlings, leaves, roots, flowers, and young pods of HPK-4 (Figure 3) [10].

Figure 3. — Plant materials used for transcript sequences. The seedlings (A), leaves (B), roots (C), flowers (D), and young pods (E) of HPK-4 used for Illumina transcript sequencing.

Ab initio gene prediction was performed by BRAKER1 v1.9 (RRID: SCR_018964) [35] with the obtained transcript sequences. Transposable elements (TEs) were detected by BLASTP searches against the NCBI NR protein database [36] with an E-value cutoff of 1 × 10⁻¹⁰. Domain search was performed by InterProScan against the InterPro database with an E-value cutoff of 1.0 (RRID: SCR_005829) [37].

A total of 46,095 gene sequences were predicted on the assembled genome with a total length of 48.3 Mbp (Table 7). After removal of TEs and both pseudo and short gene sequences, 36,105 gene sequences remained, and this set of sequences was designated as MUN_r1.1_cds (Table 4). The ratio of complete BUSCOs identified on MUN_r1.1_cds was 91.2%. Of the 36,105 sequences, 35,508 were classified as complete genes and 597 as partial. The coding sequences (CDSs) were further tagged with “f” (full similarity), “p” (partial similarity), and “d” (domain) according to the similarity level against the non-redundant database (f: E-values ≤1 × 10⁻²⁰ and identity ≥70%; p: E-values ≤1 × 10⁻²⁰ and identity <70%) and the InterPro database (d: E-values ≤1.0; Table 8). Of the 36,105 sequences, 21,471 (59.4%) were tagged with “f” and 6,692 (18.5%) with “p”. The number of gene sequences tagged with “d” was 24,575 (68.1%).

Table 7.

Statistics of candidate genes predicted by BRAKER1 v1.9.

	All predicted genes	MUN_r1.1_cds
		Exclude TE, pseudo and short genes
Number of sequences	46,095	36,105
Total length (bp)	48,277,179	38,820,013
Average length (bp)	1,047	1,075
Max length (bp)	15,732	15,732
Min length (bp)	60	150
N50 length (bp)	1,443	1,488
GC%	43.3	43.8

Open in a new tab

Table 8.

Number of CDSs showing significant similarity by BLASTP and domain searches against NCBI NR and InterPro.

Number of CDSs	% to Total		Classification	Tag		All predicted genes	MUN_r1.1_cds
	All predicted genes	MIN_r1.1_cds		Similarity against NR	Domain
19,874	43	55	Complete	f	d	Included	Included
1554	3	4	Complete	f	-	Included	Included
3574	8	10	Complete	p	d	Included	Included
2996	6	8	Complete	p	-	Included	Included
1052	2	3	Complete	-	d	Included	Included
6458	14	18	Complete	-	-	Included	Included
26	0	0	Partial	f	d	Included	Included
17	0	0	Partial	f	-	Included	Included
49	0	0	Partial	p	d	Included	Included
73	0	0	Partial	p	-	Included	Included
124	0	0	Partial	-	d	Included	Included
308	1	1	Partial	-	-	Included	Included
126	0.3	-	Pseudo			Included	Not included
107	0.2	-	Short			Included	Not included
9757	21.2	-	TE			Included	Not included

Open in a new tab

BLASTP Search against NCBI NR database f: E-value ≦1 × 10⁻²⁰ and similarity ≧70% p: E-value ≦1 × 10⁻²⁰ and similarity <70% InterProScan against the InterPro database d: E-value ≦1.0

Transfer RNA genes were predicted using tRNAscan-SE ver. 1.23 with the default parameters [38], and compared with the numbers on the genomes of Phaseolus vulgaris (Pvulgaris_218_v1.0, 681) [39], Vigna angularis (Vangularis_v1.a1) [40], Lotus japonicus (Lj3.0) [41], and A. thaliana (Araport11 [42]). Total number of putative tRNA genes in the assembled genomes (MUN_r1.11) was 690, almost the same as the numbers for the genomes of P. vulgaris (681), V. angularis (667), and A. thaliana (699, Table 9). rRNA genes were predicted by BLAST searches (E-value cutoff of 1 × 10⁻¹⁰) with query sequences of A. thaliana 5.8S and 25S rRNAs (X52320.1) and 18S rRNA (X16077.1). The total number of putative rRNA genes identified in the genome was 139, which was again the same as the number in the P. vulgaris genome.

Table 9.

Numbers of putative tRNA and rRNA encoding genes identified in MUN_r1.1 and other legume species.

tRNA
Encode	M. uniflorum (MUN_r1.11)	P. vulgaris (Pvulgaris_218_v1.0)	V. angularis (Vangularis_v1.genome)	L. japonicus (Lj3.0_pseudomol)	A. thaliana (TAIR10_genome)
Ala	40	41	44	40	33
Arg	34	43	52	54	39
Asn	17	22	30	28	19
Asp	27	30	32	32	28
Cys	12	16	20	76	17
Gln	21	21	24	23	19
Glu	31	28	32	40	27
Gly	46	51	44	48	43
His	13	15	14	19	12
Ile	101	30	32	29	25
Leu	49	56	53	57	45
Lys	34	43	35	41	33
Met	38	39	43	48	31
Phe	21	28	20	22	17
Pro	43	48	36	46	68
Ser	50	50	51	44	72
Thr	690	28	24	36	26
Trp	15	16	17	21	16
Tyr	15	17	18	22	83
Val	36	41	37	34	32
Subtotal	669	663	658	760	685
Subtotal (%)	97.0	97.4	98.7	88.6	98.0
Pseudo	19	12	7	80	13
SeC	0	1	0	12	0
Sup	0	1	0	0	0
Undet	2	4	2	6	1
Total	690	681	667	858	699
rRNA
Encode gene	M. uniflorum (MUN_r1.1)	P. vulgaris (Pvulgaris_218_v1.0)	V. angularis (Vangularis_v1.genome)	L. japonicus (Lj3.0_pseudomol)	A. thaliana (TAIR10_genome)
18S	40	48	224	27	4
25S	87	83	421	64	5
5.8S	12	8	139	6	2
Total	139	139	784	97	11

Open in a new tab

^aPseudo, SeC, Sup, and Undet represent pseudogenes, selenocysteine tRNAs, possible suppressor tRNAs, and tRNAs with undetermined/unknown isotypes, respectively.^bAccession numbers used for identification or 5.8S, 18S, and 25S rRNAs were X52320.1, X16077.1, and X52320.1, respectively.

Diversity analysis in genetic resources

Only two species in the genus Macrotyloma, i.e., horsegram and M. geocarpum, are used as crops. It was speculated that horsegram domestication occurred in India twice: once in northwestern India at 4000 years before present, and once on the Indian Peninsula at 3500 years before present [43]. In addition, horsegram has narrow genetic diversity, as revealed by molecular analysis [44]. The genetic diversity of 91 cultivated horsegram accessions and one M. axellare accession, a wild relative of horsegram that is maintained at CSK-HPAU, were investigated based on dd-RAD-Seq analysis [10]. Library construction and variant calling were performed according to Shirasawa et al. [45]. The ddRAD-Seq reads were generated by an Illumina HiSeq 2000 system with a read length of 93 nt and mapped onto the assembled genome sequences. The two accessions, IC139449 and IC547543, were excluded from further analysis because of the small number of obtained reads. The mapped ratio of the reads onto the genome (MUN_r1.11) ranged from 80–90% in most of the accessions (Figure 4). However, M. axellare and one horsegram accession (IC313367) showed low mapping ratios of 17% and 55%, respectively. M. axellare was excluded from further analysis because of its low mapping ratio.

Figure 4. — Mapped ratios of the dd-RAD-Seq reads of 92 accessions.

A total of 277 SNPs were identified in the remaining 89 accessions across the genome [10]. The Jaccard similarity coefficients of the 277 SNPs were calculated using GGT 2.0 [46], and a neighbor-joining (NJ) phylogenetic tree was constructed using MEGA ver 10.1.8 (RRID: SCR_000667) [47]. The NJ tree classified the 89 accessions into two clusters (Figure 5). Cluster 1 included varieties bred in the CSK-HPAU, which are prefixed with “HPK”. Most of the HPK varieties showed very close genetic relations and formed a subcluster (HPK cluster); the single exception was HPK-4.

Figure 5. — A phylogenetic tree of the 89 horsegram accessions based on 277 SNPs. HPK-4, used in the reference genome construction; HPK-4, used in the reference genome construction, and HPKM-193; the obtained whole genome sequences of the accessions are circled with blue and green lines, respectively.

Whole genome structure in horsegram

Figure 6 shows a graphical view of the horsegram genome structure with a graph drawn by Circos (Figure 6; RRID: SCR_011798) [48]. Repetitive sequences were frequently observed in the midsection of each chromosome, and the tendency was more pronounced in horsegram-specific sequences (Figure 6A). The ratio of repetitive sequences commonly observed in all five species was quite low, suggesting the uniqueness of repetitive sequences compared to the gene sequences. The gene sequences commonly observed between horsegram and the other compared species tended to be distributed to the two end regions of the chromosomes (Figure 6B). On the other hand, horsegram-specific gene sequences were distributed more uniformly across the genome, suggesting the unique structure of the horsegram genome.

Copy number variations (CNVs) of one horsegram accession, HPKM-193, were detected against the HPK-4 genome (Figure 6C) based on the whole genome sequence reads of HPKM-193 using CNV-Seq (RRID: SCR_013357) [49] with a 1-Mbp window. CNVs with a minus log2 ratio were particularly observed on chr09 and chr02.

Of the 277 SNPs identified among the 89 horsegram accessions, 255 were located across the genome sequences of 10 chromosomes (Figure 6D). In each chromosome, SNPs were mostly identified in the regions where common putative genes of horsegram and the other compared species were located, particularly for chr04, chr07, chr08, and chr10. The differing trends in variable distribution is thought to reflect the presence of varying degrees of selection pressure in the horsegram germplasm resources in Himachal Pradesh.

SNP density mapped on the linkage map is illustrated in Figure 6E. As in the case of the CNVs, distribution bias was observed in the SNPs of HPKM-193; however, this bias was not like that in CNVs. A higher SNP density was observed in the midsection in most of the chromosomes. Chr06 showed less variation than the other chromosomes.

Genes related to drought tolerance

Horsegram is considered one of the most drought-tolerant legume crop species. Personal investigation showed that plants can survive for more than 20 days without water under controlled conditions. A study by Bhardwaj et al. [50] described a transcriptome analysis of eight shoot and root tissues of a drought-sensitive (M-191) genotype and a drought-tolerant (M-249) genotype of horsegram under controlled and drought stress conditions. This study identified some important genic regions responsible for drought tolerance.

To estimate genes related to drought tolerance in the horsegram genome, a BLASTP search of the 36,105 putative genes was performed against amino acid sequences of A. thaliana (Araport11), and hit genes were further used in BLAST searches against DroughtDB [51], the NCBI NR protein database, and Plant Stress Gene Database [52]. A total of 158 horsegram genes showed significant similarity to the 78 genes in DroughtDB [10]. The most frequently hit gene was ABCG40, which encodes a protein that functions as an ABC transporter, and showed significant similarity to 14 horsegram genes. OST1/SRK2E and AtrbohF were also frequently identified, with hits to seven and six horsegram genes, respectively. Of the 158 genes, 93 showed the same domain sequences as the A. thaliana gene, and 52 were like the genes registered in the PSGD. These genes were indicated to have a greater likelihood of being candidate genes related to drought tolerance.

Comparative and phylogenetic analyses with other legume species

Horsegram belongs to the subtribe Phaseolinae in the millettioid clade, along with P. vulgaris and V. angularis. The genome structure of horsegram was compared with those of P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), and L. japonicus (Lr_r3.0).

The predicted gene sequences in MUN_r1.1_cds were clustered with other plant species (P. vulgaris, V. angularis, L. japonicus, and A. thaliana) for comparison at the protein sequence level. A total of 73,457 clusters were generated using the program CD-HIT (RRID: SCR_007105) [53] (Table 10). Of the 36,105 putative gene sequences, 21,369 (59.2%) genes were clustered with other plant species and 14,736 (40.8%) were considered horsegram-specific genes (Figure 7). A total of 3738 (10.4%) horsegram gene sequences were clustered with 3,864 P. vulgaris and 3,713 V. angularis genes, which were considered millettioid-specific genes. Common genes in legumes were identified for 6550 (18.1%) horsegram gene sequences, based on clusters with P. vulgaris, V. angularis, andL. japonicus.

Table 10.

Number of gene clusters in horsegram and the four plant species, P. vulgaris (Pvulgaris_218_v1.0), V. angularis (Vangularis_v1.a1), L. japonicus (Lj_r3.0), and A. thaliana (Araport11).

Clustered species	Number of clustered species	Number of clusters	Number of clustered genes
			Horsegram	P. vulgaris (Pv)	V. angularis (Va)	L. japonicus (Lj)	A. thaliana (At)
Horsegram	1	9,578	14,736	0	0	0	0
P. vulgaris (Pv)	1	3,449	0	4,114	0	0	0
V. angularis (Va)	1	8,306	0	0	9,545	0	0
L. japonicus (Lj)	1	17,544	0	0	0	21,677	0
A. thaliana (At)	1	15,177	0	0	0	0	18,129
Horsegram + Pv	2	897	1,003	1,028	0	0	0
Horsegram + Va	2	471	531	0	530	0	0
Horsegram + Lj	2	362	391	0	0	484	0
Horsegram + At	2	111	116	0	0	0	147
Pv + Va	2	1,031	0	1,248	1,159	0	0
Pv + Lj	2	262	0	293	0	333	0
Pv + At	2	88	0	97	0	0	130
Va + Lj	2	290	0	0	317	382	0
Va + At	2	92	0	0	95	0	113
Lj + At	2	326	0	0	0	398	415
Horsegram + Pv + Va (Common in Millettioids)	3	3,189	3,738	3,864	3,713	0	0
Horsegram + Pv + Lj	3	458	519	521	0	594	0
Horsegram + Pv + At	3	87	89	92	0	0	110
Horsegram + Va + Lj	3	230	272	0	255	292	0
Horsegram + Va + At	3	49	53	0	54	0	58
Horsegram + Lj + At	3	56	59	0	0	72	74
Pv + Va + Lj	3	581	0	649	664	725	0
Pv + Va + At	3	94	0	101	99	0	118
Pv + Lj + At	3	66	0	79	0	79	89
Va + Lj + At	3	41	0	0	45	51	53
Horsegram + Pv + Va + Lj (common in legumes)	4	5,011	6,550	6,591	6,472	6,847	0
Horsegram + Pv + Va + At	4	670	853	857	845	0	873
Horsegram + Pv + Lj + At	4	173	201	196	0	222	208
Horsegram + Va + Lj + At	4	76	82	0	83	102	98
Pv + Va + Lj + At	4	302	0	370	365	420	385
Common in all	5	4,390	6,912	7,097	7,000	7,056	6,638
Number of genes			36,105	27,197	31,241	39,734	27,621
Number of clustered genes			36,105	27,197	31,241	39,734	27,638
Number of non-clustered genes			0	0	0	0	17

Open in a new tab

Figure 7. — Ratios of genes of horsegram (MUN_r1.1_cds) clustered with those of four other plant species. Pv, Va, Li, and At represent genes of *P. vulgaris* (Pvulgaris_218_v1.0), *V. angularis* (Vangularis_v1.a1), *L. japonicus* (Lj_r3.0), and *A. thaliana* (Araport11), respectively.

Functional analysis was performed for MUN_r1.1_cds by classifying 36,105 putative genes into the Gene Ontology (GO) and euKaryotic clusters of Orthologous Groups (KOG) databases [54]. A total of 24,699 (68.4%) putative genes were annotated with GO categories including 9086 (25.2%) genes involved in biological processes, 4127 (11.4%) genes coding for cellular components, and 1377 (38.7%) genes associated with molecular functions (Figure 8). The ratio of annotated horsegram genes was smaller than those of the other species. The species with a ratio of classified GO categories most like that of horsegram was L. japonicus. A total of 18,630 (51.6%) putative genes showed significant similarity to genes in the KOG database (Figure 9). As in the results for GO, the ratio of hit genes was lower than for the other four species.

Figure 8. — Comparison of genes annotated by the GO database in horsegram (MUN_r1.1_cds), *P. vulgaris* (Pvulgaris_218_v1.0), *V. angularis* (Vangularis_v1.a1), *L. japonicus* (Lj_r3.0), and *A. thaliana* (Araport11). (A) Numbers and ratios of genes annotated by GO database. (B) Ratios of the classified GO categories in the predicted genes.

Figure 9. — Comparison of genes annotated by the KOG database in horsegram (MUN_r1.1_cds), *P. vulgaris* (Pvulgaris_218_v1.0), *V. angularis* (Vangularis_v1.a1), *L. japonicus* (Lj_r3.0), and *A. thaliana* (Araport11). (A) Numbers and ratios of genes annotated by the KOG database. (B) Ratio of the classified KOG categories in hit genes.

Clear relationships were observed with a warm-season legume, V. angularis, and one-on-one relationships were observed between horsegram chr02 (Mun_chr02) and V. angularis chr09 (Va_chr09), Mun_chr04 and Va_chr02, Mun_chr06 and Va_chr10, Mun_chr07 and Va_chr08, Mun_chr08 and Va_chr04, and Mun_chr09 and Va_chr05 (Figure 10A). The syntenic relations with P. vulgaris were slightly more complex than those with V. angularis, and those with the cool-season legume L. japonicus were more fragmented.

Figure 10. — Comparative and phylogenetic analyses with other legume species. (A) Graphical view of syntenic relationships between horsegram and *P. vulgaris* (left), *V. anagularis* (middle), and *L. japonicus* (right). Pink and blue dots show homologous sequences of MUN_r1.11 with forward and reverse directions against the reference sequences. (B) Distribution of Ks values of orthologous gene pairs in horsegram (Mu) and the four plant species: *P. vulgaris* (Pv), *V. anagularis* (Va), *L. japonicus* (Lj), and *A. thaliana* (At). C: Phylogenetic tree of 4,154 common single-copy genes of the six legume species: *P. vulgaris*, *V. anagularis*, *G. max*, *L. japonicus*, *M. truncatula*, and *A. thaliana*.

Synonymous substitutions per site (Ks) were estimated by comparing gene pairs in each combination of horsegram, P. vulgaris, V. angularis, L. japonicus, and A. thaliana (Araport11) by KaKs Calculator [55] based on the clustered genes using the CD-HIT program (Figure 10B). The similar distributions of horsegram, P. vulgaris, and V. angularis indicated the close relations among the three species. The ratios of gene pairs showing Ks values less than 0.1% were 21.1% between horsegram and P. vulgaris and 8.4% between horsegram and V. angularis, suggesting that there was a closer relationship between horsegram and P. vulgaris at the gene level.

The phylogenetic analysis was performed with Medica truncatula (r5.0) [56] and Glycine max (Glma4 [57]) in addition to P. vulgaris, V. angularis, L. japonicus, and A. thaliana. A total of 978 common single-copy genes were identified for horsegram and the six species by clustering the genes using OrthoFinder (RRID: SCR_017118) [58]. Multiple alignment was performed for the 978 single-copy genes using Muscle (RRID: SCR_011812) [59], and gaps were excluded by Gblock [60]. An NJ tree was created with the 4154 single-copy orthologous genes identified in the four legume species by MEGA 7.0.9 beta (RRID: SCR_000667) [61] and TIMETREE (RRID: SCR_021162) [62]. A. thaliana was used for the outgroup (Figure 10C). When the divergence time between M. truncatula and G. max was considered to be 53 million years ago, it was estimated that horsegram diverged from P. vulgaris and V. angularis 20.75 million years ago (Figure 10C). Among the four legume species in millettioids, P. vulgaris and V. angularis shared closer relations with each other than with horsegram, and horsegram was closer to P. vulgaris and V. angularis than to G. max. The results are in consonance with a previous study based on a comparison of eight chloroplast regions [63].

Data validation and quality control

The quality of assembled genome and gene sequences was investigated by using 1375 Embryophyta BUSCOs (v 3.0, obd10; RRID: SCR_015008) [24]. A total of 1340 (93.1%) BUSCOs were identified on the assembled scaffolds (MUN_r1.11), while 1259 (87.4%) and 1313 (91.2%) were converted by the pseudomolecules and CDS sequences (Table 11).

Table 11.

Statistics of the horsegram genome assembly and CDS.

	MUN_r1.11	MUN_r1.11	MUN_r1.1_cds
	Genome/Scaffolds	Genome/Pseudomolecules	CDS
BUSCOs
Complete	1340 (93.1%)	1259 (87.4%)	1313 (91.2%)
Complete single-copy	1252 (86.9%)	1181 (82%)	1208 (83.9%)
Complete duplicated	88 (6.1%)	78 (5.4%)	105 (7.3%)
Fragmented	26 (1.8%)	31 (2.2%)	23 (1.6%)
Missing	74 (5.1%)	150 (10.4%)	104 (7.2%)

Open in a new tab

Reuse potential

In this study, we have provided a first-draft genome assembly of horsegram cultivar (HPK-4) and investigated features of the horsegram genome and gene sequences as well as the genetic diversity of the accessions. This information will help to establish an efficient breeding program for horsegram by integrating conventional breeding with marker-based biotechnological tools. Finally, the genomic information revealed in this study can be applied to the improvement of other disadvantageous food legumes.

Funding Statement

This work was supported by the Bilateral Joint Research Projects from the Japan Society for the Promotion of Science and the Department of Science and Technology of the Government of India, and by funds from the Kazusa DNA Research Institute Foundation.

Data availability

The genome assembly data, annotations, and gene models are available at the Horsegram Database [64]. The obtained genome sequence reads are available from the DNA Databank of Japan (DDBJ) Sequence Read Archive (DRA) under the BioProject accession number PRJDB5374.

Data sets supporting the results of this article are available in GigaScience Database [10].

Declarations

List of abbreviations

BUSCO: benchmarking universal single-copy ortholog; CDS: coding sequence; CNV: copy number variation; CSK-HPAU: CSK Himachal Pradesh Agricultural University; GO: Gene Ontology; Kbp: kilobase pairs; KOG: euKaryotic clusters of Orthologous Groups; Mbp: megabase pairs; MP: mate-pair; NCBI: National Center for Biotechnology Information; NJ: neighbor-joining; PE: paired-end; SLR: synthetic long read; SNP: single nucleotide polymorphism; SSR: simple sequence repeat; TAS: target amplicon sequencing; TE: transposable element.

Ethical approval

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Funding

Author contributions

The project was designed by S.I., R.C., and T.S. K.S. contributed to the genome sequencing, linkage map, and pseudomolecule construction. R.C. contributed to the creation of the mapping population. H.H. contributed to the genome assembly and gene prediction, annotation, and phylogenetic analysis. S.N. contributed to the transcript sequencing. H.N. contributed to the pseudomolecule construction. S.I. contributed to the entire process of data analysis.

Acknowledgements

We thank A. Watanabe, S. Nakayama, Y. Kishida, M. Kohara, H. Tsuruoka, C. Minami, and S. Sasamoto (KDRI) for their technical assistance. We also thank the National Bureau of Plant Genetic Resources (NBPGR), New Delhi, for providing the germplasm lines for diversity analysis.

References

1.Gillett J, Polhill R, Verdcourt B, . Flora of tropical East Africa. Leguminosae (Part 3) Subfamily Papilionoideae (1); Leguminosae (Part 4) Subfamily Papilionoideae (2). London: Crown Copyright, 1971. [Google Scholar]
2.Arora R, Chandel K, . Botanical source areas of wild herbage legumes in India. Trop Grasslands, 1972; 6: 213–221. [Google Scholar]
3.Chahota RK, Sharma TR, Sharma SK, Kumar N, Rana JC, . 12 - Horsegram. In: Singh M, Upadhyaya HD, Bisht IS (eds), Genetic and genomic resources of grain legume improvement. Oxford: Elsevier, 2013; pp. 293–305. [Google Scholar]
4.Patel DP, Dabas BS, Sapra RS, Mandal S, . Evaluation of horsegram (Macrotyloma uniflorum) (Lam.) germplasm. New Delhi: National Bureau of Plant Genetic Resources Publication, 1995. [Google Scholar]
5.Gopalan C, Ramashastri BV, Balasubramanyan SC, . Nutritive value of Indian foods. Hyderabad: National Institute of Nutrition, ICMR, Offset Press, 1999; p. 156. [Google Scholar]
6.Reddy B, Brijitha N, Raghavender C, . Aflatoxin contamination in insect damaged seeds of horsegram under storage. Mycotoxin Res., 2005; 21: 187–191. [DOI] [PubMed] [Google Scholar]
7.Gautam M, Katoch S, Chahota RK, . Comprehensive nutritional profiling and activity directed identification of lead antioxidant, antilithiatic agent from Macrotyloma uniflorum (Lam.) Verdc. Food Res. Int., 2020; 137: 109600. [DOI] [PubMed] [Google Scholar]
8.Gautam M, Datt N, Chahota RK, . Assessment of calcium oxalate crystal inhibition potential, antioxidant activity and amino acid profiling in horse gram (Macrotyloma uniflorum): high altitude farmer’s varieties. 3 Biotech., 2020; 10: 402. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.National Research Council. . Tropical legumes: resources for the future. Washington, DC: National Academies Press, 1979; 10.17226/19836. [DOI] [Google Scholar]
10.Shirasawa K, Chahota RK, Hirakawa H, Nagano S, Nagasaki H, Sharma TR, Isobe SN, . Supporting data for “A Chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)”. GigaScience Database. 2021; 10.5524/100932. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Marcais G, Kingsford C, . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011; 27: 764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Luo R, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 2012; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W, . Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 2011; 27: 578–579. [DOI] [PubMed] [Google Scholar]
14.Boetzer M, Pirovano W, . Toward almost closed genomes with GapFiller. Genome Biol., 2012; 13: 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Kajitani R, et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res., 2014; 24: 1384–1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA, . The MaSuRCA genome assembler. Bioinformatics, 2013; 29: 2669–2677. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Bankevich A, Pevzner PA, . TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat. Methods, 2016; 13: 248–250. [DOI] [PubMed] [Google Scholar]
18.Kosugi S, Hirakawa H, Tabata S, . GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31: 3733–3741. [DOI] [PubMed] [Google Scholar]
19.University of California Santa Cruz Genomics Institute. . UCSC Genome Browser on Human February 2009 (GRCh37/hg19) Assembly. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/. Accessed 13 April 2016.
20.National Center for Biotechnology Information . https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/. Accessed 13 April 2016.
21.National Center for Biotechnology Information . https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/. Accessed 13 April 2016.
22.UniVec . Bethesda, MD: National Center for Biotechnology Information. 2016; http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/. Accessed 13 April 2016.
23.Air GM, Els MC, Brown LE, Laver WG, Webster RG, . Location of antigenic sites on the three-dimensional structure of the influenza N2 virus neuraminidase. Virology, 1985; 145: 237–248. [DOI] [PubMed] [Google Scholar]
24.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM, . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015; 31: 3210–3212. [DOI] [PubMed] [Google Scholar]
25.Langmead B, Wilks C, Antonescu V, Charles R, . Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 2019; 35: 421–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H, . Twelve years of SAMtools and BCFtools. GigaScience, 2021; 10(2): giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Danecek P, et al. The variant call format and VCFtools. Bioinformatics, 2011; 27: 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Shirasawa K, Kuwata C, Watanabe M, Fukami M, Hirakawa H, Isobe S, . Target amplicon sequencing for genotyping genome-wide single nucleotide polymorphisms identified by whole-genome resequencing in peanut. Plant Genome, 2016; 9: 10.3835/plantgenome2016.06.0052. [DOI] [PubMed] [Google Scholar]
29.Van Ooijen J, . JoinMap® 4, Software for the calculation of genetic linkage maps in experimental populations. Wageningen: Kyazma BV, 2006; p. 33. [Google Scholar]
30.Smit AFA, Hubley R, Green P, . RepeatMasker Open-3.0. 2013–2015, http://www.repeatmasker.org.
31.Bao W, Kojima KK, Kohany O, . Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA, 2015; 6: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Price AL, Jones NC, Pevzner PA, . De novo identification of repeat families in large genomes. Bioinformatics, 2005; 21(Suppl 1): i351–i358. [DOI] [PubMed] [Google Scholar]
33.Kofler R, Schlötterer C, Lelley T, . SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics, 2007; 23: 1683–1685. [DOI] [PubMed] [Google Scholar]
34.Henschel R, Lieber M, Wu LS, Nista PM, Haas BJ, LeDuc RD., . Trinity RNA-Seq assembler performance optimization. In: Proceedings of the 1st conference of the extreme science and engineering discovery environment: bridging from the eXtreme to the campus and beyond. 2012; pp. 1–8. [Google Scholar]
35.Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M, . BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 2016; 32: 767–769. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.National Center for Biotechnology Information . Available from: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/. Accessed 13 November 2016.
37.Mulder NJ, et al. New developments in the InterPro database. Nucleic Acids Res., 2007; 35: D224–D228. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Lowe TM, Chan PP, . tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res., 2016; 44: W54–W57. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Schmutz J, et al. A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet., 2014; 46: 707–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Sakai H, et al. The Vigna Genome Server, “Vig GS”: A genomic knowledge base of the genus Vigna based on high-quality, annotated genome sequence of the Azuki Bean, Vigna angularis (Willd.) Ohwi & Ohashi. Plant Cell Physiol., 2016; 57: e2. [DOI] [PubMed] [Google Scholar]
41.Sato S, et al. Genome structure of the legume, Lotus japonicus. DNA Res., 2008; 15: 227–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.The Arabidopsis Information Resource (TAIR) . Araport11 genome release. https://arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FAraport11_genome_release. Accessed 26 November 2016.
43.Fuller DQ, Murphy C, . The origins and early dispersal of horsegram (Macrotyloma uniflorum), a major crop of ancient India. Genet. Resour. Crop. Evol., 65: 285–305. [Google Scholar]
44.Sharma V, Sharma TR, Rana JC, Chahota RK, . Analysis of genetic diversity and population structure in horsegram (Macrotyloma uniflorum) using RAPD and ISSR markers. Agri. Res., 2015; 4: 221–230. [Google Scholar]
45.Shirasawa K, Hirakawa H, Isobe S, . Analytical workflow of double-digest restriction site-associated DNA sequencing based on empirical and in silico optimization in tomato. DNA Res., 2016; 23: 145–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Van Berloo R, . GGT 2.0: versatile software for visualization and analysis of genetic data. J. Hered., 2008; 99: 232–236. [DOI] [PubMed] [Google Scholar]
47.Kumar S, Stecher G, Li M, Knyaz C, Tamura K, . MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol., 2018; 35: 1547–1549. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Krzywinski M, et al. Circos: an information aesthetic for comparative genomics. Genome Res., 2009; 19: 1639–1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Xie C, Tammi MT, . CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform., 2009; 10: 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.Bhardwaj J, et al. Comprehensive transcriptomic study on horse gram (Macrotyloma uniflorum): De novo assembly, functional characterization and comparative analysis in relation to drought stress. BMC Genom., 2013; 14: 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
51.Alter S, et al. DroughtDB: an expert-curated compilation of plant drought stress genes and their homologs in nine species. Database, 2015; 2015: bav046. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Plant Stress Gene Database . New Delhi; Jawaharlal Nehru University. http://ccbb.jnu.ac.in/stressgenes/frontpage.html. Accessed 12 October 2020.
53.Fu L, Niu B, Zhu Z, Wu S, Li W, . CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012; 28: 3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinform., 2003; 4: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Zhang Z, Li J, Zhao XQ, Wang J, Wong GKS, Yu J, . KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinform., 2006; 4: 259–263. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Pecrix Y, et al. Whole-genome landscape of Medicago truncatula symbiotic genes. Nat. Plants, 2018; 4: 1017–1025. [DOI] [PubMed] [Google Scholar]
57.Soybase . US Department of Agriculture–Agricultural Research Service, Iowa State University. https://soybase.org/. Accessed 13 August 2021.
58.Emms DM, Kelly S, . OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol., 2019; 20: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
59.European Molecular Biology Laboratory–European Bioinformatics Institute (EMBI–EBI) . MUSCLE. Multiple sequence alignment. Hinxton: EMBL–EBI, 2019https://www.ebi.ac.uk/Tools/msa/muscle. Accessed 13 August 2021. [Google Scholar]
60.Castresana J, . Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol., 2000; 17: 540–552. [DOI] [PubMed] [Google Scholar]
61.Kumar S, Nei M, Dudley J, Tamura K, . MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinform., 2008; 9: 299–306. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Hedges SB, Dudley J, Kumar S, . TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics, 2006; 22: 2971–2972. [DOI] [PubMed] [Google Scholar]
63.Stefanović S, Pfeil BE, Palmer JD, Doyle JJ, . Relationships among phaseoloid legumes based on sequences from eight chloroplast regions. System Bot., 2009; 34: 115–128. [Google Scholar]
64.Horsegram Database . Kazusa DNA Research Institute. 2017; http://horsegram.kazusa.or.jp/. Accessed 19 January 2021.

GigaByte. 2021 Oct 8;2021:gigabyte30.

Article Submission

Sachiko Isobe

GigaByte.

Assign Handling Editor

Editor: Scott Edmunds

GigaByte.

Editor Assess MS

Editor: Hongfang Zhang

Open in a new tab

GigaByte.

Export to Production

Editor: Scott Edmunds

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sets supporting the results of this article are available in GigaScience Database [10].

[ref1] 1.Gillett J, Polhill R, Verdcourt B, . Flora of tropical East Africa. Leguminosae (Part 3) Subfamily Papilionoideae (1); Leguminosae (Part 4) Subfamily Papilionoideae (2). London: Crown Copyright, 1971. [Google Scholar]

[ref2] 2.Arora R, Chandel K, . Botanical source areas of wild herbage legumes in India. Trop Grasslands, 1972; 6: 213–221. [Google Scholar]

[ref3] 3.Chahota RK, Sharma TR, Sharma SK, Kumar N, Rana JC, . 12 - Horsegram. In: Singh M, Upadhyaya HD, Bisht IS (eds), Genetic and genomic resources of grain legume improvement. Oxford: Elsevier, 2013; pp. 293–305. [Google Scholar]

[ref4] 4.Patel DP, Dabas BS, Sapra RS, Mandal S, . Evaluation of horsegram (Macrotyloma uniflorum) (Lam.) germplasm. New Delhi: National Bureau of Plant Genetic Resources Publication, 1995. [Google Scholar]

[ref5] 5.Gopalan C, Ramashastri BV, Balasubramanyan SC, . Nutritive value of Indian foods. Hyderabad: National Institute of Nutrition, ICMR, Offset Press, 1999; p. 156. [Google Scholar]

[ref6] 6.Reddy B, Brijitha N, Raghavender C, . Aflatoxin contamination in insect damaged seeds of horsegram under storage. Mycotoxin Res., 2005; 21: 187–191. [DOI] [PubMed] [Google Scholar]

[ref7] 7.Gautam M, Katoch S, Chahota RK, . Comprehensive nutritional profiling and activity directed identification of lead antioxidant, antilithiatic agent from Macrotyloma uniflorum (Lam.) Verdc. Food Res. Int., 2020; 137: 109600. [DOI] [PubMed] [Google Scholar]

[ref8] 8.Gautam M, Datt N, Chahota RK, . Assessment of calcium oxalate crystal inhibition potential, antioxidant activity and amino acid profiling in horse gram (Macrotyloma uniflorum): high altitude farmer’s varieties. 3 Biotech., 2020; 10: 402. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9.National Research Council. . Tropical legumes: resources for the future. Washington, DC: National Academies Press, 1979; 10.17226/19836. [DOI] [Google Scholar]

[ref10] 10.Shirasawa K, Chahota RK, Hirakawa H, Nagano S, Nagasaki H, Sharma TR, Isobe SN, . Supporting data for “A Chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)”. GigaScience Database. 2021; 10.5524/100932. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11.Marcais G, Kingsford C, . A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 2011; 27: 764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12.Luo R, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience, 2012; 1: 18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13.Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W, . Scaffolding pre-assembled contigs using SSPACE. Bioinformatics, 2011; 27: 578–579. [DOI] [PubMed] [Google Scholar]

[ref14] 14.Boetzer M, Pirovano W, . Toward almost closed genomes with GapFiller. Genome Biol., 2012; 13: 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15.Kajitani R, et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res., 2014; 24: 1384–1395. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16.Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA, . The MaSuRCA genome assembler. Bioinformatics, 2013; 29: 2669–2677. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17.Bankevich A, Pevzner PA, . TruSPAdes: barcode assembly of TruSeq synthetic long reads. Nat. Methods, 2016; 13: 248–250. [DOI] [PubMed] [Google Scholar]

[ref18] 18.Kosugi S, Hirakawa H, Tabata S, . GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics, 2015; 31: 3733–3741. [DOI] [PubMed] [Google Scholar]

[ref19] 19.University of California Santa Cruz Genomics Institute. . UCSC Genome Browser on Human February 2009 (GRCh37/hg19) Assembly. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.13/. Accessed 13 April 2016.

[ref20] 20.National Center for Biotechnology Information . https://ftp.ncbi.nlm.nih.gov/genomes/refseq/fungi/. Accessed 13 April 2016.

[ref21] 21.National Center for Biotechnology Information . https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/. Accessed 13 April 2016.

[ref22] 22.UniVec . Bethesda, MD: National Center for Biotechnology Information. 2016; http://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/. Accessed 13 April 2016.

[ref23] 23.Air GM, Els MC, Brown LE, Laver WG, Webster RG, . Location of antigenic sites on the three-dimensional structure of the influenza N2 virus neuraminidase. Virology, 1985; 145: 237–248. [DOI] [PubMed] [Google Scholar]

[ref24] 24.Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM, . BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics, 2015; 31: 3210–3212. [DOI] [PubMed] [Google Scholar]

[ref25] 25.Langmead B, Wilks C, Antonescu V, Charles R, . Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 2019; 35: 421–432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26.Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H, . Twelve years of SAMtools and BCFtools. GigaScience, 2021; 10(2): giab008. 10.1093/gigascience/giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27.Danecek P, et al. The variant call format and VCFtools. Bioinformatics, 2011; 27: 2156–2158. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28.Shirasawa K, Kuwata C, Watanabe M, Fukami M, Hirakawa H, Isobe S, . Target amplicon sequencing for genotyping genome-wide single nucleotide polymorphisms identified by whole-genome resequencing in peanut. Plant Genome, 2016; 9: 10.3835/plantgenome2016.06.0052. [DOI] [PubMed] [Google Scholar]

[ref29] 29.Van Ooijen J, . JoinMap® 4, Software for the calculation of genetic linkage maps in experimental populations. Wageningen: Kyazma BV, 2006; p. 33. [Google Scholar]

[ref30] 30.Smit AFA, Hubley R, Green P, . RepeatMasker Open-3.0. 2013–2015, http://www.repeatmasker.org.

[ref31] 31.Bao W, Kojima KK, Kohany O, . Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA, 2015; 6: 11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] 32.Price AL, Jones NC, Pevzner PA, . De novo identification of repeat families in large genomes. Bioinformatics, 2005; 21(Suppl 1): i351–i358. [DOI] [PubMed] [Google Scholar]

[ref33] 33.Kofler R, Schlötterer C, Lelley T, . SciRoKo: a new tool for whole genome microsatellite search and investigation. Bioinformatics, 2007; 23: 1683–1685. [DOI] [PubMed] [Google Scholar]

[ref34] 34.Henschel R, Lieber M, Wu LS, Nista PM, Haas BJ, LeDuc RD., . Trinity RNA-Seq assembler performance optimization. In: Proceedings of the 1st conference of the extreme science and engineering discovery environment: bridging from the eXtreme to the campus and beyond. 2012; pp. 1–8. [Google Scholar]

[ref35] 35.Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M, . BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 2016; 32: 767–769. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36.National Center for Biotechnology Information . Available from: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/. Accessed 13 November 2016.

[ref37] 37.Mulder NJ, et al. New developments in the InterPro database. Nucleic Acids Res., 2007; 35: D224–D228. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38.Lowe TM, Chan PP, . tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic Acids Res., 2016; 44: W54–W57. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] 39.Schmutz J, et al. A reference genome for common bean and genome-wide analysis of dual domestications. Nat. Genet., 2014; 46: 707–713. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40.Sakai H, et al. The Vigna Genome Server, “Vig GS”: A genomic knowledge base of the genus Vigna based on high-quality, annotated genome sequence of the Azuki Bean, Vigna angularis (Willd.) Ohwi & Ohashi. Plant Cell Physiol., 2016; 57: e2. [DOI] [PubMed] [Google Scholar]

[ref41] 41.Sato S, et al. Genome structure of the legume, Lotus japonicus. DNA Res., 2008; 15: 227–239. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] 42.The Arabidopsis Information Resource (TAIR) . Araport11 genome release. https://arabidopsis.org/download/index-auto.jsp?dir=%2Fdownload_files%2FGenes%2FAraport11_genome_release. Accessed 26 November 2016.

[ref43] 43.Fuller DQ, Murphy C, . The origins and early dispersal of horsegram (Macrotyloma uniflorum), a major crop of ancient India. Genet. Resour. Crop. Evol., 65: 285–305. [Google Scholar]

[ref44] 44.Sharma V, Sharma TR, Rana JC, Chahota RK, . Analysis of genetic diversity and population structure in horsegram (Macrotyloma uniflorum) using RAPD and ISSR markers. Agri. Res., 2015; 4: 221–230. [Google Scholar]

[ref45] 45.Shirasawa K, Hirakawa H, Isobe S, . Analytical workflow of double-digest restriction site-associated DNA sequencing based on empirical and in silico optimization in tomato. DNA Res., 2016; 23: 145–153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46.Van Berloo R, . GGT 2.0: versatile software for visualization and analysis of genetic data. J. Hered., 2008; 99: 232–236. [DOI] [PubMed] [Google Scholar]

[ref47] 47.Kumar S, Stecher G, Li M, Knyaz C, Tamura K, . MEGA X: molecular evolutionary genetics analysis across computing platforms. Mol. Biol. Evol., 2018; 35: 1547–1549. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] 48.Krzywinski M, et al. Circos: an information aesthetic for comparative genomics. Genome Res., 2009; 19: 1639–1645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref49] 49.Xie C, Tammi MT, . CNV-seq, a new method to detect copy number variation using high-throughput sequencing. BMC Bioinform., 2009; 10: 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref50] 50.Bhardwaj J, et al. Comprehensive transcriptomic study on horse gram (Macrotyloma uniflorum): De novo assembly, functional characterization and comparative analysis in relation to drought stress. BMC Genom., 2013; 14: 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref51] 51.Alter S, et al. DroughtDB: an expert-curated compilation of plant drought stress genes and their homologs in nine species. Database, 2015; 2015: bav046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref52] 52.Plant Stress Gene Database . New Delhi; Jawaharlal Nehru University. http://ccbb.jnu.ac.in/stressgenes/frontpage.html. Accessed 12 October 2020.

[ref53] 53.Fu L, Niu B, Zhu Z, Wu S, Li W, . CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 2012; 28: 3150–3152. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref54] 54.Tatusov RL, et al. The COG database: an updated version includes eukaryotes. BMC Bioinform., 2003; 4: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref55] 55.Zhang Z, Li J, Zhao XQ, Wang J, Wong GKS, Yu J, . KaKs_Calculator: calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinform., 2006; 4: 259–263. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref56] 56.Pecrix Y, et al. Whole-genome landscape of Medicago truncatula symbiotic genes. Nat. Plants, 2018; 4: 1017–1025. [DOI] [PubMed] [Google Scholar]

[ref57] 57.Soybase . US Department of Agriculture–Agricultural Research Service, Iowa State University. https://soybase.org/. Accessed 13 August 2021.

[ref58] 58.Emms DM, Kelly S, . OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol., 2019; 20: 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref59] 59.European Molecular Biology Laboratory–European Bioinformatics Institute (EMBI–EBI) . MUSCLE. Multiple sequence alignment. Hinxton: EMBL–EBI, 2019https://www.ebi.ac.uk/Tools/msa/muscle. Accessed 13 August 2021. [Google Scholar]

[ref60] 60.Castresana J, . Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Mol. Biol. Evol., 2000; 17: 540–552. [DOI] [PubMed] [Google Scholar]

[ref61] 61.Kumar S, Nei M, Dudley J, Tamura K, . MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Brief Bioinform., 2008; 9: 299–306. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref62] 62.Hedges SB, Dudley J, Kumar S, . TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics, 2006; 22: 2971–2972. [DOI] [PubMed] [Google Scholar]

[ref63] 63.Stefanović S, Pfeil BE, Palmer JD, Doyle JJ, . Relationships among phaseoloid legumes based on sequences from eight chloroplast regions. System Bot., 2009; 34: 115–128. [Google Scholar]

[ref64] 64.Horsegram Database . Kazusa DNA Research Institute. 2017; http://horsegram.kazusa.or.jp/. Accessed 19 January 2021.

PERMALINK

A chromosome-scale draft genome sequence of horsegram (Macrotyloma uniflorum)

Kenta Shirasawa

Rakesh Chahota

Hideki Hirakawa

Soichiro Nagano

Hideki Nagasaki

Tilak Sharma

Sachiko Isobe

Roles

Abstract

Data description

Background

Context

Methods

Whole genome sequencing and assembly of horsegram

Figure 1.

Table 1.

Table 2.

Linkage map and pseudomolecule construction

Table 3.

Figure 2.

Table 4.

Table 5.

Repetitive sequences

Table 6.

Transcript sequencing, gene prediction, and annotation

Figure 3.

Table 7.

Table 8.

Table 9.

Diversity analysis in genetic resources

Figure 4.

Figure 5.

Whole genome structure in horsegram

Figure 6.

Genes related to drought tolerance

Comparative and phylogenetic analyses with other legume species

Table 10.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Data validation and quality control

Table 11.

Reuse potential

Funding Statement

Data availability

Declarations

List of abbreviations

Ethical approval

Consent for publication

Competing interests

Funding

Author contributions

Acknowledgements

References

Article Submission

Dr Sachiko Isobe

Roles

Assign Handling Editor

Roles

Editor Assess MS

Roles

Curator Assess MS

Roles

Review MS

Roles

Review MS

Roles

Editor Decision

Roles

Minor Revision

Dr Sachiko Isobe

Roles

Assess Revision

Roles

Re-Review MS

Roles

Editor Decision