Chromosome-level assembly of the common vetch (Vicia sativa) reference genome

Hangwei Xi; Vy Nguyen; Christopher Ward; Zhipeng Liu; Iain R Searle

doi:10.46471/gigabyte.38

. 2022 Jan 31;2022:gigabyte38. doi: 10.46471/gigabyte.38

Chromosome-level assembly of the common vetch (Vicia sativa) reference genome

Hangwei Xi ¹, Vy Nguyen ¹, Christopher Ward ¹, Zhipeng Liu ^2,^*, Iain R Searle ^1,^*

PMCID: PMC9650280 PMID: 36824524

Abstract

Vicia sativa L. (common vetch, n = 6) is an annual, herbaceous, climbing legume, originating in the Fertile Crescent of the Middle East and now widespread in the Mediterranean basin, West, Central and Eastern Asia, North and South America. V. sativa is of economic importance as a forage legume in countries such as Australia, China, and the USA, and contributes valuable nitrogen to agricultural rotation cropping systems. To accelerate precision genome breeding and genomics-based selection of this legume, we present a chromosome-level reference genome sequence for V. sativa, constructed using a combination of long-read Oxford Nanopore sequencing, short-read Illumina sequencing, and high-throughput chromosome conformation data (CHiCAGO and Hi-C) analysis. The chromosome-level assembly of six pseudo-chromosomes has a total genome length of 1.65 Gbp, with a median contig length of 684 Kbp. BUSCO analysis of the assembly demonstrated very high completeness of 98% of the dicotyledonous orthologs. RNA-seq analysis and gene modelling enabled the annotation of 53,218 protein-coding genes. This V. sativa assembly will provide insights into vetch genome evolution and be a valuable resource for genomic breeding, genetic diversity and for understanding adaption to diverse arid environments.

Data Description

Background

Vicia sativa L. (common vetch, NCBI:txid3908) (Figure 1) is an annual legume belonging to the Fabaceae family, and Vicia genus [1]. The Vicia genus contains about 180–210 species, including the economically important crop broad bean [2]. To date, no chromosome-level genome assembly has been reported within the Vicia genus. Interestingly, V. sativa has at least three different reported haploid chromosome numbers: n = 5, 6 or 7 [3], but n = 6 is the best characterized karyotype.

V. sativa is thought to have originated in the Fertile Crescent of the Middle East and is now widespread on every continent as both a crop and a weed [4]. V. sativa is a multipurpose legume; the plants are often grown for forage and the seeds can be used safely as a feed for ruminant animals. V. sativa seed contains up to 30% crude protein and is rich in essential amino acids and unsaturated fatty acids [5]. However, only a small amount of the seed can be safely fed to monogastric animals like chickens and pigs, because of the presence of the neurotoxic proteinaceous amino acids 𝛽-cyano-L-alanine and 𝛾-glutamyl-𝛽-cyano-alanine [6].

V. sativa is often used in crop rotation systems to increase nitrogen input to the soil. In a study of V. sativa/wheat rotation over a 4-year-period, cultivation of V. sativa during autumn increased soil water storage and subsequently increased biological yield and grain yield of wheat. Both yields were doubled in the third year compared with the second year of the rotation [7]. Furthermore, the symbiosis between soil rhizobia bacteria and V. sativa roots allows the plant to fix atmospheric nitrogen and later provide nitrogen for the following crop, hence reducing the use of expensive nitrogen fertilizer [8]. V. sativa exhibits excellent drought tolerance and is suitable for cultivation in arid areas. In one drought tolerance study, V. sativa could withstand a month of drought stress, with the leaf weight not decreasing significantly compared with the non-drought control [9]. V. sativa offers multiple usage and is a valuable crop in a sustainable agricultural system [10].

With the important value of V. sativa, vetch breeders have primarily selected for traits conferring high yield, pod shattering, flowering time, disease resistance against Ascochyta fabae, Uromyces viciae-fabae (rust) and Sclerotinia sclerotium [11]. Recently published transcriptome data has allowed agriculturally important traits to be uncovered at the gene expression level, such as pod-shattering resistance [12] and drought tolerance genes [13] in V. sativa. However, a lack of high-quality genome reference is currently impeding the genetic mapping of important genes and hindering further applications such as genome editing when compared with other crops.

Context

In this study, we assembled a high-quality chromosome-level reference genome for V. sativa, which is the first chromosome-level reference genome in the Vicia genus. We performed genome annotation using RNA-seq data from five tissues to ensure most of the expressed genes were captured. We also included a phylogenetic analysis of V. sativa and legume relatives. We envisage that our V. sativa genome will be an important resource for evolutionary studies of this species. The well-annotated chromosome-level genome will also provide important information to facilitate genetic mapping, gene discovery and functional gene studies.

Methods

Sampling and sequencing

To prepare V. sativa for whole genome sequencing (WGS) using long-read and short-read data, seeds of cultivar Studenica (V. sativa subsp. sativa) were obtained from the South Australian Research and Development Institute (SARDI, South Australia, Australia). Seeds were sterilized and germinated in vitro on half-strength Murashige & Skoog (1/2 MS) basal medium with 1% sucrose for 3 days at 25 °C, in the dark. Bulk 3-mm-long primary root tips were then harvested and snap-frozen in liquid nitrogen for subsequent DNA extraction. DNA was extracted using the phenol:chloroform method [14], with an additional high-salt low-ethanol wash to improve DNA purity [15]. High-quality DNA was confirmed by electrophoresis on 1% agarose gel. The DNA was sent to the Australian Genome Research Facility (AGRF, Melbourne, Australia), and Novogene Co., Ltd (Hong Kong, China) for library preparation and sequencing on a PromethION (PromethION, RRID:SCR_017987) and Novo-Seq 6000 (Illumina NovaSeq 6000 Sequencing System, RRID:SCR_016387), respectively. We obtained 72 gigabase pairs (Gbp) of Nanopore long-read data, and 205 Gb paired-end short-read data (150 base pairs [bp] read length).

To produce V. sativa CHiCAGO sequencing data [16] and Hi-C sequencing data [17], 2 g of young leaf tissue was snap-frozen in liquid nitrogen and sent to Dovetail Genomics (USA) for library preparation and sequencing. CHiCAGO and Hi-C libraries were sequenced on an Illumina HiSeq X (Illumina HiSeq X Ten, RRID:SCR_016385) to produce 162 Gbp of CHiCAGO and 148 Gbp of Hi-C sequencing data, respectively.

To prepare V. sativa RNA sequencing (RNA-seq) data, RNA was purified from the first two fully expanded leaves, shoot apexes with young leaves up to 1 cm long from 4-week-old plants, roots from 5-day-old seedlings and 4-week-old leaf-derived callus tissues using the Spectrum™ Plant Total RNA Kit (Sigma Aldrich). Additional DNase I treatment was used to remove DNA contamination (On-Column DNase I Digestion, Sigma Aldrich), and ribosome removal treatment to enrich for the non-ribosomal RNA fraction (Ribo-Zero rRNA Removal Kit for Plant Leaf or Plant Seed/Root, Illumina) [18]. Directional RNA libraries were prepared for each tissue using the NEBNext Ultra™ Directional RNA Library Prep Kit for Illumina (New England Biolabs) following the manufacturer’s protocol. Libraries were sent to Novogene Co., Ltd (Hong Kong, China) for sequencing on Novo-Seq 6000 (Illumina) to obtain 150-bp paired-end read data. In total, we obtained 74.6 Gbp of RNA-seq data. A summary of the long and short-read sequencing data is provided in Table 1.

Table 1.

Overview of sequencing data generated in this study.

Libraries	Insert size (bp)	Raw data (Gbp)	Clean data (Gbp)	Mean read length (bp)	Coverage (×)*
WGS Illumina short-reads	300	205.13	200.28	150	124.32
Nanopore reads	N/A	72.12	N/A	9094	43.71
CHiCAGO	350	162.00	N/A	150	98.18
Hi-C	350	147.60	N/A	150	89.45
Illumina RNA-seq reads	300	74.60	66.49	150	45.21

BUSCO analysis	No polishing (%)	1^st polishing (%)	2^ndpolishing (%)
Complete	69.9	97.7	97.8
Complete and single-copy	63	87.3	88.9
Complete and duplicated	6.9	10.4	8.9
Fragmented	3.5	0.3	0.3
Missing	26.6	2.0	1.9

Pseudo-chromosome	Length (bp)
1	324,818,257
2	324,640,943
3	290,752,327
4	290,123,409
5	272,590,232
6	148,681,034
Total	1,651,606,202

Feature	Value
Total length (bp)	1,653,553,227
No. of contigs	9,990
Contig N50 length (bp)	684,593
Scaffold N50 length (bp)	290,126,875
GC content (%)	35.6
Predicted protein-coding genes	53,218
Predicted noncoding genes	3,966
Content of repetitive sequences (%)	83.92

Parameters	Percentage (%)
Reads mapping rate	99.7
Genome coverage	84.1
Coverage at least 5×	81.9
Coverage at least 10×	78.3
Coverage at least 20×	76.7

Number of elements		Number of elements	Length of occupied (bp)	% of genome
Retroelements		1,361,823	1,064,507,557	64.4
	LINEs	5,620	2,743,407	0.2
	LTR elements	1,356,203	1,061,764,150	64.2
DNA transposons		704,467	242,003,507	14.6
	Mutator TIR transposon	209,091	116,510,919	7.0
	hobo-Activator	88	34,340	0.0
	Tourist/Harbinger	318	212,845	0.01
Unclassified		319,392	69,154,926	4.2
Simple repeats		174,030	10,230,793	0.6
Low complexity		29,826	1,557,616	0.1
Total		2,589,538	1,387,454,399	83.9

Database		Annotated number	Annotated percentage (%)
NCBI-NR		44,400	83.4
Swiss-Prot		31,071	58.4
InterPro	All	43,549	81.8
	Pfam	30,264	56.9
	GO	8,983	16.9
Eggnog	Pfam	34,527	64.9
	KEGG_pathway	10,777	20.3
	KEGG_ko	16,898	31.8
	GO	17,987	33.8
Annotated		47,580	89.4
Total		53,218	—

Type		Copy number	Average length (bp)	Total length (bp)	% of genome
miRNA		158	111.3	17,579	0.001
tRNA		1382	73.7	101,891	0.006
rRNA	rRNA	649	440.1	285,638	0.017
	18S	32	1763.5	56,431	0.003
	28S	39	4249.9	165,745	0.010
	5S	578	109.8	63,462	0.003
snRNA	snRNA	1777	107.5	191,047	0.011
	CD-box	1551	102.4	158,835	0.010
	HACA-box	69	126.7	8,740	0.001
	splicing	157	149.5	23,472	0.001

Species	Abbreviation name	Source of data	Data version
Vicia sativa	V. sat	This project
Pisum sativum	P. sat	URGI	V1a
Medicago truncatula	M. tru	INRA	MtA17 r5
Trifolium pratense	T. pra	Phytozome	v2
Phaseolus vulgaris	P. vul	Phytozome	v2.1
Phaseolus lunatus	P. lun	Phytozome	v1
Vigna unguiculata	V. ung	Phytozome	v1.2
Chamaecrista fasciculata	C. fas	GigaDB	v1
Faidherbia albida	F. alb	GigaDB	N/A
Cercis canadensis	C. can	GigaDB	v1
Carya illinoinensis	C. ill	Phytozome	v1.1
Arabidopsis thaliana	A. tha	Phytozome	TAIR10

Species	Number of genes	Number of orthogroups	Number of genes in orthogroups	Number of species-specific orthogroups	Number of genes in species-specific orthogroups	Single copy genes
V. sat	53,218	19,096	48,028	1774	8,594	10,009
P. sat	57,835	19,012	51,576	2203	10,289	8,131
M. tru	44,618	18,528	38,693	909	3,180	10,755
T. pra	39,943	18,366	36,476	791	2,558	10,686
P. vul	27,433	16,521	26,884	47	137	10,660
P. lun	43,997	16,918	42,007	408	7,518	10,730
V. ung	31,948	16,741	30,176	336	1,463	10,297
C. fas	32,832	14,944	31,229	472	4,336	9,630
F. alb	28,979	15,695	26,573	450	1,666	9,883
C. can	34,023	16,165	32,407	694	3,767	12,289
C. ill	31,911	15,424	30,007	528	2,501	7,830
A. tha	27,416	14,171	24,887	870	4,286	8,851

Node	Definition	Fossil	Age (Ma)
Yellow	SG Brassicales	Flowers of Dressiantha bicarpellate; USA	89.3
Red	SG Leguminosae	Seedpods and leaflets; USA	65.3
Blue	SG Caesalpinioideae	Bipinnate leaves; Colombia	58
Green	SG Papilionoideae	Flowers of Barnebyanthus buchananensis; USA	55

Reviewer name and names of any other individual's who aided in reviewer	Jonathan Kreplak
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Are all data available and do they match the descriptions in the paper?	Yes
Additional Comments
Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples <a href="http://gigadb.org/site/guide" target="_blank">http://gigadb.org/site/guide</a>	Yes
Additional Comments
Is the data acquisition clear, complete and methodologically sound?	Yes
Additional Comments
Is there sufficient detail in the methods and data-processing steps to allow reproduction?	No
Additional Comments	For "Phylogenetic tree construction and divergence time estimation", 64 single copy orthologs are selected, they should be included in a supplementary table to be able to fully reproduct the analysis. Also, Supplementary table S9 should be related to fossil calibrations but show the length of chromosome.
Is there sufficient data validation and statistical analyses of data quality?	Yes
Additional Comments
Is the validation suitable for this type of data?	Yes
Additional Comments
Is there sufficient information for others to reuse this dataset or integrate it with other data?	Yes
Additional Comments
Any Additional Overall Comments to the Author	This work is a state of the art assembly for Vicia Sativa and will be useful to understand how the Vicia genome size has expanded. The scientific part is lacking a few elements that could boost the manuscript. On Vicia Faba, satellite associated to centromeres are well-studied. It 's interesting to check and could confirm that the assembly is robust. Functional annotation seems solid but I was surprised by the low number of genes annotated with a GO using interpro. (Table S6) I would have expected an higher number. Is there an error ? eggNOG-mapper is also giving a GO assignation but the percentage isn't reported in the table. Are they similar ? More disappointing for me, orthologs analysis between species is well done and sufficient but missing a few sequenced legumes genome like P.sativum. Also, M.truncatula have a newer version that the one used. A ressource like legum federation (https://www.legumefederation.org/ ) could have been helpful to select the most adequate genome for this analysis. Figures : One of the main figure (fig2) doesn't seems right. LTR (V) represent 60% of the genome sequence but appear as abundant as TIR (VI) which are a subclass of Transposons (less than 16% of the genome). This is due to the fact that scale is different for each track. A common scale (like a density) must be used . Also, you should add TIR percentage in table S3. For me, this figure isn't informative enough and must be rework. Fig3 (D) green line legend is false and should be changed
Recommendation	Minor Revision

Reviewer name and names of any other individual's who aided in reviewer	Jianbo Jian
Do you understand and agree to our policy of having open and named reviews, and having your review included with the published papers. (If no, please inform the editor that you cannot review this manuscript.)	Yes
Is the language of sufficient quality?	Yes
Please add additional comments on language quality to clarify if needed
Are all data available and do they match the descriptions in the paper?	Yes
Additional Comments
Are the data and metadata consistent with relevant minimum information or reporting standards? See GigaDB checklists for examples <a href="http://gigadb.org/site/guide" target="_blank">http://gigadb.org/site/guide</a>	Yes
Additional Comments
Is the data acquisition clear, complete and methodologically sound?	Yes
Additional Comments
Is there sufficient detail in the methods and data-processing steps to allow reproduction?	Yes
Additional Comments
Is there sufficient data validation and statistical analyses of data quality?	Yes
Additional Comments
Is the validation suitable for this type of data?	Yes
Additional Comments
Is there sufficient information for others to reuse this dataset or integrate it with other data?	Yes
Additional Comments
Any Additional Overall Comments to the Author	In this manuscript, Xi et al reported a chromosome-level genome of the common vetch (Vicia sativa) with integration of Oxford Nanopore sequencing, Illumina sequencing, CHiCAGO and Hi-C. Then, the gene annotation and evolution were performed based on the reference genomes. These genomic resources are valuable for evolution research, genetic diversity and genomic breeding. I think this manuscript is suitable published in Gigabyte. Some minor comments and suggestions as following: 1) The Line Number is missed in this manuscript, which make the detailed comments is not inconvenient. 2) Page 6, “resequenced short-reads” should be “De novo sequencing” or “sequencing”. 3) For the 1.93 Gb assembled genome size, it is a little larger than that of estimated by the flow cytometry (1.77 Gb) and Genomescope (1.61 Gb). Maybe there are some duplicated sequences in this version of assembled genomes. Some redundancy removal software can deal with this question such as Haplotigs, Purge_dups and so on. 4) For the evaluation of genome, LTR Assembly Index (LAI) was suggested for the quality assessment. 5) In Table S2, the mapping rate is very well but the genome coverage is just 76% which looks a little low. What’s the reason? 6) In Table S4, the gene set was combined by August. However, in methods, the annotation software is BRAKER v2.1.6.
Recommendation	Minor Revision

PERMALINK

Chromosome-level assembly of the common vetch (Vicia sativa) reference genome

Hangwei Xi

Vy Nguyen

Christopher Ward

Zhipeng Liu

Iain R Searle

Roles

Abstract

Data Description

Background

Figure 1.

Context

Methods

Sampling and sequencing

Table 1.

Genome size estimation and genome assembly

Figure 2.

Table 2.

Chromosome-level assembly using Hi-C and linkage map data

Figure 3.

Figure 4.

Table 3.

Figure 5.

Table 4.

Data validation and quality control

Table 5.

Genome annotation

Table 6.

Table 7.

Table 8.

Table 9.

Phylogenetic tree construction and divergence time estimation

Table 10.

Figure 6.

Table 11.

Figure 7.

Table 12.

Figure 8.

Reuse potential

Acknowledgements

Funding Statement

Data availability

Declarations

List of abbreviations

Ethical approval

Consent for publication

Competing interests

Funding

Authors’ contributions

References

Article Submission

Professor Iain Searle

Roles

Assign Handling Editor

Roles

Editor Assess MS

Roles

Curator Assess MS

Roles

Review MS

Roles

Review MS

Roles

Editor Decision

Roles

Minor Revision

Professor Iain Searle

Roles

Assess Revision

Roles

Re-Review MS

Roles

Editor Decision

Roles

Final Data Preparation

Roles

Editor Decision

Roles

Accept